-> THE HUMAN GENOME
-> HUMGEN1...know which way
is the best to access the human genome data ? (last update Mar. 2,
2006)
HUMGEN1...know which
way is the best to access the human genome data ? (last update Mar. 2,
2006)
The corresponding chapter
of the main page contains detailed, but separated descriptions
of the major genome databases like UCSC, Ensembl, and NCBI. In
the
following text, I will try to distill some important parts, in order to
provide
a quick access to this topic.
1. Sequence search - graphical genome
browsers:
Tip! The
UCSC genome bioinformatics site
provides an easy-to-use but extremely powerful graphical genome
browser, which I would strongly recommend. The BLAT
search is a very efficient and extremely fast way to search the
genome using a query (DNA or protein) sequence. BLAT on
DNA is designed to quickly find sequences of 95% and greater similarity
of length 40 bases or more, meaning that BLAT is not as sensitive as
BLAST but sufficient to quickly identify the genomic position of human
query sequences. At the output page, there are 2 links:
"browser" and "details". The first one leads to
a graphical genome browser which is highly customizable,
meaning that you can select or de-select all entries (like ESTs,
RefSeqs, SNPs, UniGene clusters, repeats, human-mouse-homology items,
and many more. As an example, this browser is by far the best one to
display the positions of EST sequences. The link "details" displays the
sequence alignment of your query sequence to the genomic sequence, in a
very user-friendly way, from 5' to 3', in contrast to BLAST,
which displays the matching fragments in descending order of their
degree of identity. The link "DNA" lets you download the
sequences currently shown in your browser window, again allowing many
user-definable sequence formatting options (like upper/lower-case or
number of letters
per line). The link "Convert" converts coordinates
from one draft of the human genome to another. The link "PDF/PS"
creates PDF and PostScript images of the current browser window,
which can be used by other applications (and which is of better quality
than a screenshot).
The Ensembl BLASTView
is
a very good tool if you have a series of sequences (max. 30)
you want
to search "in-batch", and if you want to have full and
easy-to-understand control over all available options, like E-values,
dust filtering, repeats and more. You can follow the status of every
sequence individually. You can even define the columns which will be
displayed in the summary output table. The link "Contig view"
will reveal the region of interest in the Ensembl graphical genome
browser, which is also highly customizable but has some drawbacks
like the display of EST sequences. On the other hand, there are some
very nice features like the positions of individual Affymetrix
ProbeSets, a sub-classification of repeats, multiple export options,
and a direct connection to the other major genome browsers, UCSC and
NCBI via the "Jump to" button.
The VEGA
(Vertebrate Genome
Annotation) database is a central
repository for high quality, frequently updated, manual annotation
of
vertebrate finished genome sequence. The website
is built upon code from the Ensembl
project and therefore the "look and feel" is very similar. The VEGA Human Genome Browser
is the human section of the VEGA database. This site shows data on
human chromosomes where a first-pass manual annotation has been
completed.
Currently (May 2005) there are fourteen chromosomes in the Vega human
database, comprising nearly 50% of the genome. The Sitemap
displays the different ways to access the Vega Human Database, which
are quite similar to the Ensembl database. As example, the VEGA
BLASTView offers the possibility to search the VEGA database
with (max. 30) query sequences.
H-Invitational
Database (H-InvDB) is a human gene database opened to the
public in April 2004, which is hosted by the Japan Biological
Information Research Center (JBIRC) and by the
DNA Databank of Japan (DDBJ),
with contributions from more than 40 institutes worldwide, like the
german DKFZ. The scope of H-InvDB is to provide an integrative
annotation of full-length cDNA clones available from high
throughput cDNA sequencing projects. The database generates cDNA
clusters describing their gene structures, novel alternative
splicing isoforms, non-coding functional RNAs, functional domains,
sub-cellular localizations, metabolic pathways, predictions of protein
3D structure, mapping of SNPs and microsatellite repeat motifs in
relation
with orphan diseases, gene expression profiling, and comparative
results
with mouse full-length cDNAs in the context of molecular evolution. You
may simply BLAST
the H-InvDB using your query sequence and see the cluster it belongs
to.
The link to G-integra provides a graphical view of the
cDNA alignments to the genomic sequence, thereby comparing the H-Inv
cDNAs to
Refseqs (NCBI) and Ensembl transcripts. Several features like PolyA
sites,
repetitive elements, or SNPs are displayed. Please note that the
design
of the genome viewer itself is not as good as the UCSC browser, but the
individual gene-specific database entries provide a good site of
data integration. Please also refer to the H-InvDB section at the Data Integration page for a detailed
description !
2. Text (keyword) search:
At the UCSC Genome
Browser Gateway, you can perform keyword searches on
different data freezes, using chromosome positions, GenBank accessions,
RefSeq IDs, LocusLink IDs, or keywords like "zinc finger". You
just choose the species, assembly date, and enter your query. The
keyword option may produce (long) lists of RefSeqs and other mRNA
sequences associated with this term. Please note that this
system is essentially defined to quickly locate chromosome positions of
specific genes; it is not meant to deal with complex "SRS-like"
queries. Please refer to questions like RET2
or RET3
for that purpose.
Tip! The
Ensembl TextView
is the tool to perform free-text search in the Ensembl
database. You can not only choose the species but also one of the
indices to search through, like gene, mRNA, protein, UniGene, SNP,
and more. At the output page, you will get
all important accessions like LocusLink, RefSeq, MIM, GO,
Affymetrix ProbeSet IDs, and many more, thereby providing an additional
value
as compared to the UCSC interface.
The VEGA TextView
is the tool to perform free-text search in the VEGA
database. VEGA TextView is very similar to the Ensembl TextView.
The NCBI MapViewer supports search and display of
genomic information by chromosomal position. Regions of interest can be
retrieved by text queries (e.g. gene or marker name) or by sequence
alignment (BLAST). View results at the whole genome level, and select
what to display in more detail. Multiple options exist to configure
your display, download data, navigate to related data, and analyze
supporting information using the tools provided. If you are
looking for "zinc finger" in the MapViewer of
the
human genome, you will retrieve over 600 entries, sorted by
chromosomes. Hitting one entry, you will see this gene in the graphical
genome browser (which in my opinion is not as good as the UCSC
browser).
Tip! The
NCBI Entrez
search page has been re-designed and re-structured recently, now
providing a single interface to query ALL Entrez-databases
at once, including PubMed, OMIM, GenBank, Structures, SNP,
UniGene, GEO (microarray expression data !), and more. You can enter
one or more search terms, and you will retrieve a first "overview"
results page, displaying the number of hits in each of the databases.
Then, you can see the individual results. Alternatively, you can first
choose the database of interest, and then place your query. Note that,
when sticking to the "zinc finger" example, the number of hits (over
500) is most similar when comparing the OMIM output to the UCSC and
Ensembl results.
3. Complex data mining:
Tip! The
UCSC Gene Sorter
is an excellent resource for exploring gene families and the
relationships among genes. This tool displays a table of genes
within a selected genome that are related to one another. Several
different relationships may be explored: protein-level homology,
similarity of gene expression profiles, or genomic proximity. The
Browser supports searches on a variety of terms and phrases,
including the gene name, the SwissProt
protein name, a GenBank accession, or a word or phrase present in
a gene's description. At the "Sort by" field, you can choose
e.g.
"Expression (GNF)", which looks for all datasets in this database which
show a similar expression pattern to your gene of interest.
The gene family display is highly configurable,
allowing the user to control the order and number of columns, the
number of rows, and the genes displayed. The tool provides several output
formats, including a
simple tab-delimited format that may be imported into a spreadsheet or
a relational database. In addition, the sequences of the
displayed genes can be downloaded: cDNA, protein, genomic and promoter
(!) sequences, allowing a user-definition of upstream and downstream
regions. Example: An important use of the Browser is to
gather together a collection of genes that share similar properties
for statistical analysis. For instance, one might want to examine
promotor regions of genes that share a similar expression pattern or
look for protein sequence motifs in genes that share similar
GO annotations. BUT keep in mind: You always start with only
ONE
gene, you can not provide lists of genes. The program itself generates
"lists of genes with similar expression pattern", taken from specific
expression databases. Please refer also to the UCSC Gene Sorter main section
for details!
Tip! BioMart is a
very powerful data mining tool. Please refer to questions RET1
and RET3, as well as the BioMart chapter at the main page for
detailed descriptions. Important: BioMart is not
searched via "free-text". In fact, the functionality is based on filtering
the whole set of genes in a genome via lists of accession numbers, IDs
(like Entrez Gene, RefSeq, MIM, InterPro, PDB, GO, Affymetrix, and many
more), or via their expression (cell types, developmental stages), or
via their homology to other species, or via the occurrence of SNPs. The
output may be highly customizable tables of accessions and links, as
well as sequence files.
Tip! The
UCSC
Table Browser provides a powerful and flexible graphical
interface for querying and manipulating the UCSC Genome Browser annotation
tables. The Table Browser lets you retrieve the DNA
sequence data or annotation data underlying Genome Browser
tracks for the entire genome, a specified coordinate range, or a set of
accessions. Because the Table Browser uses the same database as the
UCSC Genome Browser, the two views are always consistent. Thus,
the main section of the
UCSC Table Browser is located "close" to the UCSC Genome Browser,
in main section "Human Genome Databases".