Bioinformatics World FAQ Center
  FAQ Index -> THE HUMAN GENOME
                -> HUMGEN1...know which way is the best to access the human genome data ? (last update Mar. 2, 2006)
    
      
Navigate   AtoZ   Search this Site   Site Journal    FAQ Index   Main Index   Appendix       
            

HUMGEN1...know which way is the best to access the human genome data ? (last update Mar. 2, 2006)

    The corresponding chapter of the main page contains detailed, but separated descriptions of the major genome databases like UCSCEnsembl, and NCBI. In the following text, I will try to distill some important parts, in order to provide a quick access to this topic.
   
1. Sequence search - graphical genome browsers:

    Tip! The UCSC genome bioinformatics site provides an easy-to-use but extremely powerful graphical genome browser, which I would strongly recommend. The BLAT search is a very efficient and extremely fast way to search the genome using a query (DNA or protein) sequence. BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more, meaning that BLAT is not as sensitive as BLAST but sufficient to quickly identify the genomic position of human query sequences. At the output page, there are 2 links: "browser" and "details". The first one leads to a graphical genome browser which is highly customizable, meaning that you can select or de-select all entries (like ESTs, RefSeqs, SNPs, UniGene clusters, repeats, human-mouse-homology items, and many more. As an example, this browser is by far the best one to display the positions of EST sequences. The link "details" displays the sequence alignment of your query sequence to the genomic sequence, in a very user-friendly way, from 5' to 3', in contrast to BLAST, which displays the matching fragments in descending order of their degree of identity. The link "DNA" lets you download the sequences currently shown in your browser window, again allowing many user-definable sequence formatting options (like upper/lower-case or number of letters per line). The link "Convert" converts coordinates from one draft of the human genome to another. The link "PDF/PS" creates PDF and PostScript images of the current browser window, which can be used by other applications (and which is of better quality than a screenshot).

    The Ensembl BLASTView is a very good tool if you have a series of sequences (max. 30) you want to search "in-batch", and if you want to have full and easy-to-understand control over all available options, like E-values, dust filtering, repeats and more. You can follow the status of every sequence individually. You can even define the columns which will be displayed in the summary output table. The link "Contig view" will reveal the region of interest in the Ensembl graphical genome browser, which is also highly customizable but has some drawbacks like the display of EST sequences. On the other hand, there are some very nice features like the positions of individual Affymetrix ProbeSets, a sub-classification of repeats, multiple export options, and a direct connection to the other major genome browsers, UCSC and NCBI via the "Jump to" button.

    The VEGA (Vertebrate Genome Annotation) database is a central repository for high quality, frequently updated, manual annotation of vertebrate finished genome sequence. The website is built upon code from the Ensembl project and therefore the "look and feel" is very similar. The VEGA Human Genome Browser is the human section of the VEGA database. This site shows data on human chromosomes where a first-pass manual annotation has been completed. Currently (May 2005) there are fourteen chromosomes in the Vega human database, comprising nearly 50% of the genome. The Sitemap displays the different ways to access the Vega Human Database, which are quite similar to the Ensembl database. As example, the VEGA BLASTView offers the possibility to search the VEGA database with (max. 30) query sequences.

    H-Invitational Database (H-InvDB) is a human gene database opened to the public in April 2004, which is hosted by the Japan Biological Information Research Center (JBIRC) and by the DNA Databank of Japan (DDBJ), with contributions from more than 40 institutes worldwide, like the german DKFZ. The scope of H-InvDB is to provide an integrative annotation of full-length cDNA clones available from high throughput cDNA sequencing projects. The database generates cDNA clusters describing their gene structures, novel alternative splicing isoforms, non-coding functional RNAs, functional domains, sub-cellular localizations, metabolic pathways, predictions of protein 3D structure, mapping of SNPs and microsatellite repeat motifs in relation with orphan diseases, gene expression profiling, and comparative results with mouse full-length cDNAs in the context of molecular evolution. You may simply BLAST the H-InvDB using your query sequence and see the cluster it belongs to. The link to G-integra provides a graphical view of the cDNA alignments to the genomic sequence, thereby comparing the H-Inv cDNAs to Refseqs (NCBI) and Ensembl transcripts. Several features like PolyA sites, repetitive elements, or SNPs are displayed. Please note that the design of the genome viewer itself is not as good as the UCSC browser, but the individual gene-specific database entries provide a good  site of data integration. Please also refer to the H-InvDB section at the Data Integration page for a detailed description !

2. Text (keyword) search:

    At the UCSC Genome Browser Gateway, you can perform keyword searches on different data freezes, using chromosome positions, GenBank accessions, RefSeq IDs, LocusLink IDs, or keywords like "zinc finger". You just choose the species, assembly date, and enter your query. The keyword option may produce (long) lists of RefSeqs and other mRNA sequences associated with this term. Please note that this system is essentially defined to quickly locate chromosome positions of specific genes; it is not meant to deal with complex "SRS-like" queries. Please refer to questions like RET2 or RET3 for that purpose.

    Tip! The Ensembl TextView is the tool to perform free-text search in the Ensembl database. You can not only choose the species but also one of the indices to search through, like gene, mRNA, protein, UniGene, SNP, and more. At the output page, you will get all important accessions like LocusLink, RefSeq, MIM, GO, Affymetrix ProbeSet IDs, and many more, thereby providing an additional value as compared to the UCSC interface.

    The VEGA TextView is the tool to perform free-text search in the VEGA database. VEGA TextView is very similar to the Ensembl TextView.

    The NCBI MapViewer supports search and display of genomic information by chromosomal position. Regions of interest can be retrieved by text queries (e.g. gene or marker name) or by sequence alignment (BLAST). View results at the whole genome level, and select what to display in more detail. Multiple options exist to configure your display, download data, navigate to related data, and analyze supporting information using the tools provided. If you are looking for "zinc finger" in the MapViewer of the human genome, you will retrieve over 600 entries, sorted by chromosomes. Hitting one entry, you will see this gene in the graphical genome browser (which in my opinion is not as good as the UCSC browser).

    Tip! The NCBI Entrez search page has been re-designed and re-structured recently, now providing a single interface to query ALL Entrez-databases at once, including PubMed, OMIM, GenBank, Structures, SNP, UniGene, GEO (microarray expression data !), and more. You can enter one or more search terms, and you will retrieve a first "overview" results page, displaying the number of hits in each of the databases. Then, you can see the individual results. Alternatively, you can first choose the database of interest, and then place your query. Note that, when sticking to the "zinc finger" example, the number of hits (over 500) is most similar when comparing the OMIM output to the UCSC and Ensembl results.

3. Complex data mining:

    Tip! The UCSC Gene Sorter is an excellent resource for exploring gene families and the relationships among genes. This tool displays a table of genes within a selected genome that are related to one another. Several different relationships may be explored: protein-level homology, similarity of gene expression profiles, or genomic proximity. The Browser supports searches on a variety of terms and phrases, including the gene name, the SwissProt protein name, a GenBank accession, or a word or phrase present in a gene's description. At the "Sort by" field, you can choose e.g. "Expression (GNF)", which looks for all datasets in this database which show a similar expression pattern to your gene of interest.
    The gene family display is highly configurable, allowing the user to control the order and number of columns, the number of rows, and the genes displayed. The tool provides several output formats, including a simple tab-delimited format that may be imported into a spreadsheet or a relational database. In addition, the sequences of the displayed genes can be downloaded: cDNA, protein, genomic and promoter (!) sequences, allowing a user-definition of upstream and downstream regions. Example: An important use of the Browser is to gather together a collection of genes that share similar properties for statistical analysis. For instance, one might want to examine promotor regions of genes that share a similar expression pattern or look for protein sequence motifs in genes that share similar GO annotations. BUT keep in mind: You always start with only ONE gene, you can not provide lists of genes. The program itself generates "lists of genes with similar expression pattern", taken from specific expression databases. Please refer also to the UCSC Gene Sorter main section for details!

    Tip! BioMart is a very powerful data mining tool. Please refer to questions RET1 and RET3, as well as the BioMart chapter at the main page for detailed descriptions. Important: BioMart is not searched via "free-text". In fact, the functionality is based on filtering the whole set of genes in a genome via lists of accession numbers, IDs (like Entrez Gene, RefSeq, MIM, InterPro, PDB, GO, Affymetrix, and many more), or via their expression (cell types, developmental stages), or via their homology to other species, or via the occurrence of SNPs. The output may be highly customizable tables of accessions and links, as well as sequence files.

    Tip! The UCSC Table Browser provides a powerful and flexible graphical interface for querying and manipulating the UCSC Genome Browser annotation tables. The Table Browser  lets you retrieve the DNA sequence data or annotation data underlying Genome Browser tracks for the entire genome, a specified coordinate range, or a set of accessions. Because the Table Browser uses the same database as the UCSC Genome Browser, the two views are always consistent. Thus, the main section of the UCSC Table Browser is located "close" to the UCSC Genome Browser, in main section "Human Genome Databases". 
         
Main Index  FAQ Index