Bioinformatics World FAQ Center - Index
  FAQ Index -> DATA INTEGRATION
                -> DAT1...get a quick, yet comprehensive overview of data concerning my gene / protein of interest ? (last update Aug. 26, 2005)
                -> DAT2...get the full-length sequence and annotation from an EST sequence or accession number? (last update Sep. 14, 2005)
                -> DAT3...retrieve a large number of sequences and write one common FASTA sequence file ? -> see RET3 !
                -> DAT4...know which genes of a specific dataset are associated with a disease ? -> see GENOM5 !
                -> DAT5...know all genes associated with cardiovascular diseases having a described polymorphism in the promoter region ? -> see GENOM7 !
                -> DAT6...convert a list of IDs (like microarray probe IDs) into other IDs (like gene names) ? (last update Jun. 27, 2006)
                -> DAT7...see the functions and pathways where my gene set of interest is involved ? -> see PATH1 !
                -> DAT8...produce a general annotation table for a gene set of interest ? (last update Dec. 21, 2005)
                 
                            
Navigate   AtoZ   Search this Site   Site Journal    FAQ Index   Main Index   Appendix       
                             

                                                   
DAT1...get a quick, yet comprehensive overview of data concerning my gene / protein of interest ? (last update Aug. 26, 2005)
                     
    The major question when dealing with a new gene is how to get a quick overview what is already known about its cDNA sequence, genomic organization, protein domains and structure, localization, expression, and finally function. There are a few databases which address this need. NOTE: If you want to produce an annotation table "in-batch" for a whole set of genes, please refer to DAT8 !
   
    Tip! The Bioinformatic Harvester, available at EMBL, is a kind of "data super - integration" tool, as it provides information from already "integrative" databases (like SOURCE, Ensembl, UCSC, and NCBI Entrez,...) in ONE SINGLE webpage. This is achieved by displaying database hits which are rich in graphical information as “iframes”. “iframes” provide the user the latest information from the original database server. Note that this is extremely convenient as each “iframe” can be manipulated individually. There is a multitude of query options (but not via EST acc.). NOTE that Harvester works at the moment exclusively on HUMAN proteins, contained in the UniProt collection. Other species might be included in the future. Please also refer to the Harvester description on the main page.

    Tip! The NCBI Entrez search page has been re-designed and re-structured recently, now providing a single interface to query ALL Entrez-databases at once, including PubMed, OMIM, GenBank, Structures, SNP, UniGene, GEO (microarray expression data !), and more. You can enter one or more search terms, and you will retrieve a first "overview" results page, displaying the number of hits in each of the databases. Then, you can see the individual results. Alternatively, you can first choose the database of interest, and then place your query.
    Entrez Gene is one of the Entrez databases, which was constructed to replace the widely known and used LocusLink database in the year 2004. Entrez Gene integrates information from LocusLink and from genes annotated on Reference Sequences from completely sequenced genomes. You can query on names, symbols, accessions, publications, GO terms, chromosome numbers, E.C. numbers, and many other attributes associated with genes and the products they encode. Because Gene is now an Entrez database, all the familiar and useful functions are now available, including Preview/Index, History, and LinkOut. The full functionality of LocusLink was maintained in Entrez Gene, and even extended by additional links like to the GEO database of expression data. Please note that the Entrez Gene IDs are identical to the LocusLink IDs, therefore it is relatively easy to convert LocusLink-specific files to Gene-specific ones by exchanging the corresponding URLs. A great feature of ENTREZ Gene is the option to restrict the query to specific fields, making it possible to search for e.g. gene name only ! A query for the gene "TNF" yields 295 (!) entries in LocusLink, even when restricting to human entries, whereas in ENTREZ Gene, one specifically retrieves 1 human entry, by using "TNF[Gene Name] AND homo[Organism]".
   
    Tip! The UCSC Gene Sorter is an excellent resource for exploring gene families and the relationships among genes. This tool displays a table of genes within a selected genome that are related to one another. Several different relationships may be explored: protein-level homology, similarity of gene expression profiles, or genomic proximity. The Browser supports searches on a variety of terms and phrases, including the gene name, the SwissProt protein name, a GenBank accession, or a word or phrase present in a gene's description. At the "Sort by" field, you can choose e.g. "Expression (GNF)", which looks for all datasets in this database which show a similar expression pattern to your gene of interest.
    The gene family display is highly configurable, allowing the user to control the order and number of columns, the number of rows, and the genes displayed. The tool provides several output formats, including a simple tab-delimited format that may be imported into a spreadsheet or a relational database. In addition, the sequences of the displayed genes can be downloaded: cDNA, protein, genomic and promoter (!) sequences, allowing a user-definition of upstream and downstream regions. Example: An important use of the Browser is to gather together a collection of genes that share similar properties for statistical analysis. For instance, one might want to examine promotor regions of genes that share a similar expression pattern or look for protein sequence motifs in genes that share similar GO annotations. BUT keep in mind: You always start with only ONE gene, you can not provide lists of genes. The program itself generates "lists of genes with similar expression pattern", taken from specific expression databases. Please refer also to the UCSC Gene Sorter main section for details!

    Tip! The UCSC Proteome Browser Gateway provides a fast access to protein - specific data for a gene of interest. You simply enter a gene symbol or a Swiss-Prot/TrEMBL protein ID into the query field. A list will be displayed showing proteins from several species which match to your query. Each protein-entry shows in a concise manner data like polarity, hydrophobicity, cysteines, aminoacid frequencies and anomalies, pI, molecular weight,  InterPro and Pfam domains, and predicted comparative 3D structures. In addition, links to the related UCSC databases Genome Browser and Gene Sorter, as well as to the Gene Details Page are included, allowing a very efficient navigation between these resources.

    The Stanford Online Universal Resource for Clones and ESTs (SOURCE) compiles information from several publicly accessible databases, including UniGene, dbEST, Swiss-Prot, GeneMap99, RHdb, GeneCards and LocusLink The mission of SOURCE is to provide a unique scientific resource that pools publicly available data commonly sought after for any clone, GenBank accession number, or gene. You can query using GB IDs, LocusLink IDs, UniGene ID, gene name. Currently 3 species (human, mouse, rat) are available. The output comprises a very informative, yet concise description of the gene of interest, including all important links for data integration (like LocusLink or GeneCards), including emphasis on gene expression data. When available, there is a link to published microarray expression data. Please note that not all microarray data stored in the SMD (Stanford Microarray Database) are retrievable via SOURCE, therefore you may also directly search SMD (basic or advanced) for datasets via lists of specific experimental setups. Note that the link "Authors' webpage" offers direct access to the primary databases holding the expression data. In addition, you will find EST data, normalized expression distribution in tissues according to EST data, and SAGE data, an expression analysis method at the NCBI. SOURCE Batch Search is the batch extract interface for SOURCE. You can input a list of GenBank Accessions (! including ESTs !), dbEST cloneIDs, UniGene ClusterIDs, UniGene gene names, or UniGene gene symbols and retrieve data from a check list, where you may choose e.g. UniGene Name, Symbol, ClusterID, LocusLinkID, RefSeq mRNA, RefSeq Protein, GO Annotations, and 2 fileds extracted fron Swiss-Prot: Enzymatic Function and Subcellular Location. Note that the input list must be a text file consisting of a single column, or pasted from e.g. EXCEL as single column. NOTE: Accession numbers are case sensitive, UniGene ClusterIDs MUST include the "species-prefix" (like "Hs."), Gene names are not case-sensitive.You may copy the output text and paste it into WORD (and then convert the text into a table), or in EXCEL choose "Edit", "Paste Special", "Text". NOTE: This tool is especially useful if you want to convert large lists of accession numbers, like from LocusLink to Unigene. 

    GeneCards is a database of human genes, their products and their involvement in diseases. It offers concise information about the functions of all human genes that have an approved symbol, as well as selected others. GeneCards can be searched using gene symbols like "BRCA1" or keywords including GenBank accession numbers, UniGene clusters, SNP Id., clone identifier (ATCC, IMAGE) and others. The output includes related sequences, SNPs, medical news, disorders and mutations, MIPS PEDANT Viewer report (!), links to UCSC and Ensembl, PubMed search and many more. In addition, links to gene expression data (Arrays, ESTs, links to SOURCE) are available.

    GeneLynx is a portal to a collection of hyperlinks for each human gene. You can access the information about a particular human gene by providing any reasonable identifier - just type a keyword, ANY accession number or ID, or submit a related protein or nucleotide sequence on the BLAST search page. You can also perform a more refined keyword search on the Text search page. NOTE that it is possible to perform batch retrieval of GeneLynx entries. GeneLynx enables you to paste or upload a list of accession numbers* (GB ID, LocusLink ID,  Ensembl ID, OMIM ID, UniGene cluster (without "Hs.") and more) to the Batch GeneLynx page and obtain a list in which these identifiers are associated with appropriate GeneLynx IDs and descriptions. Simply paste the IDs (or select a plain text file containing the IDs, separated by spaces or line-breaks, to upload). The output-HTML page can also be saved locally for later use. * Note that you may even use a list of EST accession numbers (pick the file format "Nucleotide Accession"), which automatically retrieves the corresponding UniGene cluster and GeneLynx data !!! Nevertheless you might encounter differences in UniGene accessions at GeneLynx and at NCBI depending on the database build currently used.

    Tip! H-Invitational Database (H-InvDB) is a human gene database opened to the public in April 2004, which is hosted by the Japan Biological Information Research Center (JBIRC) and by the DNA Databank of Japan (DDBJ), with contributions from more than 40 institutes worldwide, like the german DKFZ. Please note that H-InvDB is based essentially on cDNA sequences; it is not a genome sequence repository although providing a graphical genome viewer (i.e. chromosome regions of matching cDNAs). The scope of H-InvDB is to provide an integrative annotation of full-length cDNA clones available from high throughput cDNA sequencing projects. The database generates cDNA clusters describing their gene structures, novel alternative splicing isoforms, non-coding functional RNAs, functional domains, sub-cellular localizations, metabolic pathways, predictions of protein 3D structure, mapping of SNPs and microsatellite repeat motifs in relation with orphan diseases, gene expression profiling, and comparative results with mouse full-length cDNAs in the context of molecular evolution. Please refer to the H-InvDB section at the Data Integration page for a detailed description !

    HPRD - Human Protein Reference Database is yet another good starting point if you are interested in human proteins. HPRD represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome. All the information in HPRD has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data. HPRD is a joint project between Pandey Lab, Johns Hopkins School of Medicine, Baltimore, and the IOB, Institute of Bioinformatics, Bangalore, India. At first sight, HPRD may seem like "yet another protein database", but there are some features which are really worth mentioning. In particular, HPRD offers very nice query options, e.g. you may limit the search to certain molecular classes, cellular components, domains, motifs, tissue and cell type-specific expression, diseases, or length of the protein sequence or molecular weight. Please refer to the HPRD section at the Pathways page for a detailed description !
           
Main Index  FAQ Index


     
DAT2...get the full-length sequence and annotation from an EST sequence or accession number? (last update Sep. 14, 2005)
             
    There are several databases involved in this question, and often it is necessary to scan through more than one of them.

1. Query via EST accession numbers:

    Tip! First, I would recommend a keyword search using the GenBank accession number of the EST against the NCBI-UniGene Database containing EST-Clusters. Even more convenient, you may perform this search using the NCBI ENTREZ cross-database query form. Each UniGene cluster contains sequences that represent a unique gene (annotated sequences and ESTs), as well as related information such as the tissue types in which the gene has been expressed and map location. You have to select the species and type in the accession number. In the case that there are already annotated cDNAs available, you will find entries in the part "mRNA/GENE SEQUENCES". Often, you will find a link to the NCBI-LocusLink database (probably replaced by a link to Entrez Gene in 2004), which provides a very valuable linkpage concerning all information related to a certain gene, like to the NCBI-OMIM database, containing "mini-reviews" about what was already published. 

    Tip! If you know the EST accession number, you can use it to query the SOURCE database (choose "GenBank Accession" as search option). There you will find all available links, like Unigene or LocusLink (Entrez Gene), and many additional links especially to expression data. 
  
2. Query via EST sequences:

    Tip! If you are dealing with a sequence from a "well characterized" species (like human, mouse, rat), possibly the best (and fastest !) way to get to the full-length sequence and annotation, is to perform a BLAT search at UCSC against the corresponding genome. BLAT is extremely fast, and lists the desired gene on top of the output page.

3. Query via EST accession numbers or sequences:

    Another good access point are the TIGR Gene Indices. TIGR holds information on ESTs not only from human but from a variety of species (animal, plant, protist, and fungal). You can either BLAST these databases, or query by keywords, like EST-accession number, ID, tissue, cDNA library name, gene product name, and many more. TIGR clusters are named as "TC" (Tentative Consensus; "THC" stands for Tentative Human Consensus). Note that you can also search for a gene using the GenBank accession number of an annotated cDNA in the database. The main advantage of TIGR is that ESTs belonging to a single gene / cluster are not only listed but assembled into a contig sequence, and represented as a graphical image showing corresponding sizes and positions. In addition, you will find an expression summary, and links to the genomic organization of the gene. The latter is very useful by showing the TCs from all related species in one image, and a link to the Ensembl database.
                     
    H-Invitational Database (H-InvDB) is a human gene database opened to the public in April 2004, which is hosted by the Japan Biological Information Research Center (JBIRC) and by the DNA Databank of Japan (DDBJ), with contributions from more than 40 institutes worldwide, like the german DKFZ. The scope of H-InvDB is to provide an integrative annotation of full-length cDNA clones available from high throughput cDNA sequencing projects. The database generates cDNA clusters describing their gene structures, novel alternative splicing isoforms, non-coding functional RNAs, functional domains, sub-cellular localizations, metabolic pathways, predictions of protein 3D structure, mapping of SNPs and microsatellite repeat motifs in relation with orphan diseases, gene expression profiling, and comparative results with mouse full-length cDNAs in the context of molecular evolution. You may simply BLAST the H-InvDB using your query sequence and see the cluster it belongs to. Please also refer to the H-InvDB section at the Data Integration page for a detailed description !

    Tip! ECgene (gene prediction by EST clustering) predicts genes by combining genome-based EST clustering and a transcript assembly procedure in a coherent and consistent fashion. Specifically, ECgene takes alternative splicing events into consideration. The positions of splice sites (i.e. exon-intron boundaries) in the genome map are utilized as critical information in the whole procedure. Sequences that share splice sites in the genomic alignment are grouped together to define an EST cluster. Transcript assembly, based on graph theory, produces gene models and clone evidence, which is essentially identical to sub-clustering according to splice variants. ECgene is available for human, mouse, and rat genomes. ECgene can be queried both by EST accession numbers or via BLAST search. Example: EST acc. AI700705. Note that at first, the complete Gene Summary page is shown where you will find the input EST in the list of all ESTs. The link Alignment Viewer will present a graphic which shows the position of the input EST, meaning the exon(s) where it corresponds to. Thus, ECgene is an excellent resource to identify the potential transcript variants of a gene where your EST belongs to ! Note that when performing a BLAST search, you will retrieve the ECgene Transcript ID which you can use then to query ECgene. Please also refer to the ECgene section at the Expression page for a detailed description !
                   
Main Index  FAQ Index 
                 


      
DAT3...retrieve a large number of sequences and write one common FASTA sequence file ? -> see RET3 !

    This question involves resources of high-throughput data retrieval, which are described in FAQ RET3. This FAQ relates to the main section "High-throughput Data Retrieval".
                       
Main Index  FAQ Index  
                              

           
DAT4...know which genes of a specific dataset are associated with a disease ? -> see GENOM5 !

    This question involves resources of high-throughput data retrieval specifically concerning gene-disease correlations, which are described in FAQ GENOM5. This FAQ relates to the main section "Disease-centered Data Integration".
                       
Main Index  FAQ Index  
                   

 
DAT5...know all genes associated with cardiovascular diseases having a described polymorphism in the promoter region ? -> see GENOM7 !

    This question involves resources in the fields of SNPs and diseases, which are also capable of batch handling and filtering of gene datasets. This FAQ relates to the main section "Disease-centered Data Integration".
                       
Main Index  FAQ Index  
                   

                                                 
DAT6...convert a list of IDs (like microarray probe IDs) into other IDs (like gene names) ? (last update Jun. 27, 2006)
                     
    Many web-based applications require the input of one specific type of ID or allow to select among a group of IDs, e.g. if you want to analyze a set of genes derived from a certain experiment like a microarray analysis. Therefore, it is often necessary to convert lists of IDs into other corresponding ID types. There are some resources for this purpose, but they show very different functionalities and support different kinds of IDs.

    Tip! DAVID - The Database for Annotation, Visualization and Integrated Discovery integrates functional genomic annotations with intuitive graphical summaries. DAVID provides a comprehensive set of tools for investigators to visually summarize annotation from large list of genes, including those derived from microarray and proteomic studies. DAVID is provided at NCI-Frederick and was developed to support the bioinformatic needs at the National Institute of Allergy and Infectious Diseases (NIAID). DAVID is composed of several tools for the functional annotation and classification of large gene sets. NOTE: There are no individual URLs for the individual applications. All have to be started from the DAVID main page. Please refer also to the DAVID section at the main page.
    The
Gene ID Conversion Tool of DAVID is a very nice, quick and easy to use tool for such purposes !!! This tool converts list of gene ID/accessions to others of your choice with the most comprehensive gene ID mapping repository. The ambiguous accessions in the list can also be determined. The user may use e.g. a list of Affymetrix identifiers as input, and wants to retrieve the corresponding gene symbols. There are 2 options for the output:
    In the "normal" ID list, the hyperlinks on gene names link out to the GeneCards database at Weizmann Institute. The resulting list can be easily downloaded as txt file and imported into programs like Excel. Note that the "conversion summary" is also very helpful for large datasets, in order to quickly identify those entries which could not be converted !
    In the "Show Gene List" option (new),
the hyperlinks on gene names link to the "internal" DAVID gene database. Each gene report displays the most important database links like GenBank, RefSeq, OMIM, GeneRIFs, Entrez Gene, UniProt, and more. In addition, the link "RG" (related genes) scans the input gene set for related genes, and presents a list with color-coded similarity scores (see above!). Note that it is also possible to search the whole genome for related genes and to highlight the ones included also in the input dataset in this list!

NOTE: AS DAVID produces only ONE line per gene, it is the better resource for this purpose than BioMart which produces redundant lines (see discussion below) !


    ID Mapping is a tool for ID conversion provided by PIR, a member of the UniProt consortium, which is capable of converting lists of IDs (like GenBank AC, gi, RefSeq, TIGR, Pfam, Prints, PROSITE, KEGG pathway ID, GO, BIND, Gene name, Entrez Gene ID, OMIM, PubMed, and more) to and from UniProtKB ID or AC. Note that this tool can therefore also be used for questions like how to get all proteins belonging to a certain pathway. NOTE: The conversion only works to and from UniProtKB ID or AC, it is not possible to e.g. convert Entrez Gene IDs into gene names !
  
    A very powerful resource for data retrieval and also for ID list conversion is BioMart (formerly known as "EnsMart"). NOTE that there are different web interfaces for BioMart, please refer to the BioMart main section for details. BioMart is a data retrieval tool that generates lists of biological objects (e.g. genes, SNPs) from data held in the Ensembl (and other) databases. In fact, the functionality is based on filtering the whole set of genes in a genome via lists of accession numbers, IDs (like Entrez Gene, RefSeq, MIM, InterPro, PDB, GO, Affymetrix, and many more). BioMart can generate a number of different types of output, including sequence and tabulated list data. Multiple output formats, including HTML, text and Microsoft Excel, are also supported. Note that EXCEL sheets maintain all hyperlinks !
    In order to convert a list of IDs, you may perform the following steps. At the Start Page you have to select the organism and the database. At the Filter Page, have a look at the dropdown menu "Entries with following IDs", where you can provide your own list of e.g. LocusLink IDs, MIM IDs, RefSeq IDs, Affymetrix ProbeSet IDs (!), and many more. The entries can be separated by commas or line breaks. At the Output Page, you should choose the Features Page, where you can select those IDs which you want to appear in the final table. NOTE: BioMart is transcript-centered, so if a gene has multiple transcripts, you will get redundant lines (meaning duplicates, triplicates etc. of genes) in your final EXCEL-sheet. Unfortunately there is no way to remove these automatically. So, if you need a gene list where each gene is represented just once, you have to manually delete these rows. Note that BioMart supports a long list of different IDs, although some are still missing. A test run showed that there is no option to search using mouse gene symbols or to output mouse gene symbols (in contrast this is no problem with human HUGO IDs).

    TOUCAN is a workbench for regulatory sequence analysis, especially for detecting significant transcription factor binding sites across species. Please refer to the TOUCAN section at the main page for details. Among many functions, TOUCAN provides a module to generate annotation tables of gene lists. From the TOUCAN menu, choose "Get_Seq", "From Ensembl", "Get Info", and hit the "Update" button. This starts the process of retrieving the informations and creating the table, which can be followed in the progress bar. Each row represents one gene of your dataset and each column represents one kind of database identifier. This includes basic gene, mRNA and protein IDs as well as sites of "data integration" like Ensembl Gene, EMBL, HUGO, UniGene, UniProt, RefSeq, and LocusLink. In addition, IDs in the fields of protein domains and motifs (Interpro), as well as protein structure (PDB) are listed. As the number of microarray platforms increases rapidly, also the number of columns in TOUCAN displaying chip-specific identifiers is constantly growing. Thereby, not only the world-leader in this technology, Affymetrix, is covered but also other platforms. Finally, the table presents also IDs, which are related to functional information of each gene, like the involvement in biological processes, or the localization to specific cellular compartments (GO, MIM). To look at this information in detail, you can highlight all rows and copy / paste the data into programs like MS EXCEL. Conclusion: TOUCAN supports a long list of ID types as input / output (including gene names, microarry IDs, and many more). There is also the positive feature that the table is non-redundant displaying only one row for one gene. Nevertheless, the table is "static", there are no hyperlinks.

     The NCBI Entrez search page provides a single interface to query ALL Entrez-databases at once, including PubMed, OMIM, GenBank, Structures, SNP, UniGene, GEO (microarray expression data !), and more. NCBI Entrez shows very limited functionality concerning the question of ID list conversion. First, only a few Entrez databases support batch queries, like "Nucleotide" or "Protein" where you can query using a list of IDs. Also, there is no option to generate tables of IDs, which would be comparable to BioMart. Also, there is no way to use gene symbols "in batch" to query Entrez Gene, or there is no way to use microarray IDs like Affymetrix ProbeSets as input.
   
    The Stanford Online Universal Resource for Clones and ESTs (SOURCE) compiles information from several publicly accessible databases, including UniGene, dbEST, Swiss-Prot, GeneMap99, RHdb, GeneCards and LocusLink.The mission of SOURCE is to provide a unique scientific resource that pools publicly available data commonly sought after for any clone, GenBank accession number, or gene. SOURCE Batch Search is the batch extract interface for SOURCE. You can choose between 3 organisms (human, mouse, rat). You can input a list of IDs (! including ESTs !), dbEST cloneIDs, UniGene ClusterIDs, UniGene gene names, UniGene gene symbols, or LocusLink (now Entrez Gene) IDs, and retrieve data from a check list, where you may choose e.g. UniGene Name, Symbol, ClusterID, LocusLinkID, RefSeq mRNA, RefSeq Protein, GO Annotations, and 2 fileds extracted fron Swiss-Prot: Enzymatic Function and Subcellular Location. Note that the input list must be a text file consisting of a single column, or pasted from e.g. EXCEL as single column. NOTE: Accession numbers are case sensitive, UniGene ClusterIDs MUST include the "species-prefix" (like "Hs."), Gene names are not case-sensitive.You may copy the output text and paste it into WORD (and then convert the text into a table), or in EXCEL choose "Edit", "Paste Special", "Text". Conclusion: Only 3 species are supported, and the list of possible IDs is very limited (e.g. no option to use microarray IDs...). Also, the output is a "static" text, there are no hyperlinks.

    The program Resourcerer at the TIGR webpage can also be used for ID conversion. RESOURCERER provides annotation based on the TIGR Gene Indices (TGI) for commonly available microarray resources, including widely used clone sets and Affymetrix GeneChip Arrays. If you have a list of accession numbers, click at the link "Batch Search" and either upload a *.txt file containing your accession numbers, like UniGene, RefSeq, GenBank incl. EST Acc., LocusLink  (separated by spaces) or simply type in the numbers in the text box. You will retrieve a table containing links to UniGene, LocusLink, the human, mouse, and rat TIGR indices, and to the GO database. You can save the output page in *.html format  using your Browser's "Save page..." function and open it in applications like WORD or EXCEL, having all the hyperlinks fully intact. If the list is very long, it will be separated into multiple files !!! OR: If you use the "Download Virtual"-function of Resourcerer, you will get a tab-delimited txt-file of the table (single file irrespective of its length) which you can import into EXCEL, but which has NO hyperlinks. Please note that the output file lists the array names which contain a gene of interest but NOT the individual identifiers (like ProbeSet IDs in case of Affymetrix arrays), please refer to the BioMart description for this purpose. Conclusion: Only 4 species are supported (human, mouse, rat, zebrafish), only 4 types of ID are supported as input (GenBank, UniGene, LocusLink, RefSeq). There is no way to input gene names or microarray IDs.
           
Main Index  FAQ Index
  


DAT7...see the functions and pathways where my gene set of interest is involved ? -> see PATH1 !

    This question involves resources which are capable of handling large datasets, like those coming from microarray analyses, in a high-throughput manner in order to predict the biological function or any other "over-represented" annotation terms of such sets. Thus, these resources build a kind of bridge between the main sections "Data Integration" and "Pathways, Interactions, Functions".
                       
Main Index  FAQ Index  
                   

                                                 
DAT8...produce a general annotation table for a gene set of interest ? (last update Dec. 21, 2005)
                     
    When performing e.g. microarray experiments, the user ends up with lists of gene names or certain types of accession numbers and is confronted with the question of annotating this gene set. Thus, this question somehow represents the more general approach to the matter discussed in DAT7, which focuses on functions and pathways. Also, this question represents the "multi-gene version" of DAT1 which deals with the annotation of single genes. At the main pages, these resources are mainly described in section "High-Throughput Data Retrieval".

    Tip! A very powerful resource for data retrieval is BioMart (formerly known as "EnsMart"). NOTE that there are different web interfaces for BioMart, please refer to the BioMart main section for details. BioMart is a data retrieval tool that generates lists of biological objects (e.g. genes, SNPs) from data held in the Ensembl (and other) databases. In fact, the functionality is based on filtering the whole set of genes in a genome via lists of accession numbers, IDs (like Entrez Gene, RefSeq, MIM, InterPro, PDB, GO, Affymetrix, and many more). BioMart can generate a number of different types of output, including sequence and tabulated list data. Multiple output formats, including HTML, text and Microsoft Excel, are also supported. Note that EXCEL sheets maintain all hyperlinks !
    In order to produce a general annotation table, you may perform the following steps. At the Start Page you have to select the organism and the database. At the Filter Page, have a look at the dropdown menu "Entries with following IDs", where you can provide your own list of e.g. LocusLink IDs, MIM IDs, RefSeq IDs, Affymetrix ProbeSet IDs (!), and many more. The entries can be separated by commas or line breaks. At the Output Page, you should choose the Features Page, where you can select those IDs which you want to appear in the final table. You should select items like general gene accessions (Ensembl Gene ID, External Gene ID, Description), function-related items (GO ID and description, MIM ID), disease attributes (OMIM), protein domain attributes (Interpro, Pfam, Prosite), and protein family IDs.
    NOTE:
BioMart is transcript-centered, so if a gene has multiple transcripts, you will get redundant lines (meaning duplicates, triplicates etc. of genes) in your final EXCEL-sheet. Unfortunately there is no way to remove these automatically. So, if you need a gene list where each gene is represented just once, you have to manually delete these rows. Note that BioMart supports a long list of different IDs, although some are still missing. A test run showed that there is no option to search using mouse gene symbols or to output mouse gene symbols (in contrast this is no problem with human HUGO IDs).
    Conclusion: In general, a versatile data retrieval tool. In contrast to e.g. DAVID, there are no pathway related items (like KEGG or Biocarta). When exported into MS EXCEL format, all hyperlinks to the different databases remain active ! Problem of redundancy of entries.
                                      
    Tip! DAVID - The Database for Annotation, Visualization and Integrated Discovery integrates functional genomic annotations with intuitive graphical summaries. DAVID provides a comprehensive set of tools for investigators to visually summarize annotation from large list of genes, including those derived from microarray and proteomic studies. DAVID is provided at NCI-Frederick and was developed to support the bioinformatic needs at the National Institute of Allergy and Infectious Diseases (NIAID). DAVID is composed of several tools for the functional annotation and classification of large gene sets. NOTE: There are no individual URLs for the individual applications. All have to be started from the DAVID main page. Please refer also to the DAVID section at the main page.
   
The Functional Annotation Tool of DAVID can be used effectively to produce an annotation table of a gene set of interest. You simply paste a list of ifdentifiers of your gene set, like Entrez Gene, Affymetrix, RefSeq, UniProt, GenBank, or UniGene. DAVID automatically determines the input species and produces the annotation summary results within a short time. NOTE: Pop-up blockers should be turned off in order to ensure that the program is running properly ! The user then selects from a long list of items (accessions) which ones to display in the final output table. Note that this is quite similar to the "Filter" page selection at BioMart, but includes additional / different items like pathway databases (KEGG, Biocarta,...). Examples include: Main acc. (Entrez Gene, Affy, GenBank, RefSeq,...); Other acc. (MEROPS, MGI,...); Gene Ontology (GO terms can be selected from different GO levels, and from the 3 branches: biological process, molecular function, cellular component); Protein Domains (Interpro, Pfam, SMART, COG, BLOCKS, PDB,...); Pathways (KEGG, Biocarta, EC number); General Annotations (gene name, symbol, OMIM,...); Functional Categories (COG Ontology, PIR keywords,...); Protein Interactions (BIND, DIP, TRANSFAC,...); Literature (PubMed, GeneRIF,...). The option "Export Selected Annotation as Table"generates either a txt file or an xls file of the selected items. Note that the table is easier to read without selecting the function "Add hyperlinks" !
    Conclusion:
In contrast to BioMart, only ONE ROW is generated for ONE input gene, thereby avoiding the redundancy of rows produced by BioMart. On the other hand, there are no active hyperlinks in DAVID tables. In Excel, it is possible to simply sort the annotation list by e.g. KEGG pathways or GO terms related to biological processes in order to get an impression of the biological function of the dataset.

    Tip! The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. Proteins have been classified according to families and subfamilies, molecular functions, biological processes, pathways. The high-throughput analysis tools suitable for gene datasets are based on a PANTHER-specific gene ontology as on pathway maps. There are several tools which build the PANTHER portal. Please refer to the PANTHER main section for detailed information !
    The Batch ID Search tool allows to find PANTHER-classified genes, transcripts, and proteins by uploading a list of IDs. A list of IDs like gene symbol, gene ID, protein accessions, and more can be uploaded or pasted. The result list presents a kind of annotation table which is mainly based on the PANTHER-specific GO terms (PANTHER Biological Process, PANTHER Molecular Function), and on the PANTHER Pathways. The result list can be displayed in various formats, like gene list, or transcripts/proteins list, or PANTHER Ontology terms, Families, Pathways, and Pathway Components. Note that this is also a quick option to display all genes corresponding to a specific pathway or biological process ! This list can be saved as txt file which is best opened in MS Excel, but which is static (no hyperlinks). Alternatively, you may simply copy/paste the whole list into e.g. WORD, which maintains all hyperlinks! Note that via the "Species Filter" tabs, it is possible to quickly generate the ortholog gene datasets derived from the supported species !!! Please refer to the PANTHER main section for detailed information !

    TOUCAN is a workbench for regulatory sequence analysis, especially for detecting significant transcription factor binding sites across species. Please refer to the TOUCAN section at the main page for details. Among many functions, TOUCAN provides a module to generate annotation tables of gene lists. From the TOUCAN menu, choose "Get_Seq", "From Ensembl", "Get Info", and hit the "Update" button. This starts the process of retrieving the informations and creating the table, which can be followed in the progress bar. Each row represents one gene of your dataset and each column represents one kind of database identifier. This includes basic gene, mRNA and protein IDs as well as sites of "data integration" like Ensembl Gene, EMBL, HUGO, UniGene, UniProt, RefSeq, and LocusLink. In addition, IDs in the fields of protein domains and motifs (Interpro), as well as protein structure (PDB) are listed. As the number of microarray platforms increases rapidly, also the number of columns in TOUCAN displaying chip-specific identifiers is constantly growing. Thereby, not only the world-leader in this technology, Affymetrix, is covered but also other platforms. Finally, the table presents also IDs, which are related to functional information of each gene, like the involvement in biological processes, or the localization to specific cellular compartments (GO, MIM). To look at this information in detail, you can highlight all rows and copy / paste the data into programs like MS EXCEL.
    Conclusion:
TOUCAN supports a long list of ID types as input / output (including gene names, microarry IDs, and many more). There is also the positive feature that the table is non-redundant displaying only one row for one gene. Nevertheless, the table is "static", there are no hyperlinks.

   
    The Stanford Online Universal Resource for Clones and ESTs (SOURCE) compiles information from several publicly accessible databases, including UniGene, dbEST, Swiss-Prot, GeneMap99, RHdb, GeneCards and LocusLink.The mission of SOURCE is to provide a unique scientific resource that pools publicly available data commonly sought after for any clone, GenBank accession number, or gene. SOURCE Batch Search is the batch extract interface for SOURCE. You can choose between 3 organisms (human, mouse, rat). You can input a list of IDs (! including ESTs !), dbEST cloneIDs, UniGene ClusterIDs, UniGene gene names, UniGene gene symbols, or LocusLink (now Entrez Gene) IDs, and retrieve data from a check list, where you may choose e.g. UniGene Name, Symbol, ClusterID, LocusLinkID, RefSeq mRNA, RefSeq Protein, GO Annotations, and 2 fileds extracted fron Swiss-Prot: Enzymatic Function and Subcellular Location. Note that the input list must be a text file consisting of a single column, or pasted from e.g. EXCEL as single column. NOTE: Accession numbers are case sensitive, UniGene ClusterIDs MUST include the "species-prefix" (like "Hs."), Gene names are not case-sensitive.You may copy the output text and paste it into WORD (and then convert the text into a table), or in EXCEL choose "Edit", "Paste Special", "Text".
    Conclusion:
Only 3 species are supported, and the list of possible IDs is very limited (e.g. no option to use microarray IDs...). Also, the output is a "static" text, there are no hyperlinks.

    The program Resourcerer at the TIGR webpage can also be used for ID conversion. RESOURCERER provides annotation based on the TIGR Gene Indices (TGI) for commonly available microarray resources, including widely used clone sets and Affymetrix GeneChip Arrays. If you have a list of accession numbers, click at the link "Batch Search" and either upload a *.txt file containing your accession numbers, like UniGene, RefSeq, GenBank incl. EST Acc., LocusLink  (separated by spaces) or simply type in the numbers in the text box. You will retrieve a table containing links to UniGene, LocusLink, the human, mouse, and rat TIGR indices, and to the GO database. You can save the output page in *.html format  using your Browser's "Save page..." function and open it in applications like WORD or EXCEL, having all the hyperlinks fully intact. If the list is very long, it will be separated into multiple files !!! OR: If you use the "Download Virtual"-function of Resourcerer, you will get a tab-delimited txt-file of the table (single file irrespective of its length) which you can import into EXCEL, but which has NO hyperlinks. Please note that the output file lists the array names which contain a gene of interest but NOT the individual identifiers (like ProbeSet IDs in case of Affymetrix arrays), please refer to the BioMart description for this purpose.
    Conclusion:
Only 4 species are supported (human, mouse, rat, zebrafish), only 4 types of ID are supported as input (GenBank, UniGene, LocusLink, RefSeq). There is no way to input gene names or microarray IDs.
           
Main Index  FAQ Index