Bioinformatics World    
                -> Pathways and Interactions Databases
                -> Selected Pathways
                -> Enzymes
                -> GO Portals
                -> GO Browsers
                -> GO Annotators
                -> GO Data Mining
                -> Integrated Functional Data Mining
Navigate    AtoZ   Search this Site   Site Journal    FAQ Index   Main Index   Appendix       
Pathways and Interactions Databases
NOTE: This section lists resources which store information on metabolic and regulatory pathways as well as on protein-protein interactions, as these fields are often not separated. At the FAQ pages, the question of predicting the pathways and biological functions of specific gene datasets is treated mainly in FAQ PATH1.
(Blueprint, Toronto, Canada)
BIND (Biomolecular Interaction Network Database) is hosted by the Blueprint Initiative in Toronto, Canada. BIND is a database designed to store full descriptions of interactions, molecular complexes and pathways. Interactions between any two molecules composed of proteins, nucleic acids and small molecules are described. The database can be used to study networks of interactions, to map pathways across taxonomic branches and to generate information for kinetic simulations.
NOTE that many of the described interactions are from yeast and drosophila proteins, and only a small fraction from human, so it might be worth checking for homologs of your protein of interest ! 

There are several ways to query the BIND database.
1. BINDBlast: BLAST the BIND database using a protein sequence. You will see if your sequence or homologous sequences are contained in the DB.  You will get links concerning the interacting protein(s), method (like 2-hybrid screen), abstracts etc.  Note that each interaction is described by a unique BIND-interaction ID. Interactions may be visually navigated using a Java applet called "BIND Interaction Viewer". Simply select the viewer from the menu associated with each interaction report. Alternatively, you may also display protein 3D structures associated with interactions, via launching the NCBI Cn3D Viewer (please refer to the section "Visualization Software" in main section "3D structures").
2. Search or Browse the BIND database. You can perform a simple text query, or search via diverse accession numbers.  You may also browse the whole BIND database for described interactions.
3. Search PreBIND: PreBIND is a data mining tool that helps researchers locate biomolecular interaction information in the scientific literature. You can enter the name or accession number (RefSeq) or a PubMed ID (PMID) of a protein and PreBIND will return a list of papers that talk about the other molecules that interact with that protein. These papers are found using a list of synonyms that the protein is known by so you can find papers that talk about the protein by names other than the one that you entered. The fact that a paper describes interaction information is determined by a supervised learning algorithm called a support vector machine (SVM); for this reason PreBIND may return papers that could not be easily retrieved simply by doing literature searches for keywords such as "interaction".
NOTE that you will retrieve literature hits ONLY if the protein is also already part of the BIND database !

Typical BIND accessions: refer to section BIND IDs.
BioCarta Pathways
(BioCarta, Inc., San Diego, USA)
BioCarta is a company which develops reagents and assays for biopharmaceutical research. The BioCarta website serves as interactive web-based resource for life scientists. This information falls into four categories – gene function, proteomic pathways, ePosters, and research reagents.

The BioCarta Pathways section includes expert-curated interactive graphic models of many  pathways from diverse fields like apoptosis, cell cycle, cell signalling, development, immunology, neuroscience, adhesion, and metabolism. There is a keyword search option for pathway names, or gene names. You may also perform a multi-gene search limiting the output to pathways including all of the query genes.
NOTE that clicking onto individual genes in a pathway map reveals a table comprising all important links like LocusLink, UniGene, KEGG, OMIM, GeneCard, PubMed and more.
NOTE: Although BioCarta offers a "Multi-Gene Search" option, tests showed that it is NOT suitable to analyze large gene datasets for common pathways.

NOTE: Biocarta pathways are also searchable via the portal BioCarta Pathways on CGAP. As "additional value", CGAP has linked each human gene in BioCarta and each human enzyme in KEGG to its CGAP Gene Info page. Please refer to the main section of BioCarta Pathways on CGAP for further information and for examples !
(SRI International)




BioCyc is a collection of 205 Pathway/Genome Databases plus the BioCyc Open Chemical Database. Each Pathway/Genome Database in the BioCyc collection describes the genome and metabolic pathways (NO signaling pathways !) of a single organism, with the exception of the MetaCyc database, which is a reference source on metabolic pathways from many organisms. The BioCyc databases are divided into three tiers, based on their quality (intensive-moderate-no curation).
Query: ALL databases (including MetaCyc and HumanCyc) can be queried from the BioCyc Query page. The user may query for pathways, reactions, compounds, genes, and proteins, or browse ontologies or screen through lists of database entries.

MetaCyc is a database of nonredundant, experimentally elucidated metabolic pathways. MetaCyc contains 700 pathways from more than 600 different organisms. MetaCyc is curated from the scientific experimental literature. It stores pathways involved in both primary metabolism (including photosynthesis), secondary metabolism, as well as associated compounds, enzymes, and genes.
Query: MetaCyc pathways can be browsed from a list, from ontologies, or queried directly when searching for pathways, proteins, reactions or compounds.

HumanCyc is a bioinformatics database that describes the human metabolic pathways and the human genome.  By presenting metabolic pathways as an organizing framework for the human genome, HumanCyc provides the user with an extended dimension for functional analysis of H. sapiens at the genomic level.
Users can query the HumanCyc database through the BioCyc web site. At this BioCyc Query page, it is possible to choose one of several options, like to browse lists of all stored pathways, genes, proteins, or compounds.

- Note that HumanCyc (like all other BioCyc databases) does not contain signaling pathways. Some data, like alternative splicings and tissue specificity are not yet supported.
- BioCyc contains many more comments which describe individual metabolic pathways than KEGG.
- KEGG maps are much larger than BioCyc maps, and are mosaics that combine reactions from many organisms, whereas BioCyc maps describe single pathways elucidated in single organisms.
- Taken together, BioCyc is at least of equal quality in the field of metabolic pathways as compared to KEGG.

Typical BioCyc accessions: refer to section BioCyc IDs.
DIP - Database of Interacting Proteins
The DIP database, hosted by the University of California, catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data.
NOTE that Registration is required to gain access to some of the DIP features. Registration is free to the members of the academic community.

There are several ways to query the DIP database.
1. Node: Search the database for matches within the various fields describing protein (node) entries. The results are returned as a list of proteins that fulfill the criteria specified.
NOTE: If you do not retrieve any hits when using a protein name of interest, you may try searching for your protein in SwissProt and/or PIR databases. Once you have found your protein there you can come back to DIP as both of those databases provide cross-references to DIP.
2. BLAST: Search the database for sequences matching a particular sequence or its fragment. The results are returned as a list of proteins sorted by the BLAST significance score (p-value).
3. Motif: Search the database for proteins containing a motif specified as a Prosite ID or as a user-specified regular expression.
4. Article: Search the database for interactions described in selected article(s). The results are returned as a list of interactions that were described by at least one experiment from the selected article(s).

A query result page shows 2 links.
- "Node"
: Interactions are displayed as color graphs (top-right link "Graph") containing nodes. Click at the "Legend" in the bottom right corner to obtain a description of the quality and reliablility of each interaction (e.g. described by the thickness of each line).
- "Links" displays a list of these interacting proteins.
HiMAP - Human Interactome Map
(University of Michigan)
HiMAP is a dynamic browser for the human protein-protein interaction map, provided by the University of Michigan. HiMAP allows users to begin with a single protein or a set of proteins and explore both known and predicted protein-protein interactions. Literature-confirmed interactions come from the Human Protein Reference Database (HPRD), yeast-two-hybrid-defined interactions come from two recent publications in Nature and Cell, and predicted interactions were generated by a Bayesian Analysis published in Nature Biotechnology. Note: The Y2H dataset of the Cell publication is also available for query at the MDC Berlin-Buch under the name PPI Database.
Note: A registration is needed, which is free for academic institutes. Note that you do not need to register again if you already registered for the related resource Oncomine (see Oncomine section).

1. Single gene query (see command "Tools"):
Search by gene symbol, gene name, Locus Link ID or Unigene ID. In order to draw protein interaction networks, the user may select between different methods of protein-protein interaction determination / prediction, like specific Yeast 2 Hybrid datasets, the HPRD dataset, literature-confirmed interactions, or "pure" predictions. Note: If more than one option is chosen, it is possible to highlight interactions derived from a specific method by selecting the "Highlight edges" checkbox. Note: It is possible to determine the node colors in the graph based on molecular function or cellular localization. The link "Legend" in the graph window explains the different colors, like "kinase", "transcription factor", "receptor" or "nucleus" and "cytoplasm". The interaction network is drawn based on the selected methods. Each "connecting line" (edge) between 2 genes is clickable in order to display an information box displaying the 2 gene names, the evidence type, and associated PubMed references where this "interaction" is described.

2. Batch gene search / Gene list upload (see command "Tools"): Use 'Batch Gene Search' for temporary lists of proteins less than 100. Use 'Gene List Upload' for larger lists and to store lists. Go to 'My Gene List' to view and analyze uploaded lists. Example: It is possible to use a group of co-clustering genes from a microarray experiment, and use this gene list as input for HiMAP. The created map presents a very informative overview about the potential "relationships" of the proteins within that group. NOTE that "relations" may not necessarily mean direct protein-protein interactions, like in the example of TTP (ZFP36) and PTGS2 (COX2) which is classified as "Shared reference into function" based on the PMID: 12578839. This paper demonstrates that TTP binds to the 3'-UTR of the PTGS2 mRNA. Another example is the relation between IL1A and SELE (E-selectin) which is also classified as "Shared reference into function", based on the PMID: 12673844, which shows that Interleukin 1 upregulates the expression of SELE. Another example is the relation between IL8 and VEGF which is based on several literature citations pointing at a co-expression in certain situations (also not necessarily meaning protein-protein interaction).
Thus, as HiMAP is based not only on confirmed interactions but also on a predictive model to find protein-protein interactions, it actually inludes a wide range of "Interaction types" beyond real protein-protein interaction, like relation based on co-expression (based on co-citations), enriched domain pairs (based on InterPro protein domains), or shared biological process (based on GO terms). Therefore, HiMAP is also mentioned in section "Integrated Functional Data Mining".

NOTE: The Adobe SVG viewer plugin is needed in order to display the protein interaction networks ! Tests showed that this plugin works much better in MS Internet Explorer than in Netscape, at least in NS version 7.2 !  
HPRD - Human Protein Reference Database
(Johns Hopkins School of Medicine, Baltimore, and Institute of Bioinformatics, Bangalore, India)
HPRD - Human Protein Reference Database represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome. All the information in HPRD has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data. HPRD is a joint project between Pandey Lab, Johns Hopkins School of Medicine, Baltimore, and the IOB, Institute of Bioinformatics, Bangalore, India. At first sight, HPRD may seem like "yet another protein database", but there are some features which are really worth mentioning.

1. Query HPRD:
Besides the "usual" options (accession numbers, gene names), you may limit the search to certain molecular classes, cellular components, domains, motifs, tissue and cell type-specific expression, diseases, or length of the protein sequence or molecular weight.
Example: You may also retrieve ALL proteins related to a specific term, like "Expression": "Testis" by simply leaving all other search fields empty. Note that HPRD is strongly "graphically oriented", meaning that proteins are depicted with their domain structures and sites of post-translational modification. This delivers a good impression concerning the organization of individual proteins at the cost of options concerning data download. Note: Some terms like "endothelial cells" are listed under "vascular endothelial cells" or "vascular endothelium".
In general, this type of interface allows a large number of specific combinations of search terms and thus may be used to address diverse types of questions like the following examples:
- I want to know all N-glycosylated proteins localized to the endoplasmic reticulum (Note: "Glycosylation under "PTMs" (post-translational modifications). Result: 30 proteins.
- I want to know all transcription factors having a HMG domain. Result: 22 proteins.
- I want to know all proteins related to atherosclerosis. Result: 3 proteins. Note: The term "arteriosclerosis" will retrieve no hits.
- I want to know all transcription factors expressed in vascular endothelial cells. NOTE: This example, of course, demonstrates the limitations of this system as the result yields only 4 proteins, which is definitely much less than the actual number. So, the system will in any case only retrieve those entries which have been manually annotated as being expressed in a certain tissue or cell type. This limitation may probably be the case with every of the above questions.

2. Browse HPRD:
You may also browse the sections of the HPRD database, and retrieve all corresponding protein entries. The main sections are: Molecule Class, Domains, Motifs, PTMs, and Localization. This feature is especially useful if you want to get ALL entries of a specific type, and you do not need to make any combinations of search terms.

3. Pathways:
This section displays selected pathways, like EGF receptor, B cell receptor, Wnt, IL2 receptor, or TNF receptor 1. The pathway diagrams have been drawn based on the protein-protein interaction data contained in HPRD. The pages show static JPEG images by default. If you click 'view SVG format' button at the bottom of the diagrams, you might be prompted to download a plug-in for SVG. You should accept and download the plug-in as it will allow you to visualize the data in SVG format in a new pop up window. In this window, you will be able to link directly to any molecule page by clicking on the name of any molecule in the interaction network. You will also be able to search for a protein by name by 'right clicking' and using the 'Find' command on the SVG page.

4. Single protein entries:
Each protein entry in HPRD is composed of several "tabs" which correspond to specific data:
- "Summary": shows data like localization, domains and literature references relating to expression data.
- "Sequence": shows protein and nt sequence. Note: Useful feature is the color-coding of protein domains within the sequence!
- "Alternate Names": self-explanatory. Note: Sometimes, the official gene symbol is not contained at all, like for the gene IKK alpha (official: CHUK).
- "Diseases": mainly a link to the OMIM database entry.
- "PTMs and Substrates": Sites of post-translational modifications and known substrates of a certain protein.
- "Interactions": all known interaction partners of a specific protein. NOTE: This is a very useful and valuable component of the HPRD database (and the reason why HPRD is categorized under the section "Pathways and Interactions"). The related PubMed abstracts are linked with each interaction and the type (in vitro, in vivo) is indicated. This is an example of the interaction partners of the protein IKK alpha.
NOTE: These HPRD-based interaction data are also available when performing a query at the STRING database, and choosing the view "Experiments" at the output overview; see also STRING section.

Typical HPRD accessions: refer to section HPRD IDs.
(Protein Design Group, National Center of Biotechnology, Madrid, Spain)
iHOP - information Hyperlinked Over Proteins generates a network built of co-citations of genes and proteins in public literature. iHOP is a public service provided by the Protein Design Group (PDG), National Center of Biotechnology (CNB), Madrid, Spain. By employing genes and proteins as hyperlinks between sentences and abstracts, iHOP converts the information in PubMed into one navigable resource. Note that protein-protein interactions which are experimentally verified are specifically highlighted.

Query iHOP:
The search for literature information about a particular gene (GENE X) is the starting point in iHOP. You may limit the search to specific fields or to individual organisms.

This is an example of the search result for the gene Ptgs2 (COX2). First, all entries in different organisms are displayed, where you can choose the species of interest. The result will be shown on one page, containing general information about the gene and all sentences that associate the gene with others. Gene symbols within sentences are hyperlinks to their corresponding information pages. For example, clicking on gene symbol (GENE Y) will open the overview page for the gene (GENE Y). Sentences that associate the current gene (GENE Y) with the previous gene (GENE X) will be shown at the top of the page and are separated from other sentences by a line. For every sentence the original abstract is available via a specific icon, and there is another icon which adds interesting sentences to the so-called Gene Model, which serves as a logbook and provides a graphic representation of your findings. All the associations between collected sentences are represented as a graph. Nodes in this graph represent genes. Edges correspond to sentences that associate two genes with each other. The intensity of edges increases with the number of sentences that describe an association.

- "Show Overview": displays all associated genes as a list, indicating the number of sentences with the "base gene" for each one. NOTE that this list may also contain synonyms of the base gene, which are not explicitly marked.
- "Find in this page": opens a small input box which lets you search for specific terms within the page. NOTE that in case you are using a browser like Netscape, you may use the "built-in" function "Find text as you type", which immediately highlights any text that you type (but not within the URL field).
- "Filter and Options": "Highlighting associative verbs": In general more than 90% of all active relations between proteins in the literature are expressed syntactically as "protein verb protein". In iHOP all verbs that describe interactions between proteins (e.g. 'bind', 'phosphorylate', 'inhibit', 'activate', etc.) and occur between two proteins can be highlighted. "Show official symbols": Select this option to show the official gene symbol beside the identified synonym, e.g. "PC7" -> "[PCSK7] PC7". "Show only experimentally confirmed associations": iHOP allows the user to overlay experimental data on the literature network, so that sentences that include proteins whose interaction has experimental evidence will be highlighted and ranked first. Experimental data are derived from the IntAct database, and include techniques like Y2H (Yeast-2-hybrid screens), TAP (Tandem Affinity Purification), and HMS (High-throughput mass spectrometric protein complex identification).
- "Definitions for gene X": shows definitions according to the functional context of each literature reference.
- "Enhanced PubMed/Google query": It is a well known fact that one will retrieve a different set of literature references depending on which of the synonyms of a certain gene is used in the query. Enhanced queries can include synonyms and orthographic variations for genes or MeSH terms. The user my choose the synonyms he wants to include in the query. You can either search in Google or in the PubMed database.

NOTE: iHOP is also included and automatically performed in the "data super-integration" tool Bioinformatic Harvester at EMBL !

IntAct provides a freely available, open source database system and analysis tools for protein interaction data. All interactions are derived from literature curation or direct user submissions.
IntAct can be queried using diverse types of identifiers, like gene name ("Ptgs2"), IntAct accession number ("EBI-298933"), UniProt acc. ("Q05769"), UniProt ID ("PGH2_MOUSE"), InterPro acc., GO acc., and PubMed ID.
The output displays lists of interaction partners, links to PubMed references, experimental techniques to verify interactions, graphical displays of interaction networks, and more.

Note: IntAct data are also linked by other resources like the Stanford SOURCE database (section "SwissProt Information", "Miscellaneous" in individual gene-centered entries). Note that only the acc. numbers are listed, there are no direct links to IntAct.

Typical IntAct accessions: refer to section IntAct IDs.
KEGG - Kyoto Encyclopedia of Genes and Genomes
(Kyoto University)











KEGG is part of the GenomeNet project of the Kyoto University. KEGG, which was initiated in 1995, turns sequence information from a number of organisms into metabolic or regulatory pathways. This site makes it easy to place genes into a functional context, and to look for as yet unknown genes that might exist in an organism. A good site to start from is the KEGG table of contents. KEGG consists of 4 main databases: PATHWAY, GENES, LIGAND, and BRITE.

1. Main KEGG Databases and associated tools:

The KEGG PATHWAY database is a graphical catalogue of manually drawn pathway maps for metabolic (like glycolysis or ATP synthesis),  and regulatory pathways (like apoptosis or cell cycle), which can be simply browsed by topics. Pathway maps are based on extensive survey of published literature. If available, different organisms are compared. All components of the maps are clickable leading to detailed information. Maps are available both as GIF-files and as XML version. These KEGG Markup Language (KGML) files provide graph information that can be used to computationally reproduce and manipulate KEGG pathway maps.

: The Pathway database (as well as the others) is searchable in different ways:
- DBGET: see description below. Note that this interface allows searches for pathway names like "MAPK" or "cytokine" but not a search for e.g. a gene name of interest (like "DUSP4") !
- Search Objects in KEGG Pathways: This site allows to perform in-batch searches of gene lists against the KEGG Pathway database. Example: A cluster of gene names from a microarray experiment can be analyzed to display all pathways which are involved and to graphically highlight the input genes within the pathway maps. As query, different identifiers (gene names, EC accessions, KEGG gene identifiers, and more) may be used. The output shows the list of pathways and the corresponding genes, but there is no statistical summary how many genes of your input list are found in which pathway.
Note: If a gene is not found in this search, it might still be present in the KEGG Genes database, but it is not assigned to a pathway (yet) !
Note: If you want to perform such a search while including the KEGG annotation "vocabulary" (KO), please refer to section KAAS below.

NOTE: KEGG pathways are also searchable via the portal KEGG Pathways on CGAP. As "additional value", CGAP has linked each human gene in BioCarta and each human enzyme in KEGG to its CGAP Gene Info page. Please refer to the main section of KEGG Pathways on CGAP for further information and for examples !

1.2. GENES:
The KEGG GENES database is a collection of gene catalogs for all complete genomes and selected partial genomes, generated from publicly available resources, mostly NCBI RefSeq.
All genomes in KEGG GENES are subject to SSDB computation (precomputed protein similarities, similar to BLink at NCBI). SSDB (Sequence Similarity Database) contains the information about amino acid sequence similarities among all protein-coding genes in the complete genomes, which is computationally generated from the GENES database in KEGG. SSDB is a huge graph consisting of protein-coding genes as its nodes and similarities as its edges. In addition, all KEGG GENES entries are given manual KO assignments (see below).   
Query KEGG GENES: Each GENES entry contains cross-reference information to outside databases, including NCBI gi numbers, Entrez Gene IDs, and UniProt accession numbers. KEGG provides automatic ID conversion, enabling the use of such outside identifiers to access KEGG GENES and then the other KEGG databases.

1.3. LIGAND: comprises the "the universe of chemical reactions involving metabolites and other biochemical compounds, as well as drugs and xenobiotic compounds". KEGG LIGAND is a composite database consisting of COMPOUND, GLYCAN, REACTION, RPAIR, and ENZYME databases. ENZYME is derived from the Enzyme Nomenclature, but the other four are developed and maintained at KEGG in a common relational database.
NOTE: The Reaction Classification (RC) system is described below!

- Compound: collection of chemical compounds that are related to various cellular processes.
- Reaction: collection of reactions, mostly enzymatic reactions, involving those compounds.    
- Enzyme: for the Enzyme Nomenclature by IUBMB and IUPAC 

Query - Test:
Note: Please refer also to the other resources described in section "Small Molecules Databases" for comparison !
- All 3 test gene names (PTGS2, TP53, SELE) are not found in KEGG LIGAND, BUT they are retrievable using KEGG GENES. Example: The human PTGS2 entry links to the Enzyme ("EC:") entry which then links to e.g. inhibitors like Diclofenac, stored in KEGG COMPOUND database. Note: The term "SELE" retrieves many hits, so it is better to use the alternate gene symbol "ELAM".
- All 3 test drug names (Aspirin, Diclofenac, Celebrex) are found, using the KEGG Compound database via the search field "Name". Note: Celebrex is only found via its synonym "Celecoxib".
Note: If available, there are direct links to other compound databases like PubChem and ChEBI.
- All 3 test disease names (Atherosclerosis, Alzheimer, Inflammation) are not found in KEGG, although the KEGG BRITE database has a section "Disease Genes, Genomes and Pathways" which also has a "Text Search" option.

1.4. BRITE had been a separate database for many years, but it was formally included in KEGG in release 34.0 (April 2005) to establish a logical foundation for the KEGG project. KEGG BRITE is a collection of hierarchies, with the main objective to allow automated functional annotations of datasets. BRITE focuses both on hierarchical structuring of gene-based knowledge via the KEGG Orthology (KO) system, as well as on structuring knowledge on biochemical reactions via the Reaction Classification (RC) system, and finally on structuring knowledge based on compounds and drugs, the Chemical Ontology (CO).

General remark on Gene Ontology (GO) - KEGG Orthology (KO): (from Mao et al., Bioinformatics 2005)
In recent years, high-throughput technologies such as DNA sequencing and microarrays have created the need for the automated annotation and analyses of large sets of genes, including whole genomes. To this end, an ontology, which is defined as a specification of a conceptualization, provides a common controlled vocabulary to facilitate electronic communication and sharing of information across different research groups and enables comparison of annotations across different genomes and different gene sets. Several ontologies have been developed for genome annotation and expression analysis such as the Gene Ontology (GO). The Gene Ontology organizes functional terms into three top-level categories: molecular function, biological process and cellular component. Each category is structured as a directed acyclic graph (DAG) in which a term may have more than one parent and more than one child. The Gene Ontology has been used in the annotation of many genome databases. Researchers annotating these databases use a combination of automation and manual curation to assign GO terms to genes in these genomes.

While GO offers tremendous value, it also has certain limitations:
- Firstly, the GO hierarchy has highly varied depths along different branches, from two levels (e.g. GO:0001662 behavioral fear response) to 15 levels (e.g. GO:0030607 mitotic spindle orientation). Some of the variation is inherent in different functional families, while some may be an artifact of the uneven contribution by different groups participating in GO's development and may affect the reliability of statistical significance tests of GO terms.
- Secondly, because GO was originally developed for the annotation of eukaryotic genomes, the functional categorization in GO, and genome annotation using GO, is not as accurate for some prokaryotes as for eukaryotes.
- Thirdly, because GO terms do not correspond directly to known pathways; it is difficult to identify pathways directly from GO annotations.

Mao et al. demonstrate that KO is effective as a controlled vocabulary for automated annotation of sets of sequences, including whole genomes, and since KO links directly to known pathways, KO annotations enable concurrent pathway identification. Historically, enzyme commission (EC) numbers were used to describe common gene products in metabolic pathways. The ortholog identifiers were later introduced to overcome limitations in the enzyme nomenclature. The KEGG Orthology is a further extension of the ortholog identifiers, and is structured as a DAG hierarchy of four flat levels.
- The top level consists of the following five categories: metabolism,genetic information processing, environmental information processing, cellular processes and human diseases.
- The second level divides the five functional categories into finer sub-categories.
- The third level corresponds directly to the KEGG pathways, and the
- fourth level consists of the leaf nodes, which are the functional terms.

1.4.1. The KEGG ORTHOLOGY (KO) system organizes knowledge about orthologous genes and paralogous genes. KEGG KO is a pathway-based classification of orthologous gene groups.The KO database contains a manually curated set of orthologous gene groups in the complete genomes, which are linked to the nodes (boxes) indicating gene products (mostly proteins) in the KEGG pathway maps. Thus, the KO database accession number, or the K number, represents the common identifier of the pathway node of the KEGG PATHWAY database and the ortholog group of the KEGG GENES database. The ultimate goal of this system is the ability to assign KO identifiers to all genes from a newly sequenced genome via comparison to already characterized genes with assigned KO, and therefore also to predict pathways in this newly sequenced genome.

1.4.2. KAAS - KEGG Automatic Annotation Server : KAAS provides functional annotation of genes by BLAST comparisons against the manually curated KEGG GENES database. The result contains KO (KEGG Orthology) assignments and automatically generated KEGG pathways.
KAAS accepts a multi-FASTA sequence set (default: protein) as input. KO assignments are based on results from BLASTP. Check the "Nucleotide" checkbox if queries are nucleotide sequences representing a set of EST contigs or ESTs. In this case, KO assignments are based on results from BLASTX and TBLASTN. KO assignment methods may be performed based on the bi-directional best hit (BBH, default) of BLAST or single-directional best hit (SBH). The computation time of the BBH method is about twice that of SBH. However, the method based on BBH will be more accurate than SBH, if the number of query sequences is large enough (genome scale). If the number of query sequences is small, then the SBH method should suffice (and save time).
The URL to access the results is sent by Email. The results are available in several formats:
- KO list: presents a list of the input sequences.
- KO hierarchy: lists the different metabolic and regulatory pathways together with the genes of the user-dataset which play a role in this specific pathway.
- Pathway map: first indicates how many genes of your dataset correspond to a certain pathway and then presents graphical images of the pathways, highlighting the user-submitted genes ! Note: This is a very nice tool to identify pathways which are induced in a certain gene set, which may e.g. be derived from a microarray study analyzing a certain biological stimulus.
- Download: produces a simple list of the input dataset together with the assigned "K" numbers.
Note: If you want to compare the output to a "normal" batch search against the Pathway database, please refer to section PATHWAY above !

1.4.3. The Reaction Classification (RC) system in the "chemical space" is the counterpart of the KO system in the "genomic space". It represents the attempt to organize knowledge on chemical reactions by categorizing chemical structure transformation patterns.

2. General database retrieval system:

2.1. DBGET: DBGET is a simple database retrieval system to perform keyword searches in flat-file databases, which in this context means text data, GIF images for KEGG pathways, Java graphics for genome maps and expression profiles, and 3D graphics for protein structures. This is accomplished by treating a collection of HTML files as a database.
Query DBGET: The Web version of DBGET provides the choice of the bfind mode (default) and the bget mode. In the bfind mode, keywords and optional characters can be entered in the search box. You then get a list of entries that contain matching keywords. By selecting an entry name in the list, you obtain the database entry. When you know beforehand the entry name (or the primary accession number or the primary gene name) of your interest, it is much faster to retrieve the entry by switching to the bget mode. Just enter the entry name (or the accession number or the gene name) in the search box. Once you get an entry in either mode, you can further retrieve related entries in different databases by clicking on marked items. In order to obtain all related entries at a time, click on LinkDB at the top line or the marked entry name. This invokes the LinkDB search or the blink mode search.    
Databases (and thus accession numbers) included are (selection): RefSeq, GenBank, EMBL, UniProt, PDB, EPD, Pfam, PROSITE, BLOCKS, and OMIM.

3. Specialized entry points to KEGG:

3.1. EXPRESSION: KEGG EXPRESSION is a Web-based system for integrated analysis of gene expression profiles together with KEGG pathways and KEGG genomes. As a database, it contains microarray data for Synechocystis PCC6803, Bacillus subtilis, and Escherichia coli obtained by the Japanese research community. The browser portion of this system may also be used to analyze your own data.

3.1.1. KegArray is a standalone Java application for integrated analysis of gene expression profiles together with KEGG pathways and KEGG genomes. KegArray runs on Mac and Windows and is made freely available to both academic and non-academic users. The use of KegArray is recommended because the web-based system is no longer supported.

Typical KEGG accessions: refer to section KEGG IDs.
PharmGKB - the "Pharmacogenetics and Pharmacogenomics Knowledgebase", is a research tool developed by Stanford University which integrates data on genes, diseases, drugs, and pathways.
PharmGKB Pathways are drug centered, gene based, interactive pathways which aim to highlight candidate genes and gene groups and associated genotype and phenotype data of relevance for pharmacogenetic and pharmacogenomic studies.

Please refer to the PharmGKB main section for details.
(CSHL, EBI, GO Consortium)


Reactome (formerly known as "Genome KnowledgeBase) is a collaboration among Cold Spring Harbor Laboratory, The European Bioinformatics Institute, and The Gene Ontology Consortium to develop a curated resource of core pathways and reactions in human biology. The information in this database is authored by biological researchers with expertise in their field, maintained by editorial staff, and cross-referenced with PubMed, GO, and the sequence databases at NCBI, Ensembl and UniProt. In addition to curated human events, inferred orthologous events in 21 non-human species including mouse, rat, chicken, fugu fish, worms, fly, yeast and E.coli are also available.

You can search/browse by biological processes like DNA replication, translation, glucose metabolism etc., and you will obtain a hierarchical structure of topics and subtopics, including literature and sequence links. 

SkyPainter is a program which is part of Reactome, which can (at least by definition) be used to graphically visualize "over-represented" pathways which are turned on by a given set of genes.

Query - 2 options:

1. identifiers only: the gene list is pasted using one of several types if identifiers. Identifiers which can be used are UniProt accession numbers and ids, GenBank/EMBL/DDBJ protein ids, RefPep, RefSeq, EntrezGene, MIM, InterPro, Affymetrix, Agilent and Ensembl protein, transcript and gene identifiers. All purely numeric identifiers, such as from MIM and EntrezGene have to have the abbreviated database name and colon prepended to them, i.e. MIM:602544, EntrezGene:55718.
Note: To see an example of the Reactome reaction map painted using a test set of identifiers, click on the hyper-linked word "identifiers" in the data entry box.

2. identifiers and numerical values: (like Affymetrix IDs and signal values of a time-course experiment)
If the identifiers are followed (separated by space or tab) by a numeric value, the colouring will be done according to the average of the numeric values of all identifiers linked to the reaction. A time series can be displayed as an animation by providing multiple values (on the same line, separated by a single space or tab) for each identifier. This feature can be used, for example, to produce a "movie" on the basis of micro-array expression analysis a time series.
Note: To see an example of the reaction map painted using a test set of identifiers with values (producing a "movie") click on the words "identifiers with values" in the data entry box. Please note that the overrepresentation analysis is not performed in this case.

- Mapping from submitted identifiers to Reactions is shown in tab-delimited format, providing a quick overview which of the input identifiers are included in Reactome's pathways.
- A graphical map of "total pathway events" is presented, which is color-coded to visualize pathways which are influenced by the user-set of genes. The colour of each reaction arrow on the reaction map indicates the the number of genes in the submitted list that participates in the reaction.
- Statistically over-represented pathways ("events")
are also listed in tabular form.

Personal remarks:
- Note
that gene names have to be used in the format "ATF3_HUMAN", whereas "ATF3" alone will NOT find any hits.
- There is a strong heterogeneity concerning the display of individual pathways, some images are from publications, some are newly produced for Reactome, and some pathways do not contain any diagramy at all.
- NOTE: Test runs using a gene set derived from a microarray experiment showed that only a fraction of the genes was represented in the database, and therefore also some specific pathways were highlighted.
Conclusion: By comparison, the KEGG KAAS tool yielded the more comprehensive results !

Typical Reactome accessions: refer to section Reactome IDs.
STRING is a database of known and predicted protein-protein interactions.
The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources: genomic context, high-throughput experiments, coexpression, and previous knowledge (PubMed). STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable.

You may query using an identifier of your gene / protein of interest, or you may paste a protein sequence. You can choose the prediction methods, the level of confidence, and the number of interactors shown.

The output displays a list of potential functional associations (a list of proteins) and the prediction method used in each case. These data can also be displayed in several views, e.g. the "Summary Network" which shows a graph containing all the proteins listed and a color-code of their "relationships".
Note: If you want to see protein-protein interactions which are experimentally verified, you may either click at the respective column to see one specific pair, or you hit the button "Experiments" which lists ALL interactions. You can also see the source database for each interaction entry, like BIND or HPRD.

NOTE: STRING is also included and automatically performed in the "data super-integration" tool Bioinformatic Harvester at EMBL !
(BIOBASE, Germany)
TRANSPATH is an information system on gene-regulatory pathways. There is a public version of TRANSPATH available. It focuses on pathways involved in the regulation of transcription factors. Elements of the relevant signal transduction pathways like hormones, enzymes, complexes and transcription factors are stored together with information about their interaction. All data is extracted by experts from the scientific literature. It is an extension module to the TRANSFAC database on transcription factors and their binding sites. 
NOTE: TRANSPATH "professional" is accessible via registration at BIOBASE , BUT it is NOT free anymore for non-profit organizations.
Selected Pathways
TNFa / NF-kB pathway
(Cellzome, Heidelberg, Germany)
The TNFa / NF-kB pathway was published as Nature Cell Biology paper by Cellzome, a company based in Heidelberg. This Interaction Network is based on a comprehensive Tandem Affinity Purification-Mass Spectrometry (TAP-MS) analysis of proteins implicated in TNFα/NF-κB signal transduction pathways.
The web interface allows a simple keyword search, by entering the gene name (or parts) of interest (e.g. IKK or TRAF) to search the database for protein interactions. The output shows a list of potential interactors and the "level of reliability" of each interaction.
In addition, an interactive map is available, providing a direct "click here to get" functionality.
NOTE: Enzymes represent a large group of proteins that act as catalysts in mediating and speeding a specific chemical reaction. There are several databases which selectively store data on enzymes with links to related reactions and pathways.
Typical Enzyme accessions: refer to section Enzymes IDs.
(Cologne University, Germany)
BRENDA was started in 1987 as development of an enzyme data information system as part of the protein-design activities then at the GBF in Braunschweig. The data base is implemented in a relational data base and covers some 40 data fields with information about nomenclature, reaction and specificity, enzyme structure, isolation / preparation, stability, literature references and cross references to sequence and 3D-structure data banks.

Note: Although BRENDA intends to give a representative overview on the molecular variability of each enzyme the data base is not a compendium. The user will have to go to the cited primary literature for more detailed information.
ENZYME is a repository of information relative to the nomenclature of enzymes. It is primarily based on the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) and it describes each type of characterized enzyme for which an EC (Enzyme Commission) number has been provided.
You may query by EC number, enzyme class, description, cofactor, and more.

Typical ENZYME accessions: refer to section ENZYME IDs.
IntEnz is the name for the Integrated relational Enzyme database and is the most up-to-date version of the Enzyme Nomenclature. The Enzyme Nomenclature comprises the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) on the Nomenclature and Classification of Enzyme-Catalysed Reactions. IntEnz is supported by NC-IUBMB and contains enzyme data curated and approved by this committee.

Typical IntEnz accessions: refer to section IntEnz IDs.
GO Portals
NOTE: The section "GO Portals" lists major sites dedicated to the creation of Gene Ontologies, standardized vocabularies to describe gene and protein functions in cells.
At the FAQ pages, the question of predicting the pathways and biological functions of specific gene datasets is treated mainly in FAQ PATH1, which therefore serves as "anchor point" to compare these different resources.
Gene Ontology Consortium

Gene Ontology Consortium: The goal of the Gene Ontology Consortium is to produce a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing. The three organizing principles of GO are molecular function, biological process and cellular component. A gene product can have one or more molecular functions, be used in one or more biological processes and may be associated with one or more cellular components.

NOTE: The GO Consortium site also provides an excellent linkpage to GO tools, which are all categorized and described in a concise manner. Also, users can see immediately if these tools are web-based or available as downloadable programs.

NOTE: Mappings of external classification systems to GO: This page lists "correlations" between GO terms and equivalent external database entries like InterPro, Pfam, PROSITE, SMART domains, TIGR families, and many more.

NOTE: The following good remark is taken from the 2can tutorial site of EBI: "It is important to distinguish the ontologies (developed as organism-independent structured vocabularies) from the annotations (organism-specific and describing gene product molecular function, biological process and cellular component using the GO). In some sense, the GO provides the tree on which biologists can hang their organism's gene products." 
GOA (EBI): GOA is a project run by the EBI that aims to provide assignments of gene products to the Gene Ontology (GO) resource. In the GOA project, this vocabulary will be applied to a non-redundant set of proteins described in the Swiss-Prot, TrEMBL and Ensembl databases that collectively provide complete proteomes for Homo sapiens and other organisms.

In the first stage of this project, GO assignments have been applied to a data set representing the human proteome by a combination of electronic mappings and manual curation. Subsequently, GO assignments for all complete and incomplete proteomes that exist in Swiss-Prot and TrEMBL have been provided.

NOTE: GOA also provides a  good overview table of tools to access the Gene Ontology "vocabulary" AND the genes/proteins associated with individual terms.
OBO - Open Biomedical Ontologies

OBO - Open Biomedical Ontologies is an umbrella web address for well-structured controlled vocabularies for shared use across different biological and medical domains. OBO is developed and maintained by, the world's largest Open Source software development web site. OBO contains ontologies and points to some other efforts within the community. Ideally we see a range of ontologies being designed for biomedical domains. Some of these will be generic and apply across all organisms and others will be more restricted in scope, for example to specific taxonomic groups.
Access: The user may use the OBO Ontology Browser to browse the ontologies or view the ontologies in table form.

NOTE: You will find that some ontologies are in early stages of development, whereas others, like GO, are being developed and used by a number of large databases.
NOTE: OBO provides a concise overview of ontology projects in the biomedical field and thus serves as an excellent entry point to the field.
GO Browsers
NOTE: The section "GO Browsers" lists resources which allow to browse / search the GO (Gene Ontology) term vocabularies and hierarchies. Note that some of these resources also provide a function of annotating a query gene set with GO terms, but the majority of these programs are described in section "GO Annotators". 
At the FAQ pages, the question of predicting the pathways and biological functions of specific gene datasets is treated mainly in FAQ PATH1, which therefore serves as "anchor point" to compare these different resources.
(GO Consortium)
AmiGO, the GO browser of the Gene Ontology Consortium, can be searched either by terms to display GO hierarchies (like QuickGO) AND a list of all genes associated with this term. In addition, you can filter the results by their source databases, or by different "quality levels" of evidence for association. You may also download the Fasta sequences of the selected genes.

AmiGO can be searched also using a list of gene names, meaning you can directly look for your genes of interest and display their assigned GO terms. For this purpose, you should select the "advanced query", paste your list of gene names, select "gene products"and the species used.
Tests showed that it is quite tricky to choose the best way to query, as gene names like "GEM" not only pick the specific gene but also others which contain "GEM" in their name like "GEMI4". Otherwise, when selecting the option "Exact match", there is no hit at all, as we would have to use "GEM_HUMAN" for this purpose. Thus, it is quite hard to retrieve only the desired genes without having to manually check the whole relusts list.
In contrast to KAAS, each gene is listed separately in the result file, listing all associated GO terms, BUT there is NO examination of "common over-represented" pathways which are induced.
(Expression Profiler, EBI)
The EP:GO browser is built into EBI's Expression Profiler, a set of tools for clustering, analysis and visualization of gene expression and other genomic data. With it, you can search for GO terms and identify gene associations for a node, with or without associated subnodes, for the organism of your choice. 

NOTE: Choose "Show a single GO category and associations" in order to view the individual genes /proteins !
With QuickGO (EBI), a GO browser integrated into InterPro at the EBI, you can search for a GO term to see its relationships and definition, as well as any available mappings to SWISS-PROT keywords, to the Enzyme Classification or Transport Classification databases, or to InterPro entries.

Typical GO accessions: refer to section GO IDs.
Gene Ontology Assignments
At the TIGR Human Gene Index, there is a function to browse the Gene Ontology assignments and to retrieve all human TIGR clusters that list a certain GO annotation.
The interface is very user-friendly ! You can also search by GO Id or by keyword (like "helicase").
GO Annotators
NOTE: The section "GO Annotators " lists resources which allow to map GO terms to single genes or to gene sets. The major difference to section "GO Data Mining" is that there is no "follow-up analysis" like statistical examination of over-represented terms or functional comparison of gene clusters. As some of these resources also generate annotation tables for whole datasets, their main description may be contained in section "High-Throughput Data Retrieval". Note that some of these GO annotators are described in section "GO Browsers". 
At the FAQ pages, the question of predicting the pathways and biological functions of specific gene datasets is treated mainly in FAQ PATH1, which therefore serves as "anchor point" to compare these different resources.
BioMart and GO data retrieval: BioMart is a very powerful data retrieval tool provided by the EBI, please see also the BioMart chapter at the main page.
With respect to GO data, BioMart can be used in several ways:

1. BioMart can be used to download lists of genes which are associated with specific GO terms. For this purpose, you have several options at the "Filter" page. In the area "Gene", you can "limit the output to genes with the GO ID(s)", where you can paste your own list of GO accessions. Or you can first find/browse GO terms in the "Gene Ontology" area, which opens the QuickGO browser, and then limit the BioMart gene list to genes showing these GO terms.
Note that in BioMart, you can perform powerful combinations of searches, e.g. the GO topic "inflammation" combined with the field "Expression", e.g. in the "Cell type" "Endothelium".

2. BioMart can also be used to display all GO terms associated with all individual genes of a gene set of interest. For this purpose, you have to select the GO-specific items at the "Output" page ("Features" tab) of a BioMart session, in order to include them in the final annotation table.
DAVID - The Database for Annotation, Visualization and Integrated Discovery provides a comprehensive set of tools for investigators to visually summarize annotation from large list of genes, including those derived from microarray and proteomic studies. DAVID is composed of several tools for the functional annotation and classification of large gene sets. DAVID provides annotation and statistical analysis for GO terms, pathway assignments, COGs, KOGs, and more.
The "Functional Annotation Tool" of DAVID can be used effectively as GO annotator.

Please refer to the DAVID main section for further information !
GO Data Mining
NOTE: The section "GO Data Mining" lists resources which not only allow to map GO terms to single genes or to gene sets, but which provide tools for "follow-up analysis" like statistical examination of over-represented terms or functional comparison of gene clusters.
At the FAQ pages, the question of predicting the pathways and biological functions of specific gene datasets is treated mainly in FAQ PATH1, which therefore serves as "anchor point" to compare these different resources.
DAVID - The Database for Annotation, Visualization and Integrated Discovery provides a comprehensive set of tools for investigators to visually summarize annotation from large list of genes, including those derived from microarray and proteomic studies. DAVID is composed of several tools for the functional annotation and classification of large gene sets. DAVID provides annotation and statistical analysis for GO terms, pathway assignments, COGs, KOGs, and more.
The "Functional Classification Tool" of DAVID can be used effectively for GO data mining. Also, the "Functional Annotation Tool" of DAVID has features like "Export Selected Annotation as Chart" which allow a statistical analysis of GO terms, pathways, and more.

Please refer to the DAVID main section for further information !
Vanderbilt University, Tennessee)

GOTM (GOTree Machine) is a web-based platform for interpreting microarray data or other interesting gene sets using Gene Ontology. Features include a user friendly web-based interface, an expandable tree for browsing the GO hierarchy, fixed tree as HTML output for archive, bar chart for publication, a statistic analysis indicating GO terms with relatively enriched gene numbers and suggesting biological areas that warrant further study, and finally retrieving subsets of genes by GO term or keyword searching. Note: A free registration is needed in order to use GOTM.

- Select the ID type of your input file. Most major commercial microarray platforms are supported.
-  You can choose either single gene list analysis or interesting gene list vs. reference gene list analysis. For single gene list analysis, you only need to upload the file of interesting gene list. And you will get a GO Tree for this gene list. For interesting gene list vs. reference gene list analysis, you need to upload the file of interesting gene list, and choose an existing reference gene list (all major array types available) or upload the file of reference gene list. In any case, for the statistic analysis, the reference list should have all the genes in the interesting gene list.You will get a GO Tree for the interesting gene list, and GO terms with relatively enriched gene numbers in the interesting gene list comparing to the reference gene list.
- Interesting gene list file upload: The file should include the appropriate ID (required) and corresponding microarray ratio (optional), separated by tab. One ID per row. Don't use excel files directly. If you have an excel file of gene list, save as tab delimited text file first. GOTM only accept plain text files.

Output (interesting gene list vs. reference gene list option):
- A tree of GO category folders is presented which can be expanded or collapsed, highlighting in red those GO categories which are statistically overrepresented in the given gene set. A click onto a term reveals the corresponding gene sub-list. Note that there is currently a transition from LocusLink IDs to Entrez Gene IDs. The letter code is the following: O:Observed gene number in the GO category; E:Expected gene number in the GO category; R:Ratio of enrichment for the GO category; P:Significance of enrichment for the GO category.
- The link "GO categories which are relatively enriched" produces a complete list, which may also be displayed via the buttons "Tree View" and "DAG View".
- "Tree View" presents a very instructive hierarchical tree of all over-represented GO terms including all significance values, which can be easily pasted into e.g. WORD.
- "DAG View" produces a very nice GIF-image including all over-represented GO terms. Both views allow a very good insight into the GO term hierarchy levels!
- "Bar Chart" presents a very instructive bar chart comparing the observed and expected numbers of genes corresponding to each GO term. Note that this view presents only GO terms from a certain level of GO hierarchy which can be selected at best using the DAG View !!! NOTE: You have to tell Netscape that it is a bmp-type of image, whereas Internet Explorer automatically determines the file type. Strangely, a test file produced with IE had about 900 kb, but the Netscape version only 22 kb (at same quality !).
- "Export GOTree" produces a text output of GOtree, which is at first sight similar to Tree View, but which shows ALL GO terms of ALL 3 GO branches (biological process, molecular function, cellular component), still highlighting the over-represented terms. Note that it may take very long to produce this file!

Search Functions:
- GO Term Search: If you know exactly the GO term you are looking for, GO TERM SEARCH is good for you. You can search for GO categories using one or several GO terms (seperated by ",", no space between terms!)
- Keyword Search: This is a fuzzy key word search in case you don't know the exact GO term. You can search for GO categories using keyword (one word or phase per search)

NOTE: Your GO trees will be stored for 1 month, during this period of time, you can come back and retrieve your GO tree using the analysis name. This is an advantage as compared to DAVID which deletes data if the browser is idle for 30 minutes !
!!! NOTE: There is a kind of enhanced version of this tool incorporated in the toolkit WebGestalt, which allows the selection of different settings in order to generate the GO Tree. You may want to compare the performance of these 2 tools. Please refer to the WebGestalt section for details ! Example of additional information: The DAG graph shows the numbers of genes and respective P-values !

- GOTM is an excellent resource for GO based data mining as it presents data in various forms, both allowing quantitative analysis (like using the Bar Chart or Tree View) as well as qualitative analysis (like using the DAG View to produce a very good, concise overview of the whole dataset).
- Nevertheless, it may be advantageous to try the enhanced version incorporated in the toolkit WebGestalt !

Integrated Functional Data Mining
NOTE: The section "Integrated Functional Data Mining" lists resources which integrate several systems for "follow-up analysis" concerning functional annotation of gene datasets. Thus, statistical examination of over-represented terms or functional comparison of gene clusters is based on 2 or more different fields, like GO assignments, pathway maps, protein domain tables, and cocitation networks.
At the FAQ pages, the question of predicting the pathways and biological functions of specific gene datasets is treated mainly in FAQ PATH1, which therefore serves as "anchor point" to compare these different resources.
BiblioSphere PathwayEdition
(Genomatix Inc., Munich, Germany)
BiblioSphere PathwayEdition is part of the commercial Genomatix suite of products. This program deals with such cocitations of genes on different levels, like 2 genes are cocited within an abstract, or within the same sentence, or within the same sentence together with a "functional term" (regulation, inhibit,...) or via a direct connection via such a functional term. Text from Genomatix webpage: "BiblioSphere PathwayEdition uses the world's largest database of biological networks created from millions of individually modeled relationships between genes, proteins, complexes, cells and tissues. A unique combination of hand curation from biological experts and up to date text mining techniques for automated knowledge extraction provides you with the best data quality available. BiblioSphere PathwayEdition allows a view on your data, integrated in biological networks according to different biological context."
BiblioSphere also provides links to other resources like GO (Gene Ontology) or BIND. BiblioSphere produces graphical maps which display the relationships between the genes of your "genes of interest" dataset, like gene citations in PubMed, Gene-Gene co-citations, or transcription factor citations.

1. Installation:
There are a few steps to take in order to use BiblioSphere:
- Install Java 1.4.2 or higher on your computer
- Register for a free evaluation account to get access to Genomatix products. NOTE that you only have 20 free analyses per month !!! Note that there is not only a limitation in the number of analyses but also in the functionality of the obtained data !
Download & install BiblioSphere Analysis Kit.

2. Project management / Query:
- When launched, hit the green arrow to connect to the server. The program will open the Project Manager interface. Continue with "New Project". You can now create a new project by defining a title and an optional description.
- Go on with "New Analysis". At the "Enter Query Parameters" page, you can choose the species, decide whether you want to analyze a single gene or a group of genes and search for co-cited transcription factors only (default) or include all genes. You can enter gene symbols, locus identifiers, RefSeq identifiers or Affymetrix identifiers.

3. Output:
- On the right side you can now retrieve the single gene BiblioSphere for each of your input genes. To view a single gene output "costs" 1 of the 20 analyses.
- Aternatively you can click the link "View Networks" on the top right which generates the Cluster Centered BiblioSphere with all interconnections of your input gene set. This "costs" 3 of the 20 analyses. NOTE: It also "costs" 1 of the 20 analyses if you just want to re-open an already existing network of your input dataset that you created some time ago !!!
- The default view of BiblioSphere Pathway Edition is the "Pathway view". You may select the checkbox "Signal Transduction Pathways" in order to see the names of pathways involved. Mouse-over a certain line reveals the number of co-citations of these 2 connected genes. In case you see a green line, there is additional evidence that a transcription factor directly binds within the promoter region of the other gene. Via the button "JPG" you may download a JPG-image of the network. Tests showed that it was not possible to generate error-free images of large networks. Note that you may also mouse-drag single components of the network to other positions in order to reduce the image size.
- The "Cocitation Browser" lists all abstracts which corrspond to a certain gene-gene link which was selected in the Pathway view.
- The "Genes" view produces a kind of compact annotation table of the gene list, indicating e.g. whether a gene is annotated as a transcription factor. This table can be copied into e.g. Excel. Note: The numbered button in the first column produces the complete co-citation list for the specific gene. This "costs" 2 of the 20 analyses !!! Note: Via "Save" only the entries of the active window are saved as html-page. You have to go through all pages and save them individually if you want to store all PubMed entries !
- "Gene-Gene Connections" view presents a table listing all "relations" between the genes of the input dataset. This table can be copied into e.g. Excel.
- "TF Analysis" displays the result of MatInspector analysis for TFBS in the promoters of cocited genes and links to promoter analysis with GEMS launcher.
DAVID - The Database for Annotation, Visualization and Integrated Discovery integrates functional genomic annotations with intuitive graphical summaries. DAVID provides a comprehensive set of tools for investigators to visually summarize annotation from large list of genes, including those derived from microarray and proteomic studies. DAVID is provided at NCI-Frederick and was developed to support the bioinformatic needs at the National Institute of Allergy and Infectious Diseases (NIAID). DAVID is composed of several tools for the functional annotation and classification of large gene sets.
NOTE: There are no individual URLs for the individual applications. All have to be started from the DAVID main page.

1. Functional Annotation Tool:
1.1. Scope: The scope of this tool is twofold, first to generate an annotation table of a gene set of interest, and second, to determine "over-represented" terms like pathways or GO terms in order to predict the biological processes affected in a specific dataset.
1.2. Input: You simply paste a list of ifdentifiers of your gene set, like Entrez Gene, Affymetrix, RefSeq, UniProt, GenBank, or UniGene. DAVID automatically determines the input species and produces the annotation summary results within a short time. NOTE: Pop-up blockers should be turned off in order to ensure that the program is running properly !
1.3. Filter: The user then selects from a long list of items (accessions) which ones to display in the final output table. Note that this is quite similar to the "Filter" page selection at BioMart. Examples include: Main acc. (Entrez Gene, Affy, GenBank, RefSeq,...); Other acc. (MEROPS, MGI,...); Gene Ontology (GO terms can be selected from different GO levels, and from the 3 branches: biological process, molecular function, cellular component); Protein Domains (Interpro, Pfam, SMART, COG, BLOCKS, PDB,...); Pathways (KEGG, Biocarta, EC number); General Annotations (gene name, symbol, OMIM,...); Functional Categories (COG Ontology, PIR keywords,...); Protein Interactions (BIND, DIP, TRANSFAC,...); Literature (PubMed, GeneRIF,...).
1.4. Output:
- "Export Selected Annotation as Table": This option generates either a txt file or an xls file of the selected items. Note that the table is easier to read without selecting the function "Add hyperlinks" ! NOTE: In contrast to BioMart, only ONE ROW is generated for ONE input gene, thereby avoiding the redundancy of rows produced by BioMart. On the other hand, there are no active hyperlinks in DAVID tables. In Excel, it is possible to simply sort the annotation list by e.g. KEGG pathways or GO terms related to biological processes in order to get an impression of the biological function of the dataset. Note that there is no statistics applied in the Table report.
- "Export Selected Annotation as Chart": NOTE that the chart options on top of the filter page have to be selected in order to produce a chart ! Please refer also to the attached Help file for detailed information on the statistical analysis ! NOTE that this is an extremely nice function as it reports a list of terms which are "over-represented" in the diverse categories which were selected at the filter page, meaning that e.g. all GO terms (separated by the 3 branches), and all pathways, all PIR keywords, InterPro domains etc. are reported in a statistical analysis, showing a P-value representing the probability that such a term is found in a specific dataset size by chance. Again,
the function "Add hyperlinks" should not be selected when downloading the file.

2. Functional Classification Tool:
2.1. Scope: The Functional Classification Tool provides a rapid means to organize large lists of genes into functionally related groups to help unravel the biological content captured by high throughput technologies. The Functional Classification Tool generates a gene-to-gene similarity matrix based on shared functional annotation using over 75,000 terms from 14 functional annotation sources. Tools are provided to further explore each functional gene cluster including listing of the “consensus terms” shared by the genes in the cluster, display of enriched terms, and heat map visualization of gene-to-term relationships. 
2.2. Input: Same as for the Functional Annotation Tool. NOTE: A submitted job expires in the DAVID server session if you have not done anything for 30 min. Then you have to re-submit your gene list.
- Options: The general classification stringency can be set from "lowest" to "highest", or set to "custom" values. Please refer to the attached help-file for advanced options.
2.3. Output: In case that functional groups are detected, these gene lists are shown. Several buttons allow different types of analyses:
   - "Heatmap view":
A global view of ALL (in contrast to "2D View") cluster-to-cluster relationships is provided using a fuzzy heat map visualization. Fuzzy heat map visualization allows genes and terms to appear multiple times within the heat map providing a much clearer view of gene-to-term relationships within a cluster of related genes and a much clearer view of cluster-to-cluster relationships. The attached help-file presents a very good introduction to the field. Horizontal grey lines separate areas of genes belonging to the different functional groups. Vertical grey lines separate areas of annotation terms belonging to the different functional groups. Thus, when large datasets are analyzed, a global image is generated highlighting the primary genes-terms patterns of all groups in diagonal. The patterns above or below the diagonal are to answer the question "how are other functional groups' terms similar to the primary patterns?". The user can either click on individual areas or even mouse-zoom sub-areas to see the gene names or terms in the lower box. Individual genes/terms can be highlighted within the images. The button "Detailed View" produces a 2D View of the selected area (Adobe SVG Viewer needed).
: Tests showed that this Heatmap view works much better in MS Internet Explorer than in Netscape ! Adobe SVG-viewer plugin is needed.

   - "Enriched Terms": A statistical chart is presented showing "over-represented" terms which may be sorted by P-value, by count or by category&term. These terms include GO terms, Interpro domains, PIR keywords, and more.
   - "2D View": a gene-term 2D heat map view presents a graph showing all known associations of terms (like GO terms) and the genes from the input data
set. It allows users to see gene members and their associated annotation term in a heatmap type of view so that user can further explore the gene-gene and term-term relationships within a group.  The terms displayed in the map have to pass the term frequency setting in option session, i.e. 50% of gene associates it as default.
   - "Related Genes": This tool scans the input gene set for related genes, and presents a list with color-coded similarity scores. Any given gene is associating with a set of annotation terms. If genes share similar set of those terms, they are most likely involved in similar biological mechanisms. The algorithm adopts kappa statistics to quantitatively measure the degree of the agreement how genes share the similar annotation terms. Kappa result ranges from 0 to 1. The higher the value of Kappa, the stronger the agreement. Kappa more than 0.7 typically indicates that agreement of two genes are strong. Kappa values greater than 0.9 are considered excellent. Please refer also to the attached Help-file for details. Note that it is also possible to search the whole genome for related genes and to highlight the ones included also in the input dataset in this list!
NOTE: This tool is very good, but it would be great if one point was considered: unfortunately, many of the output tables and images are hard to save or to print from your browser as they are Java windows. BUT NOTE: Tables like "Enriched Terms" can be transferred via COPY/PASTE to EXCEL, maintaining the cells-columns-rows structure ! Only sometimes, the only solution is to make a screenshot and print this image via a graphics program, which is very inconvenient.

3. Gene ID Conversion Tool:
3.1. Scope: This tool converts list of gene ID/accessions to others of your choice with the most comprehensive gene ID mapping repository. The ambiguous accessions in the list can also be determined. NOTE: This is a very nice, quick and easy to use tool for such purposes !!!
3.2. Input: E.g. a list of Affymetrix identifiers, where you want to retrieve the corresponding gene symbols.
3.3. Output - 2 options:
- "normal" ID list: Note that the hyperlinks on gene names link out to the GeneCards database at Weizmann Institute. The resulting list can be easily downloaded as txt file and imported into programs like Excel. Note that the "conversion summary" is also very helpful for large datasets, in order to quickly identify those entries which could not be converted !
- "Show Gene List" option (new):
Note that the hyperlinks on gene names link to the "internal" DAVID gene database. Each gene report displays the most important database links like GenBank, RefSeq, OMIM, GeneRIFs, Entrez Gene, UniProt, and more. In addition, the link "RG" (related genes) scans the input gene set for related genes, and presents a list with color-coded similarity scores (see above!). Note that it is also possible to search the whole genome for related genes and to highlight the ones included also in the input dataset in this list! 
HiMAP - Human Interactome Map
(University of Michigan)
HiMAP is a dynamic browser for the human protein-protein interaction map, provided by the University of Michigan. Because of this definition, the main section of HiMAP is located under "Pathways and Interactions Databases".
HiMAP actually inludes a wide range of "Interaction types" beyond real protein-protein interaction, like relation based on co-expression (based on co-citations), enriched domain pairs (based on InterPro protein domains), or shared biological process (based on GO terms).
NOTE: Although HiMAP does not present statistical values and does not create annotation tables for gene sets, it is well suited and highly recommendable if you want to generate a quick overview about the potential "relationships" within your gene set of interest !

Please refer to
the main section of HiMAP for details !
(Applied Biosystems)



cSNP Analysis

The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. Proteins have been classified according to families and subfamilies, molecular functions, biological processes, pathways.
NOTE: The performance of PANTHER was much better in tests when using MS Internet Explorer than when using Netscape, especially concerning Java applications.

1. Prowler - Browse PANTHER:
The Prowler tool is a very efficient possibility to filter the PANTHER databases according to specific criteria. Example: Biological Process: Apoptosis + Pathway: Apoptosis signaling pathway + Species: NCBI:H.sapiens. This selection will retrieve a list of 62 genes which can be displayed and downloaded (see details below under "Batch ID Search").
Note: This is a very convenient and quick option to display all genes corresponding to a specific pathway or biological process (according to the PANTHER annotation)!

2. Search the PANTHER databases:
2.1. Genes: the database of genes, transcripts and proteins in the following species: Human, Mouse, Rat, and Drosophila melanogaster.
2.2. Families and HMMs: A library (PANTHER/LIB) of protein families and subfamilies and associated data such as phylogenetic trees, multiple sequence alignments and HMMs.
2.3. Pathways: a database of regulatory and metabolic pathways mapped to protein sequences, viewable using the CellDesignerLite tool.
2.4. Ontologies: a collection of terms (PANTHER/X) describing protein molecular functions and biological processes. Note that there are tools available which map the PANTHER Ontology terms to GO terms and vice versa.

3. Batch ID Search:
The Batch ID Search tool allows to find PANTHER-classified genes, transcripts, and proteins by uploading a list of IDs. A list of IDs like gene symbol, gene ID, protein accessions, and more can be uploaded or pasted. The result list presents a kind of annotation table which is mainly based on the PANTHER-specific GO terms (PANTHER Biological Process, PANTHER Molecular Function), and on the PANTHER Pathways.
The result list can be displayed in various formats, like gene list, or transcripts/proteins list, or PANTHER Ontology terms, Families, Pathways, and Pathway Components.
Note: These lists link the repsective genes of the input gene set to specific terms, but there is no statistical evaluation of "overrepresented terms". For this purpose, you have to refer to the section "Tools" !
Note: This is also a quick option to display all genes corresponding to a specific pathway or biological process !
Note: This list can be saved as txt file which is best opened in MS Excel, but which is static (no hyperlinks). Alternatively, you may simply copy/paste the whole list into e.g. WORD, which maintains all hyperlinks!
Note: Via the "Species Filter" tabs, it is possible to quickly generate the ortholog gene datasets derived from the supported species !!!

4. Tools:
The section "Tools" contains programs for "follow up analysis" concerning the functional annotation of gene datasets.

4.1. Gene Expression Data Analysis:
The Gene Expression Data Analysis tool allows in-depth analyses of gene datasets concerning over-representation of specific PANTHER Ontology terms or pathways. It is somehow similar to the DAVID-Functional annotation/classification tools.
The pathway visualization tool will display your experimental results on detailed diagrams of the relationships between genes/proteins in known pathways.
There are 2 different options:

4.1.1. Compare gene lists: Map lists of genes to a PANTHER ontology. For pathways, you can then view the gene expression values overlaid on top of a pathway diagram, where genes will be colored differently for different clusters of genes. Use the binomial statistics tool to compare classifications of multiple clusters of lists to a reference list to statistically determine over- or under- representation of PANTHER classification categories. Each list is compared to the reference list using the binomial test for each molecular function (MF), biological process (BP), or pathway term in PANTHER.
- Create your gene list in a suitable format, like an Excel column of EntrezGene IDs or gene symbols, saved as txt file and upload this file. For example, each selected list may be a cluster of co-expressed genes under a particular set of conditions. You can e.g. upload two lists, one of up-regulated genes and one of down-regulated genes, from a differential mRNA microarray experiment.
- Select the reference list:
For example, the reference list may be the set of all genes in the experiment, or the set of all genes in the genome (human, mouse, rat, Drosophila) being analyzed.
- Select one of 3 different search options: Pathways, Biological Process, Molecular Function.
- A list of Pathways (resp. Ontology terms), showing the corresponding number of reference genes, the number of genes in the user dataset, together with the "expected value", and finally a P-value describing the probability that this pathway etc. is listed "by chance" (the smaller the better).
Note: The table can be sorted simply by clicking onto the column headers (default: sorted by P-values)!
Note: This list can be saved as txt file (via the "Export Results" tab) which is best opened in MS Excel, but which is static (no hyperlinks). Alternatively, you may simply copy/paste the whole list into e.g. WORD, which maintains all hyperlinks!
- Pathway diagrams: these diagrams highlight the genes from the input dataset in different colors. Note: Test showed that the retrieval of these diagrams was not always functional.
- Chart displays: there are multiple chart types which can be selected to visualize the dataset, like "Bar chart of gene count", and "Bar chart of difference". These images can be saved to your local machine.

4.1.2. Analyze a list of genes with expression values: Upload a list of genes and their corresponding fold-change values from a differential expression experiment.
- File of genes + expression data must be tab-delimited, and must contain a gene, transcript, or protein identifier and a corresponding expression data value. Example: A cluster from an Affymetrix array experiment displayed as Excel sheet of 2 columns, first is the gene symbol, second is the "Signal" value. The table may contain a header line (which has to be specified in the input form). This file has to be exported as tab-delimited text file.
- For pathways, you can view the gene expression values overlaid on top of a pathway diagram, where genes are colored according to the expression value. The user can can also specify how the pathway diagrams will be colored according to the gene expression values. 

4.2. HMM Scoring:
Search a single protein sequence against the HMMs in the PANTHER library Version 6.0. The top scoring HMM is reported, along with the E-value (the number of false-positive hits expected) and the alignment is shown.
this tool allows to use protein sequences as input to access the PANTHER databases, in case you do not know a gene-specific identifier (like gene name).

4.3. cSNP Analysis:
This tool estimates the likelihood that a particular nonsynonymous coding SNP will cause a functional impact on the protein. It calculates the subPSEC (substitution position-specific evolutionary conservation) score based on an alignment of evolutionarily related proteins, as described in Thomas et al., 2003 and Thomas & Kejariwal, 2004.
- Simply paste a protein sequence and enter the substitution(s) relative to this input sequence in the standard amino acid substitution format, e.g. A265V. Multiple substitutions should be separated by a tab, space, or return. You may use the integrated example protein APOE by clicking onto the "?" to get an impression how the program works.
- A table containing the subPSEC scores which describe the probability that a certain substitution will have a deleterious effect or may even have a gain-of-function effect. The user can click on the link on the number of the multiple sequence alignment (MSA) position to view the column in the MSA where the substitution occurs (red color).
(Vanderbilt University, Tennessee)

WebGestalt is a "WEB-based GEne SeT AnaLysis Toolkit". WebGestalt incorporates information from different public resources and provides an easy way for biologists to make sense out of large sets of genes. It enables biologists to manipulate integrated information and find patterns that are not detectable otherwise. WebGestalt is designed for functional genomic, proteomic, and large scale genetic studies from which high-throughput data are continously produced. It currently works from human and mouse. WebGestalt is free for academic use after registration. NOTE: If you have already registered for GOTM, you can use this login !

- Select the method of upload (e.g. "from file") for your gene set.
- Select the organism and ID type of your input file. Most major commercial microarray platforms are supported.

1. Gene set information retrieval tool:
This tool extracts an annotation table of your gene set of interest, similar to BioMart or DAVID - Functional Annotation tool. Note that from the general ID types, only the major ones are represented (like gene name, Entrez Gene, RefSeq, UniGene, Ensembl, Swissprot), but there is a focus on function-related annotations (like KEGG, Biocarta, GO, OMIM, PubMed, Gene RIF). The tool automatically generates an EXCEL sheet as output ! Note: Although the table is "static" (no active hyperlinks), this is one of the best tools for gene set annotation, as it includes the most relevant databases, while avoiding the generation of line redundancy seen in BioMart output tables.

2. Gene set organization tool:
2.1. By Function:
- GO Tree: The input for this tool is similar to GOTM, but includes 3 additional fields: Statistical method, significance level, and minimum number of genes. Please refer to the explanations that come with the input page. Note that prints of large graphs are hardly legible because of the small font size. In general, many functions are known from GOTM (like Bar Chart, Export GO Tree, and the Search functions). Example of additional information as compared to GOTM: The "Enriched DAG" graph shows the numbers of genes and respective P-values !
- KEGG Table and Maps: Input is practically identical to GO Tree. The KEGG Table shows KEGG pathways associated with the gene set, the number of genes in each pathway and the Entrez Gene IDs for the genes. Also, the parameters for the enrichment of the KEGG pathway are shown. Each pathway name in the KEGG Table is hyperlinked to the KEGG Map, in which genes in the gene set are highlighted in red. Each Entrez Gene ID is hyperlinked to the gene information record. Note: The complete page including the KEGG Table can be saved as htm file.
- Biocarta Table and Maps: functions very similar to the KEGG description.
- Protein Domain Table: Input is similar to the others. The Protein Domain Table organizes genes based on the Pfam Protein Domains. The table shows the name of the Pfam domains associated with the gene set, the number of genes having each domain and the Entrez Gene IDs for the genes. The 4th column gives the parameters for the enrichment of the protein domain. Each domain name is linked to the Conserved Domain Database (CDD) of the NCBI, where domain function, structure and sequence are available. Each Entrez Gene ID is linked to the Conserved Domain Summary of the NCBI, where a graphical view of domains on the gene is available.
NOTE: Obviously, the "Protein Domain Table" shows less hits in the results file as compared to the extraction of protein domains via the "Gene set information retrieval tool", even when selecting "1" as "minimum number of genes" ! Thus, if you want to get ALL protein domains, you should use the latter tool.

2.2. By Tissue Expression:
- Tissue Expression Bar Chart:
Each tissue is represented by a bar in the chart. The height of the bar represents the number of genes that are expressed in the tissue. Click on individual bars to view the list of genes expressed in the tissue. The last column of the table lists the p value indicating the over/under-representation of each gene in the respective tissue. Clicking on individual p values to view the expression pattern of the gene in various tissue types. Each tissue is represented by two bars. The height of the red bar represents the observed number of EST sequences for the selected gene in the tissue. The height of the green bar represents the expected number of EST sequences.
Note: This is an excellent tool to produce a "quick-view" of statistical analysis of single gene-tissue expression, although it is purely based on EST expression data (no SAGE or microarray data).

2.3. By Chromosome Location:
- Chromosome Distribution Chart:
Each gene is represented by a red cross symbol and located on the chromosome based on its location.

2.4. By Publication:
- GRIF Table:
The GRIF Table shows the co-occurrence of the genes in publications, based on the gene-publication association information retrieved from Gene Reference Into Functions. The table shows the PubMed IDs for the publications associated with the gene set, GRIF comments, the number of genes in each publication and the Entrez Gene IDs for the genes. Each PubMed ID is hyperlinked to the corresponding PubMed record, where the abstract for the paper is available. Each Entrez Gene ID is hyperlinked to the gene information record.
- PubMed Table: The PubMed Table shows the co-occurrence of the genes in publications, based on the gene-publication association information retrieved from PubMed. The table shows the PubMed IDs for the publications associated with the gene set, the number of genes in each publication and the Entrez Gene IDs for the genes. Each PubMed ID is hyperlinked to the corresponding PubMed record, where the abstract for the paper is available. Each Entrez Gene ID is hyperlinked to the gene information record.
NOTE: References including very high numbers of genes are most probably publications describing large-scale sequencing projects, which do not describe a functional connection between these genes !

3. Comparison of 2 different gene datasets:
- "Boolean Operation" allows the comparison of 2 datasets, meaning that simply the gene lists are either "merged", "subtracted" or the "intersection" between the 2 sets is calculated.
NOTE: This feature can be highly useful when comparing large clusters derived from microarray experiments! The user can quickly identify the degree of overlapping genes, and estimate the biological processes involved.

4. Get Human / Mouse Orthologs:
This button very quickly generates the ortholog dataset of your input genes derived from the "other" species.

- WebGestalt delivers "additional value" as compared to GOTM, which is purely based on Gene Ontology terms, and is therefore rather comparable to resources like DAVID - Functional Annotation Tool.
- In general, save and download options are more versatile than in e.g. DAVID.
- WebGestalt incorporates practically ALL FIELDS of functional annotation: GO, Pathways, and Co-citation Networks !!!
- Alltogether, WebGestalt is an excellent resource for functional annotation of gene datasets.