-> DATA INTEGRATION
-> DAT1...get a quick,
yet comprehensive overview of data
concerning my gene / protein of interest ? (last update Aug. 26, 2005)
->
DAT2...get the full-length
sequence and annotation from
an EST sequence or accession number? (last update Sep. 14, 2005)
->
DAT3...retrieve
a large number of
sequences and write one common FASTA sequence file ? ->
see RET3 !
->
DAT4...know which genes of a specific dataset are associated
with a disease ? ->
see GENOM5 !
->
DAT5...know all genes associated with cardiovascular diseases
having a described polymorphism in the promoter region ?
-> see GENOM7 !
->
DAT6...convert a list of IDs
(like microarray probe IDs) into other IDs (like gene names) ? (last update Jun. 27, 2006)
->
DAT7...see the
functions and pathways
where my gene set of interest is involved ?
-> see PATH1 !
->
DAT8...produce a general
annotation table for a gene set of interest ? (last update Dec. 21, 2005)
DAT1...get a quick, yet comprehensive overview of data concerning my gene / protein of
interest
? (last
update Aug. 26, 2005)
The major question when dealing with a new gene is
how to get a quick overview what is already known about its
cDNA sequence, genomic organization, protein domains and structure,
localization, expression, and finally function. There are a few
databases which address this need. NOTE: If you want to produce
an annotation table "in-batch" for a whole set of genes, please
refer to DAT8 !
Tip! The Bioinformatic Harvester,
available at EMBL, is a kind of "data super - integration" tool, as it
provides information from already "integrative" databases (like
SOURCE, Ensembl, UCSC, and NCBI Entrez,...) in ONE
SINGLE webpage. This is achieved by displaying database hits which are rich in graphical information as “iframes”. “iframes” provide the user the latest
information
from the original database server. Note that this is extremely
convenient as each
“iframe” can be manipulated individually. There is a
multitude of query options (but not via EST acc.). NOTE that
Harvester works at the moment exclusively on HUMAN
proteins,
contained in the UniProt collection. Other species might be included in
the
future. Please also refer to the Harvester description on
the main
page.
Tip! The
NCBI Entrez
search page has been re-designed and re-structured recently, now
providing a single interface to query ALL Entrez-databases
at once, including PubMed, OMIM, GenBank, Structures, SNP,
UniGene, GEO (microarray expression data !), and more. You can enter
one or more search terms, and you will retrieve a first "overview"
results page, displaying the number of hits in each of the databases.
Then, you can see the individual results. Alternatively, you can first
choose the database of interest, and then place your query.
Entrez
Gene is one of the Entrez databases, which was constructed to replace
the widely known and used LocusLink database in the year 2004.
Entrez Gene integrates information from LocusLink and from
genes
annotated on Reference
Sequences from completely sequenced genomes. You can query
on names, symbols, accessions, publications, GO terms, chromosome
numbers, E.C. numbers,
and many other attributes associated with genes and the products they
encode. Because Gene is now an Entrez database, all the
familiar and useful functions are now available, including Preview/Index,
History, and LinkOut. The full functionality
of LocusLink was maintained in Entrez Gene,
and even extended by additional links like to the GEO database of
expression data. Please note that the Entrez Gene IDs are identical
to the LocusLink IDs, therefore it is relatively easy to convert
LocusLink-specific files to Gene-specific ones by exchanging the
corresponding URLs. A great feature of ENTREZ Gene is the option
to restrict the query to specific fields, making it possible to
search for e.g. gene name only ! A query for the gene "TNF" yields 295
(!) entries in LocusLink, even when restricting to human entries,
whereas in ENTREZ Gene, one specifically retrieves 1 human entry, by
using "TNF[Gene Name] AND homo[Organism]".
Tip! The UCSC Gene Sorter
is an excellent resource for exploring gene families and the
relationships among genes. This tool displays a table of genes
within a selected genome that are related to one another. Several
different relationships may be explored: protein-level homology,
similarity of gene expression profiles, or genomic
proximity. The Browser supports searches on a variety of terms
and phrases, including the gene name, the SwissProt protein
name, a GenBank accession, or a word or phrase present in a gene's
description. At the "Sort by" field, you can choose e.g.
"Expression
(GNF)", which looks for all datasets in this database which show a similar
expression pattern to your gene of interest.
The gene family display is highly configurable,
allowing the user to control the
order and number of columns, the number of rows, and the genes
displayed. The tool provides several output formats, including
a simple tab-delimited format that may be imported into a spreadsheet
or a relational database. In addition, the sequences of the
displayed genes can be downloaded: cDNA, protein, genomic and promoter
(!) sequences, allowing a user-definition of upstream and downstream
regions. Example: An important use of the Browser is to gather
together a collection of genes that share similar properties
for statistical analysis. For instance, one might want to examine
promotor regions of genes that share a similar expression pattern or
look for protein sequence motifs in genes that share similar
GO annotations. BUT keep in mind: You always start with only
ONE
gene, you can not provide lists of genes. The program itself generates
"lists of genes with similar expression pattern", taken from specific
expression databases. Please refer also to the UCSC Gene Sorter main section
for details!
Tip! The UCSC Proteome
Browser Gateway provides a fast access to protein - specific
data for a gene of interest. You simply enter a gene symbol or a
Swiss-Prot/TrEMBL protein ID into the query field. A list will be
displayed showing proteins from several species which match to your
query. Each protein-entry shows in a concise manner data like
polarity, hydrophobicity, cysteines, aminoacid frequencies and
anomalies, pI, molecular weight, InterPro and Pfam domains, and
predicted comparative 3D structures. In addition, links to the
related UCSC databases Genome Browser
and Gene Sorter,
as well as to the Gene Details Page are included,
allowing a very efficient navigation between these resources.
The Stanford Online Universal
Resource for Clones and ESTs (SOURCE)
compiles information from several publicly accessible databases,
including UniGene, dbEST, Swiss-Prot, GeneMap99,
RHdb, GeneCards and LocusLink The
mission of SOURCE is to provide a unique scientific resource that
pools publicly available data commonly sought after for any clone,
GenBank accession number, or gene. You can query using GB IDs,
LocusLink IDs, UniGene ID, gene name. Currently 3 species (human,
mouse, rat) are available.
The output comprises a very informative, yet concise
description of the gene of interest, including all important links for
data integration (like LocusLink or GeneCards), including emphasis
on gene expression data. When available, there is a link to published
microarray expression data. Please note that not all
microarray data stored
in the SMD
(Stanford
Microarray Database) are retrievable via SOURCE, therefore you may
also directly search
SMD (basic
or advanced)
for datasets via lists of specific experimental setups. Note that the
link "Authors' webpage" offers direct access to the primary databases
holding the expression data. In addition, you will find EST data, normalized
expression distribution in tissues according to EST data, and SAGE
data, an expression analysis method at the NCBI. SOURCE
Batch Search is the batch extract interface for SOURCE.
You
can input a list of GenBank Accessions (! including ESTs !),
dbEST cloneIDs, UniGene ClusterIDs, UniGene gene names, or UniGene gene
symbols and retrieve
data from a check list, where you may choose e.g. UniGene Name, Symbol,
ClusterID, LocusLinkID, RefSeq mRNA, RefSeq Protein, GO Annotations,
and 2 fileds extracted fron Swiss-Prot: Enzymatic Function and
Subcellular Location. Note that the input list must be a text
file consisting of a single column, or pasted from e.g. EXCEL
as
single column. NOTE: Accession numbers are case sensitive,
UniGene ClusterIDs MUST include the "species-prefix" (like
"Hs."), Gene names are not case-sensitive.You may copy the output text
and paste it into WORD (and then convert the text into a table), or in
EXCEL choose "Edit", "Paste Special", "Text". NOTE: This tool is
especially useful if you want to convert large lists of accession
numbers, like from LocusLink to Unigene.
GeneCards
is a database of human genes, their products and their involvement in
diseases. It offers concise information about the functions of
all human genes that have an approved symbol, as well as selected
others. GeneCards can be searched using gene symbols like
"BRCA1" or keywords including GenBank accession numbers,
UniGene clusters, SNP Id., clone identifier (ATCC, IMAGE) and others.
The output includes related sequences, SNPs, medical news,
disorders and mutations, MIPS PEDANT Viewer report (!), links to UCSC
and Ensembl, PubMed search and many more. In addition, links to gene
expression data (Arrays, ESTs, links to SOURCE) are available.
GeneLynx
is a portal to a collection of hyperlinks for each human gene.
You can access the information about a particular human gene by
providing any reasonable identifier - just type a keyword, ANY
accession number or ID, or submit a related
protein or nucleotide sequence on the BLAST search page. You can
also perform a more refined keyword search on the Text search
page. NOTE that it is possible to perform batch
retrieval of GeneLynx entries. GeneLynx enables you to paste or
upload a list of accession numbers* (GB ID, LocusLink ID,
Ensembl ID, OMIM ID, UniGene cluster (without "Hs.") and more) to the Batch GeneLynx page
and obtain a list in which these identifiers are associated with
appropriate GeneLynx IDs and descriptions. Simply paste the IDs
(or
select a plain text file containing the IDs, separated by spaces or
line-breaks, to upload). The output-HTML page can also be saved locally
for later use. * Note that you may even use a list of
EST accession numbers (pick the file format "Nucleotide Accession"),
which automatically retrieves the corresponding UniGene cluster and
GeneLynx data !!! Nevertheless you
might encounter differences in UniGene accessions at GeneLynx and at
NCBI depending on the database build currently used.
Tip! H-Invitational
Database (H-InvDB) is a human gene database opened to the
public in
April 2004, which is hosted by the Japan Biological Information
Research
Center (JBIRC)
and by the DNA Databank of Japan (DDBJ), with
contributions from more than 40 institutes worldwide, like the german
DKFZ.
Please note that H-InvDB is based essentially on cDNA
sequences; it is not a genome sequence repository although
providing a graphical genome viewer (i.e. chromosome regions of
matching
cDNAs). The scope of H-InvDB is to provide an integrative
annotation
of full-length cDNA clones available from high throughput cDNA
sequencing projects. The database generates cDNA clusters
describing their
gene structures, novel alternative splicing isoforms, non-coding
functional RNAs, functional domains, sub-cellular localizations,
metabolic pathways, predictions of protein 3D structure, mapping of
SNPs and microsatellite
repeat motifs in relation with orphan diseases, gene expression
profiling,
and comparative results with mouse full-length cDNAs in the context of
molecular evolution. Please refer to the H-InvDB section at the Data Integration page for
a detailed description !
HPRD -
Human Protein Reference
Database is yet another good starting point if you are interested in human
proteins. HPRD represents a centralized platform
to visually depict and integrate information pertaining to domain
architecture, post-translational modifications, interaction
networks
and disease association for each protein in the human proteome.
All the
information in HPRD has been manually extracted from the
literature by
expert biologists who read, interpret and analyze the published data.
HPRD is a joint project between Pandey Lab,
Johns Hopkins School of Medicine, Baltimore, and the IOB, Institute of
Bioinformatics, Bangalore, India. At first sight, HPRD may seem like
"yet another protein database", but there are some features which are
really worth mentioning. In particular, HPRD offers very nice query
options, e.g. you may
limit the search to certain molecular classes, cellular components,
domains, motifs, tissue and cell type-specific expression, diseases, or
length of the protein sequence or molecular weight. Please refer
to the HPRD section at the Pathways page for
a detailed description !
DAT2...get the full-length
sequence and annotation from an EST sequence or accession number? (last
update Sep. 14, 2005)
There are several databases involved in this
question, and often it is necessary to scan through more than one of
them.
1. Query via EST accession numbers:
Tip!
First, I would recommend a keyword search using the GenBank accession
number of the EST against the NCBI-UniGene
Database containing EST-Clusters. Even more convenient, you
may perform this search using the NCBI ENTREZ cross-database
query form. Each UniGene cluster contains sequences that
represent a unique gene (annotated sequences and ESTs), as well as
related information such as the tissue types in which the gene has been
expressed and map location. You have to select the species and type in
the accession number. In the case that there are already annotated
cDNAs available, you will find entries in the part "mRNA/GENE
SEQUENCES". Often, you will find a link to the NCBI-LocusLink
database (probably replaced by a link to Entrez
Gene in 2004), which provides a very valuable linkpage concerning
all information related to a certain gene, like to the NCBI-OMIM
database, containing
"mini-reviews" about what was already published.
Tip! If
you know the EST accession number, you can use it to query the SOURCE database (choose
"GenBank Accession" as search option). There you will find all
available links, like Unigene or LocusLink (Entrez Gene), and many
additional links especially to expression data.
2. Query via EST sequences:
Tip! If
you are dealing with a sequence from a "well characterized" species
(like human, mouse, rat), possibly the best (and fastest !) way to get
to the full-length sequence and annotation, is to perform a BLAT
search at UCSC against the corresponding genome. BLAT is
extremely fast, and lists the desired gene on top of the output page.
3. Query via EST accession numbers
or
sequences:
Another good access point are the TIGR Gene Indices.
TIGR holds information on ESTs not only from human but from a variety
of species (animal, plant, protist, and fungal). You can either BLAST these databases, or
query by keywords, like EST-accession number, ID, tissue, cDNA library
name, gene product name, and many more. TIGR clusters are named as "TC"
(Tentative Consensus; "THC" stands for Tentative Human Consensus). Note
that you can also search for a gene using the GenBank accession number
of an annotated cDNA in the database. The main advantage of TIGR is
that ESTs belonging to a single gene / cluster are not only listed but
assembled into a contig sequence, and represented as a graphical
image showing corresponding sizes and positions. In addition, you will
find an expression summary, and links to the genomic organization of
the gene. The latter is very useful by showing the TCs from all
related species in one image, and a link to the Ensembl database.
H-Invitational
Database (H-InvDB) is a human gene database opened to the
public in April 2004, which is hosted by the Japan Biological
Information Research Center (JBIRC) and by the
DNA Databank of Japan (DDBJ),
with contributions
from more than 40 institutes worldwide, like the german DKFZ. The
scope of H-InvDB is to provide an integrative annotation of
full-length
cDNA clones available from high throughput cDNA sequencing
projects. The database generates cDNA clusters describing their
gene structures,
novel alternative splicing isoforms, non-coding functional RNAs,
functional
domains, sub-cellular localizations, metabolic pathways, predictions of
protein 3D structure, mapping of SNPs and microsatellite repeat motifs
in
relation with orphan diseases, gene expression profiling, and
comparative results with mouse full-length cDNAs in the context of
molecular evolution. You may simply BLAST
the H-InvDB using your query sequence and see the cluster it belongs to.
Please also refer to the H-InvDB
section at the Data Integration page
for a detailed
description !
Tip! ECgene (gene
prediction by EST clustering) predicts genes by combining genome-based
EST clustering and a transcript assembly procedure in a coherent
and consistent fashion. Specifically, ECgene takes alternative splicing
events
into consideration. The positions of splice sites (i.e.
exon-intron boundaries)
in the genome map are utilized as critical information in the whole
procedure. Sequences that share splice sites in the genomic alignment
are grouped together to define an EST cluster. Transcript
assembly, based
on graph theory, produces gene models and clone evidence, which is
essentially
identical to sub-clustering according to splice variants. ECgene is
available for human, mouse, and rat
genomes. ECgene can be queried both by EST accession numbers or via
BLAST search. Example: EST acc. AI700705. Note that at
first, the complete Gene Summary page is shown where you will
find the input EST in the list of all ESTs. The link Alignment
Viewer will present a graphic which shows the position of the input
EST, meaning the exon(s) where it corresponds to. Thus, ECgene is an
excellent resource to identify the potential transcript variants
of a gene where your EST belongs to ! Note that when performing
a BLAST search, you will retrieve the ECgene Transcript ID which you
can use then to query ECgene. Please also refer to the ECgene
section at the Expression page
for a detailed
description !
DAT3...retrieve
a large number of
sequences and write one common FASTA sequence file ? -> see RET3
!
This question involves resources of high-throughput
data retrieval, which are described in FAQ RET3. This FAQ relates to
the main section
"High-throughput Data Retrieval".
DAT4...know which genes of a
specific dataset are associated
with a disease ? ->
see GENOM5 !
This question involves resources of high-throughput
data retrieval specifically concerning gene-disease correlations, which
are described in FAQ GENOM5. This FAQ relates to the main section
"Disease-centered Data Integration".
DAT5...know all genes associated
with cardiovascular diseases having a described polymorphism in the
promoter region ?
-> see GENOM7 !
This question involves resources in the fields of
SNPs and diseases, which are also capable of batch handling and
filtering of gene datasets. This FAQ relates to the main section
"Disease-centered Data Integration".
DAT6...convert a list of IDs (like
microarray probe IDs) into other IDs (like gene names) ? (last update Jun. 27, 2006)
Many web-based applications require the input of one
specific type of ID or allow to select among a group of
IDs, e.g. if you want to analyze a set of genes derived from a certain
experiment like a microarray analysis. Therefore, it is often necessary
to convert lists of IDs into other corresponding ID types. There are
some resources for this purpose, but they show very different
functionalities and support different kinds of IDs.
Tip! DAVID
- The
Database for Annotation, Visualization and Integrated
Discovery integrates functional genomic
annotations with intuitive graphical summaries. DAVID provides a
comprehensive set of tools for investigators to visually summarize annotation
from large list of genes, including those derived from
microarray and proteomic studies. DAVID is provided at NCI-Frederick and was developed to
support the bioinformatic needs at the . DAVID is
composed of several tools for the functional annotation and
classification of large gene sets. NOTE: There are no individual
URLs for the individual
applications. All have to be started from the DAVID main page. Please
refer also to the DAVID section at the
main page.
The Gene ID Conversion Tool of
DAVID is
a very nice, quick and easy to use
tool for such purposes !!! This tool converts list of gene
ID/accessions to
others of your choice with the most comprehensive gene ID mapping
repository. The ambiguous accessions in the list can also be
determined. The user may use e.g. a list of Affymetrix identifiers as
input, and wants to retrieve the corresponding gene symbols. There are 2
options for the output:
In the "normal" ID list, the hyperlinks on
gene names link out to the GeneCards
database at Weizmann Institute. The resulting list can be easily
downloaded as txt
file and imported into programs like Excel. Note that the "conversion
summary" is also very helpful for large datasets, in order to
quickly identify those entries which could not be converted !
In the "Show Gene List" option (new), the hyperlinks
on gene names link to the "internal" DAVID gene database. Each gene
report
displays the most important database links like GenBank, RefSeq, OMIM,
GeneRIFs, Entrez Gene, UniProt, and more. In addition, the link "RG"
(related genes) scans the input gene
set for related genes, and presents a list with color-coded similarity
scores (see above!). Note that it is also possible to
search the whole genome for related genes and to highlight the
ones included also in the input dataset in this list!
NOTE:
AS DAVID produces only ONE line per gene, it is the better resource for
this purpose than BioMart which produces redundant lines (see
discussion below) !
ID Mapping
is a tool for ID conversion provided by PIR,
a member of the UniProt consortium, which is capable of converting
lists of IDs (like GenBank AC, gi,
RefSeq, TIGR, Pfam, Prints, PROSITE, KEGG pathway ID, GO, BIND, Gene
name, Entrez Gene ID, OMIM, PubMed, and more) to and from UniProtKB ID
or AC. Note that this tool can therefore also be used for
questions like how to get all proteins belonging to a certain pathway. NOTE:
The conversion only works to and from UniProtKB ID
or AC, it is not possible to e.g. convert Entrez Gene IDs into
gene names !
A
very powerful resource for data retrieval and also for ID list
conversion is BioMart
(formerly known as "EnsMart"). NOTE that there are different
web interfaces for BioMart, please refer to the BioMart main section for details.
BioMart is a data retrieval
tool that
generates lists of biological objects (e.g. genes, SNPs) from data held
in the Ensembl (and other) databases. In fact, the functionality is
based on filtering
the whole set of genes in a genome via lists of
accession numbers, IDs (like Entrez Gene, RefSeq, MIM, InterPro, PDB,
GO, Affymetrix, and many more). BioMart can generate a number of different
types of output,
including sequence and tabulated list data. Multiple
output formats, including HTML, text and Microsoft Excel, are also
supported. Note that EXCEL sheets maintain all hyperlinks !
In order to convert a list of IDs,
you may perform the following steps. At the Start Page you have
to select the organism and the database.
At the Filter Page, have a look at the dropdown menu "Entries
with following IDs", where you can provide your own list of
e.g. LocusLink IDs, MIM IDs, RefSeq IDs, Affymetrix
ProbeSet IDs (!), and many more. The entries can be
separated by commas or line breaks. At the Output Page, you
should choose the Features
Page, where you can select those IDs which you want to appear in
the final table. NOTE: BioMart
is transcript-centered, so if a gene has multiple transcripts,
you will get redundant lines (meaning duplicates, triplicates
etc. of genes) in your final EXCEL-sheet. Unfortunately there is no way
to remove these automatically. So, if you need a gene list where each
gene is represented just once, you have to manually delete these
rows. Note that BioMart supports a long list of different IDs,
although some are still missing. A test run showed that there is no
option to search using mouse gene symbols or to output mouse gene
symbols (in contrast this is no problem with human HUGO IDs).
TOUCAN
is a workbench for
regulatory sequence analysis, especially for detecting significant
transcription factor binding sites across species. Please refer to the TOUCAN section at the main page for
details. Among many functions, TOUCAN provides a module to generate annotation
tables of gene lists. From
the TOUCAN menu, choose "Get_Seq", "From Ensembl", "Get Info", and hit
the "Update" button. This starts the process of retrieving the
informations and creating the table, which can be followed in the
progress bar. Each row represents one gene of your dataset and each
column represents one kind of database identifier. This includes basic
gene, mRNA and protein IDs as well as sites of "data integration" like
Ensembl Gene, EMBL, HUGO, UniGene, UniProt, RefSeq, and LocusLink. In
addition, IDs in the fields of protein domains and motifs (Interpro),
as well as protein structure (PDB) are listed. As the number of
microarray platforms increases rapidly, also the number of columns in
TOUCAN displaying chip-specific identifiers is constantly growing.
Thereby, not only the world-leader in this technology, Affymetrix, is
covered but also other platforms. Finally, the table presents also IDs,
which are related to functional information of each gene, like the
involvement in biological processes, or the localization to specific
cellular compartments (GO, MIM).
To look at this information in detail, you can highlight
all rows and copy / paste the data into programs like MS EXCEL. Conclusion:
TOUCAN supports a long list of ID types as input / output (including
gene names, microarry IDs, and many more). There is also the positive
feature that the table is non-redundant displaying only one row for one
gene. Nevertheless, the table is "static", there are no hyperlinks.
The
NCBI Entrez
search page provides a single interface to query ALL
Entrez-databases
at once, including PubMed, OMIM, GenBank, Structures, SNP,
UniGene, GEO (microarray expression data !), and more. NCBI Entrez
shows very limited functionality concerning the question of ID
list conversion. First, only a few Entrez databases support batch
queries, like "Nucleotide" or "Protein" where you can query using a
list of IDs. Also, there is no option to generate tables of IDs, which
would be comparable to BioMart. Also, there is no way to use gene
symbols "in batch" to query Entrez
Gene, or there is no way to use microarray IDs like
Affymetrix ProbeSets as input.
The Stanford Online Universal
Resource for Clones and ESTs (SOURCE)
compiles information from several publicly accessible databases,
including UniGene, dbEST, Swiss-Prot, GeneMap99,
RHdb, GeneCards and LocusLink.The
mission of SOURCE is to provide a unique scientific resource that
pools publicly available data commonly sought after for any clone,
GenBank accession number, or gene. SOURCE
Batch Search is the batch extract interface for SOURCE.
You can choose between 3 organisms (human, mouse, rat). You
can input a list of IDs (! including ESTs !),
dbEST cloneIDs, UniGene ClusterIDs, UniGene gene names, UniGene gene
symbols, or LocusLink (now Entrez Gene) IDs, and retrieve
data from a check list, where you may choose e.g. UniGene Name, Symbol,
ClusterID, LocusLinkID, RefSeq mRNA, RefSeq Protein, GO Annotations,
and 2 fileds extracted fron Swiss-Prot: Enzymatic Function and
Subcellular Location. Note that the input list must be a text
file consisting of a single column, or pasted from e.g. EXCEL
as
single column. NOTE: Accession numbers are case sensitive,
UniGene ClusterIDs MUST include the "species-prefix" (like
"Hs."), Gene names are not case-sensitive.You may copy the output text
and paste it into WORD (and then convert the text into a table), or in
EXCEL choose "Edit", "Paste Special", "Text". Conclusion: Only
3 species are supported, and the list of possible IDs is very limited
(e.g. no option to use microarray IDs...). Also, the output is a
"static" text, there are no hyperlinks.
The program Resourcerer
at the TIGR webpage can also be
used for ID conversion. RESOURCERER provides annotation based on the TIGR Gene Indices (TGI)
for commonly available microarray resources, including widely used
clone sets and Affymetrix GeneChip Arrays. If you have a list of
accession numbers,
click at the link "Batch
Search" and either upload a *.txt file containing your accession
numbers, like UniGene, RefSeq, GenBank incl. EST Acc., LocusLink
(separated by spaces) or simply type in the numbers in the text box.
You will retrieve a table containing links to UniGene,
LocusLink, the human, mouse, and rat TIGR indices, and to the GO
database. You can save the output page in
*.html format using your Browser's "Save page..."
function and open it in applications like WORD or
EXCEL, having all the hyperlinks fully intact.
If the list is very long, it will be separated into multiple
files
!!! OR: If you use the "Download Virtual"-function of
Resourcerer, you will get a tab-delimited txt-file of the table
(single
file irrespective of its length) which you can import into EXCEL, but
which has NO hyperlinks. Please note that the output
file
lists the array names which contain a gene of interest but
NOT the
individual
identifiers (like ProbeSet IDs in case of Affymetrix arrays),
please
refer
to the BioMart
description for this purpose. Conclusion: Only 4 species
are
supported (human, mouse, rat, zebrafish), only 4 types of ID are
supported as input (GenBank, UniGene, LocusLink, RefSeq). There is no
way to input gene names or microarray IDs.
DAT7...see the
functions and pathways
where my gene set of interest is involved ?
-> see PATH1 !
This question involves resources which are capable
of handling large datasets, like those coming from microarray
analyses, in a high-throughput manner in order to predict the biological
function or any other "over-represented" annotation terms of such
sets. Thus, these resources build a kind of bridge between the main
sections "Data Integration" and "Pathways, Interactions, Functions".
DAT8...produce a general annotation
table for a gene set of interest ? (last update Dec. 21, 2005)
When performing e.g. microarray experiments, the
user ends up with lists of gene names or certain types of accession
numbers and is confronted with the question of annotating this gene
set. Thus, this question somehow represents the more general approach
to the matter discussed in DAT7, which focuses on
functions and pathways. Also, this question represents the "multi-gene
version" of DAT1 which deals with the annotation
of single genes. At the main pages, these resources are mainly
described in section "High-Throughput
Data Retrieval".
Tip! A
very powerful resource for data retrieval is BioMart
(formerly known as "EnsMart"). NOTE that there are different
web interfaces for BioMart, please refer to the BioMart main section for details.
BioMart is a data retrieval
tool that
generates lists of biological objects (e.g. genes, SNPs) from data held
in the Ensembl (and other) databases. In fact, the functionality is
based on filtering
the whole set of genes in a genome via lists of
accession numbers, IDs (like Entrez Gene, RefSeq, MIM, InterPro, PDB,
GO, Affymetrix, and many more). BioMart can generate a number of different
types of output,
including sequence and tabulated list data. Multiple
output formats, including HTML, text and Microsoft Excel, are also
supported. Note that EXCEL sheets maintain all hyperlinks !
In order to produce a general annotation table,
you may perform the following steps. At the Start Page you have
to select the organism and the database.
At the Filter Page, have a look at the dropdown menu "Entries
with following IDs", where you can provide your own list of
e.g. LocusLink IDs, MIM IDs, RefSeq IDs, Affymetrix
ProbeSet IDs (!), and many more. The entries can be
separated by commas or line breaks. At the Output Page, you
should choose the Features
Page, where you can select those IDs which you want to appear in
the final table. You should select items like general gene accessions
(Ensembl Gene ID, External Gene ID, Description), function-related
items (GO ID and description, MIM ID), disease attributes (OMIM),
protein domain attributes (Interpro, Pfam, Prosite), and protein family
IDs.
NOTE: BioMart
is transcript-centered, so if a gene has multiple transcripts,
you will get redundant lines (meaning duplicates, triplicates
etc. of genes) in your final EXCEL-sheet. Unfortunately there is no way
to remove these automatically. So, if you need a gene list where each
gene is represented just once, you have to manually delete these
rows. Note that BioMart supports a long list of different IDs,
although some are still missing. A test run showed that there is no
option to search using mouse gene symbols or to output mouse gene
symbols (in contrast this is no problem with human HUGO IDs).
Conclusion: In general, a versatile data
retrieval tool. In contrast to e.g. DAVID, there are no pathway related
items (like KEGG or Biocarta). When exported into MS EXCEL format, all hyperlinks
to the different databases remain active ! Problem of
redundancy of entries.
Tip! DAVID
- The
Database for Annotation, Visualization and Integrated
Discovery integrates functional genomic
annotations with intuitive graphical summaries. DAVID provides a
comprehensive set of tools for investigators to visually summarize annotation
from large list of genes, including those derived from
microarray and proteomic studies. DAVID is provided at NCI-Frederick and was developed to
support the bioinformatic needs at the . DAVID is
composed of several tools for the functional annotation and
classification of large gene sets. NOTE: There are no individual
URLs for the individual
applications. All have to be started from the DAVID main page. Please
refer also to the DAVID section at the
main page.
The Functional Annotation Tool
of DAVID can be used effectively to produce an annotation table of
a gene set of interest. You simply paste a list of ifdentifiers
of
your gene set, like Entrez Gene, Affymetrix, RefSeq, UniProt, GenBank,
or UniGene. DAVID automatically determines the input species and
produces the annotation summary results within a short time. NOTE:
Pop-up blockers should be turned off in order to ensure that the
program is running properly ! The user then selects from a long
list of
items (accessions) which ones to display in the final output table.
Note that this is quite similar to the "Filter" page selection at
BioMart, but includes additional / different items like pathway
databases (KEGG, Biocarta,...). Examples include: Main acc.
(Entrez Gene, Affy,
GenBank, RefSeq,...); Other acc. (MEROPS, MGI,...); Gene Ontology (GO
terms can be selected from different GO levels, and from the 3
branches: biological process, molecular function, cellular component);
Protein Domains (Interpro, Pfam, SMART, COG, BLOCKS, PDB,...); Pathways
(KEGG, Biocarta, EC number); General Annotations (gene name, symbol,
OMIM,...); Functional Categories (COG Ontology, PIR keywords,...);
Protein Interactions (BIND, DIP, TRANSFAC,...); Literature (PubMed,
GeneRIF,...). The option "Export Selected Annotation as
Table"generates
either a txt file or an xls file of the selected items. Note
that the table is easier to read without selecting the function "Add
hyperlinks" !
Conclusion: In contrast to BioMart, only ONE ROW
is
generated for ONE input gene, thereby avoiding the redundancy
of rows
produced by BioMart. On the other hand, there are no active
hyperlinks
in DAVID tables. In Excel, it is possible to simply sort the annotation
list by e.g. KEGG pathways or GO terms related to biological processes
in order to get an impression of the biological function of the
dataset.
Tip!
The
PANTHER (Protein ANalysis
THrough Evolutionary
Relationships) Classification
System was designed to classify proteins (and their genes) in
order to facilitate high-throughput analysis. Proteins have been
classified according to families and subfamilies, molecular functions,
biological processes, pathways. The high-throughput analysis tools
suitable for gene datasets are based on a PANTHER-specific gene
ontology as on pathway maps. There are several tools which
build the PANTHER portal. Please refer to the PANTHER main section for detailed
information !
The Batch
ID Search tool allows to find
PANTHER-classified genes, transcripts, and proteins by uploading a list
of IDs. A list of IDs like gene symbol, gene ID, protein accessions,
and more can be uploaded or pasted. The result list presents a kind of annotation
table which is mainly based on the PANTHER-specific GO
terms (PANTHER Biological Process, PANTHER Molecular Function), and on
the PANTHER Pathways. The result list can be displayed
in various formats,
like gene list, or transcripts/proteins list, or PANTHER Ontology
terms, Families, Pathways, and Pathway Components. Note that
this is also a quick option to display all genes corresponding to a
specific pathway or biological process ! This
list can be saved as txt file which is
best opened in MS Excel, but which is static (no hyperlinks).
Alternatively, you may simply copy/paste the whole list into
e.g. WORD, which maintains all hyperlinks! Note that via the "Species
Filter" tabs, it is possible to quickly generate the ortholog gene
datasets derived from the supported species !!! Please refer
to the PANTHER main section for
detailed information !
TOUCAN
is a workbench for
regulatory sequence analysis, especially for detecting significant
transcription factor binding sites across species. Please refer to the TOUCAN section at the main page for
details. Among many functions, TOUCAN provides a module to generate annotation
tables of gene lists. From
the TOUCAN menu, choose "Get_Seq", "From Ensembl", "Get Info", and hit
the "Update" button. This starts the process of retrieving the
informations and creating the table, which can be followed in the
progress bar. Each row represents one gene of your dataset and each
column represents one kind of database identifier. This includes basic
gene, mRNA and protein IDs as well as sites of "data integration" like
Ensembl Gene, EMBL, HUGO, UniGene, UniProt, RefSeq, and LocusLink. In
addition, IDs in the fields of protein domains and motifs (Interpro),
as well as protein structure (PDB) are listed. As the number of
microarray platforms increases rapidly, also the number of columns in
TOUCAN displaying chip-specific identifiers is constantly growing.
Thereby, not only the world-leader in this technology, Affymetrix, is
covered but also other platforms. Finally, the table presents also IDs,
which are related to functional information of each gene, like the
involvement in biological processes, or the localization to specific
cellular compartments (GO, MIM).
To look at this information in detail, you can highlight
all rows and copy / paste the data into programs like MS EXCEL.
Conclusion:
TOUCAN supports a long list of ID types as input / output (including
gene names, microarry IDs, and many more). There is also the positive
feature that the table is non-redundant displaying only one row for
one
gene. Nevertheless, the table is "static", there are no
hyperlinks.
The Stanford Online Universal
Resource for Clones and ESTs (SOURCE)
compiles information from several publicly accessible databases,
including UniGene, dbEST, Swiss-Prot, GeneMap99,
RHdb, GeneCards and LocusLink.The
mission of SOURCE is to provide a unique scientific resource that
pools publicly available data commonly sought after for any clone,
GenBank accession number, or gene. SOURCE
Batch Search is the batch extract interface for SOURCE.
You can choose between 3 organisms (human, mouse, rat). You
can input a list of IDs (! including ESTs !),
dbEST cloneIDs, UniGene ClusterIDs, UniGene gene names, UniGene gene
symbols, or LocusLink (now Entrez Gene) IDs, and retrieve
data from a check list, where you may choose e.g. UniGene Name, Symbol,
ClusterID, LocusLinkID, RefSeq mRNA, RefSeq Protein, GO Annotations,
and 2 fileds extracted fron Swiss-Prot: Enzymatic Function and
Subcellular Location. Note that the input list must be a text
file consisting of a single column, or pasted from e.g. EXCEL
as
single column. NOTE: Accession numbers are case sensitive,
UniGene ClusterIDs MUST include the "species-prefix" (like
"Hs."), Gene names are not case-sensitive.You may copy the output text
and paste it into WORD (and then convert the text into a table), or in
EXCEL choose "Edit", "Paste Special", "Text".
Conclusion: Only
3 species are supported, and the list of possible IDs is very limited
(e.g. no option to use microarray IDs...). Also, the output is a
"static" text, there are no hyperlinks.
The program Resourcerer
at the TIGR webpage can also be
used for ID conversion. RESOURCERER provides annotation based on the TIGR Gene Indices (TGI)
for commonly available microarray resources, including widely used
clone sets and Affymetrix GeneChip Arrays. If you have a list of
accession numbers,
click at the link "Batch
Search" and either upload a *.txt file containing your accession
numbers, like UniGene, RefSeq, GenBank incl. EST Acc., LocusLink
(separated by spaces) or simply type in the numbers in the text box.
You will retrieve a table containing links to UniGene,
LocusLink, the human, mouse, and rat TIGR indices, and to the GO
database. You can save the output page in
*.html format using your Browser's "Save page..."
function and open it in applications like WORD or
EXCEL, having all the hyperlinks fully intact.
If the list is very long, it will be separated into multiple
files
!!! OR: If you use the "Download Virtual"-function of
Resourcerer, you will get a tab-delimited txt-file of the table
(single
file irrespective of its length) which you can import into EXCEL, but
which has NO hyperlinks. Please note that the output
file
lists the array names which contain a gene of interest but
NOT the
individual
identifiers (like ProbeSet IDs in case of Affymetrix arrays),
please
refer
to the BioMart
description for this purpose.
Conclusion: Only 4 species
are
supported (human, mouse, rat, zebrafish), only 4 types of ID are
supported as input (GenBank, UniGene, LocusLink, RefSeq). There is no
way to input gene names or microarray IDs.