
-> GENOMICS
-> GENOM1...scan my gene
(set) of interest for the presence of
SNPs
(Single Nucleotide Polymorphisms)? (last update Feb. 22, 2006)
-> GENOM2...search for
(also distant) orthologs/homologs of my gene/protein of interest ?
(last update May 5, 2006)
-> GENOM3...see which regulatory
elements are conserved in a set of orthologous promoters (Phylogenetic
Footprinting) ? ->
see GEN5 !
-> GENOM4...get all human proteins
present in Drosophila but not in C. elegans ? (last update Apr. 14, 2004)
-> GENOM5...know which genes of a
specific dataset are associated
with a disease ? (last
update Jun. 7, 2005)
-> GENOM6...identify Conserved Non-coding Sequences (CNS) and
conserved transcription factor binding sites in large
genomic
regions via comparative genomics ? (last update Mar. 14, 2006)
-> GENOM7...know all genes associated
with cardiovascular diseases having a described polymorphism in the
promoter region ? (last
update Jun. 7, 2005)
-> GENOM8...know the genes and drugs
related to diseases like atherosclerosis ? -> see CHEM2 !
-> GENOM9...analyze the expression of a
gene set of interest in cancer tissues ? (last
update Feb. 13, 2006)
-> GENOM10...determine the expression
profiles of normal vs. cancer tissues ? (last
update Feb. 13, 2006)
GENOM1...scan my gene (set) of interest for the presence of
SNPs
(Single Nucleotide Polymorphisms)? (last update Feb. 22, 2006)
Today, there are various databases which store
information on Single Nucleotide Polymorphisms and make these data
retrievable. In general, we may subdivide these resources according to
the types of source sequences where the SNPs are derived from,
coding sequences, whole transcript sequences and genomic sequences. In
addition, they may be compared according to the feature if they support
only single-gene queries or if they also provide a batch-retrieval
option for high-throughput dataset analyses.
1. SNPs in transcript sequences:
Tip!
Maybe the best way to query for single transcript-specific SNPs is to
use the following NCBI resources. Query Entrez
Gene with the gene name of interest, maybe also specifying the
species of interest via the "Limits" tab, in order to retrieve the
specific Entrez Gene-report, example GeneID:5743
(PTGS2). The sidebar contains, among many others, links to SNP data of
this gene. You may either directly go to "SNP: GeneView" or you first
open the link "SNP" and then hit "GeneView" in one of the SNP hits
(example: GeneView
of human PTGS2).
Alternatively, you may query Entrez SNP
with the gene name of interest which produces the same result page.
The GeneView can be easily modified in order to display either
only those SNPs located within the coding region or all SNPs located
within the gene (including the UTRs of the transcript), which are
distinguishable via a simple color code. Note that the promoter
region or other flanking genomic regions are not displayed
in GeneView. Please refer to the section below for information.
Entrez SNP incorporates the NCBI
SNP database into the Entrez database system. This database
stores all
SNP detail information, like method, submitter, a variation
summary, and a validation summary. The example Reference SNP rs20426
shows a SNP affecting the first amino acid of human PTGS2 (Met/Ile).
In general, you can also BLAST
the NCBI SNP database with your query sequence. You can thereby
choose the chromosome to BLAST against. The output gives a good
overview of SNPs found, and shows their precise position and
surrounding sequence.
If you want to get a "quick-view" of the SNPs
located in the
cDNA of your gene of interest, you may also query Ensembl for the
specific
gene name. When you open the respective Ensembl Gene Report
(example: ENSG00000073756
for PTGS2), you may open the link "Gene variation info" which opens the
GeneSNPView (example: PTGS2).
This view is somehow similar to the SNP-GeneView at NCBI displaying a
list of all SNPs in the transcript region. Note that a small
part of the flanking promoter region might be included in the SNP list,
here.
In the Ensembl Gene Report, there is also a link to
the Ensembl Transcript Report (example: ENST00000186982
for PTGS2). At the cDNA sequence section, the user can now
select to
display also exons, codons, translations, and SNPs. There is a simple
color code in order to distinguish between SNPs which are located in
UTRs and those which are located in coding regions. The latter are
divided into those producing synonymous and those producing
non-synonymous changes in the coding sequence. Note that it is
not possible to click onto individual SNPs here in order to retrieve
detailed information, but you may use the link "Export SNP Info in
Region" in order to use BioMart to
generate the respective data file (see also below, as BioMart can also
be used for high-throughput retrieval of SNP data).
Glovar
is a project from the
Sanger Insitute, which shows sequence variation in a genomic context.
Glovar aims to compare all public human reads in the trace repository to the current
genome build and shows the read alignments and SNPs discovered plus
other public SNPs from dbSNP,
alongside the latest Vega and Ensembl genes. Currently, Glovar is
available for the human
genome. The Glovar interfaces have an "Ensembl look and feel", like
the option to query using a multitude of identifiers. It is also
possible to browse single chromosomes. The output page (example: PTGS2)
is very similar to the GeneSNPView described Note: The
SNP data
stored in Glover are also available via
queries at the VEGA
Human Genome Browser. Please refer to the VEGA section for details.
H-Invitational
Database (H-InvDB) is a human gene database opened to the
public
in April 2004, which is hosted by the Japan Biological Information
Research Center (JBIRC)
and by the DNA Databank of Japan (DDBJ). The scope of
H-InvDB is to provide an integrative annotation of full-length cDNA
clones available from high throughput cDNA sequencing projects.
If you want to scan a cDNA sequence of interest for SNPs, you
may
perform a simple
keyword
query, and then look at the "cDNA view", which lists all
SNPs within the "cDNA information" section (example: PTGS2).
The links directly connect
to the NCBI SNP database. Please refer to the H-InvDB section at the Data Integration page for additional
information.
2. SNPs in genomic sequences (including transcript regions):
Note that the promoter
region or other flanking genomic regions are not displayed in
NCBI's GeneView of SNP records. If you are interested in these regions,
several resources are available.
The
UCSC Genome
Browser is an excellent site to retrieve all kinds of
genome-related data, also SNP data. Simply select the species of
interest and enter a gene name or any other identifier which is
supported, also chromosomal positions (example: human
PTGS2 gene). In order to display SNPs, simply set the option "SNPs"
within section "Variation and Repeats" to "full" or "pack" and hit the
"Refresh" button. All Reference SNPs will be shown in the selected
sequence window, color coded to reflect SNPs within transcribed regions
vs. those in intergenic regions and to reflect SNPs producing
synonymous vs.
non-synonymous changes in the coding sequence. Each SNP can be selected
to display all relevant database links, including a cross-reference to
NCBI dbSNP. Note that you may
easily analyze the promoter region of a gene by adding a few kb
at the respective end of the sequence window via the "Position" field
of the browser. Note that it is possible to highlight the
nt-positions of SNPs when selecting the "DNA" link for sequence
download and select the option "SNP" in the "Extended DNA Case/Color
Options". Note that it is also possible to generate a tab-delimited
table listing all SNPs in the sequence window, by using the Table
Browser and choosing the group "Variation and Repeats", track
"SNPs".
Note that the UCSC Genome Browser View of a
specific gene is also retrievable via the Entrez Gene entry at
NCBI, following the link "UCSC" in the sidebar.
Tip! The
Ensembl genome
browser may also be used for this purpose. Simply query for the
specific
gene name. When you open the respective Ensembl Gene Report
(example: ENSG00000073756
for PTGS2), you can retrieve the coordinates of the genomic location
which also links to the Ensembl genome browser view. In this browser
view, you can simply add a few kb within the position number fields in
order to display the (potential) promoter region. Then, you have to
select "SNP" from the "Features" drop down menu which will display all
SNPs color-coded which are located within the sequence region. Each SNP
can be displayed in detail in the so-called SNPView of Ensembl.
Again, you may use the link "Export SNP Info in Region" in
order to use BioMart
to generate the respective data file (see also below, as BioMart can
also be used for high-throughput retrieval of SNP data). Note that
a very nice feature of this option is that BioMart generates tables
which can be opened in Excel and which maintain all hyperlinks ! NOTE:
The "example 2" described under point 3. for sets of genes can also be
used for SINGLE genes !!!
3. High-throughput retrieval of SNP data:
Tip! A
very powerful resource for data retrieval is BioMart. BioMart
is a data retrieval tool that
generates lists of biological objects (e.g. genes, SNPs) from data held
in the Ensembl (and other) databases. Please refer to the BioMart section at the main page for
general details.
Example 1: BioMart can be used to retrieve
datasets of e.g. SNPs from large genomic regions. There are 2
different ways for this purpose. The first is to retrieve the
genomic region you are interested in via the species-specific genome
browser view starting at the Ensembl home page,
example: human
genome browser, where you can type in the from-to coordinates of
your region or search for a specific gene. Having the genomic region
displayed in the browser window, simply use the link "Export SNP
Info in Region" in order to proceed directly to the BioMart Output
page to generate the respective table. In BioMart, the user can
customize the types of SNP attributes which will appear in the final
table, like type of validation, genomic region and amino acid changes.
The second way is to start a "de novo"
BioMart session,
select "SNP" as dataset and the species like "Homo sapiens". At the Filter
page, you may now select the chromosome and genomic from-to positions,
as well as general SNP filter options, which include "validated
SNPs only", or "only SNPs with allele frequency data from north
american population". In addition, gene associated SNP filters
may be set like SNPs located in coding regions or in introns or in
5'flanking regions etc. Finally, a filter can be added to search only
for SNPs affecting amino acid changes or affecting splice sites, etc.
At the Output page, the options are the same as described
above. NOTE that if you want to download a multi-sequence
file of all SNPs you should select "Sequences" in the menu
"Attribute Page" on top of the Ourtput page.
Example 2: BioMart can also be used to
retrieve SNP data from a set of genes (gene names, Entrez Gene
IDs, etc) which do not have to be located in a common genomic context.
For this purpose, start a BioMart session and
select e.g. "Ensembl" " Homo sapiens genes". At the Filter
page, paste a list of gene specific identifiers within the field "ID
list limit". In addition, you may filter for SNPs in specific regions
(coding etc), SNPs which result in aa changes, splice site changes etc,
and you may filter for validated SNPs only. Note that now, the
number of hits relates to the number of input genes and not to the
number of SNPs found! Select "SNPs" in the menu "Attribute Page" on top
of the Output page. Here, you may finally select the attributes of the gene
associated SNPs which you want to retrieve. Note that obviously
"gene associated" SNPs also include SNPs which are located up to about
5 kb upstream of the transcribed region, meaning that SNPs affecting
the proximal promoter region should be contained in the final output.
Of course, this also works for SINGLE genes!
Tip! HapMart is a
data mining tool for retrieving data from the HapMap
database, storing data on genetic variations like SNPs. It is based on
the Biomart interface. Please refer to the main section of
HapMart for details concerning SNP databases, filter settings, and
output / download options.
4. SNP - Disease correlations: (please refer alo to FAQs GENOM5 and GENOM7)
You may search
the HGMD, the Human Gene
Mutation Database, maintained by the Univ. of Wales (Cardiff) for
SNP-Disease correlations. HGMD
stores not only mutations in human genes but also curated
polymorphisms showing clear phenotypes, meaning polymorphisms
extracted from literature which have a significant effect. HGMD is also
directly linked via NCBI Entrez Gene entries. HGMD supports only single
queries. The output not only lists nucleotide substitutions affecting
coding regions but also affecting splice sites and regulatory
sequences.
Tip! GAD - Genetic Association
Database is an archive of human genetic association studies
of complex diseases
and disorders. GAD is maintained by the NIH
- National Institutes of Health. The goal of this database is to allow
the user to
rapidly identify medically relevant polymorphism from the large
volume
of polymorphism and mutational data, in the context of standardized
nomenclature. The data is from published scientific papers.
Study data is recorded in
the context of official human gene nomenclature with additional
molecular reference numbers and links. It is gene centered. That is,
each record is a record of a gene or marker. Please also refer
to the GAD section at the Data Integration page for a general
description.
GAD provides an extremely useful Batch
Search option, if you want to
quickly analyze whole sets of genes derived from e.g. microarray data
in order to see which of these genes are associated with a known
disease. You enter human official gene symbols (max. 300) as a list
(comma or space or tab
or new line separated) in the text area. If you have other types
of identifiers,(Accession , Unigene, etc. ) you can go to batch
SOURCE to get the identifiers translated to official Gene Symbols.
Refer also to the SOURCE
Batch Search
section for instructions! Finally, you will retrieve a table
displaying the known disease
correlations of your genes of interest (categorized by "Aging",
"Cancer", "Cardiovascular", "Development", "Immune", and others) where
you may click on all
individual records. Single records, like the example
of TP53 and colorectal cancer, display a lot of links to
diverse databases, like allele description, polymorphism class, PubMed
links, pathway data, population data, as well as links to SNP databases
(NCBI SNP, HapMap, Map View). In addition, links to "experts in the
field" are provided. Note: If you choose the option Positive
Only, then the output list of your query
is reduced to those records describing a positive association
with a certain disease. Note: In general, GAD also stores
negative association data (often published only in obscure scientific
journals).
5. Prediction of functional impacts of SNPs:
Tip! The
tool cSNP
Analysis is part of the PANTHER
(Protein
ANalysis THrough
Evolutionary Relationships)
Classification
System. This tool estimates the likelihood that
a particular nonsynonymous coding SNP will cause a functional
impact on the protein. It calculates the subPSEC
(substitution position-specific evolutionary conservation) score based
on an alignment of evolutionarily related proteins, as described in Thomas
et al., 2003 and Thomas & Kejariwal, 2004. As input
simply paste a protein sequence and enter the substitution(s) relative
to this input sequence in the standard amino acid substitution format,
e.g. A265V. Multiple substitutions should be separated by a tab, space,
or return. You may use the integrated example protein APOE by
clicking onto the "?" to get an impression how the program works. The output
is a table containing the subPSEC scores which describe the
probability that a certain substitution will have a deleterious effect
or may even have a gain-of-function effect. The
user can click on the link on the number of the multiple sequence
alignment (MSA) position to view the column in the MSA where the
substitution occurs (red color).
GENOM2...search for
(also distant) orthologs/homologs of my gene/protein of interest ?
(last update May 5, 2006)
Please note that in general, there are several
approaches to deal with this question. Of course, you can perform simple
homology searches like BLAST using your nucleotide (BLASTN) or your
protein (BLASTP or
TBLASTN) query along with the appropriate databases, and then
"manually" screen the results. In addition, there exist some strategies
for the detection of distant homologies, which are described
below. Related programs are also found in the section "Sequence
Similarity" at the main pages. Finally, there is a number of
databases
which "pre-compute" these sequence similarities and provide
cross-species
information on orthologous/ homologous genes/ proteins and try to
create protein families. These sites which are described below are
categorized either under "Genomics"
("Comparative Genomics", "Phylogenetics") or under "Proteins" ("Domains, Families") at the main
pages.
Note: If you want a
very quick and easy approach to this question, I would recommend the UCSC Gene Sorter,
described under "Protein Families" below!
1. Refined homology detection via multiple sequence alignments:
1.1. Starting with SINGLE protein sequences:
A well known strategy for the detection of distant
homologies is to search not only with one query sequence but to first
generate a multiple sequence alignment of close sequence
neighbours and then perform database searches using the consensus
derived from such alignments. This procedure emphasizes the importance
of conserved residues in the sequence and assumes that distantly
related proteins also somehow preserve this conservation. In a second
step (see below) the retrieved sequence homologs may again (manually)
be aligned to a (now larger) sequence alignment.
Tip! PSI-BLAST
is a Protein-BLAST derivative available at the NCBI BLAST page.
PSI-BLAST takes a single protein sequence as input. Position-Specific
Iterative (PSI)
BLAST is a program based on the BLAST 2.0 algorithm that is designed to
detect weak relationships between the query and
members of the database not necessarily detectable by standard BLAST
searches. The added sensitivity of this program over regular BLAST
comes from the use of a profile
that is constructed (automatically) from a multiple alignment
of the highest scoring hits in the initial BLAST search. The profile
is generated by calculating position-specific scores for every position
in the alignment. A highly conserved position will receive a high score
and weakly conserved positions receive scores near zero. The profile is
then used to perform additional BLAST searches (called iterations)
and the results of each iteration are used to refine the profile. You
may
also check the BLAST
guide for information. Usually, after a few iterations, no new
sequences (marked in order to be quickly identified) are added to the
list of reliable homologs which show E-values below a certain
threshold (default E=0.005).
If you want to, you may now select those "reliable"
entries, and download this sequence set in FASTA format via the
"Get selected sequences" button at the bottom of the output page.
Note: Please refer to the main PSI-BLAST section for details on
multiple alignment construction, score matrices, sequence weights, and
BLAST applied to a PSSM.
BLASTView
of Ensembl provides an integrated platform for sequence similarity
searches
against Ensembl databases. As
there
is a very user-friendly selection of the search sensitivity, ranging
from
"exact matches" to "distant homologies", it is possible to use
this
program also for high-sensitivity searches. Nevertheless, the usage of
NCBI's PSI-BLAST is much clearer and the output offers better options
of data analysis and download.
The BLOCKS
server
is a service for biological sequence analysis at the Fred Hutchinson Cancer Research Center
in Seattle, Washington, USA. "Blocks" are short multiply aligned
ungapped segments
corresponding to the most highly conserved regions of proteins. BLOCKS
server provides several methods to search a single sequence
against the BLOCKS and Prints databases.
Block
Searcher
compares a protein or DNA sequence to the current database of protein
blocks. Typically, a group of proteins has more than one region in
common and
their relationship is represented as a series of blocks separated by
unaligned regions. If a second block for a group also scores highly in
the search, the evidence that the sequence is related to the group is
strengthened. It is recommendable to search both Blocks+
and Prints. Blocks+ has automatically - generated blocks, while
Prints has hand-crafted blocks.
IMPALA
(Integrating Matrix
Profiles And Local Alignments) searches a protein query sequence
against a multiple alignment database represented as a collection of
PSI-BLAST checkpoint files. IMPALA has been implemented on the Blocks
Server to search a blocks database, such as Blocks+. Although the
Blocks Searcher performs a
similar
type of search, there are differences between IMPALA and the Blocks
Searcher in the PSSMs used, in the alignments reported, and in the
calculation of statistics that can lead to somewhat different results.
Therefore, any marginal similarity detected with one searching program
should be confirmed using the other. Both programs generally detect
true positive hits but they tend to report different false positives,
and so any hit not detected by both searching programs should be
regarded with caution.
RPS-BLAST
(Reversed
Position Specific BLAST) is one of the BLAST series of searching
programs
from
NCBI. RPS-Blast uses the query sequence to search a database of
pre-calculated PSSMs (in this case PSI-BLAST checkpoint files made from
multiple alignments of protein families) and report significant hits in
a single pass. The role of the PSSM has changed from "query" to
"subject", hence the term "reverse" in RPS-Blast. Here the checkpoint
files are made from the Blocks and Prints alignments,
and are the same files searched with the similar IMPALA
searching program.
1.2. Starting with MULTIPLE unaligned protein sequences (if you know
already the closest relatives to your sequence):
Tip! Block
Maker, which is also part of the BLOCKS server, finds conserved
blocks in a group of two or
more unaligned protein sequences, which are assumed to be related,
using two different algorithms. At least two
related protein sequences (and a
maximum of 250) must be provided to
make blocks. Each sequence must have a unique name of 10
characters or less. If you have the accession numbers of some sequences
you would like to use, NCBI Entrez
or similar programs can create a file for you in FASTA format.
The main difference to programs like ClustalW
is that Block Maker does not perform a global alignment covering the
complete sequences but selects for the sequence blocks which are best
conserved. Block Maker output allows a series of follow-up
analyses, like the display of phylogenetic trees, or sequence
logos (see below).
Note: If you already have generated a
multiple
alignment by programs like ClustalW, please use the Blocks
Multiple
Alignment Processor instead (see below), as test runs showed
that this further improved the significance of the extracted blocks.
1.3. Starting with a pre-made sequence ALIGNMENT of related protein
sequences:
Tip! You
may, for example, take the sequence set produced by PSI-BLAST and
perform a ClustalW
multiple sequence alignment. Please refer also to FAQ SIM3 for this purpose. You can now
investigate conserved regions either via the "built-in" Java applet JalView
or by coloring the alignment using Boxshade
(see FAQ GRAPH1). Alternatively,
you may submit the ClustalW alignment to the Blocks
Multiple
Alignment Processor which carves out ungapped regions between
the requested minimum and maximum widths of the submitted multiple
alignment and converts them to Blocks format, thus again providing all
the follow-up analyses as described below.
1.4. Follow-up analyses using conserved multiple aligned
protein sequences:
The output of programs like Block Searcher, IMPALA,
RPS-BLAST, Block Maker, or Blocks Multiple
Alignment Processor is twofold.
First, there are several ways to display
the conserved sequence blocks and the general relationships between
the sequences. The blocks may be shown in different textual formats as
well as in a graphical map. The conservation profile of blocks can be
visualized as sequence logos
(please refer to the main section
"Sequence Logos" and to FAQ SIM4
!). Also, phylogenetic
trees may be constructed from the input sequences. Finally, 3D Blocks is a search
and display tool which allows you to view blocks on a
protein structure (if available).
Second, the extracted blocks themselves may
be used to search protein sequence databases for additional
homologs. This can be done by either using the whole blocks (multiple
alignments) as query, like in the programs LAMA and MAST, or by using
the deduced consensus sequence, as in the program COBBLER.
LAMA
(Local Alignment of
Multiple
Alignments) is a program for comparing protein multiple sequence
alignments with each other. The program can search databases of
such multiple alignments. The search is for sequence similarities
between conserved regions of protein families. The method is sensitive,
detecting weak sequence relationships between protein families.
Sequence similarities beyond the range of conventional sequence
database searches can be detected by the method. LAMA compares multiple
sequence
alignments of proteins.
COBBLER
means COnsensus
Biasing
By Locally Embedding Residues. A single sequence is selected from a set
of blocks and enriched by replacing the conserved regions delineated by
the blocks by consensus residues derived from the blocks. Embedding
consensus residues improved performance with readily available single
sequence query searching programs, such as BLAST and FASTA, in
comprehensive tests. This is especially useful in PSI-BLAST
searches.
2. Comparative Genomics, Phylogenetics:
Tip! BLink
(BLAST Link) is an extremely useful source of information
concerning protein homologs and orthologs across
multiple species. BLink is not available as "program per se"
but as link in each protein record stored in NCBI Entrez, see example
report. Note that in case there is no direct link to BLink
from Entrez
Gene, you may first open the respective HomoloGene
link, and then jump to BLink from the protein records displayed in
HomoloGene. BLink entries are based on pre-computed sequence
alignments, generated from routine all-against-all BLAST comparisons
performed at NCBI. The best 200 of these alignments can be displayed.
BLink reports are highly customizable. Conserved protein domains
are shown on top of the alignment, with links to the NCBI CDD
database. The alignments are depicted graphically and are
color-coded on the basis of taxonomic origins (having some "look and
feel" of the COG
database). Each alignment is displayed in NCBIs BLAST 2
sequences
format. All protein hits can also be displayed in their
specific BLink reports. The "Best Hits" format button displays
ONLY THE BEST HIT IN EACH SPECIES, allowing a very quick access to the
potential orthologs of
a protein in other species. The "Common Tree" button displays
the
BLAST hits along the branches of the taxonomic tree, allowing for
selection of individual species. The "Taxonomy Report" button
lists the BLink results as a BLAST taxonomy report. The "3D
structures" button limits the output to those sequences derived
from structure records (linked via the colored dots in the "3D
structures" display). The "CDD search" button links to a
pre-computed conserved domain display for the query sequence.
Note that BLink is also integrated in the "data
super-integration tool" Bioinformatic
Harvester of the EBI.
Tip! HomoloGene
is a NCBI-resource of curated and calculated orthologs for
genes as represented by UniGene or by annotation of genomic sequences.
The calculated homologs are the result of nucleotide
sequence comparisons between each pair of organisms, maintained
by the database. For the comparisons, EST and mRNA sequences from
UniGene are used, as well as transcripts extracted from annotation
of genomic sequences. The list of available organisms is shown
at the HomoloGene start page. HomoloGene can be queried via keyword
search (gene names, symbols, accessions), but not via sequence
search. HomoloGene entries are also linked at ENTREZ Gene pages of
individual genes. HomoloGene is now integrated in NCBIs ENTREZ database system.
The NCBI databases COG (Clusters of
Orthologous Groups of proteins) and KOG
(Eukaryotic orthologous groups) were constructed by comparing protein
sequences encoded in
a list of complete prokaryotic (COG) and eukaryotic (KOG)
genomes. KOG covers species like human, Drosophila, C.elegans,
Arabidopsis, and S.cerevisiae. In general, please also refer to
the COG and KOG database descriptions
at the main page for detailed information. The underlying
premise is that orthologs are more similar to each other than
they are to any other protein from the respective genomes ("reciprocal
best hits"). In multiple - genome comparisons, pairs of potential
orthologs can be joined to form clusters of orthologs. Note
that a COG is built by definition by proteins from at least 3
sufficiently distant species ("3 clades").
If we just concentrate on the KOG database,
there are several ways to access these data using
your protein query sequence. One which might be already known from
other questions is to perform a CD-Search
at the NCBI database CDD
(Conserved Domain database). Previously, CDD contained protein domains
from
Smart, Pfam, and NCBI-specific data, but later was
updated to also show similarities to existing COGs or KOGs.
The output links reveal multiple alignments, as well as direct access
to the COG and KOG database entries.
The Eukaryotic
Gene Orthologs (EGO), previously called TIGR Orthologous Gene
Alignments (TOGA), is a database for orthologous genes in eukaryotes.
EGO is generated by pair-wise comparison between the Tentative
Consensus (TC) sequences (contig sequences of individual EST clusters)
that comprise the TIGR Gene
Indices from individual organisms. The EGO database can be accessed
through the SEARCH
function. You can perform a BLAST search or search using gene names or
TIGR accessions. If available, you will retrieve a "tentative
ortholog" accession, which groups predicted orthologs from a list
of species. In addition, a ClustalW multiple sequence alignment of
these orthologous cDNA sequences is displayed.
Please note that a special feature of EGO is the
search for "Orthologs
of human disease genes". Thereby, Human disease genes in Online
Mendelian Inheritance in Man (OMIM)
database were matched to a TIGR Human Gene Index accession (THC number)
and Orthologs of human disease genes have been identified using EGO
database. You can query using OMIM or LocusLink ID, gene name and
various types of accession numbers.
PhyloBLAST
is a program to perform
a molecular phylogenetics analysis of a protein sequence.
PhyloBLAST accepts only protein sequences. PhyloBLAST uses
BLASTP to find related amino acid sequences in the Swiss-Prot database.
The first result is a "BLAST style" graphic including all pairwise
alignments. You may select those sequences desired, for a full
phylogenetic analysis, starting with a ClustalW multiple sequence
alignment. A choice of Phylip programs, including parsimony,
UPGMA, neighbor joining and distance matrix methods, produces a phylogenetic tree.
3. Protein Families:
Tip! The
UCSC Gene Sorter
is an excellent resource for exploring gene families and the
relationships among genes. This tool displays a table of genes
within a selected genome that are related to one another. Several
different relationships may be explored: protein-level homology,
similarity of gene
expression profiles, or genomic proximity. The Browser supports
searches on a variety of terms and phrases, including
the gene name, the SwissProt protein name, a GenBank accession, or a
word or phrase present in a gene's description. The gene
family display is highly configurable, allowing the user
to control the order and number of columns, the number of rows,
and the genes displayed. Please refer also to the UCSC Gene Sorter main section
for details!
Concerning this specific question, you have
3 options to sort by protein homology
(BLASTP, Rankprop, PSI-BLAST), and then you can easily download the
sequences via the "sequence" button. NOTE: UCSC Gene Sorter
primarily looks for homologs within the SAME SPECIES! BUT: Via
"configure" it is possible to
display
the gene orthologs (best BLASTP hits in Ensembl) of a list of
species,
like mouse, zebrafish, drosophila, C.elegans, and yeast. The tool
provides
several output formats, including a simple tab-delimited
format that may be imported into a spreadsheet or a relational
database. In addition, the sequences of the displayed genes
can be downloaded: cDNA, protein, genomic and promoter (!)
sequences,
allowing a user-definition of upstream and downstream regions.
Tip! The
Ensembl project also
provides tools to look if your query belongs to a predicted protein
family. First, you have to get the data file of your gene of
interest, which you can do either using TextView or by BLAST (sequence)
search. Note that for this step, it is best to search "All" indices,
and not only the "Family" index. At the "Ensembl Gene Report",
you
mostly will find "gene(s) that have been identified as putative
homologues by reciprocal BLAST analysis" from other species. At the "Transcript
Summary", you often will see a link to a predicted "Protein
Family", having a unique "ENSF..." accession number. Following
this
link, you will get lists and multiple alignments of these protein
sequences, offering many download options (ClustalW, FASTA,
MSF,...). Note that in contrast to the UCSC Gene Sorter, you
will retrieve essentially the orthologs ("reciprocal best
hits") and homologs with high sequence similarity, but not other, more
distantly related genes/proteins.
Pfam
is a large collection of multiple sequence alignments and
hidden Markov models covering many common protein domains. For each
family in Pfam you can: Look at multiple alignments, view
protein domain architectures, examine species distribution, follow
links to other databases, and view known protein structures. Pfam
is a database of two parts. Pfam-A is the curated part
of Pfam containing over 7255 protein families. To give Pfam a more
comprehensive coverage of known proteins a supplement called Pfam-B
is automatically generated. This contains a large number of small
families taken from the PRODOM
database that do not overlap with Pfam-A. Although of lower quality
Pfam-B families can be useful when no Pfam-A families are found. There
are several ways to query Pfam, like Protein
sequence search, DNA
sequence search, or Keyword
Search. You will retrieve lists of matching Pfam-A and -B hits, as
well as pairwise alignments to your query sequence.
TIGRFAMs
are a collection of protein families featuring curated multiple
sequence alignments, Hidden Markov Models (HMMs) and associated
information designed to support the automated functional
identification of proteins by sequence homology. Classification by
equivalog family, where achievable, complements classification by
orthologs, superfamily, domain or motif. Use this page to see the curated
seed alignmet for each TIGRFAM, the full alignment of all family
members and the cutoff scores for inclusion in each of the TIGRFAMs. You can
query TIGRFAMs by Text Search or query by sequence using the option Sequence Search. Note that by default,
both databases (TIGRFAMs and PFAM) are searched. Note that TIGRFAMs are
automatically searched when performing InterPro
queries. Note that PFAM is a collection of HMM
models of protein families complementary to TIGRFAMs. PFAM models are
constrained to be non-overlapping with one another and thus are
more likely to describe domains rather than full-length
proteins.
Superfamily
is a server that provides structural (and hence implied functional)
assignments to protein sequences at the superfamily level. This
server does not attempt (at present) to distinguish between families
within superfamilies, but is able to detect the broader and more
distant relationships at the superfamily level. A superfamily contains
all proteins for which there is structural evidence of a common
evolutionary ancestor. The server can be entered in three ways: begin
with a sequence (search the library); begin with a superfamily
(select from SCOP); or begin with a genome (select from list).
The output for the first case will give you a list of hits
that your sequences make to models belonging to superfamilies, their alignments
to the model, and assigned genome sequences (a very instructive list
of genomes in which a certain superfamily has already been
described !).
GENOM3...see which regulatory
elements are conserved in a set of orthologous promoters (Phylogenetic
Footprinting) ? ->
see GEN5 !
The prediction of Transcription
Factor Binding Sites (TFBS) in a single promoter produces many
false positives. This can be drastically improved by comparing this
promoter to the corresponding (orthologous) promoters of the same
gene in other species. Those sites which are conserved in evolution
are most likely to have functional importance. As many overlaps exist,
this question is treated along with FAQ GEN5.
GENOM4...get all human proteins
present in Drosophila but not in C. elegans ? (last update Apr. 14, 2004)
This question addresses the general matter of
identifying protein sets which are expressed specifically in certain
species but not in others. A prerequisite to find a solution is the
construction of databases which reliably cluster orthologous proteins,
in order to be able to search which clusters of proteins contain
specific species while excluding others. Of course, this is not a
trivial task, as clearly discussed in Tatusov et. al, 2003.
The NCBI databases COG (Clusters of
Orthologous Groups of proteins) and KOG
(Eukaryotic orthologous groups) were constructed by comparing protein
sequences encoded in
a list of complete prokaryotic (COG)
and eukaryotic (KOG) genomes. KOG covers species like human,
Drosophila, C.elegans, Arabidopsis, and S.cerevisiae. The
underlying premise is that orthologs are more similar to each
other than
they are to any other protein from the respective genomes ("reciprocal
best hits"). In multiple - genome comparisons, pairs of potential
orthologs can be joined to form clusters of orthologs. Note
that a COG is built by definition by proteins from at least 3
sufficiently
distant species ("3 clades"). In general, please also refer
to the COG and KOG database descriptions
at the main page for detailed information, and to FAQ GENOM2 for additional remarks.
The list of KOGs
is represented by tables, which display the numbers of proteins present
in the different eukaryotic species by a letter- and color-code, and
the deduced number of KOGs in the last column. If we are looking for
human proteins present in Drosophila but not in C. elegans, we have to
look at the respective rows marked by "H" and "D" but not "C". The
corresponding links reveal lists of proteins (KOGs) present in the
selected species. Even more precise, there is a table called TWOGs,
which directly
lists clusters represented by only 2 species. It should be
noted that
a few improvements could be made concerning the functionalities of the
interfaces;
e.g. there is still no option to simply restrict the output lists to specific
combinations of 2 species.
Please note that there is also a comparable tool for
prokaryotic genomes, named phylogenetic
search tool. The choices are "dc" ("don't care"): COG may
or may not contain this organism; "yes": COG must
contain this organism; "no": COG must not contain this
organism. The list of results will be the subset of COGs that fits the
pattern indicated.
GENOM5...know which genes of a
specific dataset are associated
with a disease ? (last
update Jun. 7, 2005)
Many high-throughput methods like microarrays
finally produce lists of "hot candidates". One of the most interesting
questions,
when working on such datasets, concerns the potential involvement of
these genes
in a disease. In principle, there are two ways of retrieving
gene-disease information. The first one is to look for already known
and
described gene-disease correlations. The second one, in cases of
"orphan" diseases, where the responsible gene is not known, one may at
least try to correlate the genomic position of genes and respective
diseases. Finally, when analyzing large datasets, it is highly
favorable to have resources which allow batch submissions of gene names
to quickly identify those genes with known disease correlations;
meaning in order to answer this specific question, proceed to option C) !
1. Resources based on single-gene / single-disease queries:
Tip! The
OMIM
database is a catalog of human genes and genetic disorders. The
database contains textual information (like "mini-reviews" !),
pictures, and reference information. It also contains links to
NCBI's Entrez database of MEDLINE articles and sequence information.
Therefore, OMIM is also automatically searched when performing an ENTREZ "cross-database"
query using any keyword. Please note that when choosing "Limits", you
may restrict your search to specific chromosomes, or to individual
fields of the OMIM database. A different approach to access the data is
to use the OMIM
Gene Map. The OMIM gene map presents the cytogenetic map location
of
disease genes (in chromosomal order) and other expressed genes
described in OMIM, thereby providing links to both the NCBI Map Viewer
(genomic map), as well as to the OMIM entries which describe the
disorders.
NOTE: See the BioMart section below if you want to perform batch
queries of gene lists to reveal their OMIM data !!!
The Eukaryotic Gene
Orthologs (EGO), previously called TIGR Orthologous Gene
Alignments (TOGA), is a database for orthologous genes in eukaryotes.
Thereby, Human disease genes in Online Mendelian Inheritance in Man (OMIM)
database were matched to a TIGR Human Gene Index accession (THC number)
and Orthologs of human disease genes have been identified using EGO
database. You can query using OMIM or LocusLink ID, gene name and
various types of accession numbers.
2. Resources based on single-gene / single-disease queries
including orphan disease prediction:
Tip! The
DiseaseInfo viewer is a tool which is integrated into the H-Invitational
Database (H-InvDB), which provides an integrative
annotation of full-length cDNA clones. This viewer displays
information on known disease-related genes via links to OMIM,
LocusLink, and GeneLynx, but also shows co-localized orphan
diseases. Orphan disease (here) means a disease mapped on the
chromosomal region, but whose responsible gene has not been identified
yet. Co-localization does not mean direct relationships between gene
and
disease; however, genes that are cytogenetically co-localized with a
disease could be possible candidate genes of that disease. You first
have to get the specific database entry of your gene of interest,
either
via BLAST
(sequence) search or via keyword search,
and then look for the specific disease link within the so-called "Locus
view". Please also refer to the H-InvDB section at the Data Integration page for a detailed
description.
3. Resources allowing batch queries of gene datasets:
Tip! GAD - Genetic Association
Database is an archive of human genetic association studies
of complex diseases
and disorders. GAD is maintained by the NIH
- National Institutes of Health. The goal of this database is to allow
the user to
rapidly identify medically relevant polymorphism from the large
volume
of polymorphism and mutational data, in the context of standardized
nomenclature. The data is from published scientific papers.
Study data is recorded in
the context of official human gene nomenclature with additional
molecular reference numbers and links. It is gene centered. That is,
each record is a record of a gene or marker. Please also refer
to the GAD section at the Data Integration page for a general
description.
GAD provides an extremely useful Batch
Search option, if you want to
quickly analyze whole sets of genes derived from e.g. microarray data
in order to see which of these genes are associated with a known
disease. You enter human official gene symbols (max. 300) as a list
(comma or space or tab
or new line separated) in the text area. If you have other types
of identifiers,(Accession , Unigene, etc. ) you can go to batch
SOURCE to get the identifiers translated to official Gene Symbols.
Refer also to the SOURCE
Batch Search
section for instructions! Finally, you will retrieve a table
displaying the known disease
correlations of your genes of interest (categorized by "Aging",
"Cancer", "Cardiovascular", "Development", "Immune", and others) where
you may click on all
individual records. Single records, like the example
of TP53 and colorectal cancer, display a lot of links to
diverse databases, like allele description, polymorphism class, PubMed
links, pathway data, population data, as well as links to SNP databases
(NCBI SNP, HapMap, Map View). In addition, links to "experts in the
field" are provided. Note: If you choose the option Positive
Only, then the output list of your query
is reduced to those records describing a positive association
with a certain disease. Note: In general, GAD also stores
negative association data (often published only in obscure scientific
journals).
Tip! A
very powerful resource for data retrieval is BioMart. BioMart
is a data retrieval tool that
generates lists of biological objects (e.g. genes, SNPs) from data held
in the Ensembl (and other) databases. Please refer to the BioMart section at the main page for
general details. There are different options in BioMart to filter a set
of genes for those having a known disease context. You may filter your
gene list at the "Filter" page by choosing "Disease Genes ONLY". You
may also filter by specific diseases using the "Expression" section at
the "Filter" page, in particular the "Pathology" fields (e.g.
atherosclerosis, asthma, diabetes). Alternatively, you may skip the
filtering of your gene list and display ALL genes in the output table
BUT include a disease-specific column by selecting "Disease OMIM ID"
and "Disease description" at the Output-"Features" tab.
GENOM6...identify Conserved Non-coding Sequences (CNS) and
conserved transcription factor binding sites in large
genomic
regions via comparative genomics ?
(last update Mar. 14, 2006)
Conserved Non-coding
Sequences (CNS) are believed to be highly reliable
candidates for regulatory regions in genomes, as the general assumption
states that regions of conservation within otherwise dissimilar
sequences are very likely to be functional. CNS can not only be found
in "proximal" promoter sequences but also in distal regions
like "enhancers". Thus, as more and more genome sequences
become available, comparative genomics develops into a quickly
expanding bioinformatics field. The comparison of large genomic
sequences demands the availability of special alignment algorithms, as
pre-existing ones like BLAST or BLAT are not suitable in this respect.
There is already a series of bioinformatics tools available for these
purposes, and I am trying to describe and compare those in the
following paragraphs. Most of these tools were developed by 2
different institutes, the
Lawrence Berkeley
Lab. and the Lawrence Livermore
Nat. Lab., as described in individual sections at the main page (LBL, LLL). Likewise, I will cite these
sources while describing individual programs.
1. Identification / extraction of genomic regions of interest:
If you are, for example, interested in the whole
intergenic region between your human gene of interest and its "upstream
neighbouring" gene, the best way to get this region is the UCSC Genome Browser.
You can query this browser using various kinds of accession numbers
(GenBank, RefSeq, EST, LocusLink), HUGO gene symbols, as well as
genomic positions (if you know them already). Note that you
have to be careful when selecting genomic positions as they
vary between different versions (freezes) of genome assemblies! For
example: The human version "hg15" is identical to the UCSC freeze of
April 2003, and the most recent version (hg16) corresponds to July
2003.
If you want to query by sequence (like a
cDNA sequence), the best way is to perform a BLAT
search at the UCSC, quickly identifying the corresponding
genomic region. Both options finally yield an image of the desired
region within the genome browser. Here you may move left or right, zoom
in and out, or directly enter position coordinates, in order to display
the region of
interest. Finally, you may either simply write down the position
coordinates (like "chr11:2,314,818-2,374,484") or extract the DNA
sequence using the "DNA" link on top of the browser window. You may
afterwards query tools for comparative genomics using one of these two
input types.
2. Identification of CNS via pre-computed alignments:
Test runs showed that in many cases, it is not only
the fastest but also a highly accurate way to use pre-computed
whole-genome alignments to reveal conserved regions, instead of
performing the alignment yourself. For this purpose, several similar
browsers are available.
VISTA Browser
(LBL)
is a very nice Java applet, which allows the user to examine pre-computed
alignments of whole genome assemblies. Pairwise and
multiple alignments are available. This tool is tightly connected
to the UCSC
Genome
browser. To browse whole-genome alignments, just select
a base genome and enter a RefSeq gene name or a position (e.g.
chrX:1-100000) on this genome. Note that
in order to use this browser, Java 2 must be
installed
on your computer. This Java applet lets you zoom, move, and
analyze the genomic alignments very nicely. You may select / deselect
individual organisms, zoom highly conserved regions, and also directly
jump to
the UCSC browser where you can e.g. download the sequence region. You
may also directly perform rVISTA (transcription factor binding
sites) analyses, by hitting the "i" (Alignment Details) button,
which reveals a page with all pairwise alignments and rVISTA analyses
links. Please refer also to the rVISTA description below ! VISTA
browser
is similar to other programs like K-Browser.
Personally, I would recommend the VISTA browser which seems more
"up-to-date" and shows a higher versatility.
Tip! ECR Browser (LLL) is a dynamic whole-genome
navigation tool for visualizing and studying evolutionary
relationships between vertebrate genomes and for analyzing sequence
conservation profiles.
In order to run
the ECR Browser, select
a base organism and indicate the name of a gene or a
chromosomal location (chr1:from-to format). Visually ECRs (Evolutionary
Conserved Regions) are represented as colored peaks
on a graph, with the x-axis representing positions in the base
genome and the y-axis representing % identity between the base and
aligned
genomes at that specified position. ECRs are color-coded
differently according to the properties of the underlying sequence of
the base genome. This allows the user to visually distinguish between
ECRs that correspond to coding exons (blue), untranslated regions
(UTRs, yellow) and noncoding elements (red if they are intergenic or
pink if they lie within an intron). Green bars on the bottom axis of
the plot shows the position of repetitive elements in
the base genome and this annotation is shaded to the top of the plot in
gray. Annotated genes are
depicted as a horizontal blue line above the graph, with
strand/transcriptional orientation indicated by the inclined vertical
lines.
In addition,
the ECR browser is equipped with a 'Grab ECR' feature that
allows users to rapidly extract sequences. A mouse click on the 'Grab
ECR' button, followed by a second click on any colored peak (ECR)
on the plot results in appearance of a new web page describing the ECR
corresponding to that peak. NOTE that this only works
when pop-up blockers are switched off ! Chromosomal location, length,
percent
identity of the pairwise alignment, and GC content of the ECR are
given.
In addition the full alignment is visualized. Sequences and alignments
from other species can be obtained by using the "Grab ECR" feature to
retrieve a peak from the conservation plot depicting alignments with
the genome of that species. An additional link can be used
to forward the ECR alignment directly to rVISTA (see below).
Additional features can be accessed via the
commands at the top of the ECR Browser window. "Base genome"
let's you quickly switch between different species selected as base
genome. "Browser Settings" allows customized displays, like
selection of species, graph type, number and height of layers, and
stringency settings to detect ECRs. In addition, there is the option
to display pre-computed conserved transcription factor binding sites
directly in ECR Browser, without having to run "Grab ECR" and rVISTA
first. This is a static "quick-view" generated using default settings. "Highlight core ECRs"
displays only those ECRs which show at least 77 % conservation in a
window of 350 bp (see also corresponding reference). "ECRs"
displays a list of
the identified ECRs in a genomic region and all sequences. "DNA"
produces a fasta sequence file of the complete genomic region of the
base genome. NOTE that you can get all the syntenic regions / sequences
of the other species via the link "Synteny/Alignments" produces a list of all the
syntenic regions / sequences
of the other species. You may then directly view the rVISTA analyses
(conserved TFBS) for all pairs of sequences. NOTE: You
may also send ALL selected sequences to Mulan to
generate phylogenetic trees and identify multi-species
transcription factor binding sites (please refer to the MULAN description). "SNPs"
produces a list of all Single-Nucleotide Polymorphisms within the
individual ECRs.
3. Identification of CNS via self-computed alignments:
3.1. All sequences submitted by the user:
mVISTA (main VISTA; LBL)
is a program for visualizing alignments of an arbitrary number of
genomic sequences from different species. VISTA is especially designed
to display alignments of orthologous genes / regulatory regions
of up to 100 species. Note that it is not possible to
paste sequences, but you have to save them as FASTA-files
in *.txt format, like using
MS WORD. In addition you may provide an annotation file for the first
(base) sequence, specifying the positions of exons, UTRs, etc. This
annotation file can also be written as a simple txt-file, see the instructions
page for an example. If you provide an annotation for the first
sequence, then this will also be applied for the homologous regions of
the second sequence. You may now choose between different
alignment programs in
mVISTA: AVID, which produces global pair-wise alignments
(sequences can be *finished or draft*); LAGAN, which produces
global *multiple* alignments of finished sequences; and Shuffle-LAGAN,
which produces glocal pair-wise alignment of finished sequences and is
capable of *detecting rearrangements*.
Note that you may choose the option to directly
analyze the results with rVISTA (regulatory VISTA) to reveal
conserved Transcription Factor Binding Sites (TFBS). If so, you
can choose individual TFBS and select "stringency values" (core and
matrix similarity). Please refer to rVISTA below for output details.
The
Output
comprises different sections: TextBrowser displays input and output files for
visualization and download, including text files listing the
conserved regions of the 2 sequences that meet the specified criteria
(default: 75% identity within 100bp). Another novel (April 2005)
feature are the rankVISTA
conservation plots which depict evolutionarily conserved
segments
in pairwise or multiple alignments as a bar graph, where the heights
scale with statistical significance [-log10(P-value)]. For
example, a height of 4 indicates that the probability of seeing that
level of conservation by chance in a neutrally-evolving 10-kb segment
of the base sequence is less than 10-4. VISTA
Image provides the VISTA plot
of the alignment(s) in PDF format. Dynamic Visualization
links the results to
VISTA Browser which provides multiple novel analysis options. NOTE that
you will get an Email containing the link
to the directory of output files which can be downloaded.
zPicture (LLL) is the most
convenient way to align the 2 input sequences. zPicture is a
dynamic alignment and visualization tool based on the blastz
alignment program utilized by PipMaker. There are several input
options for zPicture, like copy/paste, fasta-files, NCBI
accessions, or Upload
sequence and gene annotation from the UCSC
Genome Browser. Optionally, you can provide annotations for
the input sequences.
The output of zPicture includes several file formats and a dynamic
visualization
tool that graphically displays the conserved regions and allows for
user-defined parameter settings. In addition, there is
a direct link to submit the alignment to rVISTA analysis ! NOTE: multi-zPicture is
a multi-sequence version of zPicture alignment and
visualization tool. Please note that it is not possible to
submit
multi-zPicture alignments to rVista yet. Nevertheless, all other options
are fully functional.
If we try a direct comparison between mVISTA and
zPicture, we may list the following points. mVISTA accepts up to 100
input sequences (and all pairwise alignments with the base sequence
can be analyzed with rVISTA afterwards), zPicture accepts only 2
input sequences (more sequences in multi-zPicture but these
alignments can not
be sent to rVISTA afterwards). In mVISTA, you can (or "must", however
you
see it) define "a priori" the parameters which define a CNS region,
like 70 % identity in a window of 100 bp. In zPicture, you may try
different parameter settings to define an ECR in the so-called "dynamic
visualization tool" at the output page, but there is no way to use
other settings than the
default one for a subsequent rVISTA analysis. This is important because
rVISTA
differentiates between "conserved" (meaning aligned AND within an
evolutionary
conserved region, "aligned" (aligned but NOT within an evolutionary
conserved
region) and "all" TFBS (Transcription Factor Binding Sites). Concerning
the
paramaters which define the minimal requirements for a TFBS to
be
listed, mVISTA allows the user to set both "core similarity" and
"matrix similarity",
whereas zPicture allows only the definition of matrix similarity.
Anyway,
both programs provide pre-defined settings to reduce the output list of
TFBS
to hopefully specific hits. In mVISTA, these options are listed as
"minimize
false positives / negatives", or "minimize the sum of both error
rates",
in zPicture there is an option "optimized for function", along with
"use
only high-specificity matrices". It is interesting to note that the
option
"minimize false positives" in mVISTA seems much more stringent than the
others as this often reduces the output list to just a handful of
entries (or even none). In mVISTA, results are coming as an
Email-link (which is advantagous as results are stored for one month at
the server !), in zPicture
results are displayed directly in the browser window.
3.2. User submits only the base sequence:
GenomeVISTA
(LBL) lets you compare your sequences with several whole genome
assemblies. It will automatically find the ortholog, obtain
the alignment and VISTA plot. You will also be able to compare
your alignment with pre-computer alignments of other species in
the same base genome interval. As input just paste your sequence and choose
the base genome. The results can be displayed through the VISTA
text browser or the graphical VISTA browser. Note that these GenomeVISTA
analyses take quite long, so in many cases it is much faster to
retrieve the regions via the pre-computed alignments in VISTA
Browser
!
Genome Alignment
(LLL) lets you
align your FASTA sequence from any organism to either
(meaning only
ONE of these species at a time !) human, mouse, rat, chicken, fugu or drosophila genome. The output
list contains direct links to zPicture, ECR Browser, and rVISTA.
4. Prediction of conserved Transcription Factor Binding Sites
(TFBS) within CNS:
Tip! rVISTA (regulatory
Vista, LBL) combines transcription factor binding sites database search
with a comparative sequence analysis. It can be used directly or
through mVISTA,
Genome VISTA,
or
VISTA Browser.
Anyway, if you have 2 un-aligned sequences, you have to submit them to
an alignment program (mVISTA, MAVID, Advanced
PipMaker) prior to using rVISTA. Note that rVISTA still
only runs on and compares 2 input sequences, there is no
"multiple version" yet. rVISTA reveals conserved
TFBS. You can choose individual TFBS to visualize and select
"stringency values" (core and matrix similarity). This point
is actually critical, producing either huge lists of potential TFBS or
very short ones if the settings are too stringent. The default stringency
settings are quite "loose" (Core 0,75 and Matrix 0,7). There are
also options "minimize false positives / negatives" but these might be
too stringent (try !!!). As output, TFBS are graphically
visualized along the
sequence. If you want to get the exact position numbers and the
exact sequences, use the link "Summary of data" (easily
overseen
!!!) at the bottom of the output. Please refer also to the rVISTA (LBL) chapter at the main
page.
Tip! rVISTA
(regulatory
VISTA, LLL) is quite similar to the
version at LBL. Again, rVISTA works only for 2 input
sequences. rVISTA at LLL offers even more options to run the
program starting from different applications. These
applications (which
can be also used as individual programs !) are zPicture, ECR Browser, precalculated blastz alignments
developed in Webb
Miller's lab.,
GALA, and Genome Alignment.
Please refer also to the rVISTA
(LLL) chapter at the main page for extensive descriptions of these
programs.
Tip! multiTF identifies
transcription
factor binding sites conserved across multiple species. There
are
2 diffrent ways to initiate a multiTF search, and I would
suggest
to use MULAN, as this
program
is integrated in the same web-portal. Multiple sequence alignments
generated
by MULAN can be automatically submitted to multiTF from the results web
page.
The "handling" and output of multiTF is very simillar to rVISTA, e.g.
the
user can set the parameters for detection of TFBS (like matrix
similarity,
individual TF selection). TFBS can be dynamically visualized along the
sequences
(similar display as in rVISTA but for multiple species). It is
possible
to list and display either ALL TFBS or only those which are conserved
across ALL species. You may also highlight individual TFBS positions in
the alignment.
Taken together, MULAN and the interconnected tool multiTF
somehow represent the "multi-species" equivalent to the system
mVISTA-rVISTA, where rVISTA is based on the TF prediction for 2
aligned species (2 sequences). Please refer also to the main
chapter describing different other programs of the Lawrence
Livermore
National Lab for comparative genomics.
Note that MULAN / multiTF can also
be used in connection with the ECR Browser (see above). ECR
Browser is a powerful tool to display large genomic regions of synteny
between several species and to extract the individual DNA sequences.
These sequences can then be used as input for MULAN and finally multiTF
can display all transcription factor binding sites which are conserved
across ALL species. This
is done by using the
link "Synteny/Alignments" in ECR Browser, which sends ALL
selected sequences to MULAN to generate phylogenetic trees and identify
multi-species transcription factor binding sites.
NOTE: If you are specifically looking for a TFBS
which is not
contained in the TF database used (like TRANSFAC) but where you
have a
certain consensus sequence from (like WWCAAWG), you may scan the MULAN
alignment for this pattern by using the option "User-defined consensus
sequences" within the multiTF input window "Defining transcription
factor binding sites".
EEL
- Enhancer Element Locator is a tool for locating distal
gene enhancer elements in mammalian genomes by comparative genomics
and to identify conserved TFBS in predicted enhancers.
EEL is described in Hallikas
et al., Cell 2006. Please refer also to the main section of EEL.
In order to address this specific question, you may
try a search in EEL
Database of precomputed EEL alignments. EELweb stores
precomputed alignments between orthologous genes from human and many
other species. The data is regularly updated with some synchronization
with ENSEMBL database, which is
used as source of genomic information. EELweb can be search for conserved TFBS (all or selected
from a list) in 100 kb upstream
and downstream regions of a specific gene
(set) of interest. A list of Ensembl Gene IDs can be used as
query to search for precomputed TFBS in predicted enhancer regions of
these genes. Select the suitable species comparison. The Ensembl IDs
must correspond to the chosen organism. Sites in the module: Restrict
the query by requiring certain types of transcription factor binding
sites to be conserved in the elements. Note that the maximum number of
results listed is 1000. If you want to produce higher numbers,
you have to install the local version of EEL.
Remarks: A test run using the human genes
CD8A (ENSG00000153563) and CD8B
(ENSG00000172116) did not produce any enhancer regions / TFBS, although
this locus is well documented concerning functional enhancers. Thus, it
may be questioned whether the EEL database really holds a comprehensive
list of genes / enhancers.
GENOM7...know all genes associated
with cardiovascular diseases having a described polymorphism in the
promoter region ? (last
update Jun. 7, 2005)
This question actually combines 2 different
questions, namely to find resources which list genes associated
with certain diseases (which may also be others, like cancer, immune
diseases,...) AND to automatically select "by one click" those genes
having a polymorphism in a certain sequence region (which may also be
3' untranslated, or coding sequence, or other).
Tip! GAD - Genetic Association
Database is an archive of human genetic association studies
of complex diseases
and disorders. GAD is maintained by the NIH
- National Institutes of Health. The goal of this database is to allow
the user to
rapidly identify medically relevant polymorphism from the large
volume
of polymorphism and mutational data, in the context of standardized
nomenclature. The data is from published scientific papers.
Study data is recorded in
the context of official human gene nomenclature with additional
molecular reference numbers and links. It is gene centered. That is,
each record is a record of a gene or marker. Please also refer
to the GAD section at the Data Integration page for a general
description.
In order to address the specific question here, the
option Advanced
Search allows to query by complex combinations of
keywords using fields like "Reference", "Submitter", Entrez GeneID",
"UniGene cluster", "Ensembl", and others. In particular, it is
possible to select polymorphisms (field "Polymorphism Class")
specifically related to genomic
regions like "5'promoter", "3'untranslated", or "coding sequence" and
to restrict the output to those genes related to cardiovascular
diseases (field "Disease Class"). Note that if you select the
option Positive
Only, then the output list of your query
is reduced to those records describing a positive association
with a certain disease. Note: In general, GAD also stores
negative association data (often published only in obscure scientific
journals).
Single records, like the example
of IL6 and coronary heart disease, display a lot of links to
diverse databases, like allele description, polymorphism class, PubMed
links, pathway data, population data, as well as links to SNP databases
(NCBI SNP, HapMap, Map View). In addition, links to "experts in the
field" are provided.
A
very powerful resource for data retrieval is BioMart. BioMart
is a data retrieval tool that
generates lists of biological objects (e.g. genes, SNPs) from data held
in the Ensembl (and other) databases. Please refer to the BioMart section at the main page
for general details. There are different options in BioMart to filter
a
set of genes (or ALL genes of a genome) for those having a known disease
context (please refer to FAQ GENOM5 !). BUT NOTE
that it is not so easy to filter for cardiovascular diseases using
BioMart as compared to GAD ! Nevertheless, BioMart offers several
options for filtering related to SNP data. You may select
genes having SNPs in the coding region, 5'UTR, 5'flanking, intronic,
and more.
Overall, there is no direct link between SNPs and
diseases in BioMart, making GAD the better resource for this purpose.
GENOM8...know the genes and drugs
related to diseases like atherosclerosis ? -> see CHEM2 !
This question involves resources which correlate
diseases with the employment of specific drugs, Thus, it is located in
section "Cheminformatics".
GENOM9...analyze the expression of a
gene set of interest in cancer tissues ? (last
update Feb. 13, 2006)
1. Microarray, SAGE, and EST data:
Tip! CGAP - Cancer Genome
Anatomy Project is an NCBI resource which offers a
comprehensive molecular characterization of normal, precancerous, and
malignant cells. It contains genomic data for humans and mouse,
including transcript sequence, gene expression patterns, SNPs,
clone resources, and cytogenetic information. Please refer also to the CGAP main section for details.
In order to address this specific question, CGAP
provides at the Gene Finder page
the option to use the Batch Gene Finder.
In order to use the Batch Gene Finder, prepare a text file containing
the list of (human OR mouse)
gene symbols, UniGene
clusters, accession numbers, protein accession number, UniProt
(SwissProt) protein accessions, UniProt (SwissProt) protein identifiers
(like "ACTB_HUMAN") or Entrez Gene numbers. The text file must list the
identifiers in a vertical column, e.g. export a one-column EXECEL sheet
in txt (tab-delimited) format. The created gene list displays
the query
ID, gene name and symbol, and RefSeq accessions, as well as links to
the individual Gene Info pages (see CGAP main section for
details). In addition, the link "Common View" allows to create
a table displaying all GO
terms, Pathways (KEGG and Biocarta), motifs, SNPs, and cyto
locations
for the complete input gene set. In addition, the expression of
the whole gene set can be viewed as colored graph within the NCI60
panel of cancer cell lines (please refer to the NCI60 section for background). The
link "SAGE Summary" displays the SAGE counts of the input gene
set in a series of normal and cancer tissues.
2. Microarray data:
Oncomine
is a resource of the University of Michigan for
examining gene expression in cancer. The goal of the project is
to
collect, standardize, analyze, and deliver published cancer gene
expression data to the research community. Probe the expression
of a
gene across thousands of cancer samples.
Note that the "Gene Search" option
allows SINGLE gene queries (there is NO batch query using gene
sets) using several
types of identifiers, like gene name, gene symbol, Entrez Gene ID,
Affymetrix ProbeSet IDs, and more. As first result,
a gene overview is presented showing the gene name and aliases.
As "expression overview", the "Differential activity map" summarizes significant
differential expression of a gene of interest grouped by tissue type
and analysis type. Three types of analyses
are summarized on the
Summary page: Normal vs Normal, Cancer vs Normal, and Cancer vs Cancer. Note: Please refer to the main section of Oncomine for
detailed descriptions of the other analysis modules !
3. SAGE and EST data:
Tip!
ECgene (gene
prediction by EST clustering) predicts genes by combining genome-based
EST clustering and a transcript assembly procedure in a coherent
and consistent fashion. Specifically, ECgene takes alternative splicing
events
into consideration. The positions of splice sites (i.e.
exon-intron boundaries)
in the genome map are utilized as critical information in the whole
procedure. Sequences that share splice sites in the genomic alignment
are grouped together to define an EST cluster. ECgene is
available for human, mouse, and rat
genomes. Please refer also to the main
section of ECgene for information !
In order to address this specific question, the
module ECexpression
is relevant. ECexpression is the expression data viewer of ECgene.
ECexpression utilizes the extensive expression data from EST
and SAGE sources to develop a queriable expression ontology.
EST or SAGE libraries are prepared from known tissue samples and the
origin of these samples was carefully documented to allow standardized
querying of expression. There are 4 divisions: anatomical site,
pathology, developmental stage, and sex. Note: A very useful
feature is that normal and cancer libraries are divided
and also displayed separately in the graphs. Therefore, this layout
makes it easy to find any tissue-specific or cancer-specific
isoforms !
Note: There is also a "stand-alone" query form,
which allows to set the reliability level, select the graph type,
retrieve different types of SAGE tags, select the tag redundanly, and
to choose whether to include all, or only non-normalized EST libraries.
The reason is that many cDNA libraries are normalized or subtracted in
order to find genes with low expression level, which in turn renders
expression data qualitative rather than quantitative.
4. EST data:
DigiNorthern, provided
by the Bioinformatics Group
of the Roswell Park Cancer Institute, is a tool for
virtually displaying the expression profile of query genes
(currently only accept DNA sequence as input) based on the
EST sequences currently available at NCBI GenBank. There are
currently two versions for this program. DN1 takes one
sequence as query gene and lists all the cell lines/tissues/organs
that express the gene and displays the relative expression levels of
the gene based on the number of matched ESTs vs the total number of
ESTs for related libraries. Whereever available, comparison will also
be made between the same tissue/organ in normal and cancer
status. DN2
takes two
sequences as query genes and compares their expression profiles
side by side. DigiNorthern is currently available for Human
and mouse.
NOTE: There is no batch submission
of gene datasets in DigiNorthern.
GENOM10...determine the expression
profiles of normal vs. cancer tissues ? (last
update Feb. 13, 2006)
1. Microarray, SAGE, and EST data:
Tip! CGAP - Cancer Genome
Anatomy Project is an NCBI resource which offers a
comprehensive molecular characterization of normal, precancerous, and
malignant cells. It contains genomic data for humans and mouse,
including transcript sequence, gene expression patterns, SNPs,
clone resources, and cytogenetic information. Please refer also to the CGAP main section for details. In order
to address this specific question, CGAP
provides several tools for the analysis of cDNA (EST) expression
in normal and cancer tissues.
The cDNA xProfiler is
a tool that compares gene expression between two pools of
libraries. For a
gene to be "present" in a library pool, there must be at least one EST
sequence found in the UniGene cluster for that gene. This tool allows
to generate datasets of "unique" ESTs via comparison of e.g. different
tissues or between normal and cancer libraries of the same tissue.
The GLS - Gene
Library Summarizer finds all the genes expressed in a single
cDNA library or group of cDNA libraries. It then classifies the
genes as unique
or
non-unique, and then further identifies the genes in each of these
groups as known
or unknown.
The DGED
(cDNA Digital Gene Expression Displayer) is a tool that compares
gene expression between two pools of libraries. In contrast to the
xProfiler, the DGED treats the presence of a gene in a library pool as
a matter of degree. It compares the "degree" of presence of a
gene in pool A with its "degree" of presence in pool B.
This comparison is reduced to two numbers: the
sequence odds
ratio and measure
of significance.
The SAGE
DGED (SAGE Digital Gene Expression Displayer) is a tool that
identifies those genes that are expressed at significantly different
levels (as defined by the user) in two pools of human libraries,
based on SAGE tag analysis. The algorithm takes into account
the differences in sample size between Pools A and B, which can be
large.
The user selects a value for statistical significance (P value) and a
value for the difference in the level of expression (F value) between
the two pools. The results are based on the sequence odds ratio
and measure of significance.
NOTE: These tools can not only be used for cancer
vs. normal
but also for normal vs. normal comparisons !
2. Microarray data:
Tip! Oncomine
is a resource of the University of Michigan for
examining gene expression in cancer. The goal of the project is
to
collect, standardize, analyze, and deliver published cancer gene
expression data to the research community. The user may explore genes,
processes,
and pathways deregulated in a particular type of cancer.
"Profile Search" allows to query using keywords like
cancer types, tissue types, clinical parameters, and more.
Alternatively, you may browse all cancer profiles by clicking
the icon and then use the filters to find the profile of interest. This search first presents a list
of
studies filtered after certain criteria. These include source tissue
(like breast or prostate), and several analysis types (like cancer vs. normal, cancer
vs. cancer etc.). For each study, the number (and percentage) of up-,
down-, and
differentially expressed genes is indicated.
Note:
Please refer to the main section of
Oncomine for detailed descriptions of the advanced analysis modules
!