| Genomics Linkpages |
|
| MedDB (Medical University of Vienna, Austria) |
MedDB,
created at the Medical University of Vienna, Austria, provides a very good,
concise, compact overview of bioinformatics resources, listed in
well-structured categories. MedDB also contains a section "Genomics", with
sub-sections like "Pathways", "Polymorphism", or "Transcription".
Please refer to this main section for additional information. |
| Comparative Genomics | |
| NOTE: This section also includes programs which are especially designed for the alignment of 2 (or sometimes more than 2) large genomic sequences, preferably syntenic chromosomal regions of different species, like DBA, LAGAN, PipMaker and zPicture. In contrast, programs for "general" pairwise or multiple alignment are listed in the section "Sequence Similarity". | |
| DBA
- DNA Block Aligner (EBI) |
Dna Block Aligner
(DBA),
which is part of the Wise2
package at EBI,
aligns two sequences under the assumption that the sequences
share a number of
colinear blocks of conservation separated by potentially large
and varied lengths of DNA in the two sequences. DBA works well on
syntenous regions of non coding DNA between say mouse and human, for
example, the upstream regions of a gene from mouse and human, or the
conserved intron of a human - chicken gene. The conserved blocks may be
regions important for regulation of the gene. The conserved blocks
may have one or two gaps in them. The input form works simply by submitting your two sequences, or you can use the file upload feature if you like. An ASCII output of the alignment is returned to you. The final model is a probabilistic finite state machine (or pair-HMM) which aligns the two sequences. NOTE: Other programs for this purpose are e.g. PipMaker or zPicture (please refer to these descriptions below !). NOTE: DBA alignments can take very long ! An example of 2 sequences of approx. 100 kb each was not finished "by the end of the day", while the same task was performed by zPicture within a minute ! |
| Dcode.org - Tools for comparative genomics (Lawrence Livermore Nat. Lab) including: rVISTA zPicture ECR Browser eShadow Genome Alignment MULAN multiTF Note: Tools of Dcode.org which are described elsewhere: CREME SynoR |
The DCODE.org
Comparative
Genomics Center at
Lawrence Livermore National Lab
is working on the development of tools for comparative sequence
analysis. DCODE.org
is a publicly available resourse for
regulatory genome data mining. It provides tools for evolutionary
comparisons, sequence alignments, and detection of functional sequence
patterns. Note that the rVISTA tool provided here, although quite similar in "look and feel", is independent from the rVISTA tool provided by Lawrence Berkeley Lab. 1. rVISTA (regulatory VISTA) combines transcription factor binding sites (TFBS) database search with a comparative sequence analysis, thereby reducing the number of predicted transcription factor binding sites by several orders of magnitude. As example, when comparing promoter sequences of a human gene and its orthologous mouse counterpart, it is possible to extract those TFBS that are conserved between the 2 species and therefore are expected to be functionally significant. Note that, in contrast to mVISTA, rVISTA works only for 2 input sequences. 1.1. rVISTA input: There are 5 different ways to run rVISTA (which can be also used as individual programs !), which are explained at the rVISTA start page: 1.1.1. zPicture is the most convenient way to align the 2 input sequences. zPicture is a dynamic alignment and visualization tool based on the blastz alignment program utilized by PipMaker. zPicture alignments can be automatically (!) submitted to rVista to identify conserved transcription factor binding sites. Note: There is a very good, concise Help for using zPicture. There are several input options for zPicture, like copy/paste, fasta-files, NCBI accessions, or Upload sequence and gene annotation from the UCSC Genome Browser. Optionally, you can provide annotations for the input sequences. The output of zPicture includes several file formats and a dynamic visualization tool that graphically displays the conserved regions and allows for user-defined parameter settings. In addition, there is a direct link to submit the alignment to rVISTA analysis ! NOTE: multi-zPicture is a multi-sequence version of zPicture alignment and visualization tool. Please note that it is not possible to submit multi-zPicture alignments to rVista yet. Nevertheless, all other options are fully functional. 1.1.2. ECR Browser: If you know the location or the gene name of your sequence in either human, mouse, rat, or fugu genome then you can fetch precalculated alignments for the rVista processing from the ECR (Evolutionary Conserved Regions) Browser. In order to run the ECR Browser, select a base organism and indicate the name of a gene or a chromosomal location (chr1:from-to format). NOTE: A good approach is to first extract the exact genomic coordinates of the region you are interested in using the UCSC Genome Browser; MAKE SURE YOU ARE USING THE SAME DATA FREEZE AS IN ECR BROWSER ! The ECR browser is a dynamic whole-genome navigation tool for visualizing and studying evolutionary relationships between vertebrate genomes and for analyzing sequence conservation profiles. Visually ECRs are represented as colored peaks on a graph, with the x-axis representing positions in the base genome and the y-axis representing % identity between the base and aligned genomes at that specified position. ECRs are color-coded differently according to the properties of the underlying sequence of the base genome. This allows the user to visually distinguish between ECRs that correspond to coding exons (blue), untranslated regions (UTRs, yellow) and noncoding elements (red if they are intergenic or pink if they lie within an intron). Green bars on the bottom axis of the plot shows the position of repetitive elements in the base genome and this annotation is shaded to the top of the plot in gray. Annotated genes are depicted as a horizontal blue line above the graph, with strand/transcriptional orientation indicated by the inclined vertical lines. Note the "Flip"-button (2 arrows) which lets you quickly reverse the strand orientation of the base sequence! In addition, the ECR browser is equipped with a 'Grab ECR' feature that allows users to rapidly extract sequences. A mouse click on the 'Grab ECR' button (which changes its color), followed by a second click on any colored peak (ECR) on the plot results in appearance of a new web page describing the ECR corresponding to that peak. NOTE that this only works when pop-up blockers are switched off ! Chromosomal location, length, percent identity of the pairwise alignment, and GC content of the ECR are given. In addition the full alignment is visualized. Sequences and alignments from other species can be obtained by using the "Grab ECR" feature to retrieve a peak from the conservation plot depicting alignments with the genome of that species. An additional link can be used to forward the ECR alignment to rVista. In addition to these functions, links to the oligo/primer design tool are provided for the base and the second sequences. Additional features can be accessed via the commands at the top of the ECR Browser window: - "Base genome" let's you quickly switch between different species selected as base genome. - "Browser Settings" allows customized displays, like selection of species, graph type, number and height of layers, and stringency settings to detect ECRs. In addition, there is the option to display pre-computed conserved transcription factor binding sites directly in ECR Browser, without having to run "Grab ECR" and rVISTA first. This is a static "quick-view" generated using default settings. - "Highlight core ECRs" displays only those ECRs which show at least 77 % conservation in a window of 350 bp (see also corresponding reference). - "ECRs" displays a list of the identified ECRs in a genomic region and all sequences. - "DNA" produces a fasta sequence file of the complete genomic region of the base genome. - "Synteny/Alignments" produces a list of all the syntenic regions / sequences of the other species. You may then directly view the rVISTA analyses (conserved TFBS) for all pairs of sequences. NOTE: You may also send ALL selected sequences to Mulan to generate phylogenetic trees and identify multi-species transcription factor binding sites (please refer to the MULAN description below). - "SNPs" produces a list of all Single-Nucleotide Polymorphisms within the individual ECRs. 1.1.3. Precalculated blastz alignments. blastz alignments could be obtained through the use of multiple tools developed in Webb Miller's lab. 1.1.4. GALA (Genome Alignment and Annotation Database) now automatically forwards genome scans to rVista 2.0. 1.1.5. Genome Alignment (see below !) lets you align your FASTA sequence from any organism to either human, mouse, rat, chicken, fugu or drosophila genome. The output list contains direct links to zPicture, ECR Browser, and rVISTA ! 1.2. Output of rVISTA: (Note that you can also save the output link and return to the stored results later): Note that if an annotation file was supplied, rVISTA will identify aligned binding sites predicted only for the noncoding regions. You can search this alignment either for conserved BIOBASE (TRANSFAC professional) matrices or for self-defined consensus sequences. You may choose single transcription factors to search for or all matrices available. You may specify the values for core similarity and matrix similarity. Note: TRANSFAC searches are performed using core similarity values of 0.8 and matrix similarity values of 0.85. - Summary tables: displays the sequences, scores, and positions of all selected TFBS within the two query sequences. - Visualization and clustering module: A web interface to graphically display the positions of conserved TFBS. You may also perform clustering analyses. - TFBS families in the alignment: Here, you can choose a single TFBS family (like Sp1 or AP2) to visualize in the alignment, meaning that you will see the conserved positions directly in a sequence alignment of the 2 sequences. - Positions of conserved / aligned / all TFBS: showing a list of these. "Conserved" means aligned AND within an evolutionary conserved region, "aligned" are the sites in two species that are interconnected by the alignment, but are not required to be locally conserved (not within an ECR). 2. eShadow is a program for phylogenetic shadowing of closely related species. NOTE: Compared to standard phylogenetic footprinting tools, eShadow is aimed at detection of relatively long putative regulatory elements (~100bps), while other tools usually detect smaller evolutionary conserved elements of a size of a transcription factor binding site. This program is a tool for performing comparative sequence analysis of multiple closely-related nucleotide or protein sequences. Based on the assumption that mutations in different lineages accumulated independently of each other during evolution, eShadow analyzes ClustalW multiple sequence alignments of sequences indistinguishably similar in pairwise comparisons and identifies regions that accumulated small amounts of mutations throughout evolution. eShadow implements two different complimentary approaches: Hidden Markov Model Islands (HMMI) and Divergence Threshold (DT) to distinguish between functional vs neutrally evolving regions. 2.1. eShadow input: eShadow requires a set of orthologous sequences in the FASTA format. Every sequence should have a header line with the name of the species followed by nucleotide sequence in a format similar to the example provided at the eShadow instructions page. 2.2. eShadow output: eShadow generates graphical images displaying the conserved regions between the supplied sequences, showing these regions in different colors according to the different methods used. In addition, ORF- and exon-predictions are performed. 3. Genome Alignment: Automatically align your FASTA sequence from any organism to either human, mouse, rat, chicken, fugu or drosophila genome. Note that there is no multiple alignment option, just alignment of 2 species. The output list contains direct links to zPicture, ECR Browser, and rVISTA ! Note that the link to rVISTA analyses the whole sequences concerning TFBS, but there is an option to display only those located within evolutionary conserved regions. 4. MULAN: Mulan is a MUltiple sequence Local AligNment and conservation visualization tool that is based on refine and tba programs created by Webb Miller. Mulan performs full local alignment of multiple nucleotide sequences. It can be applied to the analysis of distant (such as humans and fish, for example) or closely related (such as humans and primates or mice and rats) species. NOTE: MULAN and the interconnected tool multiTF somehow represent the "multi-species" equivalent to the system mVISTA-rVISTA, where rVISTA is based on the TF prediction for 2 aligned species (2 sequences). Mulan can be run in 2 different ways, either tba-based alignments for "finished"-quality sequences or refine-based alignments for "draft"-quality sequences (that are represented as a set of separated contigs). 4.1. Input: Simply select the number of species, and then paste the sequences. NOTE: If you choose the option to upload the sequence from the UCSC Genome Browser, then the gene annotation is automatically uploaded as well. This is very convenient as it is a big advantage to see the positions of exons and introns in the final image !!! NOTE: Now it is also possible to first locate your region of interest in ECR Browser, extract the pre-made alignments with other species, and finally, via the link "Synteny/Alignments", you may send ALL selected sequences to MULAN to generate phylogenetic trees and identify multi-species transcription factor binding sites. 4.2. Output: - First, a phylogenetic tree is suggested based on the sequences, which can be modified by the user. - Dynamic graphical visualization of conservation profiles. - Pairwise dot-plots displaying alignment blocks. - Dynamic batch detection of ECRs (Evolutionarily Conserved Regions) by varying ECR parameters for each of the pairwise alignments. - Portal to the multiTF tool for detection of cross-species TFBS. (Available only for "finished" Mulan alignments.) 5. multiTF: multiTF identifies transcription factor binding sites conserved across multiple species. 5.1. Query: there are 2 diffrent ways to initiate a multiTF search: - MULAN multiple sequence alignments can be automatically submitted to multiTF from the results web page. - GALA (Genome Alignment and Annotation Database ) automatically forwards genome alignment data to Mulan for the subsequent multiTF post-processing. - NOTE: Now it is also possible to first locate your region of interest in ECR Browser, extract the pre-made alignments with other species, and finally, via the link "Synteny/Alignments", you may send ALL selected sequences to MULAN to generate phylogenetic trees and identify multi-species transcription factor binding sites. 5.2. Output: - Very simillar to rVISTA, the user can set the parameters for detection of TFBS (like matrix similarity, individual TF selection). - TFBS can be dynamically visualized along the sequences (similar display as in rVISTA but for multiple species). - List and display either ALL TFBS or only those which are conserved across ALL species. - Highlight individual TFBS positions in the alignment. |
| DiAlign and DiAlignTF (Genomatix Inc., Munich, Germany) |
1. DiAlign
performs local multiple alignment. DiAlign is a novel
multiple alignment tool
using a proprietary algorithm (not based on Needleman-Wunsch algorithm
including the Smith-Waterman local variant). DiAlign is based on
segment comparison and uses no gap penalty thus avoiding
several problems of conventional alignments. One of the most important
features of DiAlign is its independence of sequence length in a wide
range allowing detection of conserved short stretches directly from a
multiple alignment of longer sequences. 2. DiAlignTF displays transcription factor (TF) binding site (TFBS) matches within a multiple alignment. It is possible to display all TF binding site matches, TF binding site matches common to all or subset of the input sequences, or common TF binding site matches that are located in aligned regions. The TF binding sites are visualized in the alignment as colored boxes. The input sequences are aligned with the multiple alignment program DiAlign. TF binding site matches are identified by MatInspector. NOTE: DiAlign and DiAlignTF are part of the "GEMS Launcher"-section of the commercial Genomatix Suite. NOTE: Genomatix has termed the free academic access "evaluation account". Note that in general, there is not only a limitation in the number of analyses (max. 20 GEMS analyses (sequences) per month!) but also in the functionality of the obtained data ! NOTE: The DiAlign - DiAlignTF system is comparable to the (FREE) MULAN - multiTF system at Dcode.org. |
| Genome
Atlas Database (CBS, Denmark) |
The
Genome
Atlas
Database, provided by the CBS, Denmark, is a method of
visualising
structural features
within large regions of DNA. It was originally designed for analysis of
complete genomes, but can also be used quite readily for analysis of
regions of DNA as small as a few thousand bp in length. |
| LAGAN Alignment Toolkit (Stanford ) including: LAGAN Multi-LAGAN Shuffle-LAGAN and CHAOS (Berkeley) |
The LAGAN Alignment
Toolkit
is a set of alignment programs for comparative genomics. The
three main components are a pairwise aligner (LAGAN), a multiple
aligner (M-LAGAN), and a glocal aligner (Shuffle-LAGAN). All 3 are
based on the CHAOS local alignment tool and combine speed (regions up
to several megabases can be aligned within minutes) with high accuracy.
The results can be visualized using the VISTA server, as well as the
novel Phylo-VISTA
tool. 1. CHAOS is a pairwise local aligner optimized for non-coding, and other poorly conserved regions of the genome. It uses both exact matching and degenerate seeds, and is able to find homology in the presence of gaps. 2. LAGAN is a highly parametrizable pairwise global alignment program. It takes local alignments generated by CHAOS as anchors, and limits the search area of the Needleman-Wunsch algorithm around these anchors. The input to LAGAN consists of two sequence files, in FASTA format. The program RepeatMasker is used to mask repetitive elements. You can also ask the server to reverse-complement the second sequence, if you suspect there is homology on the opposite DNA strand. You can request a visualization of your alignment by the VISTA server. Phylo-VISTA visualization is only available for M-LAGAN, not pairwise LAGAN alignments (see below for details), but you can submit two sequences as an M-LAGAN job to see the result in Phylo-VISTA. 3. Multi-LAGAN (M-LAGAN) is a generalization of the pairwise algorithm to multiple sequence alignment. M-LAGAN performs progressive pairwise alignments, guided by a user-specified phylogenetic tree. Alignments are aligned to other alignments using the sum-of-pairs metric. The input to LAGAN consists of several sequence files, in FASTA format. You must provide a name for each sequence, and specify a phylogenetic tree that relates all of them. You can request a visualization of your alignment by the VISTA server. You can also get a Phylo-VISTA visualization of the alignment by following a link included in your e-mail. Phylo-VISTA is a novel interactive tool for multiple alignment visualization that uses the phylogenetic relationship between the sequence to better display the sequence similarities. 4. Shuffle-LAGAN is a novel glocal alignment algorithm that is able to find rearrangements (inversions, transpositions and some duplications) in a global alignment framework. It uses CHAOS local alignments to build a map of the rearrangements between the sequences, and LAGAN to align the regions of conserved synteny. NOTE: LAGAN can visualize your alignments through the VISTA server. Note that LAGAN and VISTA are not affiliated: if you request an alignment directly on the VISTA page it will be done using the AVID aligner, not LAGAN. |
| MAVID (Berkeley) and K-Browser (Berkeley) |
1. MAVID is a multiple
alignment program that is suitable for alignments of large numbers
of DNA sequences. The sequences can be small mitochondrial genomes or
large genomic regions up to megabases long. The MAVID server integrates
MAVID with various phylogenetic tree construction programs and
visualization tools. - The sequence file should contain the sequences in multi-FASTAtext file. - The results page contains the MAVID generated multiple alignment in various formats, the phylogenetic tree constructed from the sequences, and VISTA pictures of the alignments. - The output is organized into two main categories: Download and View. In the View-section, you will find the phylogenetic tree, the VISTA-plots as pdf-files, and a Boxshade-colored alignment as pdf. Downloadable files include the multiple alignment in CLUSTAL, PHYLIP or multi-FASTA format, along with the phylogenetic tree. In the case where just two sequences are submitted for alignment, the pairwise alignment is available for download in AVID format. Please note that the alignment files are not compressed, and can be large. The phylogenetic tree generated from the MAVID alignment can be downloaded directly in Newick format, or viewed using the ATV applet (the tree is rooted using the midpoint method). NOTE: MAVID is quite similar to VISTA (see VISTA chapter). Both programs use the global aligner avid, in contrast to PipMaker which is based on the local aligner called blastz. Note: It is important to note that MAVID is currently a DNA sequence alignment program and cannot align protein sequences. Note: There is a compact user guide explaining all you need to know. Note: MAVID uses DUST as pre- processing step but repeats are identified, not masked. Therefore repeats are used to reduce alignment errors, but repeats are aligned. Note: MAVID runs on the sequences as they are given, so you have to reverse sequences which are in the "wrong direction". 2. K-Browser is a multiple genome browser, displaying pre-computed whole-genome alignments of several species. Currently it is set up for human, mouse, and rat. You have to select one of the species as the "primary" genome for browsing, along which the other 2 are aligned. K-Browser is heavily based on the UCSC Genome Browser. NOTE that you have to be careful when selecting genomic positions as they vary between different versions (freezes) !!! For example: The version "Human (hg15)" in K-Browser is identical to the UCSC freeze of April 2003, and NOT to the most recent version of July 2003 (= hg16) !!! Anyway, similar to the UCSC interface you may query using diverse identifiers, like gene name atc. ! Output: All 3 genomic regions as "UCSC-like" graphics; the alignment in multifasta or in Phylip-format. Note: K-Browser is similar to the VISTA browser (see VISTA chapter). Personally, I would recommend the VISTA browser which seems more "up-to-date" and shows a higher versatility. |
| PipMaker
and MultiPipMaker (Penn State) |
PipMaker was developed by the Penn State Bioinformatics
Group. PipMaker computes alignments of similar
regions in two
DNA sequences. Note that PipMaker is based on the local aligner
called blastz, while Vista utilizes the global aligner avid.
The resulting alignments are summarized with
a "percent identity plot'', or "pip'' for short. MultiPipMaker allows the user to see relationships among more than two sequences. All pairwise alignments with the first sequence are computed and then returned as interleaved pips. Moreover, MultiPipMaker can be requested to compute a true multiple alignment of the input sequences and return a nucleotide-level view of the results. PipMaker generates graphical output as a PDF document by default, or optionally as a PostScript document. Note: PipMaker is also one of the alignment tools which can be used as initial step for an rVISTA analysis (please refer to the specific chapter). If you precalculated alignments using Advanced PipMaker, then you can submit them using the form at the bottom of the rVISTA start page. There is a specific help file concerning PipMaker instructions for rVista users. |
| Selected Comparative Genomics
Databases including: CVCGD - Cardiovascular Comparative Genomics Database (Lawrence Berkeley Laboratory) |
1. Cardiovascular
Comparative Genomics Database (CVCGD) is part of the so-called PGA (Programs for Genomic
Applications) suite of the Lawrence
Berkeley Laboratory. This database includes
well-studied CV genes, for which an understanding of regulation should
provide insights into CV relevant biological issues. You may search the CVCGD via gene names, browse via alphabetical lists, or search by categories (groups of diseases) like atherosclerosis, thrombosis, vascular development, and more. Search results are displayed as tables, showing links to OMIM, RefSeq, and GenBank. In particular, you may analyze human-mouse-rat whole genome alignments via the VISTA Genome Browser (see also VISTA chapter !), showing positions of conserved exons but also of conserved non-coding regions (CNS). Note: The VISTA Genome Browser is fully "interactive" allowing many options (zoom, walk, and direct connection to the UCSC Genome Browser for sequence download etc.). |
| VISTA - Visualization Tools for
Alignments (Lawrence Berkeley Laboratory) including: mVISTA rVISTA GenomeVISTA PhyloVISTA VISTA Browser Whole Genome rVISTA |
VISTA is
implemented in the so-called PGA (Programs
for Genomic Applications) suite of the Lawrence Berkeley Laboratory. VISTA is
a set of tools for comparative genomics. It was designed to visualize
long sequence alignments of DNA from two or more species with
annotation information. It has a clean output, allowing for easy
identification of sequence similarities and differences, and is easily
configurable, enabling the visualization of alignments of various
lengths at different levels of resolution. 1. VISTA Servers (Self-computed alignments): 1.1. mVISTA (main VISTA): program for visualizing alignments of an arbitrary number of genomic sequences from different species. Note that VISTA is especially designed to display alignments of orthologous genes / regulatory regions of up to 100 species. 1.1.1. Input: Note that it is not possible to paste sequences, but you have to save them as FASTA-files in *.txt format, like using MS WORD. In addition you may provide an annotation file for the first (base) sequence, specifying the positions of exons, UTRs, etc. This annotation file can also be written as a simple txt-file, see the instructions page for an example. If you provide an annotation for the first sequence, then this will also be applied for the homologous regions of the second sequence. You may now choose between different alignment programs in mVISTA: AVID, which produces global pair-wise alignments (sequences can be *finished or draft*); LAGAN, which produces global *multiple* alignments of finished sequences; and Shuffle-LAGAN, which produces glocal pair-wise alignment of finished sequences and is capable of *detecting rearrangements*. Note that you may choose the option to directly analyze the results with rVISTA to reveal conserved Transcription Factor Binding Sites (TFBS). If so, you can choose individual TFBS and select "stringency values" (core and matrix similarity). Please refer to rVISTA below for output details. 1.1.2. Output: - TextBrowser: input and output files for visualization and download, including text files listing the conserved regions of the 2 sequences that meet the specified criteria (default: 75% identity within 100bp). Another novel (April 2005) feature is rankVISTA: RankVISTA conservation plots depict evolutionarily conserved segments in pairwise or multiple alignments as a bar graph, where the heights scale with statistical significance [-log10(P-value)]. For example, a height of 4 indicates that the probability of seeing that level of conservation by chance in a neutrally-evolving 10-kb segment of the base sequence is less than 10-4. - VISTA Image: VISTA plot of the alignment(s) in PDF format. - Dynamic Visualization: VISTA Browser. This provides multiple novel analysis options. NOTE that you will get an Email containing the link to the directory of output files which can be downloaded. 1.2. rVISTA (regulatory Vista): combines transcription factor binding sites database search with a comparative sequence analysis. It can be used directly or through mVISTA, Genome VISTA, or VISTA Browser. Anyway, if you have 2 un-aligned sequences, you have to submit them to an alignment program (mVISTA, MAVID, Advanced PipMaker) prior to using rVISTA ! Note that rVISTA still only runs on and compares 2 input sequences, there is no "multiple version" yet. rVISTA reveals conserved Transcription Factor Binding Sites (TFBS). You can choose individual TFBS to visualize and select "stringency values" (core and matrix similarity). This point is actually critical, producing either huge lists of potential TFBS or very short ones if the settings are too stringent. NOTE: TFBS are graphically visualized along the sequence. If you want to get the exact position numbers and the exact sequences, use the link "Summary of data" (easily overseen !!!) at the bottom of the output ! NOTE: The programs MAVID and PipMaker are described in separate sections ! NOTE: The default stringency settings are quite "loose" (Core 0,75 and Matrix 0,7). There are also options "minimize false positives / negatives" but these might be too stringent (try !!!). NOTE: "Conserved" TFBS means aligned AND within an evolutionary conserved region, "aligned" are those sites in two species that are interconnected by the alignment, but are not required to be locally conserved (not within a CNS/ECR). 1.3. GenomeVISTA: lets you compare your sequences with several whole genome assemblies. It will automatically find the ortholog, obtain the alignment and VISTA plot. You will also be able to compare your alignment with pre-computer alignments of other species in the same base genome interval. Input: Just paste your sequence and choose the base genome. Results can be displayed through the VISTA text browser or the graphical VISTA browser. NOTE: These GenomeVISTA analyses take quite long, so in many cases it is much faster to retrieve the regions via the pre-computed alignments in VISTA Browser (see below) !!! 1.4. PhyloVISTA: is an interactive tool for analyzing multiple DNA sequence alignments by vizualizing a similarity measure for DNA sequences for different species while considering their phylogenic relationships. Features include multiple-resolution visualization for examining an alignment and easy comparison of any subtree of sequence data within the complete alignment dataset. NOTE: Prior to using PhyloVISTA, you have to produce a multi-alignment file in multi-fasta format. 2. Pre-computed whole genome alignments: 2.1. VISTA Browser: is a very nice Java applet, which allows the user to examine pre-computed alignments of whole genome assemblies. Pairwise and multiple alignments are available. This tool is tightly connected to the UCSC Genome browser. To browse whole-genome alignments, just select a base genome and enter a RefSeq gene name or a position (e.g. chrX:1-100000) on this genome. Please read the remarks in the MAVID section concerning the matter of genomic positions and genomic freezes. 2 output options: 5a) VISTA Browser itself: This Java applet lets you zoom, move, and analyze the genomic alignments very nicely. You may select / deselect individual organisms, zoom highly conserved regions, and also directly jump to the UCSC browser where you can e.g. download the sequence region. 5b) VISTA tracks within UCSC Browser: displays the VISTA tracks within the UCSC browser window allowing nice comparison of conserved regions with exons / introns / "strange" ESTs / repeats, etc. NOTE: To use this browser, Java 2 must be installed on your computer. NOTE: VISTA browser is similar to other programs like K-Browser (described in the MAVID-section). Personally, I would recommend the VISTA browser which seems more "up-to-date" and shows a higher versatility. 2.2. Whole Genome rVISTA (beta version): This Web site provides access to the computational tool that allows for evaluation of which transcription factor binding sites (TFBS) are over-represented in upstream regions in a group of genes. This beta-version of the tool has been developed for the mm4 version of the Mouse genome (October 2003). A database of all TFBS in mm4 conserved in the alignment with the Human July 2003 (hg16) using rVISTA (regulatory VISTA) method was created. Input: genes of your interest using locus link IDs or RefSeq names (only Mouse !!!). The programs will calculate which TFBS located in 5Kbp upstream regions of these genes are over-represented (at the P-value cutoff 0.006) using all 5Kbp upstream regions of mm4 RefSeq genes as outgroup. You can also get a list of TFBS overrepresented in each individual gene of interest. NOTE: Test runs showed that the program tends to produce quite long lists of "over-represented" TFBS. It is hard to estimate which of these are actually biologically meaningful. NOTE: This program is similar to TELiS, but uses conserved TFBS between 2 species instead of TFBS from a single species, and counts the number of TFBS relative to all mouse genes (RefSeq) whereas TELiS specifically sets the TFBS of the sample in relation to the TFBS of all genes present on the microarray platform used. |
| Phylogenetics | |
| Phylogenetics Resources
- Linkpage 1 |
A listing of Phylogenetic
resources from UCMP
(Univ. of California Museum of Paleontology) including societies,
publications, databases, and software. |
| Phylogenetics Resources
- Linkpage 2 |
The University of
Washington provides a
list of over 100 (!) phylogeny
software packages.
|
| Phylogenetics Resources -
Linkpage 3 |
The Center for Genomics and
Bioinformatics (CGB) was launched in 1997 as a new academic
department at the Karolinska Institute. The public
WWW servers at CGB provide a series of tools
/databases, with some focus on the field of phylogenetics / comparative
genomics, like ConSite, OrthoSeq, Orthostrapper, and Inparanoid. |
| Phylogenetics Resources -
Linkpage 4 |
The Pasteur Institute
offers a linkpage to many resources involved in phylogenetic
analyses. |
| BLink (NCBI) |
BLink
(BLAST Link) is an
extremely useful source of information concerning protein homologs
and orthologs across multiple species. BLink is not available as
"program per se" but as link in each protein record
stored in NCBI Entrez.
Note that in case there is no direct link to BLink from Entrez
Gene, you may first open the respective HomoloGene
link, and then jump to BLink from the protein records displayed in
HomoloGene. As
BLink is not available as "direct link", here is an example
protein record. NOTE: BLink is also integrated in the "data super-integration tool" Bioinformatic Harvester of the EBI. BLink entries are based on pre-computed sequence alignments, generated from routine all-against-all BLAST comparisons performed at NCBI. The best 200 of these alignments can be displayed. BLink reports are highly customizable, some of the options are described below. - Conserved protein domains are shown on top of the alignment, with links to the NCBI CDD database. - The alignments are depicted graphically and are color-coded on the basis of taxonomic origins (having some "look and feel" of the COG database). - Each alignment is displayed in NCBIs BLAST 2 sequences format. - All protein hits can also be displayed in their specific BLink reports. - the "Best Hits" format button displays ONLY THE BEST HIT IN EACH SPECIES, allowing a very quick access to the potential orthologs of a protein in other species. - the "Common Tree" button displays the BLAST hits along the branches of the taxonomic tree, allowing for selection of individual species. - the "Taxonomy Report" button lists the BLink results as a BLAST taxonomy report. - the "3D structures" button limits the output to those sequences derived from structure records (linked via the colored dots in the "3D structures" display). - the "CDD search" button links to a pre-computed conserved domain display for the query sequence. |
| COG
and KOG (NCBI) |
1. Clusters
of
Orthologous Groups of proteins (COGs) were delineated
by
comparing protein sequences encoded in a list of complete prokaryotic
genomes. The proteins that comprise each COG are assumed to have
evolved from an ancestral protein, and are therefore either orthologs
or paralogs. 1.1. Background: - Orthologs are proteins from different species that evolved by vertical descent (speciation), and typically retain the same function as the original. A genome - specific best hit (BeT) is the protein in a target genome which is most similar to a given protein from the query genome. The underlying premise is that orthologs are more similar to each other than they are to any other protein from the respective genomes ("reciprocal best hits"). In multiple-genome comparisons, pairs of potential orthologs identified via BeTs can be joined to form clusters of orthologs. Note that a COG is built by definition by proteins from at least 3 sufficiently distant species ("3 clades"). - Paralogs are proteins from within a given species that are derived from gene duplication, and may evolve new functions that are related to the original. 1.2. Ways to access the COG database: 1.2.1. Protein/Gene name or text search (COG start page): You may use the protein/gene name or "free text". However, this method is not foolproof, since some genes may be known by alternative names. 1.2.2. COGnitor: A more robust method to query COG is to paste the appropriate query protein sequence into COGnitor. 1.2.3. CD-Search: CD-Search (also refer to descriptions at the "Protein" and "Structures" pages !) uses your query sequence to search the NCBI CDD (Conserved Domain Database), incl. Smart, and Pfam. As an update, CD-Search now also looks for COGs and KOGs matching your query sequence, making this option very "user-friendly" ! 1.2.4. Phylogenetic pattern search tool: This tool provides a means for finding COGs that contain or exclude a selected organism. To find all COGs that contain or exclude a particular organism, simply indicate the desired choice for each listed species and submit the query. To make a selection for the entire column, click the appropriate choice at the top. The choices are "dc" ("don't care"): COG may or may not contain this organism; "yes": COG must contain this organism; "no": COG must not contain this organism. The list of results will be the subset of COGs that fits the pattern indicated. ! Please note that this tool may be used to address highly interesting questions, refer to 1.3) Output ! 1.3. COG Output: Briefly, the COG database provides three kinds of information: - Annotation of proteins: Known functions (and two- or three-dimensional structures) of one COG member can often be directly attributed to the other members of the COG. - Phylogenetic patterns: These show the presence or absence of proteins from a given organism in a specific COG. Used systematically, such patterns can be used to identify e.g. whether a particular metabolic pathway exists in an organism or e.g. identifies potential targets (genes specifically contained in one genome) for highly selective anti-bacterial agents. - Multiple alignments. Each COG page includes a link to a multiple alignment of COG members, which can be used to identify conserved sequence residues and analyze evolutionary relationships between member proteins. 2. Eukaryotic orthologous groups (KOGs) is the eukaryotic "pendant" to COG, comprising species like human, Drosophila, C.elegans, Arabidopsis, and S.cerevisiae. 2.1. Ways to access the KOG database: 2.1.1. KOGnitor: KOGnitor is the "pendant" to the prokaryotic COGnitor. Note that the "BeTs to clades" button allows you to change the stringency of the search, to insist that any COG to which the query protein is assigned must be composed of at least the indicated number of clades. The default is three, which is the number used to define the a minimal COG. 2.1.2. CD-Search: CD-Search (also refer to descriptions at the "Protein" and "Structures" pages !) uses your query sequence to search the NCBI CDD (Conserved Domain Database), incl. Smart, and Pfam. As an update, CD-Search now also looks for COGs and KOGs matching your query sequence, making this option very "user-friendly" ! 2.1.3. Phylogenetic pattern search via KOGs: Similar to the COG page, this site allows to choose KOGs (genes) which are specifically contained in species of interest. E.g. it is possible to select for KOGs present in Yeast, C.elegans, and Arabidopsis, but not in human. 2.1.4. Phylogenetic pattern search via TWOGs: By definition, KOGs are built by orthologous proteins from at least 3 species. Nevertheless, a site (TWOGs) is available listing genes that are present in only 2 species, and might therefore be specifically important for these species. 2.1.5. LSEs (Lineage Specific Expansions): Note that many proteins belong to LSEs, i.e., have evolved via duplication(s) after the divergence of the compared species. In these cases it is often very hard to define (co)orthologous relationships, and therefore clusters including such expansions should be treated with caution. 2.2. KOG Output: In principal, similar general remarks may be stated here as with the COG output above. - The first view is a tabular list indicating the number of proteins comprising a single KOG, including a phylogenetic tree, where the species are color-coded. - Clicking onto individual protein-links produces multiple alignments with the respective protein on top, along with the other KOG-members (marked by "="), and proteins from related KOGs. Note that individual sequence lengths can be longer than the section displayed. - On top of a multiple alignment there are several links, like: - COGs: showing relationships between KOGs and COGs (prokaryotic proteins) - unmask: unmask the "low-complexity regions", which are masked by stretches of "XXX" by default. - Genbank: retrieves the Genbank entry of the protein on top of the alignment. - BLink: BLink ("BLAST Link") displays the results of BLAST searches that have been done for every protein sequence in the Entrez Proteins data domain. BLink displays the graphical output of pre-computed blastp results against the protein non-redundant (nr) database. The output includes the positions of up to 200 BLAST hits on the query sequence, scores, and alignments. There is a variety of display options, including the distribution of hits by taxonomic grouping, the best hit to each organism, the protein domains in the query sequence, similar sequences that have known 3-D structures, and more. NOTE: BLink also offers Additional options which allow you to specify which taxa you would like to exclude, increase or decrease the BLAST cutoff score, or filter the BLAST hits to show only those from a specific source database, such as RefSeq or Swiss-Prot. |
| EGO
- Eukaryotic
Gene Orthologs (TIGR) |
The Eukaryotic
Gene Orthologs (EGO), previously called TIGR Orthologous Gene
Alignments (TOGA), is a database for orthologous genes in eukaryotes.
EGO is generated by pair-wise comparison between the Tentative
Consensus (TC) sequences that comprise the TIGR Gene Indices from
individual organisms. The EGO database can be accessed through the SEARCH function. You can perform a BLAST search or search using gene names or TIGR accessions. |
| HomoloGene (NCBI) |
HomoloGene
is
a resource of curated and calculated orthologs for genes as
represented by UniGene or by annotation of genomic sequences. The
calculated homologs are the result of nucleotide
sequence comparisons between each pair of organisms, maintained by the
database. For the comparisons, EST and mRNA sequences from UniGeneare used, as well as transcripts extracted from annotation of genomic sequences. The list of available organisms is shown at the HomoloGene start page. HomoloGene can be queried via keyword search. HomoloGene entries are also linked at ENTREZ Gene pages of individual genes. NOTE: HomoloGene entries have been augumented in 2004 with homology and phenotype information drawn from sources like OMIM and COG. NOTE: HomoloGene is now integrated in NCBIs ENTREZ database system. |
| PHYLIP (Washington University, Seattle) WebPHYLIP (Nebraska-Lincoln University) PHYLIP (Pasteur) |
1. PHYLIP
is a free package of programs for
inferring phylogenies, provided by the University of Washington,
Seattle. It is distributed as source code,
documentation files, and a number of different types of executables.
The Phylip homepage contains information on PHYLIP and ways to transfer
the executables, source code and documentation to your computer. 2. WebPHYLIP is a Web version of the original PHYLIP package, provided by the Bioinformatics Core Facility (BCF) at the Center for Biotechnology, University of Nebraska-Lincoln. Both the original programs and documention were modified to reflect the WebPHYLIP interface. 3. PHYLIP at Pasteur: The PHYLIP package of phylogeny programs, along with comprehensive documentations, is also available as individual interfaces at Pasteur institute. |
| PhyloBLAST (University of British Columbia, Vancouver, Canada) |
PhyloBLAST
was developed as part of the Pathogenomics Project at
the University of British Columbia, Vancouver,
Canada. PhyloBlast performs molecular phylogenetics analysis of a protein sequence. Enter the sequence at the site. PhyloBLAST uses BLASTP to find related amino acid sequences in the Swiss-Prot database. Select those sequences desired, for a full phylogenetic analysis, starting with a ClustalW multiple sequence alignment. A choice of Phylip programs, including parsimony, UPGMA, neighbor joining and distance matrix methods, produces a phylogenetic tree. |
| Phylogenetic
Footprinting (1) - ConSite and JASPAR (CGB, Karolinska) |
1. Phylofoot.org is
a resource page for programs involved in matters of phylogenetic
footprinting. Phylogenetic footprinting is a method that
identifies putative regulatory elements in DNA sequences. It identifies
regions of DNA that are unusually well conserved across a set of
orthologous sequences. As example, if a region is found to be
conserved between a human and an orthologous genomic sequence from a
distantly related organism, it is extremely likely to have a
biological role. 2. ConSite was developed at the Center for Genomics and Bioinformatics (CGB), a department of the Karolinska Institute. ConSite is a program (and web interface) that couples phylogenetic footprinting with regulatory site detection (mainly promoter comparison). ConSite is designed to compare 2 orthologous sequences and report conserved TFBS (Transcription Factor Binding Sites). NOTE that ConSite uses the JASPAR database of TF profiles (please see below) that was newly built from literature data, and that is therefore "independent" from existing databases like TRANSFAC. 2.1. At the start page, you have 3 different options: - Analyze orthologous pairs of genomic sequences: Here, you can e.g. paste 2 promoter sequences of the same gene from human and mouse, and the program will generate the alignment. - Analyze an existing alignment of 2 genomic sequences: Here, you can use your pre-made alignment (CLUSTALW format) directly for the TFBS analysis. - Analyze a single sequence: A single promoter is analyzed for TFBS (without performing cross-species comparison). This option is comparable to the TESS system, but utilizes the JASPAR profile collection. Note that optionally, you may in addition provide a cDNA sequence, and determine whether the program shall not display TFBS within exon regions (in the case that the genomic sequence is not only 5' flanking region but also contains transcribed sequence). 2.2. Select transcription factors: Now, you may select individual TFs (by name, species, or domain) or the complete set for screening your 2 input sequences. The program then scans the 2 sequences for potential TFBS and compares the sites between the aligned sequences. Only those sites that are present in both sequences and, more importantly, are located in equivalent positions, are reported in the output. Note that there is a parameter "with a minimum specificity of x bits", where x is set at a default value. Information content, in terms of bits of information, is often used in bioinformatics to describe the overall specificity of a profile. This means that, setting this parameter at higher values will exclude TFBS profiles, which are not well characterized, from the subsequent analysis. Note that the option "Analyze a single sequence" of ConSite can also be used to see if individual TFBS, which can be selected from a list, are present in a query sequence. In addition, there is the option to scan the sequence for the presence of a user-defined profile (raw counts matrix or position weight matrix), but not of a user-defined consensus sequence. 2.3. Output - 3 modes: - Graphical View: graphical display of the 2 sequences, their conserved regions, and the predicted conserved TFBS. - Alignment View: Displays the sequence alignment of the 2 sequences, and the positions of conserved TFBS. - Table View: displays a table of all predicted TFBS, their sequences and the positions within the 2 input sequences. Note that the TF labels are equipped with mouse-over function, to display the PWM and a very instructive sequence logo as pop-up window !!! 2.4. Output - 3 parameters: Note that these paramters can be set in all 3 analysis modes ! - Conservation cutoff: This is the percentage of sequence identity within the window for the definition of conserved regions. There is no fixed default value, but the program calculates the top 10 % of conserved windows (conserved region in the alignment of the 2 sequences). Note that the first output therefore may appear too stringent, and you may wish to lower this value to include additional peak areas in the analysis. - Window size: This defines the size of the sliding window (default 50 nt), i.e. the region which is used for the calculation of sequence similarity between the 2 sequences. - TF score threshold: This value defines the relative matrix score threshold (default 80 %), meaning the similarity of a sequence stretch in a promoter compared with the "matrix consensus" of each individual TFBS. Thereby, this value is comparable to the "Matrix similarity" used by MatInspector. Note that when lowering this value to e.g. 70 % (l ess stringent), you will retrieve more potential TFBS, but you should compare the actual sequences in the promoters (in Alignment View) with the individual sequence logos presented in the pop-up windows of each TFBS. 3. JASPAR is a collection of transcription factor DNA-binding preferences, modelled as position-specific weight matrices (PSSMs). JASPAR developed into a meta-database, composed of the sub-databases: 3.1. JASPAR CORE database contains a curated, non-redundant set of 123 profiles (2006) from published articles. All profiles are derived from published collections of experimentally defined transcription factor binding sites for multicellular eukaryotes. The database represents a curated collection of target sequences. The binding sites were determined either in SELEX experiments, or by the collection of data from the experimentally determined binding regions of actual regulatory regions; this distinction is clearly marked in the profiles' annotation. As far as possible, the collection is non-redundant (several models describing one transcription factor). The prime difference to similar resources (TRANSFAC, TESS etc) consist of the open data acess, non-redundancy and quality: JASPAR CORE is a smaller set that is non-redundant and curated. The JASPAR Help-section describes in detail how the profiles are generated. 3.2. JASPAR FAM database consists of models describing shared binding properties of structural classes of transcription factors. These types of models can be called "familial profiles", "consensus matrices" or metamodels. The models have two prime benefits: 1)Since many factors have similar tagrget sequences, we often experience multiple predictions at the same locations that correspond to the same site. This type of models reduce the complexity of the results. 2)The models can be used to classify newly derived profiles (or project what type of structural class its cognate transcription factor belongs to). 3.3. JASPAR PHYLOFACTS database consists of 174 profiles that were extracted from phylogenetically conserved gene upstream elements. The JASPAR PHYLOFACTS matrices are a mix of known and as of yet undefined motifs. They are useful when one expects that other factors might determine promoter characteristics, such as structural aspects and tissue specificity. They are highly complementary to the JASPAR CORE matrices, so are best used in combination with this matrix set. Please refer to this "JASPAR in brief" page for details. NOTE: As the access to TRANSFAC has been commercialized, and only the "public" version (which has not been updated for some years) is available for free, the open-access JASPAR database is a highly valuable resource. Access JASPAR at Start Page (and select one of the three sub-databases): - Browse by ID, Neme, Species, Class, or Taxonomic group - Search by ID, Name, Species, Class, or Type. - Upload a set of profiles: If saved in an earlier session, a file with a list of profile IDs can be submitted for viewing. - Compare custom profile to database profile: A user created profile can be submitted and compared to the profiles of TFBMs in the JASPAR database, using a Needleman-Wunsch algorithm. This can be useful if you know the binding activity of a novel TF (not present in JASPAR) with a series of oligonucleotides and you are able to build a profile from these sequences, and finally want to scan if this binding profile is similar to a TF already present in JASPAR. Please refer to the JASPAR Help-section to see how profiles may look like. Note: Here, only profiles (and not consensus sequences) can be used ! If you want to use a consensus sequence, please refer to the Profile comparison tool at the "alternative" JASPAR site ! List Page of TFBMs: - The list page, produced by using "Browse" or "Search", presents a list of selected profiles. For each retrieved profile, basic attributes are displayed (ID, Name, species, class) together with a sequence logo for visual inspection. For detailed information regarding any profile, press the view link. Subsets of profiles can be exported, either as a simple list of IDs for simplified retrieval in subsequent database session or for usage in the CONSITE system for phylogenetic footprinting. - JASPAR also provides a "quick and easy" way of analyzing a promoter sequence (pasted in the field on the right side) for the presence of individual TFBMs, which have to be selected from the list first. Note that there is no option "Select all" which means that this feature is designed to just show the positions of single (or a small group of) TFBMs in a query sequence. For more complex analyses, the CONSITE system should be used. - A "detailed information page" is revealed as Javascript window when viewing individual TFBMs. This window displays a sequence logo, the actual count profile, and detailed information like protein sequence or PubMed links. Note: If you want to open such a page as "normal" browser window, you may select this URL and change the matrix accession "MA..." to the ID you are interested in. NOTE: The ZLAB list of JASPAR sites contains the JASPAR profiles reformatted to the TRANSFAC style, which therefore can be directly used as input for programs like Cluster-Buster (see Cluster-Buster main section). Note that the original JASPAR format displays the 4 nucleotides as lines (which lets you read the consensus from left to right) instead of columns in TRANSFAC (which lets you read the consensus top-down) ! Data Download: You may download the JASPAR database for free. You may, as example, download all sequences which contribute to the generation of individual TFBMs. 4. NOTE: Interestingly, there is another JASPAR website, but the site described above contains much more information, like detailed description of individual TFBMs. Nevertheless, there are some useful pages at this site, like: - Browse familial binding profiles: As many TFBMs are highly overlapping, this page presents the Logos of TF families and lists the individual members. - Profile comparison tool: This tool lets you compare either a profile or a consensus sequence (like WWCAAWG) with the JASPAR TFBMs. Typical JASPAR accessions: refer to section JASPAR IDs. |
| Phylogenetic Footprinting (2) -
FootPrinter (Washington University) |
FootPrinter
was developed by the Computational Molecular Biology Group at the
University of Washington. It is available at the respective software site
either as a web
service or as a downloadable
program. Note that there is also a very good FootPrinter
manual, explaining e.g. all the input parameters. PLEASE NOTE that FootPrinter is also implemented in the TOUCAN software package, please refer to the TOUCAN chapter for details. 1. Scope: FootPrinter is a program that performs phylogenetic footprinting. It takes as input a set of unaligned orthologous sequences from various species, together with a phylogenetic tree relating these species. It then searches for short regions of the sequences that are highly conserved, according to a parsimony criterion. The regions identified are good candidates for regulatory elements. By default, the program searches for regions that are well conserved across all of the input sequences, but this can be relaxed to allow finding regions conserved in only a subset of the species. NOTE: FootPrinter is quite different from ConSite as it - searches multiple orthologous input sequences (not only 2), and - as it reports conserved motifs and not transcription factor binding sites. (But the motifs can be searched against specific databases to see whether they are known TFBS, see e.g. the TOUCAN chapter: MotifSampler). 2. Input (selected notes): 2.1. FootPrinter takes as input a set of orthologous or paralogous sequences in which you want to find motifs. The input sequences must all be listed in the same file, in FASTA format. Important: The name of the sequences (one word, after ">") must correspond to the name of some species in the phylogenetic tree. Each sequence must have a name that is different from that of the other sequences. Notice that this does not prevent you from having several sequences from the same species; all you need to do is to name them differently (e.g. mouse1, mouse2...). However, this in case, you will have to provide your own phylogenetic tree, with mouse1, mouse2... as leaves. The phylogenetic tree is given in postfix notation (the standard bracket format, also used by the PHYLIP package). Only leaves are labeled, with the name of the species. The phylogeny can contain more species than just those specified in the sequence file. In that case, unused species will be removed from the tree and branch lengths will be adjusted accordingly. Note: FootPrinter uses a "pre-made" standard phylogenetic tree ("Tree of Life") per default, which contains all commonly used species. ONLY in the case that you use more than one sequence from one species, you would have to provide your own tree. The FootPrinter manual contains examples how to write a tree. 2.2. Maximum parsimony score: This is the parameter which most drastically influences the size of the output. It basically reflects the maximum number of mutations (this is also the description at the FootPrinter input page) allowed for the motifs to be reported. If the maximum parsimony score allowed is small, FootPrinter will run quickly and report only the most conserved motifs. Test runs show that even a value of 2 produces a huge list of "conserved" motifs, therefore it is often best to choose values of 1 or even zero. 2.3. Maximum number of mutations per branch: Allows at most a fixed number of mutations per branch of the tree. Normally, it is best to set this number to a small constant like one or two. 2.4. Subregion size and Subregion change cost: In some case, you may want to penalize motifs whose position in the input sequences varies too much. This is done using these 2 options. The subregion size (typical values: 20 to 200; in general, about one tenth of the sequence length) option divides the input sequences in subregions of the given size, and penalizes motifs whose position (subregion) varies too much from species to species. Typical values for the subregion change cost are between 0 and 2. Note that already an increase from 1 to 1.5 can drastically reduce the size of the output ! This is especially useful if you are looking for motifs in promoter sequences which could be TFBS (Transcription Factor Binding Sites), and if you want to select for TFBS that appear at a more or less constant distance relative to the TSS, Transcription Start Site. 2.5. Allow regulatory element losses incl. Spanned tree significance level, and Motif loss cost: FootPrinter can find regulatory elements even if some species do not contain those regulatory elements. When regulatory element losses are allowed, the parsimony score of a motif is compared to the length of the tree spanned by the species containing the motif. You have the choice of three significance levels: "Very significant" will only report motifs that are the most significant, while "Significant" (default) and "Somewhat significant" will report more motifs. When regulatory elements losses are allowed, a cost can be given to losing a particular motif along some branch of the tree (Motif loss cost). 3. Output: There are several output formats (HTML, PostScript, Text, Motif list). 3.1. The most instructive is the HTML output, which allows the interactive selection of motifs within the sequences which are then displayed individually as colored boxes along the sequence graphs. The number of mutations of a motif is shown through the size of the font: the larger the font, the fewer mutations the motif contains. When you move the mouse over the highlighted regions, the score, position, evolutionary span and significance score of the region appears in the lower left corner of the browser. A high significance score means that the motif is unexpectedly well conserved. Note: In order to return to the full graph of ALL motifs, re-click onto the previously selected motif. Only one motif at a time can be highlighted. 3.2. Text otput: The text output files *.seq.txt contains almost the same information as the HTML file, but in a simpler representation. The first line below the sequences corresponds to the number of mutations found in the best motif overlapping with that position. The second line below the sequence corresponds to the identity of the motif (the corresponding motif in other sequences will have the same identity number). The third lines gives the parsimony score. Note: At the time of writing, a test run displayed only 2 lines below the sequence, the third one, which should display the parsimony score, was missing. |
| Phylogenetic Footprinting (3) -
PhyloCon (Washington University in St.Louis) |
PhyloCon
(Phylogenetic
Consensus) is an algorithm that takes into
account both conservation among orthologous genes from different
species (Phylogenetic Footprinting), and co-regulation
of genes within a species. The first approach may also be called
"single gene, multiple species" and the second one "multiple genes,
single species" (as used e.g. for promoter sequences of co-regulated
genes identified by microarray profiling). PhyloCon can be regarded as
"multiple genes, multiple species" approach. NOTE: There is currently no web interface for PhyloCon but the program can be downloaded as Linux executable. PhyloCon first aligns (by use of the program Wconsensus) conserved regions of orthologous sequences into multiple sequence alignments, or profiles, and then compares profiles representing non-orthologous sequences (as e.g. in clusters derived from microarray data). Motifs are found as unusually well conserved substrings by comparative genomic analysis. Note that PhyloCon does not need the length of the motif a priori. |
| Phylogenetic Trees (1) - Trees
from BLOCKS |
Trees from BLOCKS
displays phylogenetic trees from Multiple Sequence Alignments. Please refer to this main section for details. |
| Taxonomy | |
| Taxonomy 1 - Taxonomy Browser (NCBI) |
The Taxonomy
Browser provided by the NCBI is an easy to use, search
and navigate tool to retrieve species-related information. There are
also direct links to
some of the organisms commonly used in molecular research
projects. |
| Taxonomy 2 - Tree of Life (consortium) |
The Tree of Life is
a collaborative web project, produced by biologists from around the
world. On more than 2000 World Wide Web pages, the Tree of Life
provides information about the diversity of organisms on Earth, their
history, and characteristics. |
| Genomic
Mapping, Cytogenetics |
|
| e-PCR (NCBI) and UniSTS (NCBI) |
1. Electronic
PCR (e-PCR) allows you to search your DNA sequence for sequence
tagged sites (STSs), which have been used as landmarks in various
types of genomic maps. e-PCR looks for potential STSs in DNA
sequences by searching for subsequences that closely match the PCR
primers and have the correct order, orientation, and spacing that could
represent the PCR primers
used to generate known STSs. 1.1. Forward e-PCR: compares the query sequence against data in NCBI's UniSTS database, and reports all STSs which are found in the sequence. 1.2. Reverse e-PCR: compares your input PCR primer pair to the database of known STSs. Search is limited to 10 STSs per request. NOTE: Although one might assume, e-PCR is NOT a tool for general matching of PCR primer pairs onto genomes (see chapter "Primers" for that purpose !), but looks ONLY for known STS entries ! 2. UniSTS is a comprehensive database of sequence tagged sites (STSs) derived from STS-based maps and other experiments. STSs are defined by PCR primer pairs and are associated with additional information such as genomic position, genes, and sequences. UniSTS is also integrated in the NCBI ENTREZ cross-database search. |
| SNPs, Mutations, Disease | |
| NOTE:
This section covers
resources dealing with genomic variation data
of several kind like Single Nucleotide Polymorphisms (SNP), specific
repeat types, mutations, and associated diseases.
Please note that databases in the field of cancer research are described in section "Cancer". |
|
| cSNP Analysis (Applied Biosystems) |
The
tool cSNP
Analysis is part of the PANTHER
(Protein ANalysis
THrough Evolutionary
Relationships)
Classification
System. This tool estimates the likelihood that
a particular nonsynonymous coding SNP will cause a functional
impact on the protein. Please refer to the main section of cSNP Analysis for details ! |
| dbSNP and Entrez SNP (NCBI) |
The SNP database (dbSNP)
of
NCBI is a central, public repository for information related to
Single
Nucleotide Polymorphisms in the genomes of various species. dbSNP is
mainly composed of single nucleotide substitutions (99%), but also
contains small insertion/deletion polymorphisms, or microsatellite
repeats. The scope of dbSNP includes disease-causing clinical
mutations, as well as (in contrast to e.g. HGMD) neutral
polymorphisms. Therefore, SNP markers with unknown selective effect
are the majority of submitted records. NOTE: The philosophy of dbSNP is NOT to annotate the detailed biochemical or phenotypic consequences of a variation, but rather to provide links to external databases which address this question. Access dbSNP: - dbSNP can be queried directly using SNP Id, Locus Id, gene name or symbol. - SNP BLAST is a specific BLAST type allowing to scan your query sequence against the SNP database. - Entrez SNP is a part of the NCBI Entrez database system which incorporates the dbSNP database into Entrez. Thereby, SNP data can be queried using the same approach as the other Entrez databases such as PubMed and GenBank. - There is also direct link between the Entrez Gene report of a gene to the SNP database, by following the "SNP" entries within the "Links" field. dbSNP records: On the report pages of individual SNPs there is all SNP detail information, like method, submitter, a variation summary, a validation summary, a link to Entrez Gene, descriptions of the population containing the variation and frequency information by population or individual gentype. In addition, links to the major genome browsers are included, NCBI MapViewer, Ensembl Viewer and the UCSC Viewer. Typical dbSNP accessions: refer to section dbSNP IDs. |
| DiseaseInfo Viewer (H-InvDB) |
The
DiseaseInfo viewer is
a tool which is integrated into the H-Invitational
Database (H-InvDB), which provides an integrative
annotation of full-length cDNA clones. This
viewer displays information on known disease-related genes via
links to OMIM, LocusLink, and GeneLynx, but also shows
co-localized orphan diseases. Orphan disease (here)
means a disease mapped on the chromosomal region, but whose responsible
gene has
not been identified yet. Co-localization does not mean direct
relationships between gene and disease; however, genes that are
cytogenetically
co-localized with a disease could be possible candidate genes of that
disease. You first have to get the specific database entry of your gene of interest, either via BLAST (sequence) search or via keyword search, and then look for the specific disease link within the so-called "Locus view". Please also refer to the H-InvDB section at the Data Integration page for a detailed description ! |
| EGO (TIGR) |
The Eukaryotic Gene
Orthologs (EGO), previously called TIGR Orthologous Gene Alignments
(TOGA), is a database for orthologous genes in eukaryotes. Please refer also to the EGO main section. Please note that a special feature of EGO is the search for "Orthologs of human disease genes". Thereby, Human disease genes in Online Mendelian Inheritance in Man (OMIM) database were matched to a TIGR Human Gene Index accession (THC number) and Orthologs of human disease genes have been identified using EGO database. You can query using OMIM or LocusLink ID, gene name and various types of accession numbers. |
| GAD - Genetic Association Database (NIH) |
GAD
- Genetic Association
Database is an archive of human genetic association studies
of complex diseases
and disorders. GAD is maintained by the NIH
- National Institutes of Health. The goal of this database is to allow
the user to
rapidly identify medically relevant polymorphism from the large
volume
of polymorphism and mutational data. NOTE: As GAD integrates data from several resources (dbSNP, Ensembl, PubMed, pathway data, population data, NCBI SNP, HapMap, Map View, and others) and as GAD provides methods of high-throughput data retrieval (Batch Search), the main section of GAD is located at the main page "Data Integration"! |
| Glovar (Sanger) |
Glovar is a project from the
Sanger Insitute, which shows sequence variation in a genomic context.
Glovar aims to compare all public human reads in the trace repository to the current
genome build and shows the read alignments and SNPs discovered plus
other public SNPs from dbSNP,
alongside the latest Vega and Ensembl genes. Currently, Glovar is
available for the human
genome. The Glovar interfaces have an "Ensembl look and feel", like
the option to query using a multitude of identifiers. It is also
possible to browse single chromosomes. Note: The SNP data stored in Glover are also available via queries at the VEGA Human Genome Browser. Please refer to the VEGA section for details. Note: Links to Glovar are also available at Entrez Gene records of genes, simply follow the "Links" button, and then "LinkOut", which produces a list of external providers (like Glovar) which are responsible themselves for maintaining these links. |
| HapMap (CSHL and many others) including: HapMart |
The
International HapMap
(Haplotype Map) project is a partnership of scientists and funding
agencies from Canada, China, Japan, Nigeria, the United Kingdom and the
United States to develop a public resource that will help researchers
find genes associated with human disease and response to
pharmaceuticals. 1. HapMap Background: The HapMap Overview page of the NHGRI provides a good introduction for the project. The haplotype map, or "HapMap," will be a tool that will allow researchers to find genes and genetic variations that affect health and disease. In human genetics, association studies aim to identify loci that contribute to disease suceptibility by comparing patterns of genetic variation between people with a disease (cases) and those without (controls). The DNA sequence of any two people is 99.9 percent identical. The variations, however, may greatly affect an individual's disease risk. Sites in the DNA sequence where individuals differ at a single DNA base are called single nucleotide polymorphisms (SNPs). Without any prior knowledge which genes could be affected, researchers would have to screen about 10 million SNPs where the less common allele (MAF, see below) has a frequency of at least 1%. BUT there was an important observation: Sets of nearby SNPs on the same chromosome are inherited in blocks (see LD). This pattern of SNPs on a block is a haplotype. Blocks may contain a large number of SNPs, but a few SNPs are enough to uniquely identify the haplotypes in a block. The HapMap is a map of these haplotype blocks and the specific SNPs that identify the haplotypes are called tag-SNPs. Linkage Disequilibrium (LD) is the statistical association of nearby loci. The higher the LD, the stronger is the probability that 2 nearby SNPs are "inherited together" (meaning not separated by recombination events). Human recombination is concentrated into short (1-2 kb) hotspots that occur every 100-200 kb. Such recombination hotspots are often (not always) coincident with a breakdown of allelic association. The goal of HapMap is mapping this structure of allelic association across the human genome. The HapMap should be valuable by reducing the number of SNPs required to examine the entire genome for association with a phenotype from the 10 million SNPs that exist to roughly 500,000 tag-SNPs. This will make genome scan approaches to finding regions with genes that affect diseases much more efficient and comprehensive, since effort will not be wasted typing more SNPs than necessary and all regions of the genome can be included. The initial aim of HapMap was to genotype one SNP every 5 kb in the human genome across 270 individuals from 4 geographic populations: - YRI: Yoruba people of Ibidan Peninsula in Nigeria - CEU: people from the CEPH project in Utah - CHB: Chinese Han population in Beijing - JPT: individuals of Japanese ancestry from the Tokyo area In total, over one million SNPs have been typed across these genomes in HapMap phase I, completed in Oct. 2005. 2. HapMart - High-throughput retrieval of HapMap data: HapMart is a data mining tool for retrieving data from the HapMap database. It is based on the Biomart interface. Please refer also to the BioMart section for general information concerning BioMart. The HapMart is organized as a series of pages and has to be filled successively. 2.1. Start: The HapMap data is organized as "datasets" for efficient querying. The data is available as individual population datasets for retrieving population based data. The "all population" dataset is typically used to query data across populations. Select one of the datasets and press "next" to proceed to the filter page. 2.2. Filter: Please refer to the Help page for a full list of descriptions. Some excerpts: - Minor Allele Frequency (MAF): The frequency at which the less abundant (or minor) allele of a SNP is present in a population. The MAF for a SNP to be considered common is usually above 1%. You can restrict data retrieval above a selected MAF cut off. - Monomorphic SNPs A SNP for which a single form or allele can be identified in the population of interest. For the HapMap, a SNP is considered monomorphic in a particular population if not a single heterozygous individual could be found in the population sampled. - Gene: Filters to restrict a list of known genes. A list of genes (comma separated) can be entered. - ENCODE Regions: Filter based on the 10 ENCODE regions selected for studies in the HapMap data set. Please refer also to the ENCODE section for general information. - Second Dataset: This option is provided to use filter attributes of a different population to restrict the retrieval of data from the main dataset. For eg: list out all the SNPs in the Han Chinese, Beijing population for ENCODE region ENm010 which are not monomorphic in Japanese, Tokyo population. Please note that the data from the second dataset is not retrieved but used as a constraint on the main dataset. For this kind of queries the summary section will not show the count. 2.3. Output: The output page offers the choice of downloading the genotype, frequency, Assays and LD details. Various output formats can be selected including text fixed column, text comma separated, text tab separated and MS Excel. The output can be saved as a compressed file. - Genotype: A series of options define which data to be included in the SNP list. Note these options do not effect the number of SNPs added to the list but defines the details to be included in the output. Some of the standard output details are: marker name, population code, chromosome, position,strand, allele, reference allele for all the output options. The all population dataset will have options to download the sample ids and the genotypes whereas the individual populations will have the sample id's as header for the genotype data just as in the HapMap genotype dump outputs. - SNP Frequency: All the frequency related attributes can be selected for download. These are organized into separate sections: snp details, snp allele frequency, snp genotype, snp genotype counts, snp genotype frequency. - SNP Assays: Download the SNP Assay details including the genotyping center, genotyping protocol, assay LSID, protocol LSID. - LD details: LD values like D-prime, R-squared and LOD values can be downloaded for each population dataset. These are calculated for a 500kb window for each chromosome and are the same files available for bulk download from the hapmap web site. 3. Visualization of HapMap data via the UCSC Genome Browser: The UCSC Genome Browser provides several user-defined options in order to visualize HapMap data along the genome. When displaying a specific genomic region in the browser window, it is possible to display HapMap data by selecting the following options within the section "Variation and Repeats": - SNP Recombination Rates: This track shows recombination rates measured in centiMorgans per Megabase. It is based on the HapMap Phase I data, release 16a. - SNP Recombination Hotspots: This track shows the location of recombination hotspots detected from patterns of genetic variation. It is based on the HapMap Phase I data, release 16a. - HapMap LD: It is possible to display the HapMap linkage disequilibrium for several populations. Linkage disequilibrium (LD) is the association of alleles on chromosomes. It measures the difference between the observed allele frequency for a two locus allele as compared to its expected frequency, which is the product of the two single allele frequencies. When LD is low, the two loci tend to be inherited in a nearly random manner. This track shows three different measures of linkage disequilibrium — D', r2, and LOD (log odds) — between pairs of SNPs as genotyped by the HapMap consortium. LD is useful for understanding the associations between genetic variants throughout the genome, and can be helpful in selecting SNPs for genotyping. NOTE: Please refer also to the concise Help files when selecting the links associates with the titles of the individual settings. |
| HGMD - The Human Gene Mutation Database (University of Wales) |
HGMD
is
maintained by the Institute of Medical Genetics, Cardiff, University of
Wales and is the leading database storing
not only mutations in human genes but also curated polymorphisms
showing clear phenotypes. HGMD is also directly linked via NCBI Entrez
Gene entries. Please note that the public version of this database is free only for users from academic institutions/non-profit organisations, and requires a registration. Both commercial and academic/non-profit users wishing to access the most up-to-date version of the database may license HGMD and the accompanying programs from BIOBASE. Note that new records are available for exactly one year via the commercial version only, and after this period also via the public one. 1. Background: HGMD comprises various types of mutation within the coding regions of human nuclear genes causing inherited disease. Somatic mutations and mutations in the mitochondrial genome are thus not included, although in the latter case, links to Mitomap are now provided. Silent mutations within the coding region which do not alter the encoded amino acid are also not recorded. If such mutations are known to adversely affect mRNA splicing or gene expression, or have been reported in significant association with disease, they may be included. HGMD does not usually include mutations lacking obvious phenotypic consequences. Concerning polymorphisms, in order to be included, there must be a convincing association of the polymorphism with the phenotype. 2. Search: HGMD can be searched either by disease, gene name or gene symbol. 3. Locus-specific mutation databases: A considerable number of locus-specific mutation databases (concerning individual single genes/diseases) have been constructed and made publically available. Many of the lesions present in these databases are included in the HGMD. However, the locus-specific databases may contain additional unpublished material. Typical HGMD accessions: refer to section HGMD IDs. |
| HGVbase (EBI and others) |
HGVbase (Human Genome
Variation Database) is an attempt to summarize all known sequence
variations in the human genome, to facilitate research into how
genotypes affect common diseases, drug responses, and other complex
phenotypes. Sequence variations are presented with details of how they are physically and functionally related to the closest neighbouring gene. Records include SNPs, Indels, simple tandem repeats, and other sequence alternatives, regardless of location, allele frequencies, or known affect upon phenotype. HGVbase is the product of a European consortium involving the Karolinska Institute (Sweden), the European Bioinformatics Institute (UK), and the European Molecular Biology Laboratory (Germany). NOTE: you can either perform text searches or FASTA3 sequence searches. |
| OMIM (NCBI) |
OMIM
is a catalog
of human genes and genetic disorders. The database contains
textual information (like "mini-reviews" !),
pictures, and reference information. It also contains links to
NCBI's Entrez database of MEDLINE articles and sequence information.
Derived from the biomedical literature, OMIM is written and edited at
Johns Hopkins University with input from scientists around the world.
Each OMIM entry has a full-text summary of a genetically determined
phenotype and/or gene and has numerous links to other databases (HUGO,
GenBank, UniGene, RefSeq, Entrez Gene). At the end of each entry a list
of credits and edit history is presented, listing author names
and creation / update dates. 1. Search OMIM via Entrez: "Entrez"-style search. Please note that when choosing "Limits", you may restrict your search to specific chromosomes, or to individual fields of the OMIM database. 2. Search the OMIM Gene Map: The OMIM gene map presents the cytogenetic map location of disease genes (in chromosomal order) and other expressed genes described in OMIM. 3. Search the OMIM Morbid Map: This is a catalog of genetic diseases and their cytogenetic map locations arranged alphabetically by disease. Typical OMIM accessions: refer to section OMIM IDs. |
| SNP Consortium (CSHL and many others) |
The SNP
Consortium Ltd. is a non-profit foundation organized for the
purpose of providing public genomic data. Its mission is to develop up
to 300,000 SNPs (Single Nucleotide Polymorphisms)
distributed evenly throughout the human genome and to make the
information related to these SNPs available to the public without
intellectual property restrictions. This page, maintained at the Cold Spring Harbor Laboratory,
provides information on the
submitter, detailed method protocols, and
detailed sequence
information on every individual SNP. There are various ways to access the data, like table search, graphical search, gene/contig search, GenBank search, text search.... |
| SNPInspector (Genomatix Inc., Munich, Germany) |
SNPInspector
checks whether a SNP (Single Nucleotide Polymorphism) in your own
sequence has any effects on transcription factor binding sites
(TF binding sites that are lost or created due to the SNP). The
analysis is based on MatInspector
and Genomatix' library of matrix descriptions for transcription factor
binding sites. MatInspector is started using default parameters. NOTE: SNPInspector is part of the "GEMS Launcher"-section of the commercial Genomatix Suite. NOTE: Genomatix has termed the free academic access "evaluation account". Note that in general, there is not only a limitation in the number of analyses (max. 20 GEMS analyses (sequences) per month!) but also in the functionality of the obtained data ! |
| SNP - links from Ensembl | NOTE that the Ensembl database uses the same SNP accession numbers as the NCBI database. So it is possible to query via these numbers the Ensembl SNP database, yielding nice comprehensive report pages for every SNP, including all the links, like to the NCBI database and to the SNP Consortium. |
| Cancer | |
| NOTE:
This section
specifically lists resources in the field of cancer
research. As different types of data are collected here, like cytogenetic and expression
information, this section builds a bridge between the main sections "GENOMICS" and "EXPRESSION".
Please note that databases dealing with genomic variation data in general are described in section "SNPs, Mutations, Disease". |
|
| Cancer
Chromosomes (NCBI Entrez) |
Cancer
Chromosomes is an NCBI-Entrez database which was compiled
from three databases, the NCI/NCBI SKY/M-FISH & CGH
Database, the NCI Mitelman
Database of Chromosome Aberrations in Cancer, and the NCI Recurrent
Aberrations in Cancer. Search for cytogenetic, clinical, and/or reference information. Queries are performed using the same approach as for other Entrez databases such as PubMed and Nucleotide. Search using any of three methods: Entrez Query Box, Simple Search (see icon), or Advanced Search. NOTE: There is a very user-friendly "Simple Search" form, which allows a quick-search using chromosome positions (like "5q31"), or selection by affected tissue and / or diagnosis (type of cancer). |
| CGAP - Cancer Genome Anatomy Project (NCI) including: CGAP Genes CGAP Gene Finder CGAP Batch Gene Finder CGAP GO Browser CGAP Nucleotide BLAST CGAP Chromosomes SNP500Cancer CGAP Tissues GLS cDNA xProfiler DGED SAGE Genie SAGE DGED SAV CGAP Pathways BioCarta Pathways on CGAP KEGG Pathways on CGAP RNAi on CGAP CGAP Tools |
CGAP - Cancer Genome
Anatomy Project is an NCI
(National Cancer Institute) resource which offers a
comprehensive molecular characterization of normal, precancerous, and
malignant cells. It contains genomic data for humans and mouse,
including transcript sequence, gene expression patterns, SNPs,
clone resources, and cytogenetic information. Informatics tools are
provided to
query and analyze the data. Interconnected modules provide access to all CGAP data, bioinformatic analysis tools, and biological resources allowing the user to find "in silico" answers to biological questions: 1. CGAP Genes: 1.1. CGAP Gene Finder: 1.1.1. Single gene search: Genes can be searched via gene symbol and several types of acession numbers. The created (single) gene list displays the query ID, gene name and symbol, and RefSeq accessions, as well as links to the specific Gene Info page. Note: See 1.2. for a description of the NCI60 and SAGE expression data links on the Gene List page. The Gene Info page contains the following data: - database links, like Entrez Gene, Unigene, as well as specialized databases like SNPViewer, SNP500Cancer, RNAi, and more. - expression data: SAGE and NCI60. - cytogenetic location, including Mitelman breakpoint data. - protein domains and motifs. - homologs (from Homologene). - associated GO terms. - Pathway data (Biocarta and KEGG). 1.1.2. Gene list filters: the user may filter the whole database by tissue expression (cDNA expression data), GO terms, cytogenetic location, and keywords. For example, it is possible to generate a gene list of all human genes expressed in vascular tissue (tissue type) and having a function in inflammatory response (GO term). NOTE: there is a very nice integrated GO term help, which lets you easily find the GO terms of interest. Example: type "transcription*" to get all GO terms including transcription factor activity. 1.2. CGAP Batch Gene Finder: In order to use the Batch Gene Finder, prepare a text file containing the list of (human OR mouse) gene symbols, UniGene clusters, accession numbers, protein accession number, UniProt (SwissProt) protein accessions, UniProt (SwissProt) protein identifiers (like "ACTB_HUMAN") or Entrez Gene numbers. The text file must list the identifiers in a vertical column, e.g. export a one-column EXECEL sheet in txt (tab-delimited) format. The created gene list displays the query ID, gene name and symbol, and RefSeq accessions, as well as links to the individual Gene Info pages (see above for details). In addition, the following links are available: - "Common View" allows to create a table displaying all GO terms, Pathways (KEGG and Biocarta), motifs, SNPs, and cyto locations for the complete input gene set. Note that common aspects of the listed genes are highlighted. Note that this is a very convenient and quick way to create an annotation table for a gene set of interest containing the most important function-related data, which is per se independent of the expression in cancer situations ! This table can also be saved as tab-delimited text. - In addition, the expression of the whole gene set can be viewed as colored graph within the NCI60 panel of cancer cell lines (please refer to the NCI60 section for background). - The link "SAGE Summary" displays the SAGE counts of the input gene set in a series of normal and cancer tissues. 1.3. CGAP GO Browser: The Gene Ontology (GO) Consortium is developing a dynamic controlled vocabulary that can be used to annotate all eukaryotic genes. GOA (GO Annotation@EBI) is a project run by the European Bioinformatics Institute that aims to provide assignments of gene products to the Gene Ontology (GO) resource. CGAP has designed the Gene Ontology (GO) Browser to allow you to look through this hierarchical vocabulary, and to find the known human and mouse genes assigned to each term. By clicking on a hyperlinked number in parenthesis after a term, a list of the genes within this section is found, with links to the individual Gene Info pages. You may want to refer to this help file for details. 1.4. CGAP Nucleotide BLAST: CGAP provides an alternative interface to the BLAST tool to specifically ask whether a DNA sequence is similar to any UniGene clusters. If a gene similarity is found, a link to the Gene Info page is provided. 2. CGAP Chromosomes: 2.1. The Mitelman Database of Chromosome Aberrations in Cancer relates chromosomal aberrations to tumor characteristics, based either on individual cases or associations. All the data have been manually culled from the literature by Felix Mitelman, Bertil Johansson, and Fredrik Mertens. CGAP has developed five web search tools to help you analyze the information within the Mitelman Database. Note: This database is also searchable via the Entrez Cancer Chromosomes page ! 2.2. SNP500Cancer: The SNP500Cancer is specifically designed to generate resources for the identification and characterization of genetic variation in genes important in cancer. The SNP500Cancer data represents one of several initiatives within the Genetic Annotation Initiative (CGAI), designed to characterize variation, as a resource for applying genetic approaches to understanding the etiology of different cancers as well as related phenotypes. Query: You can search for SNPs using SNP identifier, gene symbol, gene alias, chromosome location, or gene ontology pathway. Alternatively, a list of ALL genes with analyzed SNPs can be browsed. Example: Search using "BRCA1" produces a list of SNPs, separated into those showing a variation in cancer populations and those showing no variation. Tools are provided for further SNP data anlysis. 2.3. FISH-mapped BACs: CGAP is generating a set of BAC clones (available to the public) that have been mapped cytogenetically by FISH and physically by STSs to the human genome. The BAC data is integrated into various CGAP and NCBI databases to provide related clinical, histopathologic, genetic, and genomic information. 2.4. SNP Maps: show the genetic and physical locations of confirmed, validated, and predicted SNPs per individual chromosomes. 3. CGAP Tissues: 3.1. cDNA Library Finder: The Library Finder tool can find a single cDNA library or a group of libraries, depending on the search criteria selected. The search first returns a Library List page, from which each library is linked to its own Library Info page. This page contains sequence and clone information and details of the library's preparation. 3.2. GLS - Gene Library Summarizer: The GLS Tool finds all the genes expressed in a single cDNA library or group of cDNA libraries. It then classifies the genes as unique or non-unique, and then further identifies the genes in each of these groups as known or unknown. 3.3. The cDNA xProfiler is a tool that compares gene expression between two pools of libraries. For a gene to be "present" in a library pool, there must be at least one EST sequence found in the UniGene cluster for that gene. This tool allows to generate datasets of "unique" ESTs via comparison of e.g. different tissues or between normal and cancer libraries of the same tissue. 3.4. DGED - cDNA Digital Gene Expression Displayer is a tool that compares gene expression between two pools of libraries. In contrast to the xProfiler, the DGED treats the presence of a gene in a library pool as a matter of degree. It compares the "degree" of presence of a gene in pool A with its "degree" of presence in pool B. This comparison is reduced to two numbers: the sequence odds ratio and measure of significance. 4. SAGE Genie: The SAGE Genie website provides highly intuitive, visual displays of human and mouse gene expression, based on a unique analytical process that reliably matches SAGE tags, 10 or 17 nucleotides in length, to known genes. 4.1. SAGE DGED: The SAGE Digital Gene Expression Displayer is a tool that identifies those genes that are expressed at significantly different levels (as defined by the user) in two pools of human libraries, based on SAGE tag analysis. The algorithm takes into account the differences in sample size between Pools A and B, which can be large. The user selects a value for statistical significance (P value) and a value for the difference in the level of expression (F value) between the two pools. The results are based on the sequence odds ratio and measure of significance. 4.2. SAV - SAGE Anatomic Viewer displays gene expression in human normal and malignant tissues by shading each organ in one of ten colors, each representing a different level of gene expression. Gene expression levels are based on the analysis of counts of SAGE tags, which are either "short" (10 bp), including "extracted short" (10 bp extracted from 17bp tag), or "long" (17 bp). SAV can be used to: 4.2.1. find the best tag for a gene / accesion number: NOTE: SAV is an excellent resource to examine the reliability of individual SAGE tags, meaning the probability that a tag is "unique" or that it matches more than one gene (and thereby renders expression data analysis highly difficult). The best tags are color coded. - In addition, the LTV (Ludwig Transcript Viewer) display, showing shorter alternative polyadenylated and internally primed transcripts, supports the prediction of reliable SAGE tags. The tag link enables the user to see which other gene(s) may be represented by the particular tag and the reliability of each mapping. - The Digital Northern (DN) display shows the expression of a particular gene (SAGE tag count) per individual SAGE library as color coded tag count. - The SAGE Anatomic Viewer itself displays the SAGE tag expression count as colored organ images which are hyperlinked to a Digital Northern displaying the tag expression in each individual library. 4.2.2. find the best gene for a tag: this is the opposite search to 4.2.1. 5. CGAP Pathways: Pathways on the CGAP web site have been obtained directly from BioCarta (to create BioCarta Pathways on CGAP) and KEGG (to create KEGG Pathways on CGAP). In addition, CGAP has linked each human gene in BioCarta and each human enzyme in KEGG to its CGAP Gene Info page, and each intermediary metabolite in KEGG to a CGAP Compound Info page. Example: The NF-kB signaling pathway from Biocarta. NOTE: The "Genes" link produces a Gene List of all genes seen in this pathway diagram, including all the options for further data analysis as described in the section "CGAP Batch Gene Finder" above !!! 6. RNAi on CGAP: The NCI is part of the consortium supporting the preparation of human and mouse libraries containing RNAi constructs that target cancer-relevant and other genes. The clones produce small RNA molecules, called shRNA (short hairpin RNA), that are available to the public from a commercial distributor. A tool called RNAi Gene Searcher is provided which searches for genes containing RNAi constructs. 7. CGAP Tools: All the NCI tools presented on the CGAP web site are listed here, organized by biological function. |
| NCI60 - Cancer Microarray Project (Stanford and others) |
NCI60 Cancer
Microarray Project: Survey of gene expression in a panel of 60
NCI cancer cell lines exhibiting patterns related to their
tissue of origin. Tissue categories analyzed include hematopoietic,
epithelial, melanocytic, and mesenchymal types. (Ross et al. 2000
Nature Genetics 24: 227-34). Query: At the Search Page you can perform keyword or BLAST searches, which will lead to graphical images showing the expression in various cancer tissues relative to the non-diseased tissues (red=upregulated, green=downregulated). NOTE: This NCI60 dataset is also available for query at several other sites, like in the context of the CGAP. Please refer to the CGAP section for details ! |
| Oncomine (University of Michigan) |
Oncomine
is a resource for
examining gene expression in cancer. The goal of the project is
to
collect, standardize, analyze, and deliver published cancer gene
expression data to the research community. Probe the expression of a
gene across thousands of cancer samples or explore genes, processes,
and pathways deregulated in a particular type of cancer. Oncomine
pre-computes cancer profiles, clusters, and gene set modules so the
user can
focus on discovery. Oncomine was developed by physicians, scientists,
and software engineers at the University of Michigan. Note: A registration is needed, which is free for academic institutes. 1. Search / Browse Oncomine: 1.1. "Gene Search" allows to query using several types of identifiers, like gene name, gene symbol, Entrez Gene ID, Affymetrix ProbeSet IDs, and more. 1.2. "Profile Search" allows to query using keywords like cancer types, tissue types, clinical parameters, and more. Alternatively, you may browse all cancer profiles by clicking the icon and then use the filters to find the profile of interest. 1.3. "Tools"-> "Catalog" presents lists of all PubMed entries which were used to build Oncomine, categorized by source tissues. You may, as example, get a quick overview of papers related to breast cancer by selecting "breast" as tissue type. Those studies which allow further data analysis in Oncomine contain an "Analyze" link. 2. Results: 2.1. "Gene Search": E.g. try "CDK4" as example. First, an overview is presented showing the gene name and aliases. Links to the following data are shown: - Summary: presents the "Differential activity map". This map summarizes significant differential expression of a gene of interest grouped by tissue type and analysis type. Three types of analyses are summarized on the Summary page: Normal vs Normal, Cancer vs Normal, and Cancer vs Cancer. These analysis types provide an overview of the tissues in which the query gene is most over-expressed. Each colored box signifies significant differential expression. Red signifies over-expressed, blue signifies under-expressed, and the color intensity signifies the level of significance. The number indicates the number of significant Oncomine profiles at the selected p-value. The "P-Value Threshold" (default is 1e-4.) can be changed by the user, + will increase the threshold and display more results, - will decrease the threshold and display fewer results. - Annotation: contains detailed annotation from various databases for the gene of interest, and the respective links (e.g. Entrez Gene, UniGene, Swissprot, orthologs, drug targets, HiMAP interactions, and pathway information (Biocarta, KEGG, Reactome). - Diff/Ex: the "Differential Expression Module" tab appears on every page in the Gene moduleand allows the user to click and navigate between sub-modules based on the selected gene. There are several ways to filter and sort the presented studies, like to display only those studies where the gene of interest is significantly over- or under-expressed. An icon links to a so-called box plot which shows the expression levels as normalized expression units. - Co/Ex: The Co/Ex module provides co-expression results for the query gene. Co-expression results are based on standard average linkage hierachical clustering performed on each Oncomine study. Clusters (>10 genes) including the query gene are sorted by their intra-correlation. "Count" is the number of genes in the cluster. Only clusters with greater than 10 genes are displayed. The clickable HeatMap icon will open a pop-up window and draw a correlation heatmap displaying the co-expression of the query gene and the cluster genes. The clickable GeneList icon will open a genelist in a pop-up window listing the genes most co-expressed with the query gene. A scatter plot can be drawn for each query gene - coexpressing gene - combination ! - Outlier: presents the gene outliers. 2.2. Profile Search: This search first presents a list of studies filtered after certain criteria. These include source tissue (like breast or prostate), and several analysis types (like cancer vs. normal, cancer vs. cancer etc.). - For each study, the number (and percentage) of up-, down-, and differentially expressed genes is indicated. Up signifies the number of genes significantly over-expressed at a q-value cutoff of 0.05. For two-class analyses, Up refers to genes over-expressed in class 2 relative to class 1. In multi-class analyses, Up refers to genes correlated with the progression from class 1 -> class 2 -> class 3, etc. The HeatMap icon will open a pop-up window and draw a heatmap displaying the genes most differentially expressed in a profile. The GeneList icon will open a genelist in a pop-up window displaying the genes most differentially expressed in a profile. - The Advanced Analysis icon links to advanced analysis options for a specific dataset, including: - Enrichment of GO terms, KEGG pathways, Biocarta pathways, InterPro domains, TRANSFAC matrices (-1000 bp), conserved promoter motifs, conserved UTR motifs, picTar predicted miRNA target genes, or HPRD interaction sites in this dataset. NOTE that this is a very powerful tool for in-depth analysis of cancer expression datasets ! - Interactome: draws all known protein-protein interactions for the top genes of a specific dataset. Interaction data are taken from the HiMAP database. - Pathway: lists Biocarta pathways where the top genes of a dataset are contained. NOTE: The Adobe SVG viewer plugin is needed in order to display some of the graphical maps ! Tests showed that this plugin works much better in MS Internet Explorer than in Netscape, at least in NS version 7.2 ! |