Bioinformatics World FAQ Center
  FAQ Index -> GENOMICS
                -> GENOM1...scan my gene (set) of interest for the presence of SNPs (Single Nucleotide Polymorphisms)? (last update Feb. 22, 2006)
                -> GENOM2...search for (also distant) orthologs/homologs of my gene/protein of interest ? (last update May 5, 2006)
                -> GENOM3...see which regulatory elements are conserved in a set of orthologous promoters (Phylogenetic Footprinting) ? -> see GEN5 !
                -> GENOM4...get all human proteins present in Drosophila but not in C. elegans ? (last update Apr. 14, 2004)
                -> GENOM5...know which genes of a specific dataset are associated with a disease ? (last update Jun. 7, 2005)
                -> GENOM6...identify Conserved Non-coding Sequences (CNS) and conserved transcription factor binding sites in large genomic regions via comparative genomics ? (last update Mar. 14, 2006)
                -> GENOM7...know all genes associated with cardiovascular diseases having a described polymorphism in the promoter region ? (last update Jun. 7, 2005)
                -> GENOM8...know the genes and drugs related to diseases like atherosclerosis ? -> see CHEM2 !      
                -> GENOM9...analyze the expression of a gene set of interest in cancer tissues ? (last update Feb. 13, 2006)       
                -> GENOM10...determine the expression profiles of normal vs. cancer tissues ? (last update Feb. 13, 2006)        
      
      
Navigate   AtoZ   Search this Site   Site Journal    FAQ Index   Main Index   Appendix       
                

                  
GENOM1...scan my gene (set) of interest for the presence of SNPs (Single Nucleotide Polymorphisms)? (last update Feb. 22, 2006)

    Today, there are various databases which store information on Single Nucleotide Polymorphisms and make these data retrievable. In general, we may subdivide these resources according to the types of source sequences where the SNPs are derived from, coding sequences, whole transcript sequences and genomic sequences. In addition, they may be compared according to the feature if they support only single-gene queries or if they also provide a batch-retrieval option for high-throughput dataset analyses.

1. SNPs in transcript sequences:

    Tip! Maybe the best way to query for single transcript-specific SNPs is to use the following NCBI resources. Query Entrez Gene with the gene name of interest, maybe also specifying the species of interest via the "Limits" tab, in order to retrieve the specific Entrez Gene-report, example GeneID:5743 (PTGS2). The sidebar contains, among many others, links to SNP data of this gene. You may either directly go to "SNP: GeneView" or you first open the link "SNP" and then hit "GeneView" in one of the SNP hits (example: GeneView of human PTGS2). Alternatively, you may query Entrez SNP with the gene name of interest which produces the same result page. The GeneView can be easily modified in order to display either only those SNPs located within the coding region or all SNPs located within the gene (including the UTRs of the transcript), which are distinguishable via a simple color code. Note that the promoter region or other flanking genomic regions are not displayed in GeneView. Please refer to the section below for information.
    Entrez SNP incorporates the NCBI SNP database into the Entrez database system. This database stores all SNP detail information, like method, submitter, a variation summary, and a validation summary. The example Reference SNP rs20426 shows a SNP affecting the first amino acid of human PTGS2 (Met/Ile).
    In general, you can also BLAST the NCBI SNP database with your query sequence. You can thereby choose the chromosome to BLAST against. The output gives a good overview of SNPs found, and shows their precise position and surrounding sequence.

    If you want to get a "quick-view" of the SNPs located in the cDNA of your gene of interest, you may also query Ensembl for the specific gene name. When you open the respective Ensembl Gene Report (example: ENSG00000073756 for PTGS2), you may open the link "Gene variation info" which opens the GeneSNPView (example: PTGS2). This view is somehow similar to the SNP-GeneView at NCBI displaying a list of all SNPs in the transcript region. Note that a small part of the flanking promoter region might be included in the SNP list, here.
    In the Ensembl Gene Report, there is also a link to the Ensembl Transcript Report (example: ENST00000186982 for PTGS2). At the cDNA sequence section, the user can now select to display also exons, codons, translations, and SNPs. There is a simple color code in order to distinguish between SNPs which are located in UTRs and those which are located in coding regions. The latter are divided into those producing synonymous and those producing non-synonymous changes in the coding sequence. Note that it is not possible to click onto individual SNPs here in order to retrieve detailed information, but you may use the link "Export SNP Info in Region" in order to use BioMart to generate the respective data file (see also below, as BioMart can also be used for high-throughput retrieval of SNP data).

    Glovar is a project from the Sanger Insitute, which shows sequence variation in a genomic context. Glovar aims to compare all public human reads in the trace repository to the current genome build and shows the read alignments and SNPs discovered plus other public SNPs from dbSNP, alongside the latest Vega and Ensembl genes. Currently, Glovar is available for the human genome. The Glovar interfaces have an "Ensembl look and feel", like the option to query using a multitude of identifiers. It is also possible to browse single chromosomes. The output page (example: PTGS2) is very similar to the GeneSNPView described   Note: The SNP data stored in Glover are also available via queries at the VEGA Human Genome Browser. Please refer to the VEGA section for details.

     H-Invitational Database (H-InvDB) is a human gene database opened to the public in April 2004, which is hosted by the Japan Biological Information Research Center (JBIRC) and by the DNA Databank of Japan (DDBJ). The scope of H-InvDB is to provide an integrative annotation of full-length cDNA clones available from high throughput cDNA sequencing projects. If you want to scan a cDNA sequence of interest for SNPs, you may perform a simple keyword query, and then look at the "cDNA view", which lists all SNPs within the "cDNA information" section (example: PTGS2). The links directly connect to the NCBI SNP database. Please refer to the H-InvDB section at the Data Integration page for additional information. 

2. SNPs in genomic sequences (including transcript regions):

    Note that the promoter region or other flanking genomic regions are not displayed in NCBI's GeneView of SNP records. If you are interested in these regions, several resources are available.

    The UCSC Genome Browser is an excellent site to retrieve all kinds of genome-related data, also SNP data. Simply select the species of interest and enter a gene name or any other identifier which is supported, also chromosomal positions (example: human PTGS2 gene). In order to display SNPs, simply set the option "SNPs" within section "Variation and Repeats" to "full" or "pack" and hit the "Refresh" button. All Reference SNPs will be shown in the selected sequence window, color coded to reflect SNPs within transcribed regions vs. those in intergenic regions and to reflect SNPs producing synonymous vs. non-synonymous changes in the coding sequence. Each SNP can be selected to display all relevant database links, including a cross-reference to NCBI dbSNP. Note that you may easily analyze the promoter region of a gene by adding a few kb at the respective end of the sequence window via the "Position" field of the browser. Note that it is possible to highlight the nt-positions of SNPs when selecting the "DNA" link for sequence download and select the option "SNP" in the "Extended DNA Case/Color Options". Note that it is also possible to generate a tab-delimited table listing all SNPs in the sequence window, by using the Table Browser and choosing the group "Variation and Repeats", track "SNPs".
    Note that the UCSC Genome Browser View of a specific gene is also retrievable via the Entrez Gene entry at NCBI, following the link "UCSC" in the sidebar.

    Tip! The Ensembl genome browser may also be used for this purpose. Simply query for the specific gene name. When you open the respective Ensembl Gene Report (example: ENSG00000073756 for PTGS2), you can retrieve the coordinates of the genomic location which also links to the Ensembl genome browser view. In this browser view, you can simply add a few kb within the position number fields in order to display the (potential) promoter region. Then, you have to select "SNP" from the "Features" drop down menu which will display all SNPs color-coded which are located within the sequence region. Each SNP can be displayed in detail in the so-called SNPView of Ensembl. Again, you may use the link "Export SNP Info in Region" in order to use BioMart to generate the respective data file (see also below, as BioMart can also be used for high-throughput retrieval of SNP data). Note that a very nice feature of this option is that BioMart generates tables which can be opened in Excel and which maintain all hyperlinks ! NOTE: The "example 2" described under point 3. for sets of genes can also be used for SINGLE genes !!!

3. High-throughput retrieval of SNP data:

    Tip! A very powerful resource for data retrieval is BioMart. BioMart is a data retrieval tool that generates lists of biological objects (e.g. genes, SNPs) from data held in the Ensembl (and other) databases. Please refer to the BioMart section at the main page for general details.
    Example 1: BioMart can be used to retrieve datasets of e.g. SNPs from large genomic regions. There are 2 different ways for this purpose. The first is to retrieve the genomic region you are interested in via the species-specific genome browser view starting at the Ensembl home page, example: human genome browser, where you can type in the from-to coordinates of your region or search for a specific gene. Having the genomic region displayed in the browser window, simply use the link "Export SNP Info in Region" in order to proceed directly to the BioMart Output page to generate the respective table. In BioMart, the user can customize the types of SNP attributes which will appear in the final table, like type of validation, genomic region and amino acid changes.
    The second way is to start a "de novo" BioMart session, select "SNP" as dataset and the species like "Homo sapiens". At the Filter page, you may now select the chromosome and genomic from-to positions, as well as general SNP filter options, which include "validated SNPs only", or "only SNPs with allele frequency data from north american population". In addition, gene associated SNP filters may be set like SNPs located in coding regions or in introns or in 5'flanking regions etc. Finally, a filter can be added to search only for SNPs affecting amino acid changes or affecting splice sites, etc. At the Output page, the options are the same as described above. NOTE that if you want to download a multi-sequence file of all SNPs you should select "Sequences" in the menu "Attribute Page" on top of the Ourtput page.
    Example 2: BioMart can also be used to retrieve SNP data from a set of genes (gene names, Entrez Gene IDs, etc) which do not have to be located in a common genomic context. For this purpose, start a BioMart session and select e.g. "Ensembl" " Homo sapiens genes".  At the Filter page, paste a list of gene specific identifiers within the field "ID list limit". In addition, you may filter for SNPs in specific regions (coding etc), SNPs which result in aa changes, splice site changes etc, and you may filter for validated SNPs only. Note that now, the number of hits relates to the number of input genes and not to the number of SNPs found! Select "SNPs" in the menu "Attribute Page" on top of the Output page. Here, you may finally select the attributes of the gene associated SNPs which you want to retrieve. Note that obviously "gene associated" SNPs also include SNPs which are located up to about 5 kb upstream of the transcribed region, meaning that SNPs affecting the proximal promoter region should be contained in the final output. Of course, this also works for SINGLE genes! 

    Tip! HapMart is a data mining tool for retrieving data from the HapMap database, storing data on genetic variations like SNPs. It is based on the Biomart interface. Please refer to the main section of HapMart for details concerning SNP databases, filter settings, and output / download options.

4. SNP - Disease correlations: (please refer alo to FAQs GENOM5 and GENOM7)                     
   
    You may search the HGMD, the Human Gene Mutation Database, maintained by the Univ. of Wales (Cardiff) for SNP-Disease correlations. HGMD stores not only mutations in human genes but also curated polymorphisms showing clear phenotypes, meaning polymorphisms extracted from literature which have a significant effect. HGMD is also directly linked via NCBI Entrez Gene entries. HGMD supports only single queries. The output not only lists nucleotide substitutions affecting coding regions but also affecting splice sites and regulatory sequences.

    Tip! GAD - Genetic Association Database is an archive of human genetic association studies of complex diseases and disorders. GAD is maintained by the NIH - National Institutes of Health. The goal of this database is to allow the user to rapidly identify medically relevant polymorphism from the large volume of polymorphism and mutational data, in the context of standardized nomenclature. The data is from published scientific papers. Study data is recorded in the context of official human gene nomenclature with additional molecular reference numbers and links. It is gene centered. That is, each record is a record of a gene or marker. Please also refer to the GAD section at the Data Integration page for a general description.    
    GAD provides an extremely useful Batch Search option, if you want to quickly analyze whole sets of genes derived from e.g. microarray data in order to see which of these genes are associated with a known disease. You enter human official gene symbols (max. 300) as a list (comma or space or tab or new line separated) in the text area. If you have other types of identifiers,(Accession , Unigene, etc. ) you can go to batch SOURCE to get the identifiers translated to official Gene Symbols. Refer also to the SOURCE Batch Search section for instructions! Finally, you will retrieve a table displaying the known disease correlations of your genes of interest (categorized by "Aging", "Cancer", "Cardiovascular", "Development", "Immune", and others) where you may click on all individual records. Single records, like the example of TP53 and colorectal cancer, display a lot of links to diverse databases, like allele description, polymorphism class, PubMed links, pathway data, population data, as well as links to SNP databases (NCBI SNP, HapMap, Map View). In addition, links to "experts in the field" are provided. Note: If you choose the option Positive Only, then the output list of your query is reduced to those records describing a positive association with a certain disease. Note: In general, GAD also stores negative association data (often published only in obscure scientific journals).

5. Prediction of functional impacts of SNPs:

    Tip! The tool cSNP Analysis is part of the PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System. This tool estimates the likelihood that a particular nonsynonymous coding SNP will cause a functional impact on the protein. It calculates the subPSEC (substitution position-specific evolutionary conservation) score based on an alignment of evolutionarily related proteins, as described in Thomas et al., 2003 and Thomas & Kejariwal, 2004. As input simply paste a protein sequence and enter the substitution(s) relative to this input sequence in the standard amino acid substitution format, e.g. A265V. Multiple substitutions should be separated by a tab, space, or return. You may use the integrated example protein APOE by clicking onto the "?" to get an impression how the program works. The output is a table containing the subPSEC scores which describe the probability that a certain substitution will have a deleterious effect or may even have a gain-of-function effect. The user can click on the link on the number of the multiple sequence alignment (MSA) position to view the column in the MSA where the substitution occurs (red color).
                           
Main Index  FAQ Index 
                                


GENOM2...search for (also distant) orthologs/homologs of my gene/protein of interest ? (last update May 5, 2006)

    Please note that in general, there are several approaches to deal with this question. Of course, you can perform simple homology searches like BLAST using your nucleotide (BLASTN) or your protein (BLASTP or TBLASTN) query along with the appropriate databases, and then "manually" screen the results. In addition, there exist some strategies for the detection of distant homologies, which are described below. Related programs are also found in the section "Sequence Similarity" at the main pages. Finally, there is a number of databases which "pre-compute" these sequence similarities and provide cross-species information on orthologous/ homologous genes/ proteins and try to create protein families. These sites which are described below are categorized either under "Genomics" ("Comparative Genomics", "Phylogenetics") or under "Proteins" ("Domains, Families") at the main pages.
    Note: If you want a very quick and easy approach to this question, I would recommend the UCSC Gene Sorter, described under "Protein Families" below!

1. Refined homology detection via multiple sequence alignments:

1.1. Starting with SINGLE protein sequences:

    A well known strategy for the detection of distant homologies is to search not only with one query sequence but to first generate a multiple sequence alignment of close sequence neighbours and then perform database searches using the consensus derived from such alignments. This procedure emphasizes the importance of conserved residues in the sequence and assumes that distantly related proteins also somehow preserve this conservation. In a second step (see below) the retrieved sequence homologs may again (manually) be aligned to a (now larger) sequence alignment.

    Tip! PSI-BLAST is a Protein-BLAST derivative available at the NCBI BLAST page. PSI-BLAST takes a single protein sequence as input. Position-Specific Iterative (PSI) BLAST is a program based on the BLAST 2.0 algorithm that is designed to detect weak relationships between the query and members of the database not necessarily detectable by standard BLAST searches. The added sensitivity of this program over regular BLAST comes from the use of a profile that is constructed (automatically) from a multiple alignment of the highest scoring hits in the initial BLAST search. The profile is generated by calculating position-specific scores for every position in the alignment. A highly conserved position will receive a high score and weakly conserved positions receive scores near zero. The profile is then used to perform additional BLAST searches (called iterations) and the results of each iteration are used to refine the profile. You may also check the BLAST guide for information. Usually, after a few iterations, no new sequences (marked in order to be quickly identified) are added to the list of reliable homologs which show E-values below a certain threshold (default E=0.005).
    If you want to, you may now select those "reliable" entries, and download this sequence set in FASTA format via the "Get selected sequences" button at the bottom of the output page.
    Note: Please refer to the main PSI-BLAST section for details on multiple alignment construction, score matrices, sequence weights, and BLAST applied to a PSSM.

    BLASTView of Ensembl provides an integrated platform for sequence similarity searches against Ensembl databases. As there is a very user-friendly selection of the search sensitivity, ranging from "exact matches" to "distant homologies", it is possible to use this program also for high-sensitivity searches. Nevertheless, the usage of NCBI's PSI-BLAST is much clearer and the output offers better options of data analysis and download.

    The BLOCKS server is a service for biological sequence analysis at the Fred Hutchinson Cancer Research Center in Seattle, Washington, USA. "Blocks" are short multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. BLOCKS server provides several methods to search a single sequence against the BLOCKS and Prints databases.
    Block Searcher compares a protein or DNA sequence to the current database of protein blocks. Typically, a group of proteins has more than one region in common and their relationship is represented as a series of blocks separated by unaligned regions. If a second block for a group also scores highly in the search, the evidence that the sequence is related to the group is strengthened. It is recommendable to search both Blocks+ and Prints. Blocks+ has automatically - generated blocks, while Prints has hand-crafted blocks.
    IMPALA (Integrating Matrix Profiles And Local Alignments) searches a protein query sequence against a multiple alignment database represented as a collection of PSI-BLAST checkpoint files. IMPALA has been implemented on the Blocks Server to search a blocks database, such as Blocks+. Although the Blocks Searcher performs a similar type of search, there are differences between IMPALA and the Blocks Searcher in the PSSMs used, in the alignments reported, and in the calculation of statistics that can lead to somewhat different results. Therefore, any marginal similarity detected with one searching program should be confirmed using the other. Both programs generally detect true positive hits but they tend to report different false positives, and so any hit not detected by both searching programs should be regarded with caution.
    RPS-BLAST (Reversed Position Specific BLAST) is one of the BLAST series of searching programs from NCBI. RPS-Blast uses the query sequence to search a database of pre-calculated PSSMs (in this case PSI-BLAST checkpoint files made from multiple alignments of protein families) and report significant hits in a single pass. The role of the PSSM has changed from "query" to "subject", hence the term "reverse" in RPS-Blast. Here the checkpoint files are made from the Blocks and Prints alignments, and are the same files searched with the similar IMPALA searching program. 

1.2. Starting with MULTIPLE unaligned protein sequences (if you know already the closest relatives to your sequence):  

    Tip! Block Maker, which is also part of the BLOCKS server, finds conserved blocks in a group of two or more unaligned protein sequences, which are assumed to be related, using two different algorithms. At least two related protein sequences (and a maximum of 250) must be provided to make blocks. Each sequence must have a unique name of 10 characters or less. If you have the accession numbers of some sequences you would like to use, NCBI Entrez or similar programs can create a file for you in FASTA format. The main difference to programs like ClustalW is that Block Maker does not perform a global alignment covering the complete sequences but selects for the sequence blocks which are best conserved. Block Maker output allows a series of follow-up analyses, like the display of phylogenetic trees, or sequence logos (see below).
    Note: If you already have generated a multiple alignment by programs like ClustalW, please use the Blocks Multiple Alignment Processor instead (see below), as test runs showed that this further improved the significance of the extracted blocks.

 
1.3. Starting with a pre-made sequence ALIGNMENT of related protein sequences:   

    Tip! You may, for example, take the sequence set produced by PSI-BLAST and perform a ClustalW multiple sequence alignment. Please refer also to FAQ SIM3 for this purpose. You can now investigate conserved regions either via the "built-in" Java applet JalView or by coloring the alignment using Boxshade (see FAQ GRAPH1). Alternatively, you may submit the ClustalW alignment to the Blocks Multiple Alignment Processor which carves out ungapped regions between the requested minimum and maximum widths of the submitted multiple alignment and converts them to Blocks format, thus again providing all the follow-up analyses as described below.                

1.4. Follow-up analyses using conserved multiple aligned protein sequences:

The output of programs like Block Searcher, IMPALA, RPS-BLAST, Block Maker, or Blocks Multiple Alignment Processor is twofold.

    First, there are several ways to display the conserved sequence blocks and the general relationships between the sequences. The blocks may be shown in different textual formats as well as in a graphical map. The conservation profile of blocks can be visualized as sequence logos (please refer to the main section "Sequence Logos" and to FAQ SIM4 !). Also, phylogenetic trees may be constructed from the input sequences. Finally, 3D Blocks is a search and display tool which allows you to view blocks on a protein structure (if available).
         
    Second, the extracted blocks themselves may be used to search protein sequence databases for additional homologs. This can be done by either using the whole blocks (multiple alignments) as query, like in the programs LAMA and MAST, or by using the deduced consensus sequence, as in the program COBBLER.
    LAMA (Local Alignment of Multiple Alignments) is a program for comparing protein multiple sequence alignments with each other. The program can search databases of such multiple alignments. The search is for sequence similarities between conserved regions of protein families. The method is sensitive, detecting weak sequence relationships between protein families. Sequence similarities beyond the range of conventional sequence database searches can be detected by the method. LAMA compares multiple sequence alignments of proteins.
    COBBLER means COnsensus Biasing By Locally Embedding Residues. A single sequence is selected from a set of blocks and enriched by replacing the conserved regions delineated by the blocks by consensus residues derived from the blocks. Embedding consensus residues improved performance with readily available single sequence query searching programs, such as BLAST and FASTA, in comprehensive tests. This is especially useful in PSI-BLAST searches.
                  

2. Comparative Genomics, Phylogenetics:

    Tip! BLink (BLAST Link) is an extremely useful source of information concerning protein homologs and orthologs across multiple species. BLink is not available as "program per se" but as link in each protein record stored in NCBI Entrez, see example report. Note that in case there is no direct link to BLink from Entrez Gene, you may first open the respective HomoloGene link, and then jump to BLink from the protein records displayed in HomoloGene. BLink entries are based on pre-computed sequence alignments, generated from routine all-against-all BLAST comparisons performed at NCBI. The best 200 of these alignments can be displayed. BLink reports are highly customizable. Conserved protein domains are shown on top of the alignment, with links to the NCBI CDD database. The alignments are depicted graphically and are color-coded on the basis of taxonomic origins (having some "look and feel" of the COG database). Each alignment is displayed in NCBIs BLAST 2 sequences format. All protein hits can also be displayed in their specific BLink reports. The "Best Hits" format button displays ONLY THE BEST HIT IN EACH SPECIES, allowing a very quick access to the potential orthologs of a protein in other species. The "Common Tree" button displays the BLAST hits along the branches of the taxonomic tree, allowing for selection of individual species. The "Taxonomy Report" button lists the BLink results as a BLAST taxonomy report. The "3D structures" button limits the output to those sequences derived from structure records (linked via the colored dots in the "3D structures" display). The "CDD search" button links to a pre-computed conserved domain display for the query sequence. Note that BLink is also integrated in the "data super-integration tool" Bioinformatic Harvester of the EBI.  

    Tip! HomoloGene is a NCBI-resource of curated and calculated orthologs for genes as represented by UniGene or by annotation of genomic sequences. The calculated homologs are the result of nucleotide sequence comparisons between each pair of organisms, maintained by the database. For the comparisons, EST and mRNA sequences from UniGene are used, as well as transcripts extracted from annotation of genomic sequences. The list of available organisms is shown at the HomoloGene start page. HomoloGene can be queried via keyword search (gene names, symbols, accessions), but not via sequence search. HomoloGene entries are also linked at ENTREZ Gene pages of individual genes. HomoloGene is now integrated in NCBIs ENTREZ database system.

    The NCBI databases COG (Clusters of Orthologous Groups of proteins) and KOG (Eukaryotic orthologous groups) were constructed by comparing protein sequences encoded in a list of complete prokaryotic (COG) and eukaryotic (KOG) genomes. KOG covers species like human, Drosophila, C.elegans, Arabidopsis, and  S.cerevisiae. In general, please also refer to the COG and KOG database descriptions at the main page for detailed information. The underlying premise is that orthologs are more similar to each other than they are to any other protein from the respective genomes ("reciprocal best hits"). In multiple - genome comparisons, pairs of potential orthologs can be joined to form clusters of orthologs. Note that a COG is built by definition by proteins from at least 3 sufficiently distant species ("3 clades").
    If we just concentrate on the KOG database, there are several ways to access these data using your protein query sequence. One which might be already known from other questions is to perform a CD-Search at the NCBI database CDD (Conserved Domain database). Previously, CDD contained protein domains from Smart, Pfam, and NCBI-specific data, but later was updated to also show similarities to existing COGs or KOGs. The output links reveal multiple alignments, as well as direct access to the COG and KOG database entries.

    The Eukaryotic Gene Orthologs (EGO), previously called TIGR Orthologous Gene Alignments (TOGA), is a database for orthologous genes in eukaryotes. EGO is generated by pair-wise comparison between the Tentative Consensus (TC) sequences (contig sequences of individual EST clusters) that comprise the TIGR Gene Indices from individual organisms. The EGO database can be accessed through the SEARCH function. You can perform a BLAST search or search using gene names or TIGR accessions. If available, you will retrieve a "tentative ortholog" accession, which groups predicted orthologs from a list of species. In addition, a ClustalW multiple sequence alignment of these orthologous cDNA sequences is displayed.
    Please note that a special feature of EGO is the search for "Orthologs of human disease genes". Thereby, Human disease genes in Online Mendelian Inheritance in Man (OMIM) database were matched to a TIGR Human Gene Index accession (THC number) and Orthologs of human disease genes have been identified using EGO database. You can query using OMIM or LocusLink ID, gene name and various types of accession numbers.

    PhyloBLAST is a program to perform a  molecular phylogenetics analysis of a protein sequence. PhyloBLAST accepts only protein sequences. PhyloBLAST uses BLASTP to find related amino acid sequences in the Swiss-Prot database. The first result is a "BLAST style" graphic including all pairwise alignments. You may select those sequences desired, for a full phylogenetic analysis, starting with a ClustalW multiple sequence alignment. A choice of Phylip programs, including parsimony, UPGMA, neighbor joining and distance matrix methods, produces a phylogenetic tree.


3. Protein Families:

    Tip! The UCSC Gene Sorter is an excellent resource for exploring gene families and the relationships among genes. This tool displays a table of genes within a selected genome that are related to one another. Several different relationships may be explored: protein-level homology, similarity of gene expression profiles, or genomic proximity. The Browser supports searches on a variety of terms and phrases, including the gene name, the SwissProt protein name, a GenBank accession, or a word or phrase present in a gene's description. The gene family display is highly configurable, allowing the user to control the order and number of columns, the number of rows, and the genes displayed. Please refer also to the UCSC Gene Sorter main section for details!
    Concerning this specific question, you have 3 options to sort by protein homology (BLASTP, Rankprop, PSI-BLAST), and then you can easily download the sequences via the "sequence" button. NOTE: UCSC Gene Sorter primarily looks for homologs within the SAME SPECIES! BUT: Via "configure" it is possible to display the gene orthologs (best BLASTP hits in Ensembl) of a list of species, like mouse, zebrafish, drosophila, C.elegans, and yeast. The tool provides several output formats, including a simple tab-delimited format that may be imported into a spreadsheet or a relational database. In addition, the sequences of the displayed genes can be downloaded: cDNA, protein, genomic and promoter (!) sequences, allowing a user-definition of upstream and downstream regions.

    Tip! The Ensembl project also provides tools to look if your query belongs to a predicted protein family. First, you have to get the data file of your gene of interest, which you can do either using TextView or by BLAST (sequence) search. Note that for this step, it is best to search "All" indices, and not only the "Family" index. At the "Ensembl Gene Report", you mostly will find "gene(s) that have been identified as putative homologues by reciprocal BLAST analysis" from other species. At the "Transcript Summary", you often will see a link to a predicted "Protein Family", having a unique "ENSF..." accession number. Following this link, you will get lists and multiple alignments of these protein sequences, offering many download options (ClustalW, FASTA, MSF,...). Note that in contrast to the UCSC Gene Sorter, you will retrieve essentially the orthologs ("reciprocal best hits") and homologs with high sequence similarity, but not other, more distantly related genes/proteins.

    Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. For each family in Pfam you can: Look at multiple alignments, view protein domain architectures, examine species distribution, follow links to other databases, and view known protein structures. Pfam is a database of two parts. Pfam-A is the curated part of Pfam containing over 7255 protein families. To give Pfam a more comprehensive coverage of known proteins a supplement called Pfam-B is automatically generated. This contains a large number of small families taken from the PRODOM database that do not overlap with Pfam-A. Although of lower quality Pfam-B families can be useful when no Pfam-A families are found. There are several ways to query Pfam, like Protein sequence search, DNA sequence search, or Keyword Search. You will retrieve lists of matching Pfam-A and -B hits, as well as pairwise alignments to your query sequence.

    TIGRFAMs are a collection of protein families featuring curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information designed to support the automated functional identification of proteins by sequence homology. Classification by equivalog family, where achievable, complements classification by orthologs, superfamily, domain or motif. Use this page to see the curated seed alignmet for each TIGRFAM, the full alignment of all family members and the cutoff scores for inclusion in each of the TIGRFAMs. You can query TIGRFAMs by Text Search or query by sequence using the option  Sequence Search. Note that by default, both databases (TIGRFAMs and PFAM) are searched. Note that TIGRFAMs are automatically searched when performing InterPro queries. Note that PFAM is a collection of HMM models of protein families complementary to TIGRFAMs. PFAM models are constrained to be non-overlapping with one another and thus are more likely to describe domains rather than full-length proteins.

    Superfamily is a server that provides structural (and hence implied functional) assignments to protein sequences at the superfamily level. This server does not attempt (at present) to distinguish between families within superfamilies, but is able to detect the broader and more distant relationships at the superfamily level. A superfamily contains all proteins for which there is structural evidence of a common evolutionary ancestor. The server can be entered in three ways: begin with a sequence (search the library); begin with a superfamily (select from SCOP); or begin with a genome (select from list). The output for the first case will give you a list of hits that your sequences make to models belonging to superfamilies, their alignments to the model, and assigned genome sequences (a very instructive list of  genomes in which a certain superfamily has already been described !).
   
Main Index  FAQ Index  
   

         
GENOM3...see which regulatory elements are conserved in a set of orthologous promoters (Phylogenetic Footprinting) ? -> see GEN5 !
 
    The prediction of Transcription Factor Binding Sites (TFBS) in a single promoter produces many false positives. This can be drastically improved by comparing this promoter to the corresponding (orthologous) promoters of the same gene in other species. Those sites which are conserved in evolution are most likely to have functional importance. As many overlaps exist, this question is treated along with FAQ GEN5.
        
Main Index  FAQ Index    
                     

                
GENOM4...get all human proteins present in Drosophila but not in C. elegans ? (last update Apr. 14, 2004)

    This question addresses the general matter of identifying protein sets which are expressed specifically in certain species but not in others. A prerequisite to find a solution is the construction of databases which reliably cluster orthologous proteins, in order to be able to search which clusters of proteins contain specific species while excluding others. Of course, this is not a trivial task, as clearly discussed in Tatusov et. al, 2003.

    The NCBI databases COG (Clusters of Orthologous Groups of proteins) and KOG (Eukaryotic orthologous groups) were constructed by comparing protein sequences encoded in a list of complete prokaryotic (COG) and eukaryotic (KOG) genomes. KOG covers species like human, Drosophila, C.elegans, Arabidopsis, and  S.cerevisiae. The underlying premise is that orthologs are more similar to each other than they are to any other protein from the respective genomes ("reciprocal best hits"). In multiple - genome comparisons, pairs of potential orthologs can be joined to form clusters of orthologs. Note that a COG is built by definition by proteins from at least 3 sufficiently distant species ("3 clades"). In general, please also refer to the COG and KOG database descriptions at the main page for detailed information, and to FAQ GENOM2 for additional remarks.
    The list of KOGs is represented by tables, which display the numbers of proteins present in the different eukaryotic species by a letter- and color-code, and the deduced number of KOGs in the last column. If we are looking for human proteins present in Drosophila but not in C. elegans, we have to look at the respective rows marked by "H" and "D" but not "C". The corresponding links reveal lists of proteins (KOGs) present in the selected species. Even more precise, there is a table called TWOGs, which directly lists clusters represented by only 2 species. It should be noted that a few improvements could be made concerning the functionalities of the interfaces; e.g. there is still no option to simply restrict the output lists to specific combinations of 2 species.
    Please note that there is also a comparable tool for prokaryotic genomes, named phylogenetic search tool. The choices are "dc" ("don't care"): COG may or may not contain this organism; "yes": COG must contain this organism; "no": COG must not contain this organism. The list of results will be the subset of COGs that fits the pattern indicated.   
     
Main Index  FAQ Index   
     

       
GENOM5...know which genes of a specific dataset are associated with a disease ? (last update Jun. 7, 2005)

    Many high-throughput methods like microarrays finally produce lists of "hot candidates". One of the most interesting questions, when working on such datasets, concerns the potential involvement of these genes in a disease. In principle, there are two ways of retrieving gene-disease information. The first one is to look for already known and described gene-disease correlations. The second one, in cases of "orphan" diseases, where the responsible gene is not known, one may at least try to correlate the genomic position of genes and respective diseases. Finally, when analyzing large datasets, it is highly favorable to have resources which allow batch submissions of gene names to quickly identify those genes with known disease correlations; meaning in order to answer this specific question, proceed to option C) !

1. Resources based on single-gene / single-disease queries:

    Tip! The OMIM database is a catalog of human genes and genetic disorders. The database contains textual information (like "mini-reviews" !), pictures, and reference information. It also contains  links to NCBI's Entrez database of MEDLINE articles and sequence information. Therefore, OMIM is also automatically searched when performing an ENTREZ "cross-database" query using any keyword. Please note that when choosing "Limits", you may restrict your search to specific chromosomes, or to individual fields of the OMIM database. A different approach to access the data is to use the OMIM Gene Map. The OMIM gene map presents the cytogenetic map location of disease genes (in chromosomal order) and other expressed genes described in OMIM, thereby providing links to both the NCBI Map Viewer (genomic map), as well as to the OMIM entries which describe the disorders. NOTE: See the BioMart section below if you want to perform batch queries of gene lists to reveal their OMIM data !!!
                
    The Eukaryotic Gene Orthologs (EGO), previously called TIGR Orthologous Gene Alignments (TOGA), is a database for orthologous genes in eukaryotes. Thereby, Human disease genes in Online Mendelian Inheritance in Man (OMIM) database were matched to a TIGR Human Gene Index accession (THC number) and Orthologs of human disease genes have been identified using EGO database. You can query using OMIM or LocusLink ID, gene name and various types of accession numbers.  

2. Resources based on single-gene / single-disease queries including orphan disease prediction:
  
    Tip! The DiseaseInfo viewer is a tool which is integrated into the  H-Invitational Database (H-InvDB), which provides an integrative annotation of full-length cDNA clones. This viewer displays information on known disease-related genes via links to OMIM, LocusLink, and GeneLynx, but also shows co-localized orphan diseases. Orphan disease (here) means a disease mapped on the chromosomal region, but whose responsible gene has not been identified yet. Co-localization does not mean direct relationships between gene and disease; however, genes that are cytogenetically co-localized with a disease could be possible candidate genes of that disease. You first have to get the specific database entry of your gene of interest, either via BLAST (sequence) search or via keyword search, and then look for the specific disease link within the so-called "Locus view". Please also refer to the H-InvDB section at the Data Integration page for a detailed description.              

3. Resources allowing batch queries of gene datasets:

    Tip! GAD - Genetic Association Database is an archive of human genetic association studies of complex diseases and disorders. GAD is maintained by the NIH - National Institutes of Health. The goal of this database is to allow the user to rapidly identify medically relevant polymorphism from the large volume of polymorphism and mutational data, in the context of standardized nomenclature. The data is from published scientific papers. Study data is recorded in the context of official human gene nomenclature with additional molecular reference numbers and links. It is gene centered. That is, each record is a record of a gene or marker. Please also refer to the GAD section at the Data Integration page for a general description.    
    GAD provides an extremely useful Batch Search option, if you want to quickly analyze whole sets of genes derived from e.g. microarray data in order to see which of these genes are associated with a known disease. You enter human official gene symbols (max. 300) as a list (comma or space or tab or new line separated) in the text area. If you have other types of identifiers,(Accession , Unigene, etc. ) you can go to batch SOURCE to get the identifiers translated to official Gene Symbols. Refer also to the SOURCE Batch Search section for instructions! Finally, you will retrieve a table displaying the known disease correlations of your genes of interest (categorized by "Aging", "Cancer", "Cardiovascular", "Development", "Immune", and others) where you may click on all individual records. Single records, like the example of TP53 and colorectal cancer, display a lot of links to diverse databases, like allele description, polymorphism class, PubMed links, pathway data, population data, as well as links to SNP databases (NCBI SNP, HapMap, Map View). In addition, links to "experts in the field" are provided. Note: If you choose the option Positive Only, then the output list of your query is reduced to those records describing a positive association with a certain disease. Note: In general, GAD also stores negative association data (often published only in obscure scientific journals).

    Tip! A very powerful resource for data retrieval is BioMartBioMart is a data retrieval tool that generates lists of biological objects (e.g. genes, SNPs) from data held in the Ensembl (and other) databases. Please refer to the BioMart section at the main page for general details. There are different options in BioMart to filter a set of genes for those having a known disease context. You may filter your gene list at the "Filter" page by choosing "Disease Genes ONLY". You may also filter by specific diseases using the "Expression" section at the "Filter" page, in particular the "Pathology" fields (e.g. atherosclerosis, asthma, diabetes). Alternatively, you may skip the filtering of your gene list and display ALL genes in the output table BUT include a disease-specific column by selecting "Disease OMIM ID" and "Disease description" at the Output-"Features" tab.
                         
Main Index  FAQ Index    
                     

       
GENOM6...identify Conserved Non-coding Sequences (CNS) and conserved transcription factor binding sites in large genomic regions via comparative genomics ? (last update Mar. 14, 2006)

    Conserved Non-coding Sequences (CNS) are believed to be highly reliable candidates for regulatory regions in genomes, as the general assumption states that regions of conservation within otherwise dissimilar sequences are very likely to be functional. CNS can not only be found in "proximal" promoter sequences but also in distal regions like "enhancers". Thus, as more and more genome sequences become available, comparative genomics develops into a quickly expanding bioinformatics field. The comparison of large genomic sequences demands the availability of special alignment algorithms, as pre-existing ones like BLAST or BLAT are not suitable in this respect. There is already a series of bioinformatics tools available for these purposes, and I am trying to describe and compare those in the following paragraphs. Most of these tools were developed by 2 different institutes, the Lawrence Berkeley Lab. and the Lawrence Livermore Nat. Lab., as described in individual sections at the main page (LBL, LLL). Likewise, I will cite these sources while describing individual programs.

1. Identification / extraction of genomic regions of interest:

    If you are, for example, interested in the whole intergenic region between your human gene of interest and its "upstream neighbouring" gene, the best way to get this region is the UCSC Genome Browser. You can query this browser using various kinds of accession numbers (GenBank, RefSeq, EST, LocusLink), HUGO gene symbols, as well as genomic positions (if you know them already). Note that you have to be careful when selecting genomic positions as they vary between different versions (freezes) of genome assemblies! For example: The human version "hg15" is identical to the UCSC freeze of April 2003, and the most recent version (hg16) corresponds to July 2003.
    If you want to query by sequence (like a cDNA sequence), the best way is to perform a BLAT search at the UCSC, quickly identifying the corresponding genomic region. Both options finally yield an image of the desired region within the genome browser. Here you may move left or right, zoom in and out, or directly enter position coordinates, in order to display the region of interest. Finally, you may either simply write down the position coordinates (like "chr11:2,314,818-2,374,484") or extract the DNA sequence using the "DNA" link on top of the browser window. You may afterwards query tools for comparative genomics using one of these two input types.

2. Identification of CNS via pre-computed alignments:

    Test runs showed that in many cases, it is not only the fastest but also a highly accurate way to use pre-computed whole-genome alignments to reveal conserved regions, instead of performing the alignment yourself. For this purpose, several similar browsers are available.

    VISTA Browser (LBL) is a very nice Java applet, which allows the user to examine pre-computed alignments of whole genome assemblies. Pairwise and multiple alignments are available. This tool is tightly connected to the UCSC Genome browser. To browse whole-genome alignments, just select a base genome and enter a RefSeq gene name or a position (e.g. chrX:1-100000) on this genome. Note that in order to use this browser, Java 2 must be installed on your computer. This Java applet lets you zoom, move, and analyze the genomic alignments very nicely. You may select / deselect individual organisms, zoom highly conserved regions, and also directly jump to the UCSC browser where you can e.g. download the sequence region. You may also directly perform rVISTA (transcription factor binding sites) analyses, by hitting the "i" (Alignment Details) button, which reveals a page with all pairwise alignments and rVISTA analyses links. Please refer also to the rVISTA description below ! VISTA browser is similar to other programs like K-Browser. Personally, I would recommend the VISTA browser which seems more "up-to-date" and shows a higher versatility.  

    Tip! ECR Browser (LLL) is a dynamic whole-genome navigation tool for visualizing and studying evolutionary relationships between vertebrate genomes and for analyzing sequence conservation profiles. In order to run the ECR Browser, select a base organism and indicate the name of a gene or a chromosomal location (chr1:from-to format). Visually ECRs (Evolutionary Conserved Regions) are represented as colored peaks on a graph, with the x-axis representing positions in the base genome and the y-axis representing % identity between the base and aligned genomes at that specified position. ECRs are color-coded differently according to the properties of the underlying sequence of the base genome. This allows the user to visually distinguish between ECRs that correspond to coding exons (blue), untranslated regions (UTRs, yellow) and noncoding elements (red if they are intergenic or pink if they lie within an intron). Green bars on the bottom axis of the plot shows the position of repetitive elements in the base genome and this annotation is shaded to the top of the plot in gray. Annotated genes are depicted as a horizontal blue line above the graph, with strand/transcriptional orientation indicated by the inclined vertical lines.
    In addition, the ECR browser is equipped with a 'Grab ECR' feature that allows users to rapidly extract sequences.
A mouse click on the 'Grab ECR' button, followed by a second click on any colored peak (ECR) on the plot results in appearance of a new web page describing the ECR corresponding to that peak. NOTE that this only works when pop-up blockers are switched off ! Chromosomal location, length, percent identity of the pairwise alignment, and GC content of the ECR are given. In addition the full alignment is visualized. Sequences and alignments from other species can be obtained by using the "Grab ECR" feature to retrieve a peak from the conservation plot depicting alignments with the genome of that species. An additional link can be used to forward the ECR alignment directly to rVISTA (see below).
    Additional features can be accessed via the commands at the top of the ECR Browser window. "Base genome" let's you quickly switch between different species selected as base genome. "Browser Settings" allows customized displays, like selection of species, graph type, number and height of layers, and stringency settings to detect ECRs.
In addition, there is the option to display pre-computed conserved transcription factor binding sites directly in ECR Browser, without having to run "Grab ECR" and rVISTA first. This is a static "quick-view" generated using default settings. "Highlight core ECRs" displays only those ECRs which show at least 77 % conservation in a window of 350 bp (see also corresponding reference). "ECRs" displays a list of the identified ECRs in a genomic region and all sequences. "DNA" produces a fasta sequence file of the complete genomic region of the base genome. NOTE that you can get all the syntenic regions / sequences of the other species via the link "Synteny/Alignments" produces a list of all the syntenic regions / sequences of the other species. You may then directly view the rVISTA analyses (conserved TFBS) for all pairs of sequences. NOTE: You may also send ALL selected sequences to Mulan to generate phylogenetic trees and identify multi-species transcription factor binding sites (please refer to the MULAN description). "SNPs" produces a list of all Single-Nucleotide Polymorphisms within the individual ECRs.
 
3. Identification of CNS via self-computed alignments:

3.1. All sequences submitted by the user:

    mVISTA  (main VISTA; LBL) is a program for visualizing alignments of an arbitrary number of genomic sequences from different species. VISTA is especially designed to display alignments of orthologous genes / regulatory regions of up to 100 species. Note that it is not possible to paste sequences, but you have to save them as FASTA-files in *.txt format, like using MS WORD. In addition you may provide an annotation file for the first (base) sequence, specifying the positions of exons, UTRs, etc. This annotation file can also be written as a simple txt-file, see the instructions page for an example. If you provide an annotation for the first sequence, then this will also be applied for the homologous regions of the second sequence. You may now choose between different alignment programs in mVISTA: AVID, which produces global pair-wise alignments (sequences can be *finished or draft*); LAGAN, which produces global *multiple* alignments of finished sequences; and Shuffle-LAGAN, which produces glocal pair-wise alignment of finished sequences and is capable of *detecting rearrangements*.
    N
ote that you may choose the option to directly analyze the results with rVISTA (regulatory VISTA) to reveal conserved Transcription Factor Binding Sites (TFBS). If so, you can choose individual TFBS and select "stringency values" (core and matrix similarity). Please refer to rVISTA below for output details.
    The
Output comprises different sections: TextBrowser displays input and output files for visualization and download, including text files listing the conserved regions of the 2 sequences that meet the specified criteria (default: 75% identity within 100bp). Another novel (April 2005) feature are the rankVISTA conservation plots which depict evolutionarily conserved segments in pairwise or multiple alignments as a bar graph, where the heights scale with statistical significance [-log10(P-value)]. For example, a height of 4 indicates that the probability of seeing that level of conservation by chance in a neutrally-evolving 10-kb segment of the base sequence is less than 10-4. VISTA Image provides the VISTA plot of the alignment(s) in PDF format. Dynamic Visualization links the results to VISTA Browser which provides multiple novel analysis options. NOTE that you will get an Email containing the link to the directory of output files which can be downloaded.

   
zPicture (LLL) is the most convenient way to align the 2 input sequences. zPicture is a dynamic alignment and visualization tool based on the blastz alignment program utilized by PipMaker. There are several input options for zPicture, like copy/paste, fasta-files, NCBI accessions, or Upload sequence and gene annotation from the UCSC Genome Browser. Optionally, you can provide annotations for the input sequences. The output of zPicture includes several file formats and a dynamic visualization tool that graphically displays the conserved regions and allows for user-defined parameter settings. In addition, there is a direct link to submit the alignment to rVISTA analysis ! NOTE: multi-zPicture is a multi-sequence version of zPicture alignment and visualization tool. Please note that it is not possible to submit multi-zPicture alignments to rVista yet. Nevertheless, all other options are fully functional.

    If we try a direct comparison between mVISTA and zPicture, we may list the following points. mVISTA accepts up to 100 input sequences (and all pairwise alignments with the base sequence can be analyzed with rVISTA afterwards), zPicture accepts only 2 input sequences (more sequences in multi-zPicture but these alignments can not be sent to rVISTA afterwards). In mVISTA, you can (or "must", however you see it) define "a priori" the parameters which define a CNS region, like 70 % identity in a window of 100 bp. In zPicture, you may try different parameter settings to define an ECR in the so-called "dynamic visualization tool" at the output page, but there is no way to use other settings than the default one for a subsequent rVISTA analysis. This is important because rVISTA differentiates between "conserved" (meaning aligned AND within an evolutionary conserved region, "aligned" (aligned but NOT within an evolutionary conserved region) and "all" TFBS (Transcription Factor Binding Sites). Concerning the paramaters which define the minimal requirements for a TFBS to be listed, mVISTA allows the user to set both "core similarity" and "matrix similarity", whereas zPicture allows only the definition of matrix similarity. Anyway, both programs provide pre-defined settings to reduce the output list of TFBS to hopefully specific hits. In mVISTA, these options are listed as "minimize false positives / negatives", or "minimize the sum of both error rates", in zPicture there is an option "optimized for function", along with "use only high-specificity matrices". It is interesting to note that the option "minimize false positives" in mVISTA seems much more stringent than the others as this often reduces the output list to just a handful of entries (or even none). In mVISTA, results are coming as an Email-link (which is advantagous as results are stored for one month at the server !), in zPicture results are displayed directly in the browser window.

3.2. User submits only the base sequence:

    GenomeVISTA (LBL) lets you compare your sequences with several whole genome assemblies. It will automatically find the ortholog, obtain the alignment and VISTA plot. You will also be able to compare your alignment with pre-computer alignments of other species in the same base genome interval. As input just paste your sequence and choose the base genome. The results can be displayed through the VISTA text browser or the graphical VISTA browser. Note that these GenomeVISTA analyses take quite long, so in many cases it is much faster to retrieve the regions via the pre-computed alignments in VISTA Browser !  

    Genome Alignment (LLL) lets you align your FASTA sequence from any organism to either (meaning only ONE of these species at a time !) human, mouse, rat, chicken, fugu or drosophila genome. The output list contains direct links to zPicture, ECR Browser, and rVISTA.

4. Prediction of conserved Transcription Factor Binding Sites (TFBS) within CNS:

    Tip! rVISTA (regulatory Vista, LBL) combines transcription factor binding sites database search with a comparative sequence analysis. It can be used directly or through mVISTA, Genome VISTA, or VISTA Browser. Anyway, if you have 2 un-aligned sequences, you have to submit them to an alignment program (mVISTA, MAVID, Advanced PipMaker) prior to using rVISTA. Note that rVISTA still only runs on and compares 2 input sequences, there is no "multiple version" yet. rVISTA reveals conserved TFBS. You can choose individual TFBS to visualize and select "stringency values" (core and matrix similarity). This point is actually critical, producing either huge lists of potential TFBS or very short ones if the settings are too stringent. The default stringency settings are quite "loose" (Core 0,75 and Matrix 0,7). There are also options "minimize false positives / negatives" but these might be too stringent (try !!!). As output, TFBS are graphically visualized along the sequence. If you want to get the exact position numbers and the exact sequences, use the link "Summary of data" (easily overseen !!!) at the bottom of the output. Please refer also to the rVISTA (LBL) chapter at the main page.

    Tip! rVISTA (regulatory VISTA, LLL) is quite similar to the version at LBL. Again, rVISTA works only for 2 input sequences. rVISTA at LLL offers even more options to run the program starting from different applications. These applications (which can be also used as individual programs !) are zPicture, ECR Browser, precalculated blastz alignments developed in Webb Miller's lab., GALA, and Genome Alignment. Please refer also to the rVISTA (LLL) chapter at the main page for extensive descriptions of these programs.

    Tip! multiTF identifies transcription factor binding sites conserved across multiple species. There are 2 diffrent ways to initiate a multiTF search, and I would suggest to use MULAN, as this program is integrated in the same web-portal. Multiple sequence alignments generated by MULAN can be automatically submitted to multiTF from the results web page. The "handling" and output of multiTF is very simillar to rVISTA, e.g. the user can set the parameters for detection of TFBS (like matrix similarity, individual TF selection). TFBS can be dynamically visualized along the sequences (similar display as in rVISTA but for multiple species). It is possible to list and display either ALL TFBS or only those which are conserved across ALL species. You may also highlight individual TFBS positions in the alignment. Taken together, MULAN and the interconnected tool multiTF somehow represent the "multi-species" equivalent to the system mVISTA-rVISTA, where rVISTA is based on the TF prediction for 2 aligned species (2 sequences). Please refer also to the main chapter describing different other programs of the Lawrence Livermore National Lab for comparative genomics.
    Note that MULAN / multiTF can also be used in connection with the ECR Browser (see above). ECR Browser is a powerful tool to display large genomic regions of synteny between several species and to extract the individual DNA sequences. These sequences can then be used as input for MULAN and finally multiTF can display all transcription factor binding sites which are conserved across ALL species. This is done by using the link "Synteny/Alignments" in ECR Browser, which sends ALL selected sequences to MULAN to generate phylogenetic trees and identify multi-species transcription factor binding sites.
    NOTE: If you are specifically looking for a TFBS which is not contained in the TF database used (like TRANSFAC) but where you have a certain consensus sequence from (like WWCAAWG), you may scan the MULAN alignment for this pattern by using the option "User-defined consensus sequences" within the multiTF input window "Defining transcription factor binding sites".

    EEL - Enhancer Element Locator is a tool for locating distal gene enhancer elements in mammalian genomes by comparative genomics and to identify conserved TFBS in predicted enhancers. EEL is described in Hallikas et al., Cell 2006. Please refer also to the main section of EEL.
    In order to address this specific question, you may try a search in EEL Database of precomputed EEL alignments. EELweb stores precomputed alignments between orthologous genes from human and many other species. The data is regularly updated with some synchronization with ENSEMBL database, which is used as source of genomic information. EELweb can be search for conserved TFBS (all or selected from a list) in 100 kb upstream and downstream regions of a specific gene (set) of interest. A list of Ensembl Gene IDs can be used as query to search for precomputed TFBS in predicted enhancer regions of these genes. Select the suitable species comparison. The Ensembl IDs must correspond to the chosen organism. Sites in the module: Restrict the query by requiring certain types of transcription factor binding sites to be conserved in the elements. Note that the maximum number of results listed is 1000. If you want to produce higher numbers, you have to install the local version of EEL.
    Remarks: A test run using the human genes CD8A (ENSG00000153563) and CD8B (ENSG00000172116) did not produce any enhancer regions / TFBS, although this locus is well documented concerning functional enhancers. Thus, it may be questioned whether the EEL database really holds a comprehensive list of genes / enhancers.
                                        
Main Index  FAQ Index     
                     

      
GENOM7...know all genes associated with cardiovascular diseases having a described polymorphism in the promoter region ? (last update Jun. 7, 2005)
                   
    This question actually combines 2 different questions, namely to find resources which list genes associated with certain diseases (which may also be others, like cancer, immune diseases,...) AND to automatically select "by one click" those genes having a polymorphism in a certain sequence region (which may also be 3' untranslated, or coding sequence, or other).

    Tip! GAD - Genetic Association Database is an archive of human genetic association studies of complex diseases and disorders. GAD is maintained by the NIH - National Institutes of Health. The goal of this database is to allow the user to rapidly identify medically relevant polymorphism from the large volume of polymorphism and mutational data, in the context of standardized nomenclature. The data is from published scientific papers. Study data is recorded in the context of official human gene nomenclature with additional molecular reference numbers and links. It is gene centered. That is, each record is a record of a gene or marker. Please also refer to the GAD section at the Data Integration page for a general description.    
    In order to address the specific question here, the option Advanced Search allows to query by complex combinations of keywords using fields like "Reference", "Submitter", Entrez GeneID", "UniGene cluster", "Ensembl", and others. In particular, it is possible to select polymorphisms (field "Polymorphism Class") specifically related to genomic regions like "5'promoter", "3'untranslated", or "coding sequence" and to restrict the output to those genes related to cardiovascular diseases (field "Disease Class"). Note that if you select the option Positive Only, then the output list of your query is reduced to those records describing a positive association with a certain disease. Note: In general, GAD also stores negative association data (often published only in obscure scientific journals).
    Single records, like the example of IL6 and coronary heart disease, display a lot of links to diverse databases, like allele description, polymorphism class, PubMed links, pathway data, population data, as well as links to SNP databases (NCBI SNP, HapMap, Map View). In addition, links to "experts in the field" are provided.

    A very powerful resource for data retrieval is BioMartBioMart is a data retrieval tool that generates lists of biological objects (e.g. genes, SNPs) from data held in the Ensembl (and other) databases. Please refer to the BioMart section at the main page for general details. There are different options in BioMart to filter a set of genes (or ALL genes of a genome) for those having a known disease context (please refer to FAQ GENOM5 !). BUT NOTE that it is not so easy to filter for cardiovascular diseases using BioMart as compared to GAD ! Nevertheless, BioMart offers several options for filtering related to SNP data. You may select genes having SNPs in the coding region, 5'UTR, 5'flanking, intronic, and more.
    Overall, there is no direct link between SNPs and diseases in BioMart, making GAD the better resource for this purpose.
                                  
Main Index  FAQ Index    
                       

       
GENOM8...know the genes and drugs related to diseases like atherosclerosis ? -> see CHEM2 !                

    This question involves resources which correlate diseases with the employment of specific drugs, Thus, it is located in section "Cheminformatics".
                       
Main Index  FAQ Index  
                   

      
GENOM9...analyze the expression of a gene set of interest in cancer tissues ? (last update Feb. 13, 2006)

1. Microarray, SAGE, and EST data:
                  
    Tip! CGAP - Cancer Genome Anatomy Project is an NCBI resource which offers a comprehensive molecular characterization of normal, precancerous, and malignant cells. It contains genomic data for humans and mouse, including transcript sequence, gene expression patterns, SNPs, clone resources, and cytogenetic information. Please refer also to the CGAP main section for details.
    In order to address this specific question, CGAP provides at the Gene Finder page the option to use the Batch Gene Finder. In order to use the Batch Gene Finder, prepare a text file containing the list of (human OR mouse) gene symbols, UniGene clusters, accession numbers, protein accession number, UniProt (SwissProt) protein accessions, UniProt (SwissProt) protein identifiers (like "ACTB_HUMAN") or Entrez Gene numbers. The text file must list the identifiers in a vertical column, e.g. export a one-column EXECEL sheet in txt (tab-delimited) format. The created gene list displays the query ID, gene name and symbol, and RefSeq accessions, as well as links to the individual Gene Info pages (see CGAP main section for details). In addition, the link "Common View" allows to create a table displaying all GO terms, Pathways (KEGG and Biocarta), motifs, SNPs, and cyto locations for the complete input gene set. In addition, the expression of the whole gene set can be viewed as colored graph within the NCI60 panel of cancer cell lines (please refer to the NCI60 section for background). The link "SAGE Summary" displays the SAGE counts of the input gene set in a series of normal and cancer tissues.

2. Microarray data:

    Oncomine is a resource of the University of Michigan for examining gene expression in cancer. The goal of the project is to collect, standardize, analyze, and deliver published cancer gene expression data to the research community. Probe the expression of a gene across thousands of cancer samples.
    Note that the "Gene Search" option allows SINGLE gene queries (there is NO batch query using gene sets) using several types of identifiers, like gene name, gene symbol, Entrez Gene ID, Affymetrix ProbeSet IDs, and more. As first result, a gene overview is presented showing the gene name and aliases. As "expression overview", the "Differential activity map"
summarizes significant differential expression of a gene of interest grouped by tissue type and analysis type. Three types of analyses are summarized on the Summary page: Normal vs Normal, Cancer vs Normal, and Cancer vs Cancer. Note: Please refer to the main section of Oncomine for detailed descriptions of the other analysis modules !

3. SAGE and EST data:

    Tip! ECgene (gene prediction by EST clustering) predicts genes by combining genome-based EST clustering and a transcript assembly procedure in a coherent and consistent fashion. Specifically, ECgene takes alternative splicing events into consideration. The positions of splice sites (i.e. exon-intron boundaries) in the genome map are utilized as critical information in the whole procedure. Sequences that share splice sites in the genomic alignment are grouped together to define an EST cluster. ECgene is available for human, mouse, and rat genomes. Please refer also to the main section of ECgene for information !
    In order to address this specific question, the module ECexpression is relevant. ECexpression is the expression data viewer of ECgene. ECexpression utilizes the extensive expression data from EST and SAGE sources to develop a queriable expression ontology. EST or SAGE libraries are prepared from known tissue samples and the origin of these samples was carefully documented to allow standardized querying of expression. There are 4 divisions: anatomical site, pathology, developmental stage, and sex. Note: A very useful feature is that normal and cancer libraries are divided and also displayed separately in the graphs. Therefore, this layout makes it easy to find any tissue-specific or cancer-specific isoforms !
    Note: There is also a "stand-alone" query form, which allows to set the reliability level, select the graph type, retrieve different types of SAGE tags, select the tag redundanly, and to choose whether to include all, or only non-normalized EST libraries. The reason is that many cDNA libraries are normalized or subtracted in order to find genes with low expression level, which in turn renders expression data qualitative rather than quantitative.
             
4. EST data:

   
DigiNorthern, provided by the Bioinformatics Group of the Roswell Park Cancer Institute, is a tool for virtually displaying the expression profile of query genes (currently only accept DNA sequence as input) based on the EST sequences currently available at NCBI GenBank. There are currently two versions for this program. DN1 takes one sequence as query gene and lists all the cell lines/tissues/organs that express the gene and displays the relative expression levels of the gene based on the number of matched ESTs vs the total number of ESTs for related libraries. Whereever available, comparison will also be made between the same tissue/organ in normal and cancer status. DN2 takes two sequences as query genes and compares their expression profiles side by side. DigiNorthern is currently available for Human and mouse.
    NOTE: There is no batch submission of gene datasets in DigiNorthern.
                                      
Main Index  FAQ Index    
                       

      
GENOM10...determine the expression profiles of normal vs. cancer tissues ? (last update Feb. 13, 2006)    

1. Microarray, SAGE, and EST data:
                  
    Tip! CGAP - Cancer Genome Anatomy Project is an NCBI resource which offers a comprehensive molecular characterization of normal, precancerous, and malignant cells. It contains genomic data for humans and mouse, including transcript sequence, gene expression patterns, SNPs, clone resources, and cytogenetic information. Please refer also to the CGAP main section for details. In order to address this specific question, CGAP provides several tools for the analysis of cDNA (EST) expression in normal and cancer tissues.
    The cDNA xProfiler is a tool that compares gene expression between two pools of libraries. For a gene to be "present" in a library pool, there must be at least one EST sequence found in the UniGene cluster for that gene. This tool allows to generate datasets of "unique" ESTs via comparison of e.g. different tissues or between normal and cancer libraries of the same tissue.
    The GLS - Gene Library Summarizer finds all the genes expressed in a single cDNA library or group of cDNA libraries. It then classifies the genes as unique or non-unique, and then further identifies the genes in each of these groups as known or unknown.
    The DGED (cDNA Digital Gene Expression Displayer) is a tool that compares gene expression between two pools of libraries. In contrast to the xProfiler, the DGED treats the presence of a gene in a library pool as a matter of degree. It compares the "degree" of presence of a gene in pool A with its "degree" of presence in pool B. This comparison is reduced to two numbers: the sequence odds ratio and measure of significance.
    The SAGE DGED (SAGE Digital Gene Expression Displayer) is a tool that identifies those genes that are expressed at significantly different levels (as defined by the user) in two pools of human libraries, based on SAGE tag analysis. The algorithm takes into account the differences in sample size between Pools A and B, which can be large. The user selects a value for statistical significance (P value) and a value for the difference in the level of expression (F value) between the two pools. The results are based on the sequence odds ratio and measure of significance.
    NOTE: These tools can not only be used for cancer vs. normal but also for normal vs. normal comparisons !

2. Microarray data:

    Tip! Oncomine is a resource of the University of Michigan for examining gene expression in cancer. The goal of the project is to collect, standardize, analyze, and deliver published cancer gene expression data to the research community. The user may explore genes, processes, and pathways deregulated in a particular type of cancer.
    "Profile Search" allows to query using keywords like cancer types, tissue types, clinical parameters, and more. Alternatively, you may browse all cancer profiles by clicking the icon and then use the filters to find the profile of interest. This search first presents a list of studies filtered after certain criteria. These include source tissue (like breast or prostate), and several analysis types (like cancer vs. normal, cancer vs. cancer etc.). For each study, the number (and percentage) of up-, down-, and differentially expressed genes is indicated.
   
Note: Please refer to the main section of Oncomine for detailed descriptions of the advanced analysis modules !
                                     
Main Index  FAQ Index