Bioinformatics World FAQ Center
  FAQ Index -> GENES
                -> GEN1...know if a stretch of genomic sequence contains a potential promoter region ? (last update May 28, 2006)
                -> GEN2...know which transcription factor binding sites, TF modules, or user-defined patterns and profiles are present in my promoter region ? (last update May 30, 2006)
                -> GEN3...know if there are repetitive elements in my DNA sequence ? (last update Nov. 18, 2005)
                -> GEN4...know which promoters or enhancers in a whole genome contain a binding site for a single or a combination of transcription factors (Motif Matching; Module Scanners) ? (last update Mar. 14, 2006)
                -> GEN5...know which regulatory elements are common in a set of promoter sequences and check if these motifs are known transcription factor binding sites (Motif Discovery) ? (last update Mar. 30, 2006)
                -> GEN6...quickly extract potential promoter sequences for a batch of human genes ? (last update May 29, 2006) 
                -> GEN7...quickly see the binding site profiles of individual transcription factors ? (last update May 18, 2005) 
                -> GEN8...detect regulatory elements in UTRs (UnTranslated Regions) in a whole-genome approach ? -> see RNA1 !
                -> GEN9...get the promoter/protein sequences of all proteins homologous to my query within a certain species ? -> see RET8 !
                -> GEN10...check how often a specific motif is present in a randomly generated sequence set ? (last update Jun. 3, 2005)      
  
                     
Navigate   AtoZ   Search this Site   Site Journal    FAQ Index   Main Index   Appendix              
                         


                   
GEN1...know if a stretch of genomic sequence contains a potential promoter region ? (last update May 28, 2006)
   
    There are several programs on the web which scan DNA sequences for potential promoter regions and / or Transcription Start Sites (TSS). Thereby, not only TATA-containing but also TATA-less promoters are predicted. Like always, each method has advantages and disadvantages. It is best to use several of these programs in parallel and compare the results.

1. Resources performing predictions of promoter / TSS position:

    The program PromoterInspector predicts promoter regions in mammalian genomic sequences. PromoterInspector is now part of the commercial GenomatixSuite, which means that there are limitations for the use of the program. The output only shows the predicted sequence position, not the transcription factor details. Please refer to the main section of PromoterInspector for details.

    NNPP
(Promoter Prediction by Neural Network), provided in the context of the Berkeley Drosophila Genome Project, is a widely used (often cited !) method that finds eukaryotic and prokaryotic promoters in a DNA sequence. The output is simply a list of predicted (core) promoter sequences with the predicted TSS indicated. Note that NNPP predicts very short sequences (only 50 bp) in proximity to the TSS. There are no further options for follow-up analyses. Note that test runs showed that NNPP is significantly less stringent than other promoter prediction programs, which results in a higher number of potential promoter sequence regions. Please refer to the main section of NNPP for details.

    Eponine, developed at Sanger Institute, is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence. Results are presented in GFF format. A simple list of TSS positions, together with the predicted strand is shown. Please refer to the main section of Eponine for details.

    Tip! FirstEF (First Exon Finder), provided at Cold Spring Harbor Labs, is a 5' terminal exon and promoter prediction program. It consists of different discriminant functions structured as a decision tree. The probabilistic models are optimized to find potential first donor sites and CpG-related and non-CpG-related promoter regions based on discriminant analysis. For every potential first donor site (GT) and an upstream promoter region, FirstEF decides whether or not the intermediate region can be a potential first exon, based on a set of quadratic discriminant functions. FirstEF calculates the a posteriori probabilities of exon, donor, and promoter for a given GT and an upstream window of length 570 bp. Taken together, FirstEF shows predicted positions of promoters, first exons, and CpG islands. NOTE: FirstEF predictions are also presented in the UCSC Genome Browser display ("Expression and Regulation" tracks)! Please refer to the main section of FirstEF for details.

2. Resources performing predictions of promoter / TSS position and Transcription Factor Binding Sites (TFBS):

    Tip! PromoterScan predicts promoter regions via comparison to eukaryotic Pol II promoter sequences. It has the advantage that it also shows the names and positions of significant transcription factor binding sites within the sequence. The results show the location of predicted promoter sequences. Predicted sequence regions are regions of DNA that contain a significant number and type of transcriptional elements (TEs) that are usually associated with Pol II promoter sequences. Reported putative promoters are those regions of your sequence that score past a predetermined cutoff score set to recognize 70% of primate promoter sequences in the Eukaryotic Promoter Database. Please refer to the main section of PromoterScan for details.

    Another program is called DRAGON Promoter Finder which is part of the portal DRAGON Genome Explorer of the Institute for Infocomm Research (I2R), Singapore. The program attempts to recognize the exact location of the transcription start site (TSS), i.e. the +1 position relative to the TSS. Therefore, the first output is a list of potential TSS. The program also  includes very nice follow-up analyses, like BLAST against the EPD (Euk. Promoter Database), or prediction of TF sites. For the latter option, the Match program is used. Please refer to the main section of Dragon Promoter Finder for details.
                    
Main Index  FAQ Index   
                          

GEN2...know which transcription factor binding sites, TF modules, or user-defined patterns and profiles are present in my promoter region ? (last update May 30, 2006)

    This is in general a quite tricky field, not because you don't find any hits but because you find too many of them. In most cases it is therefore necessary to manually screen through the lists of potential transcription factor binding sites in a sequence and try to pick those of "highest interest". Nevertheless, there are different databases and resources that can be compared.
    Please note
that the general subject of "motif matching" in a whole-genome approach (instead of single promoters) is discussed in FAQ GEN4 C) , but the programs discussed there usually are also applicable to single sequences !
    Please note that the number of false positive predictions of TFBS may be drastically reduced by using comparative genomics approaches, as discussed in FAQ GENOM6 !

1. Resources based on TRANSFAC database:

    TRANSFAC is the "classical" transcription factor database. Please refer to the TRANSFAC section at the main page for a more detailed description. You can either search through lists of transcription factors or their binding sites, or you can analyze your own input sequence for the presence of these motifs. For the latter, there are different possibilities. The "gold-standard application" is MatInspector, available at Genomatix, Inc. (see below). This program has been largely commercialized, only a limited number of free runs is provided for academic use. Instead, the programs Match and P-Match are provided for free directly at the BIOBASE portal, which ensures a very good database cross-referencing between the predicted TFBS in the output and the corresponding TF database entries in TRANSFAC ! Note that a registration at BIOBASE is required in order to use Match, and P-Match which is free for non-profit use. It is the same registration that is needed to access the TRANSAFC public database at this portal !

    Tip! Match is part of the programs at BIOBASE portal of gene regulation. Match is designed for searching potential binding sites for transcription factors (TF binding sites) nucleotide sequences. Match uses a library of mononucleotide weight matrices from TRANSFAC 6.0. Please refer to the main section of Match for a detailed description.

    Tip! P-Match (combined Pattern-Matrix search) is a new tool for identifying transcription factor binding sites (TF binding sites) in DNA sequences. It combines pattern matching and weight matrix approaches thus providing higher accuracy of recognition than each of the methods alone. P-Match uses a library of mononucleotide weight matrices from TRANSFAC 6.0 along with the site alignments associated with these matrices. Note: In general, P-Match "looks" very similar to Match, please refer to the main section of P-Match for a detailed description.
   
    The improved version of the "historical" public version of MatInspector is called MatInspector professional, which significantly reduces the number of false positives and negatives. This program is now part of the Genomatix Suite, meaning in principle free of charge for academic users, just register here.  Anyway, you are restricted to max. 20 analyses (sequences) per month ! The user may select the values for core similarity and matrix similarity (in both cases "1" means perfect match). Note that there is a highly user-friendly option when choosing the matrix similarity threshold, called "optimized". This means that the program automatically chooses the optimal value for each matrix, which minimizes the number of false positives. This optimized value is defined in a way that a minimum number of matches is found in non-regulatory test sequences. At the MatInspector output list, you can easily compare the difference between the optimized matrix threshold and the actual matrix similarity for each site. Note that you should use MS Internet Explorer (and not Netscape) in order to make use of the functionality of the Adobe SVG Viewer, allowing to interactively handle the TF sites - diagrams. If you have analyzed a multiple sequence file, you will find a button called "Search for common TF sites"at the bottom of the output page, opening a SVG window, where you can adjust to see only those TF sites present in x of total y input sequences. Via right-klick and "Copy SVG", you can paste the image into e.g. Corel Photopaint in order to save in any file-format. 

    NOTE: If you know the binding activity of a novel TF (not present in TRANSFAC) with a series of oligonucleotides and you want to build a profile from these sequences, you may use the program MatDefine. MatDefine is a tool for fully automatic definition and evaluation of weight matrices from a set of short DNA sequences. The resulting weight matrices can be used by MatInspector to scan nucleic acid sequences for matches to the described binding site. In "automatic mode" (default), the weight matrix is generated without any user interaction. A protocol describing the matrix definition process is delivered. In "interactive mode" ("More options"), the user can modify all parameters which are used in automatic mode. NOTE: MatDefine is part of the "GEMS Launcher"-section of the commercial Genomatix SuiteNOTE: Genomatix has termed the free academic access "evaluation account". Note that in general, there is not only a limitation in the number of analyses (max. 20 GEMS analyses (sequences) per month!) but also in the functionality of the obtained data !
    NOTE
that you have to register at BIOBASE (free for non-profit organizations) in order to gain access to the individual transcription factor information files !
   
    Another resource in this field is TESS, which stands for Transcription Element Search Software. TESS is a set of software for locating and displaying transcription factor binding sites in a DNA sequence. TESS uses older versions (4.0) of the TRANSFAC database (public version, not the updated "professional version !) as its store of transcription factors and their binding sites. In fact a combined search in various databases is performed (TRANSFAC site, TRANSFAC matrix, CBIL matrices, IMD..). By the way, all these databases can also be queried using keywords. Click at the link "Combined" to get to the "Combined Search Page". If you are not sure how to handle the different input parameters, simply click at the button "Analyze using the default settings !". You can choose between various forms of output, including colour coding of consensus mismatches (very useful !!), tables to show the significance of hits (!), and Jave Applets to show the binding sites on the sequence. NOTE: Within the tabular results page, you can klick at the header of every column to sort the output (e.g. klick on "Sm" will sort the output by matrix similarity). NOTE: "===" within the "annotated sequence view" indicates hits above the secondary threshold, whereas "---" indicates below. NOTE: The only disadvantage is the upper limit of 1 kb for input sequences !

If you specifically want to search for TFBS-modules (combinations of regulatory elements) in SINGLE sequences, you may use one of the following programs. Please NOTE that programs which predict such TFBS modules in a whole SET of sequences are discussed in FAQ GEN5: B2). Some of these may also be suitable for the analysis of SINGLE sequences !

    CompelPatternSearch lets you search a query sequence for the presence of potential composite regulatory elements. This tool is based on COMPEL, a database on composite regulatory elements affecting gene transcription in eukaryotes, which again is based on TRANSFAC. COMPEL collects information about composite regulatory elements (CEs) - pairs of closely situated sites and transcription factors binding to them.You may define the maximal number of mismatched nucleotides in core positions of the 2 different binding sites and the possible variation of the distance between two sites (in %). Example: If you analyze 500 bp of upstream genomic sequence of the human IL6 gene, you will retrieve 1 single CE showing NO mismatch (C00152: NF-kB and C/EBPbeta). NOTE: CompelPatternSearch can be used with SINGLE sequences only !
 
2. Resources based on JASPAR database:

    Tip! ConSite is a program (and web interface) that couples phylogenetic footprinting with regulatory site detection (mainly promoter comparison), meaning that ConSite is primarily designed to compare 2 orthologous sequences and report conserved TFBS (Transcription Factor Binding Sites). Note that ConSite uses a database ("JASPAR") of TF profiles (PWMs) that was newly built from literature data, and that is therefore "independent" from existing databases like TRANSFAC. At the ConSite start page, you have 3 different options, one of them is "Analyze a single sequence", which lets you analyze a single promoter for TFBS (without performing cross-species comparison). This option is comparable to the TESS system, but utilizes the JASPAR profile collection instead of TRANSFAC.

    Note that the option "Analyze a single sequence" of ConSite can also be used to see if individual TFBS, which can be selected from a list, are present in a query sequence. In addition, there is the option to scan the sequence for the presence of a user-defined profile (raw counts matrix or position weight matrix), but not of a user-defined consensus sequence. This can be useful if you know the binding activity of a novel TF (not present in JASPAR) with a series of oligonucleotides and you are able to build a profile from these sequences (see also description of MatDefine above), and finally want to scan a promoter sequence for the presence of this binding site (profile).
    Please refer to FAQ GEN4 C) for additional programs in the field of Motif Matching !

    JASPAR also provides a "quick and easy" way of analyzing a promoter sequence (pasted in the field on the right side) for the presence of individual TFBMs, which have to be selected from the list first. Note that there is no option "Select all" which means that this feature is designed to just show the positions of single (or a small group of) TFBMs in a query sequence. For more complex analyses, the ConSite system should be used.

3. Resources based on TRANSFAC and JASPAR databases:

    Tip! MotifScanner, which is also part of the TOUCAN package, is available as individual program from the software page of the bioinformatics group at the department of electrical engineering (ESAT) at the Katholieke Universiteit Leuven (Belgium). In contrast to TOUCAN, the output is not displayed graphically but as a simple list of TRANSFAC matrices which match to your input sequences, according to the criteria specified by the user. You may upload a multiple FASTA sequence file but there will be no comparisons of the TF composition between sequences. NOTE: The web interface of MotifScanner does not support the JASPAR matrices, in contrast to MotifScanner implemented in TOUCAN ! On the other hand, it allows the selection of individual TRANSFAC TFBMs to scan your query sequence. Please refer also to the MotifScanner section at the main page !
                      
Main Index  FAQ Index 
                       


GEN3...know if there are repetitive elements in my DNA sequence ? (last update Nov. 18, 2005)

1. Screen for (longer) repetitive elements (like LINES and SINES):
   
    Tip! A program which is often used in that context is RepeatMasker. RepeatMasker screens DNA sequences in fasta (or raw) format against a library of repetitive elements and returns a masked query sequence ready for database searches as well as a table annotating the masked regions. Simply paste in your sequence, select the DNA source (species) and further options if you like. The output table nicely lists groups of repetitive elements (like SINES, LINES, LTR elements,...) and their occurrence within the sequence. Please refer to the RepeatMasker section at the main page for details ! Note: RepeatMasker uses the Repbase database of repetitive elements, but possibly the most recent versions of Repbase are used by CENSOR (see below) !

    Tip! CENSOR is a software tool which screens query sequences against a reference collection of repeats and "censors" (masks) homologous portions with masking symbols, as well as generating a report classifying all found repeats. Thus, CENSOR is somehow similar to RepeatMasker. In general, the CENSOR output is very informative as it presents data in several formats: The graphical SVG Viewer gives a very good impression about the positions and the sizes of individual repeats. Note that the SVG Viewer works better in MS Internet Explorer than in Netscape. Tthe summary table lists all elements, very similar to RepeatMasker. The masked sequence masks the query sequence in a way that all repeats are replaced by "N". In addition, all masked segments are listed as separate fasta sequences. All pairwise alignments of the query and the repeat sequences are shown. The database entries of all repeats are shown.
    Please note that although sequence analysis using CENSOR is not restricetd, the viewing of individual repeat database entries is dependent on registration (free for academic use) ! NOTE: As CENSOR is provided by the same site which maintains the Repbase Update database (GIRI), one can be sure to use the most recent version of this database for the analysis ! If you want to get detailed information about individual repeats, you may browse or search the Repbase Update database.
   
    Another alternative is the program Repeat, which in addition includes a nice graphical output showing the positions of repeats in the query. The masked output sequence is ready for Copy/Paste. Note that Repeat has an upper limit of 31 kb of input sequence !

    H-Invitational Database (H-InvDB) is a human gene database opened to the public in April 2004, which is hosted by the Japan Biological Information Research Center (JBIRC) and by the DNA Databank of Japan (DDBJ). The scope of H-InvDB is to provide an integrative annotation of full-length cDNA clones available from high throughput cDNA sequencing projects. If you want to scan a cDNA sequence of interest for repetitive elements, you may perform a simple keyword query, and then look at the "cDNA view", which contains a link concerning repetitive elements within the "cDNA information" section. This leads to the display of the Repeat Mask Viewer, which nicely shows the position of the repeat within the input sequence as lower case letters. Please refer to the H-InvDB section at the Data Integration page for a detailed description !
                   
2. Screen for (short) nucleotide repeats:
   
    There is a whole list of EMBOSS tools for this purpose available also as web-interfaces at the Pasteur Institute. Repeats scans a DNA sequence, looking for tandemly repeated patterns where the period of the repeat has a user specified size from 1 to 32 nucleotides. Einverted finds DNA inverted repeats. Equicktandem finds tandem repeats. Etandem looks for tandem repeats in a nucleotide sequence for the repeat size equicktandem suggests. Palindrome looks for inverted repeats in a nucleotide sequence.  
   
    REPFIND is a program to find clustered, exact repeats in nucleotide sequences. For each repeat cluster that it finds, it calculates a P-value, which indicates the probability of finding such a concentration of that particular repeat just by chance. Of the many possible clusters for each repeated word, REPFIND selects the one with the most significant P-value. REPFIND only accepts single sequences as input (no batch submission). Please note one point concerning the "Low Complexity Filter". You should carefully select / deselect this option when looking for motifs like AU-rich elements, which are similar to a "low complexity" sequence, and which therefore would be masked out (hidden) prior to the analysis. Please also refer to the REPFIND description at the main page.

Main Index  FAQ Index
 

GEN4...
know which promoters or enhancers in a whole genome contain a binding site for a single or a combination of transcription factors (Motif Matching; Module Scanners) ? (last update Mar. 14, 2006)

    This question addresses the problem, that it is not very useful to BLAST e.g. the whole human genome using a short sequence stretch like "AACAATG". NOTE: The first step, the identification of the binding site for an individual transcription factor, is described in question GEN7 !

There are several ways to deal with this question (Tip: use options 3 and 4 !):

1. Databases of curated promoter sequences:

    These databases contain sequence data of experimentally verified promoter sequences. This means that they are of high quality, but normally contain only a small fraction of promoters/genes of a species' genome.

    The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Note that it is only possible to perform SRS-like keyword searches in the EPD database but NO sequence searches (like BLAST). BUT you can download the promoter sequences into a simple FASTA file. You can then - after removing the line breaks in a text editor like WORD - simply search this text file via the WORD search function using the nucleotide string as query. Of course, the more elegant version is to analyze the sequence set with the RSAT tool DNA-pattern, also allowing to search for complex patterns including IUPAC-codes (see description below). Note that in principle you can BLAST the EPD at the swiss Embnet, but it does not accept short oligo sequences as input.
  
2. Databases of in silico predicted/extracted promoter sequences:

    These databases contain "precomputed" sets of promoters, which normally were derived from genome databases by extracting 5'-upstream regions of cDNA starts. Note that the size and the quality of these databases depends on the type of cDNA sequence used (RefSeqs, other curated mRNAs or ESTs) AND on the definition of the reference position (position +1; see e.g. RSAT).
 
    PRESTA is a tool/database that combines EST databases and putative GenBank/EMBL promoters to yield datasets of predicted promoters at high accuracy.  A high stringeny BLAST-search reveals ESTs that assist in transcription start-site verification. PRESTA is therefore very useful for promoter verification by mapping EST 5' ends. In comparison with EPD, PRESTA offers somewhat additional value in adding EST sequence information to promoter data, therefore not only verifying transcription start sites but also providing expression data. Similarly to EPD, you can download the complete sets of human and mouse promoters into simple FASTA-text files, which then are searchable in a WORD-like text editor. BUT note that also in PRESTA, many genes are not found / retrievable in this database. 

    Alternatively, a good tool to address this question is part of the package RSAT - Regulatory Sequence Analysis Tools. The module of this suite which can be used in this context is called Genome-scale pattern matching. You can search e.g. the whole (meaning NOT restricted to EPD promoters !) human genome using the consensus binding site of a certain TF ("Genome-scale DNA pattern") or use a matrix-based pattern of a TF ("Genome-scale Patser"). Please refer to the corresponding chapter at the main page for further instructions. Note: In general, RSAT does not provide a selection of TF matrices from a database but YOU have to know the pattern or matrix of a TF of interest ! If you do not know this information, you may want to have a look at FAQ GEN7 first ! NOTE: Like in RSAT-Retrieve Sequence, there is now an option to select the mRNA start as "reference position" for upstream sequence retrieval in both programs (see this section for comments) !
    In general
, you may also combine the powers of different programs to achieve a good result, see next part !
            
3. "Do-it-yourself" whole genome promoter extraction and Motif Matching: Tip!

    This is a feasable way to deal with this question, as it circumvents the weakness of RSAT to correctly extract especially vertebrate promoter sequences. And I will show an example for human promoters in the following text. Anyway, the next section discusses some of the limitations which are still present.

3.1. "Do-it-yourself" whole genome promoter extraction:

    The first step is to use BioMart to extract the complete set of human promoters, please also refer to the BioMart chapter in question GEN6. At the start page, choose the human genome, and select "Ensembl Genes". At the filter page, you may either deselect all boxes, meaning you will retrieve all genes, or you may limit the output to at least "a little characterized genes", by choosing "Genes with Entrez Gene IDs" or "with RefSeq IDs". At the output page, choose the Sequences Page", where you can select for 5' flanking regions. Note that you should select "Genes - transcript information ignored (one output per gene)", in order to you recieve only the upstream region of each LONGEST 5'-cDNA. A range of 1000 bp of 5'-flanking sequence is possibly a good start. Select "Text, fasta" as output format. After a few minutes, you will retrieve a (long) txt-file of promoter sequences, which can be saved and used for further analyses.

3.2. Consensus-based (pattern-driven) Motif Matching:

    Note that there is sometimes confusion when talking about definitions of words like "consensus", "pattern", "profile", or "motif". There is a good, concise introduction in a paper describing the server WeederWeb, which actually is used for Motif Discovery. In consensus-based (or pattern-driven) methods, the different oligonucleotides recognized by a given transcription factor are described by their consensus, representing, for each position, the nucleotide that appears most frequently in the binding sites. Profile-based (or alignment-driven) methods, on the other hand, work with profile matrices instead of consensus sequences (see below).
    Please note
that the subject of motif matching on single promoter sequences is also discussed in FAQ GEN2 !

    Tip! The RSAT program DNA-Pattern (Strings) is an example of pattern-driven motif matching. Patterns can contain spacers of fixed length (e.g. CGGn{11}CCG) or variable length (GATAAGn{0,60}GATAAG). In general, several patterns (separated by breaks) can be searched at once against several sequences, meaning that you can search for several TF sites at once against all promoters ! Note that matches will be displayed independently for each pattern, unless you combine several patterns to one common one. Example: If you screen a set of sequences of 1000 bp in length with 2 patterns and you want to know only those sequences which contain BOTH of them, you may create a query pattern like "CGGACGn{0,1000}GATTTT". Note that you should also include a query pattern of "opposite order": "GATTTTn{0,1000}CGGACG". Note that if you know a "maximum allowed distance" between 2 patterns like 2 transcription factor binding sites you may specify this value in the query, like "CGGACGn{0,60}GATTTT".
The first word of each line is the string description of the pattern, the second word is an identifier for this pattern (if you do not enter an identifier, then the string (sequence) of the pattern is also used as identifier). Example: Type the following text in the Query pattern(s) box: GATAAG  Gata_Box. You can display the results also graphically, please refer to the tutorial for furhter information. Using "Browse" you select the "whole genome promoter - txt file" that you created in the first step. Note that when scanning large sequence sets, one might be interested in counting the number of matches, rather than returning their precise positions. This can be done by deselecting the checkbox match positions and selecting match counts instead, and by specifying a threshold. Threshold means that only those hits (promoters) are returned having more matches than the specified threshold (e.g. promoters showing at least 2 or 3 copies of a TF site).

    FUZZNUC is a program for nucleic acid pattern matching, using the typical "user-friendly" EMBOSS interface-style. You simply paste your set of sequences (or upload a local file in case of very large datasets like "all human promoters"), which you want to search with your pattern and the pattern (consensus sequence) itself. Note that patterns for fuzznuc are based on the format of pattern used in the PROSITE database (amended to refer to nucleic acid sequences, not proteins). Please refer to the FUZZNUC main section for details. Note that the matching hits are listed only as text one sequence after the other, there is no graphical output, which is a drawback when screening large sequence sets as it is not easy to get an overview of match counts and match positions.

3.3. Profile-based (or alignment-driven) Motif Matching:

    Profile-based (or alignment-driven) methods, on the other hand, work with profile matrices instead of consensus sequences. Briefly, a profile matrix is not a "one-line sequence", but a table (a matrix) indicating the "importance" of each nucleotide at each position via specific values.
    Please note that the subject of motif matching on single promoter sequences is also discussed in FAQ GEN2 !

    Tip! The RSAT program Patser (Matrices) is an example of profile-based motif matching. Patser allows to scan a set of DNA sequences with a profile matrix, which can be either in Transfac, Gibbs, or Consensus format. Please refer to the Patser main section for information on the different formats.

    Motif Matcher was developed by Jim Kent at UCSC and is part of the page "cis-Site Seeker". Motif Matcher is a program for finding where a given motif occurs in sequence data. The Motif Matcher help gives a very good, concise introduction on the "nature" of a motif, making it quite easy to convert a consensus sequence into a motif. Please refer also to the Motif Matcher main section for details on "motif construction". The output presents the highlighted motif matches along the sequences and a graphical summary of the motif positions, which is comparable to the RSAT "feature map" produced after performing a "DNA-pattern" or "Patser" search. Note that Motif Matcher is not suitable for screening very large datasets (like "all human promoters"), as there is no file upload option (only copy/paste the sequence set).

3.4. Annotating the lists of matching genes / promoters:

    In another step (optional), you may feed the accession numbers of the output table of step 2 into TOUCAN and look if other transcription factors are also over-represented in this set of promoters, potentially revealing TFs which might be co-regulated with your TF of interest. Please refer to the TOUCAN chapter at the main page for detailed instructions (Sequence Retrieval, MotifScanner, Statistics, ModuleSearcher). Note that in TOUCAN, you can also quickly extract a complete annotation table for your set of genes ! Alternatively, you can feed your RSAT list of potential target genes back into BioMart (at the "Filter" Page, "Limit to Genes with these IDs") in order to achieve an annotation table, which maintains all hyperlinks actively in an EXCEL sheet (at the "Output" Page, choose the "Features" you would like to see).

3.5. "Functional Clustering" of matching genes / promoters:

    In a final step (optional), if the RSAT list should be very long, you may perform a "functional clustering" of the genes. For this purpose, you may use the KEGG tool KAAS - KEGG Automatic Annotation Server. Please refer to FAQ PATH1 for details.

4. Strategies based on Comparative Genomics and / or combinations of TFBS ("Module Scanners"): Tip!

    It has to be clearly stated that strategies like the one described in the previous section often produce long lists of potential hits which means at the same time that the background or "dust" is quite high. Also, these approaches definitely are restricted to "proximal" promoter regions, let's say 1 kb upstream of the TSS (Transcription Start Sites). Especially in higher eukaryotes, data have been emerging which strongly suggest that it is not sufficient to concentrate on these proximal regions if one wants to get a comprehensive insight into a gene's expression regulation. Regulatory elements like enhancers, silencers, or insulators can be found 50-100 kb upstream or downstream of a gene, and also introns of neighbouring genes can be "hot candidates" for regulatory elements. On the bioinformatics side, comparative genomics has shown to be THE method of choice for predicting regulatory regions as these normally are highly conserved. The selection of species is critical for this purpose, and it has been shown that comparisons of moderately related species, like human and mouse, are ideal.
    In addition, transcription factors often do not act alone but in combination with other factors ("modules"), or they have multiple binding sites within regulatory regions. There are a few tools available which perform "whole-genome approaches" to identify potential target genes of TF combinations.

    In FAQ GEN5, strategies are described, how to extract potential combinations of TFs from a set of genes, using programs like CREME or ModuleSearcher. If you now want to know which regulatory regions in the whole human genome also contain this special combination of TFs, you may use one of the following programs, which collectively might be called Module Scanners. These programs are also described at the main page, section "TF Module Matching".

    Tip! SynoR performs genome-wide scans for clusters of evolutionary conserved transcription factor binding sites (cTFBS) in user-specified spatial configurations. SynoR is part of the portal Dcode.org for comparative genomics and gene regulation. The current version of this program scans human and mouse genomes for TFBS conserved in comparisons with either other mammals, chicken, frog, or fish. The identified cTFBS modules and corresponding genes go through several steps of functional annotation. (1) cTFBS modules are classified as promoters (regions 1.5 kb upstream of TSS), UTRs, introns, intergenic, or coding exons depending on their relationship to "UCSC known genes". (2) Interspecies conservation is performed for all the identified modules to describe the evolutionary history of different modules. (3) Gene Ontology (GO) characterization is performed for genes bracketing the identified noncoding modules. (4) GNF Expression Atlas 2 analysis is performed for these genes, thus allowing the prediction of tissue specificity of the identified modules. Please refer to the SynoR section at the main page for details.   

    Tip! ModuleScanner, integrated in the TOUCAN package, performs genomic searches with a predicted CRM (cis-regulatory module) or with a user-defined CRM known from the literature to find possible target genes. Please also refer to the TOUCAN chapter at the main page for additional information, like general program setup. Starting from a blank page in TOUCAN, choose "Motifs", "ModuleScanner".
    You have to choose one of the databases which all comprise pre-computed sets of CNS - conserved noncoding regions (minimally 75% identity within 100bp windows) between 2 species within 10kb upstream of the coding regions, like CNS between human- mouse or human - zebrafish or human - chicken.  Also, you can choose to display either the regions of the "primary" or the "second" species in the output. Finally, you can choose between the TRANSFAC or the JASPAR matrices of TFBS to visualize.
    Then, you have to select the transcription factor matrices from a list, which you would like to use for the scan, or you enter them manually as a string, separated by commas, like e.g. "[M00052-V$NFKAPPAB65_01,M00189-V$AP2_Q6]". You may enter also more than 2 TFs, or you may even look for clusters of only ONE special TF, by using e.g. "[M00052-V$NFKAPPAB65_01,M00052-V$NFKAPPAB65_01]". You may change the number of "top hits" to return, which is based on the score of the ModuleScanner.
    The output lists CNS regions where the chosen TF combination is found. Note: The numberings at each CNS in the output indicate the position relative to the coding sequence of the gene. Note that, ONLY conserved regions between the 2 species selected are scanned, BUT a displayed TF site is taken from the "primary" sequence only (the one indicated in the database selection list), and is not necessarily conserved in the second species. Note that also all other predicted TFs (using MotifScanner prior 0.2) are displayed as colored boxes, but you may selectively choose the "modules" by highlighting the "Mod..." entries in the left column and hitting "Enter".
    If you want to know which genes correspond to the Ensembl GeneIDs, you may use the TOUCAN annotation tool via "Get_Seq", "From Ensembl", "Get Info". You may also paste the Ensembl GeneIDs into the Ensembl query field or other data retrieval tools like BioMart, in order to perform advanced annotation and data retrieval (see also BioMart chapter at the main page).

    Tip! DBTSS - Search for TF Binding Site is a "sub-program" of the DBTSS database, accessible via the links in the left frame at the start site. DBTSS - Search for TF Binding Site can search for promoters containing putative binding sites of particular transcription factors (TFs). There are 3 sequence databases which allow a "whole-genome" search for TFBS modules: human, mouse, and human-mouse conserved. The human and mouse sequence databases contain 1.2 kb for each gene (1.0 kb upstream of TSS and 0.2 kb downstream of TSS). This means that DBTSS focuses on proximal promoters, but not on distal regulatory elements like enhancers or silencers. Analysis of evolutionary conservation is therefore restricted to this region. Please refer to the DBTSS - Search for TF Binding Site section at the main page for details.     

     Target Explorer automates the entire process from the creation of a customized library of binding sites for known transcription factors through the prediction and annotation of putative target genes that are potentially regulated by these factors. Target Explorer was specifically designed for well-annotated Drosophila melanogaster genome, but some options can be used for any sequence of interest. A free registration is needed in order to use the Target Explorer programs. Specific options can be used to scan ANY kind of sequence set (not only Drosophila sequences) for modules of TFBS. Unlike programs like SynoR, Target Explorer does not take any kind of evolutionary conservation of TFBS into account. Please refer to the Target Explorer section at the main page for details.   

     ModelInspector is a commercial program to scan sequence databases for the presence of TFBS modules (which can also be self-created using the program FastM). FastM is a method to develop user defined models of transcriptional regulatory DNA units (e.g. promoters). These models can be built using various individual elements (like transcription factor binding sites, repeats, hairpins) and their sequential order. Thus, IUPAC sequence elements can be successfully combined with different types of weight matrices and structural elements (e.g. hairpins) in the assessment of match quality. Between each pair of elements for a model a distance range has to be defined. ModelInspector utilizes either a library of predefined models or models generated by FastM or FrameWorker to scan your own DNA sequences or sequence databases for new regulatory units matching the model. Examples of databases which may be scanned are: GPD (Genomatix Promoter Database), ElDorado Genomes, EPD (Eukaryotic Promoter Database), RefSeq, and various GenBank sections.
    The Genomatix Promoter Database (GPD) is part of the commercial Genomatix suite of products. With GPD, Genomatix claims to offer the "most complete eukaryotic promoter database" and the "only one containing promoters for alternative transcripts". Promoter extraction via GPD is available for entire organisms or for microarray platforms (like Affymetrix arrays). There are three possible quality levels (gold-silver-bronze) assigned to each transcript which is associated with a promoter. GPD also offers pre-made annotation for promoter modules (combinations of TFBS) (EXCEL sheet) as well as module descriptions and TFBS matrix descriptions (txt files). Access to the GPD is exclusively commercial, and not part of the free academic "evaluation account". 
    Note:
FastM and ModelInspector is a system to help create your own model of regulatory elements; it is NOT a system to extract a model (co-occurring sites) from a given set of sequences (like in FrameWorker or in the TOUCAN program ModuleSearcher or in CREME) ! Rather, ModelInspector is similar to the TOUCAN program ModuleScanner. NOTE: FastM and ModelInspector are part of the "GEMS Launcher"-section of the commercial Genomatix SuiteNOTE: Genomatix has termed the free academic access "evaluation account". Note that in general, there is not only a limitation in the number of analyses (max. 20 GEMS analyses (sequences) per month!) but also in the functionality of the obtained data !  

    EEL - Enhancer Element Locator is a tool for locating distal gene enhancer elements in mammalian genomes by comparative genomics and to identify conserved TFBS in predicted enhancers. ELL is described in Hallikas et al., Cell 2006. Please refer also to the main section of EEL.
    In order to address this specific question, you may try a search in EEL Database of precomputed EEL alignments. EELweb stores precomputed alignments between orthologous genes from human and many other species. The data is regularly updated with some synchronization with ENSEMBL database, which is used as source of genomic information. EELweb can be search for conserved TFBS in 100 kb upstream and downstream regions of ALL genes (whole-genome approach). For this purpose, simply leave the field for the Ensembl Gene IDs empty (which means that you want to use ALL Ensembl IDs). Note: this option works only with the "any site" selection of TFBS. NOTE: Output of the web-based database version is restricted to max. 1000 hits, so if you expect more hits you may have to use the download version of the EEL program !
                       
Main Index  FAQ Index



GEN5...know which regulatory elements are common in a set of promoter sequences and check if these motifs are known transcription factor binding sites (Motif Discovery) ? (last update Mar. 30, 2006)
   
    This question demands a multi-step procedure of bioinformatic tools, where each step can be performed either seperately or as part of a complete software suite. I am going to describe both "All-in-one packages" and "Individual single-step tools".
    Please note
that this question also includes programs /databases for phylogenetic footprinting, meaning procedures to compare promoters from different species and thereby refine the results, based on the assumption that functionally important elements are conserved in evolution. An important prerequisite, of course, is the availability and reliability of orthologous promoter sequences. In addition, an essential step is the choice of species used for comparison. If the evolutionary distance is too far, then the chance of retrieving conserved motifs rapidly decreases. On the other hand, if the species are too closely related (like e.g. mouse and rat), it is often hard to distinguish regions conserved by evolutionary pressure from the "overall" sequence similarity. In any case, you should try different parameter settings at the input forms of the different programs and compare the results. Note that programs integrating phylogenetic footprinting are listed under headings containing "...Multiple (or 2) species...".

1. "All-in-one packages":

1.1. Genomatix Suite (commercial !):
   
    There is a very nice tutorial addressing that question available at the Genomatix webserver. It describes a process of different steps for the analysis of DNA expression array clusters, thereby using software modules of the Genomatix suite (Chip2Promoter, GEMS Launcher, and ElDorado). Note that there are two different (FREE) registrations, one for the package ElDorado, Chip2promoter and GEMS, and the other for MatInspector professional. Please also note that the amount of free runs is very small so you should consider one of the numerous options of subscription if you want to use these databases regularly.
    Anyway, the first step of the procedure is the clustering of the microarray data, which is performed by other programs like Cluster, TreeView or Expression Profiler. Please refer to the section "Gene Expression and Pathways" at the main page.
    The second step is the automatic in batch extraction of the promoter sequences belonging to all the cDNAs of the cluster, performed by Chip2Promoter. For this purpose, different kinds of accession numbers can be used as input, including Affymetrix ProbeSet IDs (separated by spaces). Currently, Chip2Promoter is available for human, mouse, rat, and Arabidopsis. The extracted promoter sequences can be downloaded for further use. Of course, this process could be performed "by-hand", like using the UCSC Genome Browser, for a limited number of sequences, but will be highly time-consuming for large data sets.
    The third step is the search for comnmon transcription factor (TF) sites in this set of promoters. For this purpose, two buttons are available at the output file of Chip2Promoter. The link "Show common TF sites" shows a graphical and tabular representation of TFs common to a user-defined percentage of sequences (like 80 % or 90 %). The link "Definition of common framework" performs a more sophisticated analysis as it screens for conserved pairs (termed "modules") of TFs (termed "elements"), with a user-defined maximum distance between two elements. This procedure reveals a list of potentially significant pairs of TFs involved in the regulation of the whole cluster of genes.
    Steps 4 and 5 describe a series of methods to evaluate the significance of the modules. In step 3, it is possible to save the predicted models at the Frameworker output page. In steps 4 and 5, these files can be used to perform comparisons with promoters of the EPD (Eukaryotic Promoter Database), or to compare the predicted modules of one cluster with another cluster. Please refer to the tutorial for detailed instructions.
    Finally, step 6 describes refined literature analyses to estimate the functional significance of the found modules. For this purpose, the Genomatix package ElDorado can be used. This tool creates cocitation networks of transcription factors and genes of interest.  

1.2. RSAT:

   
    A suitable (and FREE) alternative for the Genomatix suite is a software package termed RSAT - Regulatory Sequence Analysis Tools. RSAT consists of a series of modular computer programs specifically designed for the detection of regulatory signals in intergenic sequences. The only input required is a list of genes of interest (e.g. a family of co-regulated genes). From this information, you can retrieve the upstream sequences over a desired distance, discover putative regulatory signals, search the matching positions for these signals in your original dataset or in whole genomes, and display the results graphically in the form of a feature map. Each tool is presented as a form to fill. For each form, a help page provides detailed information about the parameters. Please refer to the corresponding chapter at the main page for detailed instructions.   

1.3. TOUCAN:
 
  
Tip! Another excellent (and FREE) alternative for the Genomatix suite is TOUCAN, which is a platform independent, standalone Java application that is tightly linked with Ensembl. TOUCAN was developed by the bioinformatics group at the department of electrical engineering (ESAT) at the Katholieke Universiteit Leuven (Belgium).  Please also refer to the TOUCAN chapter at the main page for detailed instructions (Sequence Retrieval, MotifScanner, Statistics, ModuleSearcher). I will give here a brief overview about the program's capabilities. First, you can perform in-batch extraction of promoter sequences using a list of diverse identifiers (like LocusLink, Ensembl, RefSeq and many more), which works similarly to the BioMart sequence extraction, yielding one output (promoter) per gene. You may also simultaneously extract promoters of orthologous genes from human, mouse and rat, potentially raising the functional relevance of common transcription factor binding sites (TFBS). The tools MotifScanner and MotifLocator detect TFBS in your sequence set and graphically display these sites. The Statistics tool detects over-represented features in your set, thereby pointing at significant TFBS. The ModuleSearcher scans your sequences for high-scoring combinations of TFBS. The ModuleScanner checks predicted modules against whole-genome CNS (Conserved Non-coding Sequence) regions. The MotifSampler detects over-represented patterns in your sequence set, which then might turn out to be known or even unknown TF binding sites. The AVID tool displays regions of high similarity between 2 sequences, which in turn can be saved as a sequence sublist and be analyzed with e.g. MotifScanner. FootPrinter is a program that performs phylogenetic footprinting. It takes as input a set of unaligned orthologous sequences from various species, together with a phylogenetic tree relating these species. It then searches for short regions of the sequences that are highly conserved. Via the tool "Consensus match", you can scan your sequences using your own patterns like "AAAGGTAA" or "WWRYAATC{1,5}NCA". Finally, you can also quickly extract a complete annotation table for your set of genes, using "Get_Seq", "From Ensembl", "Get Info", "Update" !

2. "Individual single-step tools":

2.1. Automated in-batch promoter extraction:
   
    Please refer to question GEN6, A) for a detailed description of the different tools, especially PromoSer, BioMart, and TOUCAN.

2.2. Multiple genes, single species, TFBS and TFBS modules search (Co-regulation):
   
    MotifScanner is not only implemented in TOUCAN, but also available as stand-alone web interface. This program detects TFBS in your sequence set and graphically displays these sites. At the input window you have to supply one common FASTA sequence file, and select the motif model (e.g. for human sequences you may choose the option "TRANSFAC 6.0 public - Vertebrates") and the type (species) and order of "Background Model" (3rd order models are fine in most cases). In addition, it is quite useful to "play around" with the "prior"-value (probability of finding copy of a motif:), which replaces the "core and matrix similarity values" of MatInspector, meaning a lower prior (like 0.1) is more stringent than a higher one (like 0.9). In contrast to TOUCAN, the output is not displayed graphically but as a simple list of TRANSFAC matrices which match to your input sequences, according to the criteria specified by the user. You may upload a multiple FASTA sequence file but there will be no comparisons of the TF composition between sequences.

    Using MatInspector professional (commercial, limited number of free runs !) available at Genomatix, you can search a multiple FASTA sequence file for the presence of TF sites. Note that you should use MS Internet Explorer (and not Netscape) in order to make use of the functionality of the Adobe SVG Viewer, allowing to interactively handle the TF sites - diagrams. At the bottom of the output page there is a button called "Search for common TF sites", opening a SVG window, where you can adjust to see only those TF sites present in x of total y input sequences. Via right-klick and "Copy SVG", you can paste the image into e.g. Corel Photopaint in order to save in any file-format.

    Tip! TELiS is a very fast and very easy-to-use system to find transcription factor binding motifs (TFBMs) that are over-represented in promoters of differentially expressed genes. Modern high-throughput methods like microarrays often generate lists of genes which show a different expression pattern under different experimental conditions like biological stimuli. TELiS is capable of extracting very quickly the over-represented TFBMs in such promoter datasets. This is done by pre-solving the most computationally intensive part of the problem, scanning large nucleotide sequences for multiple TFBMs. Thus, the TELiS database contains information on the prevalence of TFBMs in the promoters of all human, mouse, and rat genes. 3 different promoter sizes have been extracted from genome databases: 300 or 600 bases upstream of the Transcription Start Sites (TSS), or a region from -1000 to +200, all corresponding to mRNA sequences from the NCBI RefSeq database. The program MatInspector was used to determine the TFBMs in these sequences, using 3 different stringency values: Matrix similarity 0.8 ("low"), 0.9 ("high"), and 0.95 ("extreme"). 2 different TFBM databases can be used, the public TRANSFAC database version 3.2, or the open-access JASPAR database. Note: Although TRANSFAC has a higher number of TFBS, this public version is not updated, in contrast to JASPAR, which is a smaller set that is non-redundant and curated. Note: If you want to get detailed information on individual TFBMs, please refer to FAQ GEN7 ! Note that also in TELiS, the matrices of all TRANSFAC TFBMs and JASPAR TFBMs can be browsed one after the other, which is still less convenient than using other options listed in FAQ GEN7.
    The only input required from the user is a list of HUGO Gene Symbols, separated by tabs, spaces, or line breaks, and to choose one of the promoter sizes and one of the stringency values. The TELiS publication states that analyses of short promoter sequences (300 bases) with moderate stringency (0.90) provided optimal signal detection, whereas analyses using longer sequences or lower stringency produced poorer signal-to-noise ratios. Finally the user has to select the microarray platform which was used in the personal experiment. The last point is necessary because the TFBMs identified in the selected genes are compared to the TFBMs pre-identified in all genes contained in the experimental platform as a reference population, in order to determine over- or under-representation (please also refer to the TELiS backbround page for additional discussion). NOTE: If your array is not listed, you may simply select "All human / mouse / rat genes" at the bottom of the dropdown-list, meaning that ALL genes of a species are taken as reference for the analysis. Also note that the best results are achieved with sets of 100 or more genes/promoters, whereas the analytic sensitivity drops significantly for samples <20. Nevertheless, Incidence analysis p-values are described to remain accurate for any sample size.
    There are 2 different ways to analyze a dataset:
    1.
"Differential expression analysis": Here, a list of the top-scoring TFBMs is produced, which are color-coded to allow easy identification of over-represented (dark blue), indifferent (grey) and under-represented (red) TFBMs. The "Incidence" indicates the number (n) and the percent ("Sample mean") of promoters which contain at least one binding site, which can be easily compared to the percent of total promoters of the platform containing this site ("Population mean"), and the resulting "Ratio" between promoters in the personal dataset and total promoters of the array.
    2. "Get raw data": Using this option, the data can be downloaded as *.td format, which is best opened with programs like EXCEL (which maintains the tabular structure of the output), not with word processors like WORD. This table shows the number of each binding site in each of the gene's promoters.

    OTFBS, developed at Institute of Bioinformatics, Tsinghua University, Beijing, is a method which can detect over-represented motifs of known transcription factors from a set of related sequences. Particularly, promoters of the same gene family or from the same tissue can be submitted as analysis subject. Promoters of putative co-regulated genes clustered with gene expression data should be also a good candidate to analyze. The version of TRANSFAC Matrix OTFBS currently uses is Release 6.0. Simply submit the upstream regulatory regions of a group of related genes (max. 200), i.e. genes clustered together with microarray data, or just the genes of a same functional protein from a series of related species. Note: There is no option to adjust any parameters. The Output consists of a simple list of overrepresented TFBS, and the positions of all TFBS in all input sequences. Note that only the TRANSFAC Matrix accession numbers are listed (like "M00086"), NOT the names of the TFBS !!! If you want to know the identity of these TFBS matrices you have to query the individual accessions at TRANSFAC (see TRANSFAC section for instruction !).

    Tip! CREME, which is part of the Dcode programs provided by the Lawrence Livermore Nat. Lab, is a web-server for identifying and visualizing cis-regulatory modules in the promoter regions of a given set of potentially co-regulated human genes. Eukaryotic genes are often regulated by several transcription factors, whose binding sites are spatially clustered and form cis-regulatory modules. CREME relies on a database of putative transcription factor binding sites that have been carefully annotated across the human genome using evolutionary conservation with the mouse and rat genomes. Promoter extraction was done by mapping RefSeq mRNAs onto the genome assemblies, and by taking 1.5 kb upstream of the TSS, or up to the next neighbouring gene. The CREME database is built of TFBS which are conserved in all 3 species (human, mouse, rat), and which show PWM similarity scores of 0.8 and above. This means that CREME can be queried using a set of HUMAN genes, but the predicted TFBS modules are pre-computed according to the fact whether they are conserved in the 3 species or not. Please refer to the CREME chapter at the main page for details. 

    Tip! ModuleSearcher is integrated in the TOUCAN package, and scans your sequences for high-scoring combinations of TFBS. Please also refer to the TOUCAN chapter at the main page for detailed instructions. Note that, if you want to know  if a predicted CRM (cis-regulatory module) is found in Human-Mouse CNS (Conserved Non-coding sequences) in a "whole-genome approach", you should use the tool ModuleScanner ! Please refer to FAQ GEN4, which specifically deals with this kind of "whole-genome approach !

   
FrameWorker is is a complex software tool that allows users to extract a common framework of elements from a set of DNA sequences. These elements are usually transcription factor binding sites since this tool is designed for the comparative analysis of promoter sequences. FrameWorker returns the most complex models that are common to the input sequences (and satisfying the user parameters). These are all elements that occur in the same order and in a certain distance range in all (or a subset of) the input sequences. Typical input datasets may be, for instance, a set of promoters from orthologous genes (Phylogenetic footprinting) or a set of promoters from different genes which have been found to be co-regulated by cluster analysis of expression array data (Co-regulation). NOTE: FrameWorker is part of the "GEMS Launcher"-section of the commercial Genomatix SuiteNOTE: Genomatix has termed the free academic access "evaluation account". Note that in general, there is not only a limitation in the number of analyses (max. 20 GEMS analyses (sequences) per month!) but also in the functionality of the obtained data !
           
2.3. Multiple genes, single species, motif search (Co-regulation):

    Note that there is a whole section at the main page called "Motif Discovery", where you will find additional descriptions of the numerous programs in this field. Here, I will try to concentrate on those which have user-friendly and easy-to-use web interfaces.

    MotifSampler, which is also part of the TOUCAN package, is available as individual program from the software page of the bioinformatics group at the department of electrical engineering (ESAT) at the Katholieke Universiteit Leuven (Belgium). MotifSampler tries to find over-represented motifs in the upstream region of a set of co-regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif. You simply paste a multiple FASTA sequence file. The output gives nice graphical representations of the over-represented motifs, and displays their positions along the input sequences. Please note that the MotifSampler is a Gibbs sampling implementation, implying that it is a stochastic algorithm, thus returning different results each time ! Of course, if a certain motif is really over-represented, the same or similar motifs should be found in each run, depending on the parameters you've set.

    The RSAT tool Oligo-Analysis can be used as individual program for the detection of over-represented oligo sequences of defined length. Please also refer to the RSAT chapter at the main page ! You should select the oligo size, the background model and the organism. The results of the analysis are displayed in a table. Each row corresponds to one oligonucleotide, and each column to one statistical criterion. The E-value represents the number of patterns with the same level of over-representation which would be expected by chance alone. E.g., the E-value is of the order of 10e-6, indicating that, if we would submit random sequences to the program, such a level of over-representation would be expected every 1,000,000 trials. NOTE: In the bottom of the result page, click on the button Pattern matching (dna-pattern); then hit "GO" and then click the Feature map button, which will produce a graphical image of the results.

    Tip! The MEME (Multiple Em for Motif Elicitation) system allows you to discover motifs (highly conserved regions) in groups of related DNA or protein sequences using MEME (or MEME mirror at Pasteur Inst.) and search sequence databases using motifs using MAST (Motif Alignment and Search Tool). Simply provide a FASTA-file of your promoter sequence set, and set the number of motifs to be extracted. In addition, you may set a "minimum/maximum motif width", e.g. for TF binding sites you may choose 6 and 9, meaning the program will extract only motifs ranging from 6 to 9 bp. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs. MEME sends 3 mail-messages: a confirmation, the MEME results and the MAST results. The MEME results include very nice multi-colored diagrams, and consensus sequences of extracted over-represented motifs, and links to perform additional analyses like BLOCKS, MAST, and MetaMEME.

    WeederWeb is one of the tools developed in the lab of Graziano Pesole, Milan University. WeederWeb is a web interface to Weeder, a program for finding novel motifs (like transcription factor binding sites) conserved in a set of regulatory regions of related genes. WeederWeb is very user-firendly web interface. You simply paste the sequences, check if you want to include the reverse strand, select your "guess" how many sequences will share a motif, and choose the "speed" of the analysis. Note that using the "quick scan" option, only short motifs (6 to 8 bases) are reported, whereas "normal" and "thorough" modes scan for motifs from 6 to 12 bases ! Note that you should use the extended input form if you want to exactly choose a certain motif length, or if you want to precisely define how many variations may be accepted. There is also a special option if you are using human 5' or 3'UTRs as input sequences ! Results are coming as text file via Email, also containing a hyperlink which displays the output in a "MEME-like" fashion (including a "Sequence Logo" - representation). The result also contains a user-friendly line "Interesting motifs seem to be:".

    Tip! YMF is one of the tools developed at the computational molecular biology group, University of Washington. YMF is a program that detects statistically overrepresented words (motifs) in DNA sequences. The user may specify the characteristics of the motifs to be detected. A motif here is a short string of nucleotides, degenerate symbols, and spacers. 'Motif size' is the number of non-spacer characters in a motif. Spacers ('N's) are constrained to be in the center of the motif. Degenerate symbols allowed in a motif are R (purine - A or G), Y (pyrimidine - C or T), W (A or T), and S (C or G). YMF uses a very clear, user-friendly web interface. You simply choose the motif size, the maximum number of spacers and degenerate symbols (IUPACs), and the organism. Note that although the page states "Total uploaded sequence data should be < 10000 characters", a test run using a much larger sequence set seemed to work without problems (and results were delivered very quickly!). The Output contains a simple text file listing the motifs in "descending order of reliance", graphical plotting of "Top-scoring motifs" (works only with IE6.0+, and NOT with Netscaspe), and the option "FindExplanators". FindExplanators is a program that extracts from the set of significant motifs reported by YMF, a smaller set of "real" motifs. More specifically, given a set of DNA sequences P, and a set of motifs M (such as those reported by YMF), it extracts a subset E of motifs in M, such that given the occurrences of the motifs of E in the sequences P, the remaining motifs in M are not statistically significant.
              
2.4. Single gene, 2 species, TFBS search (Phylogenetic Footprinting):

    Tip! rVISTA (regulatory VISTA) combines transcription factor binding sites (TFBS) database search with a comparative sequence analysis, thereby  reducing the number of predicted transcription factor binding sites by several orders of magnitude. As example, when comparing promoter sequences of a human gene and its orthologous mouse counterpart, it is possible to extract those TFBS that are conserved between the 2 species and therefore are expected to be functionally significant. Note that, in contrast to mVISTA, rVISTA works only for 2 input sequences. Note that you may access rVISTA at 2 different sites, at Lawrence Berkeley Lab. and at Lawrence Livermore Nat. Lab., as described in individual sections at the main page (LBL, LLL). At LLL, there are at least 3 different ways to run rVISTA (which can be also used as individual programs !), which are explained at the rVISTA start page (zPicture, ECR Browser, Genome Alignment). Please refer to the rVISTA chapter at LLL at the main page for detailed instructions on these programs and the rVISTA output. Please also refer to FAQ GENOM6 for further information concerning comparative genomics analyses.

    Tip! ConSite is a program (and web interface) that couples phylogenetic footprinting with regulatory site detection (mainly promoter comparison). ConSite is designed to compare 2 orthologous sequences and report conserved TFBS (Transcription Factor Binding Sites). Note that ConSite uses a database ("JASPAR") of TF profiles (PWMs) that was newly built from literature data, and that is therefore "independent" from existing databases like TRANSFAC. At the ConSite start page, you have 3 different options. Analyze orthologous pairs of genomic sequences lets you e.g. paste 2 promoter sequences of the "same" (orthologous) gene from human and mouse, and the program will generate the alignment. Analyze an existing alignment of 2 genomic sequences lets you use your pre-made alignment (CLUSTALW format) directly for the TFBS analysis. Analyze a single sequence lets you analyze a single promoter for TFBS (without performing cross-species comparison). This option is comparable to the TESS system, but utilizes the JASPAR profile collection instead of TRANSFAC. Please refer to the ConSite chapter at the main page for detailed instructions about parameters and options.

2.5. Single gene, multiple species, TFBS search (Phylogenetic Footprinting):

    Tip! multiTF identifies transcription factor binding sites conserved across multiple species. There are 2 diffrent ways to initiate a multiTF search, and I would suggest to use MULAN, as this program is integrated in the same web-portal. Multiple sequence alignments generated by MULAN can be automatically submitted to multiTF from the results web page. The "handling" and output of multiTF is very simillar to rVISTA, e.g. the user can set the parameters for detection of TFBS (like matrix similarity, individual TF selection). TFBS can be dynamically visualized along the sequences (similar display as in rVISTA but for multiple species). It is possible to list and display either ALL TFBS or only those which are conserved across ALL species. You may also highlight individual TFBS positions in the alignment. Taken together, MULAN and the interconnected tool multiTF somehow represent the "multi-species" equivalent to the system mVISTA-rVISTA, where rVISTA is based on the TF prediction for 2 aligned species (2 sequences). Please refer also to the main chapter describing different other programs of the Lawrence Livermore National Lab for comparative genomics. NOTE: Now it is also possible to first locate your region of interest in ECR Browser, extract the pre-made alignments with other species, and finally, via the link "Synteny/Alignments", you may send ALL selected sequences to MULAN to generate phylogenetic trees and identify multi-species transcription factor binding sites via multiTF. NOTE: If you are specifically looking for a TFBS which is not contained in the TF database used (like TRANSFAC) but where you have a certain consensus sequence from (like WWCAAWG), you may scan the MULAN alignment for this pattern by using the option "User-defined consensus sequences" within the multiTF input window "Defining transcription factor binding sites".

    DiAlignTF displays transcription factor (TF) binding site (TFBS) matches within a multiple alignment. It is possible to display all TF binding site matches, TF binding site matches common to all or subset of the input sequences, or common TF binding site matches that are located in aligned regions. The TF binding sites are visualized in the alignment as colored boxes. The input sequences are aligned with the multiple alignment program DiAlign. TF binding site matches are identified by MatInspector. DiAlign and DiAlignTF are part of the "GEMS Launcher"-section of the commercial Genomatix SuiteNOTE: Genomatix has termed the free academic access "evaluation account". Note that in general, there is not only a limitation in the number of analyses (max. 20 GEMS analyses (sequences) per month!) but also in the functionality of the obtained data !

2.6. Single gene, multiple species, motif search (Phylogenetic Footprinting):

    Tip! FootPrinter was developed by the Computational Molecular Biology Group at the University of Washington. It is available at the respective software site either as a web service or as a downloadable program. Note that there is also a very good FootPrinter manual, explaining e.g. all the input parameters. Please note that FootPrinter is also implemented in the TOUCAN software package, please refer to the TOUCAN chapter for details. FootPrinter is a program that performs phylogenetic footprinting. It takes as input a set of unaligned orthologous sequences from various species, together with a phylogenetic tree relating these species. It then searches for short regions of the sequences that are highly conserved, according to a parsimony criterion. The regions identified are good candidates for regulatory elements. By default, the program searches for regions that are well conserved across all of the input sequences, but this can be relaxed to allow finding regions conserved in only a subset of the species. Please refer to the FootPrinter chapter at the main page for detailed instructions on input paramaters, construction of phylogenetic trees, and on handling the different output formats. In order you want to check if the derived motifs correspond to known TFBS please refer to the chapter below.

2.7. Multiple genes, 2 species, TFBS search (Co-regulation AND Phylogenetic Footprinting):

    Whole Genome rVISTA (beta version): This Web site provides access to the computational tool that allows for evaluation of which transcription factor binding sites (TFBS) are over-represented in upstream regions in a group of genes. This beta-version of the tool has been developed for the mm4 version of the Mouse genome (October 2003). A database of all TFBS in mm4 conserved in the alignment with the Human July 2003 (hg16) using rVISTA (regulatory VISTA) method was created. Input: genes of your interest using locus link IDs or RefSeq names (only Mouse !!!). The programs will calculate which TFBS located in 5Kbp upstream regions of these genes are over-represented (at the P-value cutoff 0.006) using all 5Kbp upstream regions of mm4 RefSeq genes as outgroup. You can also get a list of TFBS overrepresented in each individual gene of interest. NOTE: Test runs showed that the program tends to produce quite long lists of "over-represented" TFBS. It is hard to estimate which of these are actually biologically meaningful. NOTE: This program is similar to TELiS, but uses conserved TFBS between 2 species instead of TFBS from a single species, and counts the number of TFBS relative to all mouse genes (RefSeq) whereas TELiS specifically sets the TFBS of the sample in relation to the TFBS of all genes present on the microarray platform used.

2.8. Multiple genes, multiple species, motif search (Co-regulation AND Phylogenetic Footprinting):

    Using TRES you can simultaneously search up to 20 promoter sequences (of maximum 1000 bp each) for known transcription factor binding sites, cis-acting elements, palindromic motifs or conserved k-tuples (phylogenetic footprints). This is useful for comparative promoter sequence analysis to elucidate common themes (modules) in functionally or phylogenetically related promoters. Please note that this program nevertheless does not cover the full functionality of others in the field of phylogenetic footprinting, therefore results using a sequence set of different species have to be taken with care. TF binding sites are searched from TRANSFAC database, from ooTFD database and also plant cis-acting elements from PLACE database. When searching for TRANSFAC weight matrices, you have to select a matrix cut-off, meaning "1.0" would be a perfect match, and reasonable values are not less than 0.90. There is also a nice function "Report sites only when present in at least 70, 80, 90 or 100 % of the sequences". 

    PhyloCon (Phylogenetic Consensus) is an algorithm that takes into account both conservation among orthologous genes from different species (Phylogenetic Footprinting), and co-regulation of genes within a species. PhyloCon first aligns (by use of the program Wconsensus) conserved regions of orthologous sequences into multiple sequence alignments, or profiles, and then compares profiles representing non-orthologous sequences (as e.g. in clusters derived from microarray data). Motifs are found as unusually well conserved substrings by comparative genomic analysis. Note that PhyloCon does not need the length of the motif a priori. There is currently no web interface for PhyloCon but the program can be downloaded as Linux executable at the Washington University.

2.9. Check if a predicted motif corresponds to a known TFBS:

    If you want to know if the over-represented motif is a known TF binding site, you should best check all of the following options: You may try a search against the TF-site table at TRANSFAC, see "Search TRANSFAC" in the TRANSFAC chapter ! As wildcard, use "*" (NOT "n" !). The retrieved hits are actually promoters, where this site is included, and under "BF" you will find "Binding Factors" that bind to this site. Alternatively you may perform this search at TESS.  Note that TESS uses older versions of TRANSFAC.  As "Search Field" choose "Sequence", and enter your pattern in the "text" field. Carefully analyze the individual hits. You may also feed MatInspector professional with short input sequences, but tests showed that these should be at least >10-12 bases long. If so, this option is possibly the best one ! Yet another option is the so-called Profile Comparison Tool provided at the site of the "alternative" JASPAR website of TF profiles (PWMs) which was newly built from literature data, and is therefore "independent" from existing databases like TRANSFAC. You may either paste your consensus sequence or your binding matrix and search for similar motifs in the JASPAR database. The output is very user-friendly displaying all hits as instructive multi-colored sequence logos. If you have a user-created profile instead of a consensus sequence, you may submit this profile and compare it to the profiles of TFBMs in the JASPAR database, using the option "Compare custom profile to database profile". Please refer to the JASPAR Help-section to see how profiles may look like.

2.10. Verify that a predicted motif is not "over-represented" in a randomly generated sequence set:

    There is a separate FAQ addressing this question, please refer to GEN10 !

2.11. Produce sequence logos for over-represented motifs in the sequence set:
   
    This is actually a matter which is tightly associated with multiple sequence alignments, so please refer to question SIM4 for that purpose.

Main Index  FAQ Index



GEN6...quickly extract potential promoter sequences for a batch of human genes ? (last update
May 29, 2006)
   
    This is in general a tricky question, because from many genes, the translation start (ATG) is known BUT NOT the exact transcription start site (TSS), which of course is the reference position to extract the promoter sequence, let's say to take -800 bp upstream and +100 bp downstream of genomic sequence. Today, several tools are available that address this specific question. There are 3 groups of programs:

1. Supporting batch-queries:
   
      Tip! PromoSer is a service for promoter extraction for human, mouse, and rat genes provided as part of the Gene Regulation Tools of the Zlab, which belongs to the Boston University Bioinformatics. PromoSer comes with a compact, but very instructive Help-file describing all the different options, making PromoSer one of the best tools for this purpose. As input, you can use lists of GenBank accession numbers (RefSeqs, mRNAs, and ESTs). There is no option to use e.g. Affymetrix IDs. You then define the region upstream and downstream of the TSS (Transcription Start Site) which you want to extract. Then, choose the "Quality" and the "Support" levels. The TSS "Quality" is a rating system (between 0 and 4) which describes the composition of the sequences that support this TSS (described in the Help-file). The extraction of alternative promoters is in fact a great feature allowing the user to select which of the mRNA sequences to define as reference for the location of the TSS. The option "only the one that is best supported and is 5' most" defines the TSS at the position which is best supported by RefSeq, mRNAs and ESTs. Otherwise, you may choose to extract only the promoter that starts 5' most (most aggressive extension). In the case of the presence of ESTs containing "5'-upstream first exons" as compared to the RefSeq, a totally different promoter may be extracted. The option "ignore all extension info and return the immedite upstream region" extracts the 5'-flanking genomic region relative to the supplied accession number, meaning that also single ESTs can be defined as reference point for the promoter definition.
    As output, PromoSer first presents the extracted sequences in the form of a table which is highly instructive as it lists the exact genomic positions, chromosome number, the quality level, the number of supporting sequences, and the "genomic extension", which means the amount of genomic sequence added at 5' (positive value) relative to the accession number provided. In case that the promoter is extracted at a downstream (3') position, a negative value is indicated. Finally, the promoter sequences can also be displayed (copied) as a FASTA sequence file, and thereby be transfered to other applications (like e.g. TOUCAN).
    A very nice example to test the different options is the human gene MMP26; just see what happens when you use the RefSeq NM_021801 or the EST accession BG189720, as input along with the different options of alternative promoters. You may directly use the extracted FASTA sequences for a BLAT search at UCSC, quickly revealing the genomic position of the individual sequences.

      Tip! BioMart: BioMart is a data retrieval tool that generates lists of biological objects (e.g. genes, SNPs) from data held in the Ensembl (and other) databases. NOTE that there are different web interfaces for BioMart, please refer to the BioMart main section for details. BioMart contains a 'query builder' interface to allow users to specify genomic regions, and refine the result set using filters. BioMart can generate a number of different types of output, including sequence and tabulated list data. Multiple output formats, including HTML, text and Microsoft Excel, are also supported.
    In order to retrieve potential promoter sequences, you may perform the following steps. At the Start Page you may select "Ensembl Genes". Then, you can provide your own list of e.g. Entrez Gene IDs, MIM IDs, RefSeq IDs, Affymetrix ProbeSet IDs (!), and many more at the Filter Page. Finally, at the Output Page, you then choose the "Sequences Page", where you have different options to select for 5' upstream regions (potential proximal promoter regions).
    Taken together, advantages of BioMart are batch submission and nice query options. But there is no method comparable to the versatility of e.g. PromoSer. In addition, there is no support from promoter prediction programs (which by the way is not a major drawback), and no integration of curated data from the EPD (Eukaryotic Promoter Database).

      Tip! TOUCAN: If you have installed this program package, then you can perform in-batch extraction of promoter sequences using a list of diverse identifiers (like LocusLink, Ensembl, RefSeq and many more), which works similarly to the BioMart sequence extraction (meaning the "conservative" way based on Ensembl Genes), yielding one output (promoter) per gene. By the way, this option therefore suffers from similar limitations as the ones mentioned at the BioMart description. This tool is especially useful if you want to make comparative analyses of transcription factor binding sites. Anyway, you may also only download the sequences as FASTA-file. Note that in version 2 (released Aug. 2004), you may download the orthologous promoter regions from MULTIPLE species "in-batch" !

    TRASER: The Transcript Sequence Retreiver () of Stanford University provides rapid (in a true sense, the program is VERY EASY TO HANDLE !) retrieval of transcript and upstream (putative promoter -containing) sequences for predicted human genome mRNAs. The underlying database is built using the human genome annotation files provided by the NCBI. The program accepts ONLY LocusLink IDs as input but allows batch-submission !  You can choose the length of sequence to retrieve. Note that the database is solely based on RefSeq sequences (no ESTs included), but is able to retrieve more than one upstream region for a gene in cases where several RefSeqs exist. NOTE that the output sequences follow the UPPER/lower case model for EXON1/upstream sequences. NOTE that there are 2 output formats, as FASTA sequence file, or as tab-delimited text (making it possible to e.g. paste the sequences into an EXCEL sheet of pre-existing data !). Again, there is no support from promoter prediction programs, and no integration of curated data from the EPD.

    Chip2Promoter: The module "Chip2Promoter" of the Genomatix suite performs automatic in batch extraction of the promoter sequences. For this purpose, different kinds of accession numbers can be used as input, including Affymetrix ProbeSet IDs (separated by spaces). The extracted promoter sequences can be downloaded for further use. Currently, Chip2Promoter is available for human, mouse, rat, and Arabidopsis. A big advantage is the possibility to perform batch-retrievals. Note that you only have 5 free runs (= 5 accession numbers) per month, as registered academic user. Genomatix has a "3 level rating system" of promoters (gold = experimentally verified: promoter described in the Eukaroytic Promoter Database (EPD) or promoter derived from mapping of full length cDNAs; silver = supported by PromoterInspector prediction; bronze = upstream region, 500 bp upstream and 100 bp downstream of an annotated transcript). Note that the program does NOT consider EST sequences !
       
   
RSAT-Retrieve Sequence: allows the automatic extraction of 5'-flanking sequences (pot. promoters) for your genes of interest. You have to choose the organism, in the case of human there are 2 different databases "Homo sapiens" and "Homo sapiens EnsEMBL". In test runs, there was no big difference in the output between these two. The gene names must be separated by carriage returns, because only the first word of each line is considered as a query. Genes can be specified either by the systematic ORF identifier or by a common name. Synonyms are also supported.  Note that the option "prevent overlap with upstream ORFs" should be inactivated when working with eukaryotes. "From To" describes the limits of the region to retrieve. For upstream sequences, the default reference position is the ORF start* (and NOT the transcription start !). Negative coordinates are used to indicate sequences located upstream the start codon; a reasonable pair of values could be: From -800 to -1. Note that you might want to re-check the obtained sequence via BLAT search at UCSC. *Please note that for genes which do NOT have the start ATG in the first exon the correct promoter retrieval might be a problem because in these cases the tool will retrieve sequence from the first intron, and NOT the promoter sequence !!! BUT NOW, the user can choose between different "Feature types", like CDS (Coding Sequence), mRNA, tRNA, etc. The advantage of using mRNA is that, if the mRNA is complete (which is not always the case), the upstream regions are retrieved relative to the transcription start site (TSS), rather than the start codon!!! If you want to see a nice example, you can try to extract the upstream sequence (e.g. -1000 to -1) of the gene "SELE" (E-Selectin), and compare the output when choosing "CDS" versus "mRNA" as "feature type".
    
2. Single queries but automated sequence extraction:
   
      Tip! DBTSS (Database of Transcriptional Start Sites) stores human sequences which were produced by the oligo-capping method to obtain full-length cDNAs. Sequence comparison between DBTSS and  reference sequence database, RefSeq, revealed that 34.2 % of RefSeq sequences should be extended towards the 5' ends. DBTSS (2006) contains 1.359.000 clones corresponding to 19.753 human RefSeqs. After clustering (of splice variants), these data correspond to 15.262 genes. For comparison, EPD (release 82) contains promoters for 1.767 human genes. DBTSS data suggest that approx. 55% of the human loci have two promoters or more. Therefore, it is essential to address the topic of Alternative Promoters (APs). DBTSS includes such predictions of APs in locus-specific result views. In addition, mutually homologous genes between human and mouse were determined and their promoters could be compared with each other. Using this information, DBTSS enables users to investigate what kind of sequence elements are contained in the promoters of their genes of interest and which of them are conserved between human and mouse. Also, users can search for promoters containing putative binding sites of particular transcription factors (TFs). Please refer to the section DBTSS - Search for TF Binding Site for details !
    DBTSS offers versatile query options: RefSeq ID, UniGene ID, EntrezGene ID, Gene Symbol, Ensembl Transcript ID, and more. The output consists of very instructive graphs showing the positions of RefSeqs and Ensembl-transcripts in relation to the positions of individual Oligo-capped cDNAs. The user then can select "the favourite reference position" for the TSS, either RefSeq, ENST, or the longest Oligo-capped cDNA, and download the potential promoter region. The only disadvantages are that there is NO batch query option, and ESTs are not included. In addition, not all of the human genes are supported by "oligo-capped cDNAs".

    FIE (version 2.0) is another tool to retrieve the region upstream and/or downstream of the 'start of exon 1' (Transcription Start Site, TSS) for a particular gene. This user-specified region requires the LocusLink  ID or Gene/Protein Name and Organism Type as well as the Upstream and Downstream length with respect to the 'start of exon 1'. This reference position is determined by the longest annotated mRNA (RefSeqs, which also include un-characterized potentially full-length mRNAs like 'DKFZ', 'KIAA', or 'FLJ'). Note that version 2.0 is considerably improved, as it lists all mRNA sequences individually, so the user can decide which upstream region to extract (which was not the case in version 1.1). "Ordinary" ESTs are not considered. NO batch retrieval option. Currently only available for human genes.

    PRESTA is a tool/database that combines EST databases and putative GenBank/EMBL promoters to yield datasets of predicted promoters at high accuracy.  A high stringeny BLAST-search reveals ESTs that assist in transcription start-site verification. In principle, PRESTA would therefore be useful for promoter verification by mapping EST 5' ends. BUT: Limited query options (NO LocusLink IDs, NO RefSeq IDs etc.), NO batch query, NO user-definition of region to extract, many genes simply NOT included.  Solely based on ESTs, RefSeqs are not considered.

3. Single queries, "by hand" sequence extraction:
   
    Of course, there is also the possibility to extract promoter sequences "by hand", for this purpose I would recommend the UCSC Genome Browsers (Human, Mouse, Rat). The best way is to start from the NCBI Entrez Gene entry of your gene of interest, and klick at the UCSC-link at the top of the page. The position of the RefSeq sequence is shown in the genome browser window. Move left or right (via the "<" and ">" buttons) to see if the RefSeq or another mRNA or EST has the longest 5'-sequence. Klick onto this  sequence, write down the "Start on chromosome" (if gene lies on the + strand) or the "End on chromosome" (if gene lies on the minus strand). Go back to the browser window and enter this number (minus e.g. 1000 or plus e.g. 1000, respectively) in the "position" field on top of the page, along with the according other end of the sequence to extract. Then, hit the link "DNA" on top of the browser window, to retrieve the sequence. Don't forget to select "Reverse Complement" if the gene lies on the minus strand. In addition, you may try the "Extended case/color options".

Main Index  FAQ Index


       
GEN7...quickly see the binding site profiles of individual transcription factors ? (last update May 18, 2005)       

    In general, transcription factor binding sites (TFBS) are defined by a consensus sequence, or even more accurate by a Position-Weight Matrix (PWM), representing the occurrence of each individual nucleotide at each position. TFBS are usually short stretches of sequence (mostly between 6 and 12 bases). Transcription factors normally are part of protein families, that share a similar DNA binding specificity. Therefore, it is a good start to know the binding profiles of the major TF families, but finally also the subtle differences between individual members.

1. Resources based on TRANSFAC database:

    The cis-element information page is provided as part of the Gene Regulation Tools of the Zlab, which belongs to the Boston University Bioinformatics. The PWMs of the major classes of cis-regulatory elements and a short description of the respective transcription factors are listed.

   
TRANSFAC is the most comprehensive database on eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles. Please note that you have to register at BIOBASE (free for non-profit organizations) in order to gain access to the individual transcription factor information files. NOTE that still, there is no access to the latest versions of TRANSFAC which are only commercially available ! In order to Search TRANSFAC for the binding matrix of individual TFs, you should choose the Matrix-table, wheras if you want to gain information on a specific Transcription Factor, search in the TF-factor table. Don't forget to set the "table field to search in" to "All Fields" if you are not sure which field might correspond to your search term.

    Tip! The different TRANSFAC databases are also searchable via public SRS (Sequence Retrieval System) servers (without need to register !). Please also refer to other SRS descriptions as in RET2 or RET5. There are 5 different libraries related to TRANSFAC, all starting with "TF...". If you have a look at this list, you realize that the content of the TFMATRIX databases varies according to the different SRS servers, so choose one of these databases.. Using the yellow "Search" button at the top right corner opens the "Search" - form, where we can simply enter keywords like "MAF". A list of hits in TFMATRIX is displayed which contain the word "MAF". A TFMATRIX entry displays the consensus binding site of a transcription factor as a matrix which often has been compiled from a series of experiments (like binding assays). This matrix shows the probability for each nucleotide to be present at each position of the sequence. Also, links to the transcription factors themselves (accessions starting with "T...") and individual binding sites ("R...") are available. By the way, this is a good example that it is important to always look precisely at database entries. MAF_Q6 obviously has nothing to do with e.g. MAF_01 and seems to be a completely different protein. Note that if you retrieve no hits in the first run, you may select "*all entries*" as search field which will produce a list of ALL entries in the database. You can display the complete list in one window by adjusting the "Display Options" on the left side. This list you can scan for your factor / matrix of interest.

    Alternatively, you may query TRANSFAC at TESS, but note that TESS uses older versions of the TRANSFAC database. The link "Matrices" allows a query against the TRANSFAC matrix table similar to the one described above.

2. Resources based on JASPAR database:

    Tip! JASPAR is a collection of transcription factor DNA-binding preferences, modelled as position-specific weight matrices (PSSMs). The prime difference to similar resources (TRANSFAC, TESS etc) consists of non-redundancy and quality. JASPAR is a smaller set that is non-redundant and curated. All profiles are derived from published collections of experimentally defined transcription factor binding sites for multicellular eukaryotes. The database represents a curated collection of target sequences. The JASPAR Help-section describes in detail how the profiles are generated. NOTE: As the access to TRANSFAC has been commercialized, and only the "public" version (which has not been updated for some years) is available for free, the open-access JASPAR database is a highly valuable resource. You may quickly identify the individual profiles by using the "Browse" or "Search" functions at the JASPAR start page ! Please refer also to the JASPAR section at the main page !

    A concise page summarizing familial binding models for major transcription factor classes is provided by the "alternative" JASPAR website. A multi-colored "Logo"-representation is shown allowing a quick impression of individual binding site profiles.

3. Resources based on TRANSFAC and JASPAR databases:

    TELiS is a very fast and very easy-to-use system to find transcription factor binding motifs (TFBMs) that are over-represented in promoters of differentially expressed genes. 2 different TFBM databases can be used, the public TRANSFAC database version 3.2, or the open-access JASPAR database. Note that also in TELiS, the matrices of all TRANSFAC TFBMs and JASPAR TFBMs can be browsed one after the other, which is still less convenient than using the other options listed above. Please refer also to the TELiS section at the main page !
            
Main Index  FAQ Index  


     
GEN8...detect regulatory elements in UTRs (UnTranslated Regions) in a whole-genome approach ? -> see RNA1 !           
           
Main Index  FAQ Index   
     

     
GEN9...get the promoter/protein sequences of all proteins homologous to my query within a certain species ? -> see RET8 !
             
Main Index  FAQ Index   
               

                            
GEN10...check how often a specific motif is present in a randomly generated sequence set ? (last update Jun. 3, 2005)       

      Often,  when analyzing a group of genes, like a cluster from a microarray experiment, for the presence of over-represented sequence patterns (Motif Discovery, GEN5), or when scanning this set for the presence of one specific motif (Motif Matching, GEN4), it is questionable if the match count of a motif is really higher as compared to a random sequence set of the same size. In order to evaluate this point, it is necessary to generate such random sequence sets without being biased towards certain criteria, which normally happens in "manual random" selections.

        Tip! Random Sequence, a tool which is integrated in the RSAT portal of regulatory sequence analysis, generates random DNA sequences according to various probabilistic models (Markov chains or independently distributed nucleotides). This tool is very useful if you want to verify the significance of results obtained by programs of Motif Discovery like Oligo-Analysis or programs of Motif Matching like DNA-Pattern. You can easily generate a random sequence set corresponding to your "query dataset", simply by selecting the same sequence number and the same length. In addition, you may choose between 3 different models: "Equiprobable nucleotides" is the simplest model, where all nucleotides have the same prior probability. In "Independent nucleotides with distinct probabilities" a specific prior probability can be attached to nucleotides (AT and CG are grouped). This probability is constant over the sequence, i.e. each nucleotide is generated independently of the preceding and succeeding nucleotides. In "Markov chains (calibrated on intergenic frequencies)", the random sequence has the same oligonucleotide composition as observed in the intergenic regions of the selected organism. This is obtained by a Markov chain process, where nucleotide probabilities vary at each position, depending on the preceding nucleotides. Note that "oligonucleotide size" determines which expected oligonucleotide calibration table has to be used. The markov chain order is this value minus one. For example, calibrating with hexanucleoides (oligonucleotide length = 6) means that the nucleotide at each position depends on the 5 preceding nucleotides. This is this thus a Markov chain of order 5. Calibrating on single nucleotides (oligo length = 1) means that each nucleotide is chosen independently off the preceding one. This is thus a Bernouille model (or Markov chain of order 0).
    NOTE:
The output sequence list provides direct links to follow-up procedures like Pattern discovery and Pattern matching !!! Thereby, the random sequence set can directly be scanned for the presence of a specific pattern or for predicting "over-represented" patterns.

        Tip! Random Genes, a tool which is integrated in the RSAT portal of regulatory sequence analysis, performs a random selection among the genes of a selected organism. The selection can be performed with or without replacement (when this option is activated, a gene can appear several times in the list). This program is useful for estimating the rate of false positive for pattern discovery programs. The program can also generate several groups of random genes, which can be used to simulate the results of clustering. The output is a two-column text. The first column gives the gene identifier (like "ENST00000248553" for Ensembl transcripts), the second column the group identifier (useful when several groups are exported). In addition, a link to Retrieve Sequences is provided, allowing to extract e.g. 1 kb of upstream promoter sequence for each gene. Note thjat you may select different labels, like gene name, gene ID, both, or full identifier.
    NOTE:
The output sequence list provides direct links to follow-up procedures like Pattern discovery and Pattern matching !!! Thereby, the random sequence set can directly be scanned for the presence of a specific pattern or for predicting "over-represented" patterns.
         
Main Index  FAQ Index