Bioinformatics World    
 Main Index -> GENES
                -> Genes Linkpages
                -> Gene Prediction
                -> S/MAR Prediction
                -> Promoter Prediction
                -> Promoter Extraction
                -> Regulatory Sequence Analysis Workbenches
                -> Transcription Factor and Motif Databases
                -> TF Module Databases
                -> Regulatory Unit Databases
                -> TFBS Matching
                -> Motif Matching
                -> TF Module Matching
                -> Selected TF-Target Databases
                -> TFBS Discovery
                -> Motif Discovery
                -> TF Module Discovery
                -> Motif Design
                -> Repetitive Elements
Navigate    AtoZ   Search this Site   Site Journal    FAQ Index   Main Index   Appendix       
Genes Linkpages
Linkpage 2 - WebGene
WebGene, provided by the Institute of Biomedical Technologies (ITB), is a  collection of gene identification utility sites; incl PolyA-, repetitive element-prediction.
Linkpage 3 -
Gene Prediction
(Pasteur Institute)
The Pasteur site "Gene Prediction" is a highly recommendable linkpage for gene prediction analysis, including GENSCAN, GRAIL, EMBOSS tools, and more.
Linkpage 4 -
Zlab Gene Regulation Hub
(Zlab, Boston University)
The Zlab Gene Regulation Hub is an excellent linkpage of resources related to bioinformatics of gene regulation (mostly transcriptional regulation). At the top you can find resources developed locally by members of the Zlab, and below external resources grouped by category (like Promoter Databases, Promoter Prediction, Motif Search, and Motif Cluster Search). 

Gene Prediction
NOTE: This section contains programs which predict gene structures in a 'de novo' fashion, meaning purely based on the analysis of genomic sequences. In contrast, resources which perform gene prediction via cDNA-genomic alignments (like AceView or ECgene) are described elsewhere.
GeneMark is a program for gene prediction (exons/introns etc.) available at EBI.
The GeneMark program is accessing the coding potential of DNA sequences by using Markov models of coding and non-coding regions within a sliding window. This local approach is sensitive to local variations of coding potential and is able to show details of the coding potential distribution along with gene identification.
GENSCAN, provided by the MIT, is one of the best known programs for gene prediction. GENSCAN predicts not only exons and introns within a target sequence at high accuracy, but also performs prediction of promoter regions.
Mirror: GENSCAN at DKFZ - unregistered use     
(CBS, Denmark)
NetGene2 performs prediction of intron splice sites in human, C. elegans and A. thaliana DNA. Output is a list of Splice donor and acceptor sites.
(CBS, Denmark)
The NetStart WWW server produces neural network predictions of translation start in vertebrate and Arabidopsis thaliana nucleotide sequences. NetStart has been trained on cDNA-like sequences and will therefore presumably have better performance for cDNAs and ESTs. It is not tested on genome data which may contain introns adjacent to the start codon.
Softberry - Gene prediction programs
The Softberry-Gene prediction site contains programs to predict splice sites, protein coding exons and Gene model construction; Promoter and polyA regions recognition; includes programs like FGENES, FGENESH, BESTORF etc.

S/MAR Prediction
NOTE: This section contains programs which predict S/MARs (Scaffold/Matrix Attachment Regions) within DNA sequences. 
(Futuresoft Corp., USA)
The MAR-Wiz tool, provided by Futuresoft Corp., aims at discovering the presence of Matrix Association Regions, or MARs, within DNA sequences.
MARs are interspersed in genomes on the average of 50-100 kb, and serve to anchor chromatin loops to the nuclear matrix. MARs have been shown to facilitate long-range chromatin remodeling and accessibility. MARs in general constitute a significant functional block within sequences and facilitate the processes of differential gene expression and DNA replication.

There are 2 different versions of MAR-Wiz:
MAR-Wiz 1.0 can process DNA sequences upto 100,000 bp long, and produces a static GIF plot.
MAR-Wiz 1.5 can process DNA sequences upto 500,000 bp long, and displays the result both as GIF plot and as dynamic Java applet.

NOTE: A free registration is needed in order to run this tool.
NOTE: In test runs, the "file upload" option for sequences worked much better than the "copy/paste" option !!!
(Genomatix Inc., Munich, Germany)
SMARTest is a software tool that utilizes a proprietary S/MAR-associated model based ib weight matrices to test genomic DNA sequences for the occurrence of potential regions of S/MARs (Scaffold/Matrix Attachment Regions). Training sequences for generation of the S/MAR matrix library of SMARTest were selected from the EMBL database, from literature and from the S/MAR database S/MARt DB.
Note: the SMARTest-library contains only weight matrices that are associated with the AT-rich class of S/MARs. Therefore, the current version of SMARTest can only predict this class of S/MARs.  

NOTE: SMARTest is part of the "GEMS Launcher"-section of the commercial Genomatix SuiteNOTE: Genomatix has termed the free academic access "evaluation account". Note that in general, there is not only a limitation in the number of analyses (max. 20 GEMS analyses (sequences) per month!) but also in the functionality of the obtained data !

Promoter Prediction
Dragon Promoter Finder 
(KEL, Singapore)
DRAGON Promoter Finder is part of the portal DRAGON Genome Explorer of the Knowledge Extraction Lab (KEL), Singapore. It searches for and locates promoters in anonymous genomic size DNA sequences. The program attempts to recognize the exact location of the transcription start site (TSS), i.e. the +1 position relative to the TSS. The analysis is strand specific. The main difference to previous versions in the algorithm is that the models are now specialized for C+G-rich and for C+G-poor sequences. In one of the additional features that allow for the analysis of the selected region around the predicted TSS, the TFSEARCH  tool is replaced by a tool based on the weight matrices from the TRANSFAC 6.0 database.

The search for the potential TSSs is made from the 5’ end toward the 3’ end of a DNA sequence for either the direct or the reverse complement strand. A data-window containing 200 nt (or 250 nt for very high accuracy setting) slides along the sequence. The content of the data-window is assessed by the program as either belonging to a promoter region or a non-promoter region. DPF assumes that the TSS is a reference point representing the nucleotide on position 0. This nucleotide is the first nucleotide transcribed. The algorithm requires at least 150 nt (or 200 nt for very high accuracy setting) in a sequence upstream of the TSS and at least 49 nt downstream of the TSS. For these reasons any TSS located in the first 150 nt (or 200 nt for a very high specificity setting) of the sequence and the last 49 nt of the sequence cannot be precisely detected by the algorithm. Thus the algorithm makes no predictions of such TSSs. A sequence analyzed has to be at least 200 nt (or 250 nt for a very high specificity setting) in length.

The output file contains the locations of the predicted TSSs. The locations of the predicted TSSs on the reverse complement strand are given with the sign ‘-‘. The number shown indicates the nucleotide position counted from the 5’ end of the direct strand.

The program also  includes very nice follow-up analyses methods, like
BLAST against the EPD, prediction of TF sites, and scan for ATGs.   
Eponine, developed at Sanger Institute, is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Eponine models consist of a set of DNA weight matrices recognizing specific sequence motifs. Each of these is associated with a position distribution relative to the transcription start site.
These models consist of 4 elements (see also figure on Eponine web page):
- a diffuse preference for CpG enrichment downstream of the TSS (Transcription Start Site). This corresponds with the observation that promoters are often associated with a CpG island.
- a TATAAA motif with focused distribution centered at position -30 relative to the TSS (corresponding to the well-known TATA box).
- 2 GC-rich matrices (GCGCG and GC) closely flanking the TATA box.

The Eponine web server only accepts sequences shorter than 1Mb. Due to edge effects, the program is likely to miss start sites within 200 bases of the start and end of the sequence.

Results are presented in GFF format. A simple list of TSS positions, together with the predicted strand is shown.

Like PromoterInspector, Eponine works very much strand-independent (although the strand is indicated in the output list), as promoter predictions in general show poor strand specificity.
Note: Data show that the TATA box alone has little or no predictive power for detecting TSSs in genomic DNA. The Eponine model suggests that it is the combination of a TATA box with CG-rich "flanking" signals and an overall enrichement in CpG dinucleotides which gives the best indication that a TSS may be present.
Personal remark: It is well known that a considerable number of promoters do NOT contain a TATA box. It may be a short-sighted approach to focus on the presence of the TATA box.

FirstEF (First Exon Finder), provided at Cold Spring Harbor Labs, is a 5' terminal exon and promoter prediction program. It consists of different discriminant functions structured as a decision tree. The probabilistic models are optimized to find potential first donor sites and CpG-related and non-CpG-related promoter regions based on discriminant analysis. For every potential first donor site (GT) and an upstream promoter region, FirstEF decides whether or not the intermediate region can be a potential first exon, based on a set of quadratic discriminant functions. FirstEF calculates the a posteriori probabilities of exon, donor, and promoter for a given GT and an upstream window of length 570 bp. Taken together, FirstEF shows predicted positions of promoters, first exons, and CpG islands.

NOTE: FirstEF predictions are also presented in the UCSC Genome Browser display ("Expression and Regulation" tracks):
Three types of predictions are displayed: exon, promoter and CpG window. If two consecutive predictions are separated by less than 1000 bp, FirstEF treats them as one cluster of alternative first exons that may belong to same gene. The cluster number is displayed in the parentheses of each item. For example, "exon(405-)" represents the exon prediction in cluster number 405 on the minus strand. The exon, promoter and CpG-window are interconnected by this cluster number. Alternative predictions within the same cluster are denoted by "#N" where "N" is the serial number of an alternative prediction in the cluster.
Each predicted exon is either CpG-related or non-CpG-related, based on a score of the frequency of CpG dinucleotides. An exon is classified as CpG-related if the CpG score is greater than a threshold value, and non-CpG-related if less than the threshold. If an exon is CpG-related, its associated CpG-window is displayed. The UCSC browser displays features with higher scores in darker shades of gray/black.
NNPP - Neural Network Promoter Prediction
NNPP, provided in the context of the Berkeley Drosophila Genome Project,  is a widely used method that finds eukaryotic and prokaryotic promoters in a DNA sequence.

NNPP uses a novel technique that combines neural networks with weight pruning. A neural network is trained to recognize promoter elements (TATA-box, GC-box, CAAT-box, Transcription Start Site) until it reaches a local minimum. Then the pruning procedure deletes those weights in the network that add the lowest predictive value to the overall prediction. After pruning, the neural network is retrained until it is stuck again in a minimum. This procedure is repeated until a defined error level is reached. Eventually, the pruned neural network gives clues about the importance of specific positions in the promoter element by studying the remaining weights.
NNPP combines these single predictions for each element using time-delay neural networks for a complete promoter site prediction. TDNNs are appropriate for recognizing promoter elements because they are able to combine multiple features, even those that appear at different relative positions in different sequences. Subsequent analysis of the weight matrices in these TDNNs reveals the importance of the various elements.

Output: simply a list of predicted (core) promoter sequences with the predicted TSS indicated. Note that NNPP predicts very short sequences (only 50 bp) in proximity to the TSS. There are no further options for follow-up analyses.
Note that test runs showed that NNPP is significantly less stringent than other promoter prediction programs, which results in a higher number of potential promoter sequence regions.
(CBS, Denmark)
Promoter2.0 predicts transcription start sites of vertebrate PolII promoters in DNA sequences. It has been developed as an evolution of simulated transcription factors that interact with sequences in promoter regions. It builds on principles that are common to neural networks and genetic algorithms.

NOTE: Test runs showed that Promoter 2.0 seems to be very stringent, as in total only very few promoters were predicted.
(Genomatix Inc., Munich, Germany)
PromoterInspector performs prediction of promoter regions in mammalian genomic sequences. Prediction is based on context specific features which were extracted from training sequences (all mammalian sequences) by a heuristic free approach. The novel idea of the PromoterInspector approach is the way of feature definition: Features are defined by equivalence classes of IUPAC groups which allow a fuzzy description of the promoter context. A prediction is based on the analysis of feature frequencies. PromoterInspector is the first tool which is able to predict promoter regions with high specificity in large genomic sequences.

Output: The output only shows the sequence position, not the transcription factor details.

NOTE: As PromoterInspector is now part of the GenomatixSuite, limitations for the use of the program have now changed to 10 analyses per month. Genomatix has termed the free academic access "evaluation account". Note that there is not only a limitation in the number of analyses but also in the functionality of the obtained data !
Promoter Scan
Promoter Scan performs prediction of promoter regions via comparison to eukaryotic Pol II promoter sequences. PROMOTER SCAN is best used to find regions in primary DNA sequence that might be good candidate regions to further test for promoter functionality.

The results show the location of predicted promoter sequences. Predicted sequence regions are regions of DNA that contain a significant number and type of transcriptional elements (TEs) that are usually associated with Pol II promoter sequences. Reported putative promoters are those regions of your sequence that score past a predetermined cutoff score set to recognize 70% of primate promoter sequences in the Eukaryotic Promoter Database. At this cutoff score, false positive predictions occur at a rate of approximately one in every 14,000 single strand bases. These predictive estimates are based upon experimental test sets of promoter and non-promoter sequences; you may find different results.

Promoter Scan, if it finds a putative promoter sequence, reports the sequence range in which the putative promoter is found. It then reports if a TATA box was found, and if so makes an estimate of the Transcription Start Site (TSS) position from the TATA position. Both the TATA box location and the Estimated TSS are reported. In test sets, 72% of promoters recognized by PROMOTER SCAN have a recognized TATA box, and in those cases the reported TSS is within +- 10 bases of the actual TSS. Significant signals (most of them transcriptional elements) are also reported. The transcription factor name (or in its absence, the Ghosh site name) are reported as well as the TFD # (Ghosh TFD database reference number) strand, position, and significance weighting. It is important to realize that the signal weight DOES NOT reflect the quality of the signal, instead it is a relative weighting based upon that particular signal's ability to discriminate promoter from non-promoter sequences, and is based upon the relative frequency with which that signal is found in promoter versus non-promoter sequences. For example, in the sample sequence, at position 79 there is a NF-kB reported with a weight of 1.094000. This reflects the fact that NF-kB is found approximately 1.094 times more often in promoter sequences than in non-promoter sequences; in other words, it is not a very useful discriminator. On the other hand, a Sp1 site at position 108 has a weight of > 6. That particular definition of Sp1 is found about 6 times more frequently in promoter than in non-promoter sequences. A score of 50 means that the signal is found ONLY in promoter sequences (in the test sets used so far). The relationship between a signal's weight and the quality of the signal is not known. You will also find multiple sites of the same binding factor at the same location with different weights. These reflect the different consensus or specific signals used in the signal database. For example, a signal for a TFIID site might be "TATA", while another TFIID site definition might be "ATATAAT". Both TFIID site definitions would be reported for the sequence "GATATAATC", however only the first definition would be reported for the sequence "GTATAC".

PromoterScan on Zeon       

Promoter Extraction
Promoter extraction tools embedded in integrated sequence analysis suites
Several program packages designed for promoter analysis contain modules for the in-batch extraction of promoter sequences from databases. These packages are described in detail at other places, but are mentioned here to complete the list.
1. TOUCAN: please refer to the TOUCAN chapter, in particular to the point "Promoter Sequence Retrieval".
2. RSAT: please refer to the RSAT chapter , in particular RSAT-Retrieve Sequence.
3. BioMart: please refer to the BioMart chapter.

NOTE: Please, also refer to the FAQ GEN6 for a detailed comparison of the different tools for promoter extraction.
DBTSS - Database of Transcriptional Start Sites
(Tokyo University)
DBTSS contains exact information of the genomic positions of the transcriptional start sites and the adjacent promoters for most of the annotated human and mouse genes. Most of the cDNA sequences stored in current databases lack the precise information of 5' end termini. To overcome this difficulty, DBTSS stores human sequences which were produced by the oligo-capping method to obtain full-length cDNAs. Sequence comparison between DBTSS and  reference sequence database, RefSeq, revealed that 4,802 (34.2 %) of RefSeq sequences should be extended towards the 5' ends.
(2006) contains 1.359.000 clones corresponding to 19.753 human RefSeqs. After clustering (of splice variants), these data correspond to 15.262 genes. For comparison, EPD (release 82) contains promoters for 1.767 human genes.

Special features of DBTSS:
- DBTSS data suggest that approx. 55% of the human loci have two promoters or more. Therefore, it is essential to address the topic of Alternative Promoters (APs). DBTSS includes such predictions of APs in locus-specific result views.
- In addition, mutually homologous genes between human and mouse were determined and their promoters could be compared with each other. Using this information, DBTSS enables users to investigate what kind of sequence elements are contained in the promoters of their genes of interest and which of them are conserved between human and mouse.
- Also, users can search for promoters containing putative binding sites of particular transcription factors (TFs). Please refer to the section DBTSS - Search for TF Binding Site for details !

DBTSS offers very good query options: RefSeq ID, UniGene ID, LocusLink ID, Gene Symbol, Ensembl Transcript ID, and more.
The output consists of very nice graphs showing the positions of RefSeqs and Ensembl-transcripts in relation to the positions of individual Oligo-capped cDNAs. You then can select "your favourite reference position" for the TSS, either RefSeq, ENST, or the longest Oligo-capped cDNA, and download the potential promoter region.

Note that DBTSS does NOT support batch queries !
Note that ESTs are NOT represented in DBTSS !   
Mirror: DBTSS mirror in Germany.
EPD - Eukaryotic Promoter Database
The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries.
The recent (2006, release 83) version of EPD now includes "preliminary promoter entries" as a new class of EPD entries, in order to cope with the growing size of datasets containing TSS (Transcription Start Site) information. As a consequence, EPD entries now are not solely derived from peer-reviewed paper articles, but are computationally generated applying built-in quality controls. Note that still, gene coverage is very low in EPD, as compared to resources like DBTSS or PromoSer !

Query EPD:
- It is only possible to perform SRS- like keyword searches directly at the EPD database site but NO sequence searches (like BLAST).
- But there is a link to BLAST the EPD at
- It is possible to download the promoter sequences into a simple FASTA file.

- Note that EPD promoter files usually contain only the sequence range -500 to +100 relative to the TSS.
- The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references.
FIE - 5'-End Information Extraction
(KEL, Singapore)
FIE, which is part of the portal DRAGON Genome Explorer of the Institute for Infocomm Research (I2R), Singapore, is a tool to retrieve the region upstream and/or downstream of the 'start of exon 1' (Transcription Start Site, TSS) for a particular gene. This user-specified region requires the LocusLink  ID or Gene/Protein Name and Organism Type as well as the Upstream and Downstream length with respect to the 'start of exon 1'. This reference position is determined by the longest annotated mRNA (RefSeq AND others). ESTs are not considered. NO batch retrieval option. Currently only available for human genes.
NOTE: Version 2.0 is considerably improved, as it lists all mRNA sequences (RefSeqs, which also include un- characterized potentially full-length mRNAs like 'DKFZ', 'KIAA', or 'FLJ') individually, so the user can decide which upstream region to extract.
(Genomatix Inc., Munich, Germany)
Gene2Promoter is part of the commercial Genomatix suite of products. Gene2Promoter allows for automated extraction of groups of promoters from a list of accession numbers or gene IDs. Gene2Promoter is an optional module for ElDorado. If you need large scale extraction of promoter sequences, please have a look at GPD (the Genomatix Promoter Database).

You can query Gene2Promoter using human DNA accession numbers, like genomic sequences including a potential promoter region, but also with cDNAs  like RefSeq accessions. It is NOT possible to use copy/paste sequences as query.
NOTE that you only have 5 free runs (with at most 5 accession numbers each) per month !!! Genomatix has termed the free academic access "evaluation account". Note that there is not only a limitation in the number of analyses but also in the functionality of the obtained data !
GPD - Genomatix Promoter Database
(Genomatix Inc., Munich, Germany)
Genomatix Promoter Database (GPD) is part of the commercial Genomatix suite of products. With GPD, Genomatix claims to offer the "most complete eukaryotic promoter database" and the "only one containing promoters for alternative transcripts". Promoter extraction via GPD is available for entire organisms or for microarray platforms (like Affymetrix arrays). There are three possible quality levels (gold-silver-bronze) assigned to each transcript which is associated with a promoter.

Options include:
- pre-made annotation for transcription factor binding sites (EXCEL sheet)
- pre-made annotation for promoter modules (combinations of TFBS) (EXCEL sheet)
- module descriptions and TFBS matrix descriptions (txt files)

NOTE: Access to the GPD is exclusively commercial, and not part of the free academic "evaluation account".
If you only need small groups of promoter sequences you may want to use Gene2Promoter instead.
PRESTA - PRomoter EST Association
(Academy of Sciences of the Czech Republic)
PRESTA is a tool/database that combines EST databases and putative GenBank/EMBL promoters to yield datasets of predicted promoters at high accuracy.  PRESTA was developed at the Institute of Entomology, Academy of Sciences of the Czech Republic.
A high stringeny BLAST-search reveals ESTs that assist in transcription start-site verification.
In principle, PRESTA would therefore be useful for promoter verification by mapping EST 5' ends.  BUT: Limited query options (NO LocusLink IDs, NO RefSeq IDs etc.), NO batch query, NO user-definition of region to extract, many genes simply NOT included.  Solely based on ESTs, RefSeqs are not considered.

PRESTA can be either used as a 1) standalone Windows programme or as a 2) searchable public database, described under the section "About".
2.1. Searchable public database: The PRESTA algorithm has been used on the complete sets of human and mouse promoters to extract databases of curated promoters. Subsets of these databases can be extracted via EST quality parameters, via tissue origin, or via gene name, GB accession numbers or EST accessions. 
2.2. Download PRESTA databases: You can download the complete sets of human and mouse promoters into simple FASTA-text files. NOTE that these  entries only comprise the "pure" upstream 5'-flanking regions and not (what PRESTA calls) the downstream sequence tags (which are the first bases of the transcribed sequence). In addition, the entries just have one number or word in the definition line (refering to the "LOCUS" of an entry), so if you want to known more about one entry you may:
- Search with this term within PRESTA, at the "alternate query" page under "Gene GenBank/EMBL Acc" to retrieve the full database entry, including the downstream sequence tag and the list of relevant 5' ESTs.
- Search at NCBI-ENTREZ, under "Search Nucleotide for:"

NOTE: Compare with EPD - Eukaryotic Promoter Database !
(Zlab, Boston University)
PromoSer is a service for promoter extraction for human, mouse, and rat genes provided as part of the Gene Regulation Tools of the Zlab, which belongs to the Boston University Bioinformatics.

PromoSer comes with a compact, but very instructive Help-file describing all the different options, making PromoSer one of the best tools for this purpose.

1. Query:

- You can use lists of GenBank accession numbers as input (RefSeqs, mRNAs, and ESTs). There is no option to use e.g. Affymetrix IDs. 
- Define the region upstream and downstream of the TSS (Transcription Start Site) which you want to extract.
- Choose the "Quality" and the "Support" levels. The TSS "Quality" is a rating system (between 0 and 4) which describes the composition of the sequences that support this TSS (described in the Help-file).
- Extraction of alternative promoters: This is in fact a great feature allowing the user to select which of the mRNA sequences to define as reference for the location of the TSS. The option "only the one that is best supported and is 5' most" defines the TSS at the position which is best supported by RefSeq, mRNAs and ESTs. Otherwise, you may choose to extract only the promoter that starts 5' most (most aggressive extension). In the case of the presence of ESTs containing "5'-upstream first exons" as compared to the RefSeq, a totally different promoter may be extracted. The option "ignore all extension info and return the immedite upstream region" extracts the 5'-flanking genomic region relative to the supplied accession number, meaning that also single ESTs can be defined as reference point for the promoter definition.

2. Output:
- Result table: First, the extracted sequences are presented in the form of a table which is highly instructive as it lists the exact genomic positions, chromosome number, the quality level, the number of supporting sequences, and the "genomic extension", which means the amount of genomic sequence added at 5' (positive value) relative to the accession number provided. In case that the promoter is extracted at a downstream (3') position, a negative value is indicated.
- FASTA sequences: Finally, the promoter sequences can also be displayed (copied) as a FASTA sequence file, and thereby be transfered to other applications (like e.g. TOUCAN).
The Transcript Sequence Retreiver (TRASER) provides rapid retrieval of transcript and upstream (putative promoter -containing) sequences for predicted human genome mRNAs. The underlying database is built using the human genome annotation files provided by the NCBI.

The program accepts ONLY LocusLink IDs as input but allows batch-submission !  You can choose the length of sequence to retrieve.
that the database is solely based on RefSeq sequences, but is able to retrieve more than one upstream region for a gene in cases where several RefSeqs exist.    
NOTE that the output sequences follow the UPPER/lower case model for EXON1/upstream sequences.
NOTE that there are 2 output formats, as FASTA sequence file, or as tab-delimited text (making it possible to e.g. paste the sequences into an EXCEL sheet of pre-existing data !).
Regulatory Sequence Analysis Workbenches
NOTE: This section lists resources which provide a common interface for the in-depth analysis of regulatory sequences using several software modules. Although these modules may be categorized under different sections, they are described here as part of a common workbench. 
Genomatix -  Overview
(Genomatix Inc., Munich, Germany)
Genomatix is a company based in Munich, Germany, which offers at its Productmap software, databases and services aimed at understanding gene regulation at the molecular level.
The Genomatix Suite offers a wide range of tools and databases to predict and analyze promoter regions, to compare patterns of transcription factor binding sites within regulatory regions, or even to provide a complete portal to the human genome including other gene information, like functions, literature, cross-citations and many more !

Genomatix has termed the free academic access "evaluation account". Note that there is not only a limitation in the number of analyses but also in the functionality of the obtained data ! ALSO NOTE THAT THE AMOUNT OF FREE RUNS FOR ElDorado OR Chip2promoter IS MUCH TOO LOW TO USE THESE DATABASES REGULARLY !
RSAT - Regulatory Sequence Analysis Tools
(SCMBB, Brussels University)


RSAT-Retrieve Sequence



Genome-scale DNA-pattern

Genome-scale Patser



Gibbs Sampler



Random Sequence

Random Genes
Regulatory Sequence Analysis Tools (RSAT) was created at the SCMBB, Brussels University. There is also a list of RSAT mirrors at the start page, in case that one site is down, like in Sweden.

RSAT consists of a series of modular computer programs (excellent tutorials available) specifically designed for the detection of regulatory signals in intergenic sequences. The only input required is a list of genes of interest (e.g. a family of co-regulated genes). From this information, you can retrieve the upstream sequences over a desired distance, discover putative regulatory signals, search the matching positions for these signals in your original dataset or in whole genomes, and display the results graphically in the form of a feature map. Each tool is presented as a form to fill. For each form, a help page provides detailed information about the parameters.

RSAT-Retrieve Sequence: allows the automatic extraction of 5'-flanking sequences (pot. promoters) for your genes of interest. You have to choose the organism, in the case of human there are 2 different databases "Homo sapiens" (NCBI RefSeq sequences) and "Homo sapiens EnsEMBL" (EnsEMBL sequences). In test runs, there was no big difference in the output between these two.
The gene names must be separated by carriage returns, because only the first word of each line is considered as a query. Genes can be specified either by the systematic ORF identifier or by a common name. Synonyms are also supported.  Note that the option "prevent overlap with upstream ORFs" should be inactivated when working with eukaryotes. "From To" describes the limits of the region to retrieve. For upstream sequences, the default reference position is the ORF start* (and NOT the transcription start !). Negative coordinates are used to indicate sequences located upstream the start codon; a reasonable pair of values could be: From -800 to -1.
that you might want to re-check the obtained sequence via BLAT search at UCSC.
*Please note that for genes which do NOT have the start ATG in the first exon the correct promoter retrieval might be a problem because in these cases the tool will retrieve sequence from the first intron, and NOT the promoter sequence !!! BUT NOW
, the user can choose between different "Feature types", like CDS (Coding Sequence), mRNA, tRNA, etc.
The advantage of using mRNA is that, if the mRNA is complete (which is not always the case), the upstream regions are retrieved relative to the transcription start site (TSS), rather than the start codon!!! If you want to see a nice example, you can try to extract the upstream sequence (e.g. -500 to -1) of the gene "SELE" (E-Selectin), and compare the output when choosing "CDS" versus "mRNA" as "feature type".

2. Pattern matching: You know the regulatory motif (e.g. the consensus for a transcriptional factor), you know the genes: you look for the matching positions within the upstream regions from your genes.

2.1. DNA-Pattern (Strings):
Patterns can contain spacers of fixed length (e.g. CGGn{11}CCG) or variable length ( GATAAGn{0,60}GATAAG). Several patterns (separated by breaks) can be searched at once against several sequences. The first word of each line is the string description of the pattern, the second word is an identifier for this pattern. Example: Type the following text in the Query pattern(s) box: GATAAG  Gata_Box.  You can display the results also graphically, please refer to the tutorial for furhter information.
Note that one might be interested in counting the number of matches, rather than returning their precise positions. This can be done by deselecting the checkbox match positions and selecting the checkbox match count table instead. Even better (especially  for large lists !) : select match counts and specify a threshold !
Notice that positions are returned in negative coordinates, relative to the end of the sequence (the last nucleotide has position -1). This behaviour was selected with the "Origin" option in the dna-pattern form (Origin=end). This option is particularly useful for analyzing regulatory sequences, but it can be inactivated in other cases.
Note: "Substitutions": imperfect matches can be allowed, with a given number of substitutions. BUT insertions and deletions are not supported.  
Note: "Threshold": Only return sequences having more matches than the specified threshold (e.g. promoters showing at least 2 or 3 copies of a TF site).                   

Patser (Matrices):  
Patser allows to scan a set of DNA sequences with a profile matrix, which can be either in Transfac, Gibbs, or Consensus format.
Patser allows to scan a set of sequences using ONE pattern but NOT using a specific combination of TFBS. If you want to perform such searches, you may choose other programs like Target Explorer (see Target Explorer section).

Example - TRANSFAC format (NF-kappaB binding site): There are simply spaces between each number and a line break between each line.
PO      A      C      G      T
01 0 0 18 0 G
02 0 0 18 0 G
03 0 0 18 0 G
04 2 0 16 0 G
05 16 1 0 1 A
06 0 0 3 15 T
07 0 7 1 10 Y
08 0 16 0 2 C
09 0 18 0 0 C
10 0 17 1 0 C
Example - Consensus format (should also work without the "|" separators): There is simply one space between each number and a line break between each line.
A |   0   0   0   2  16   0   0   0   0   0
C |   0   0   0   0   1   0   7  16  18  17
G |  18  18  18  16   0   3   1   0   0   1
T |   0   0   0   0   1  15  10   2   0   0

3. G
enome-scale pattern matching: You know the regulatory motif, you ignore the genes: you look in the genome for all genes possibly regulated by your element.

3.1. Genome-scale DNA-pattern:
Example:  finding genes that are regulated by the TF GATA which binds to a GATA-box, whose consensus is GATAAG. An interesting property is that a single GATA-box is insufficient to affect the transcription. All genes effectively controlled by these elements possess at least one group of 2 to 4 closely associated GATA-boxes. Therefore, besides the usual parameters, select "match counts" in the "Return" option, and change the "Threshold" to e.g. "3".
NOTE: Like above, here is an example pattern derived from the human E-selectin (SELE) promoter region, which can be used to verify that 2 different relative positions (from-to) are extracted depending on the feature type you select (CDS or mRNA): TCAGCTGTTCTTGGCTGACTTCACATC. (Choose "Homo sapiens EnsEMBL" and -500 to -1 as "upstream" region).
NOTE: If you want to screen for combinations of different patterns, please refer to FAQ GEN4 for details !!!
NOTE: Like in RSAT-Retrieve Sequence above, there is now an option to select the mRNA start as "reference position" for upstream sequence retrieval (see above for comments) !

3.2. Genome-scale Patser:
When the pattern is relatively large and degenerated (which is not the case for GATA boxes), matrix-based pattern genome-scale matching (with patser) provides better results than string-based pattern counts. However, even in such a case, there is a trade between sensitivity and coverage.  
: You can obtain a TF- matrix by searching the section Matrix table within TRANSFAC database (see below, Registration at BIOBASE etc....).
NOTE that you can obtain matrices from different sources (TRANSFAC, CBIL-GIBBS, IMD) very nicely at TESS.
NOTE: In general, the question how to obtain TF matrices is treated in FAQ GEN7 !
NOTE: If you want to screen for combinations of different patterns, please refer to FAQ GEN4 for details !!!
NOTE: Like in RSAT-Retrieve Sequence above, there is now an option to select the mRNA start as "reference position" for upstream sequence retrieval (see above for comments) !

4. Pattern discovery: You know the genes, you ignore the regulatory motif : you look for elements shared by a set of functionally related upstream sequences, which could reveal unknown regulatory sites.
NOTE: Please note the general "WARNINGS" for pattern discovery with mammalian genomes !!!

4.1. String-based pattern discovery:      
4.1.1. Oligo-analysis:
This is a simple and fast method to extract over-represented elements, and works for DNA, protein, or even "any" type of "sequences", meaning that you can even screen any type of text for over-represented motifs.
: The program simply looks for "words", not knowing if these are known TF-binding sites !  You should select the oligo size, the background model and the organism.
The results of the analysis are displayed in a table. Each row corresponds to one oligonucleotide, and each column to one statistical criterion. The E-value represents the number of patterns with the same level of over-representation which would be expected by chance alone. E.g., the E-value is of the order of 10e-6, indicating that, if we would submit random sequences to the program, such a level of over-representation would be expected every 1,000,000 trials.
NOTE: In the bottom of the result page, click on the button Pattern matching (dna-pattern); then hit "GO" and then click the Feature map button, which will produce a graphical image of the results.
NOTE: If you want to quickly identify which TF might bind to your "word of high frequency", you can search the TF- site table within TRANSFAC database (see below, Registration at BIOBASE etc....). Alternatively you may perform this search at TESS. Make sure to select the search field "sequence" ! This question is also treated in the last section of FAQ GEN5!!!
4.1.2. Dyad-analysis: TFs like Zn cluster proteins have two distant points of contact with DNA. Each contact point imposes a specificity over 3 base pairs, but there is an intermediate region of fixed width but variable content. The width of the spacing is transcription factor-specific. Dyad analysis is a specific algorithm to extract such motifs.

4.2. Matrix-based pattern discovery:
Gibbs Sampler: Detect matrices in a set of (DNA or protein) sequences.  The main parameter for evaluating the result is the information per parameter (IPP). The higher the IPP, the more significant is the matrix. Beware that gibbs sampling is a stochastic process. In consequence, each run of the program can return a different result, which can be disappointing at first sight. The good practice is to run the gibbs sampler repeatedly, and check which motifs are selected frequently.  
4.2.2. Consensus: This is a matrix-based pattern discovery for DNA or protein sequence sets developed by Jerry Hertz. By default, the program returns the 4 most significant matrices. These matrices are often slight variants of the same motif. The most informative statistics for estimating the significant of the discovered matrices is the expected frequency. E.g., an expected value of 0.07 indicates the number of matrices with a higher or equal significance which would be obtained by chance alone.    

5. Pattern-Assembly: This is a tool to assemble a set of oligonucleotides or dyads into clusters of overlapping patterns (the assembly). Patterns can be either oligonucleotides (e.g. CACGTG, CACGTT) or dyads (e.g. CGGn{11}CCG). Patterns can be entered either in the text area, or by specifying a file on your machine (upload). Each pattern must appear as the first word of a line. Lines starting with a semicolon (;) are ignored. Strand sensitive or insensitive assembly is possible. With the strand insensitive option, patterns can be used either in direct or reverse complement orientation for assembly. For each pattern, the orientation which offers the best match is chosen.
The program returns clusters of aligned oligonucleotides. The information associated to each oligo in the input file is returned besides the same oligo in the output file. The row "best consensus" indicates the consensus sequence of all patterns.

Note: A concise introduction about Position-Specific Scoring Matrices (PSSMs) is available here !

6. Random dataset generation (for estimation of significance):

6.1. Random Sequence: This tool generates random DNA sequences according to various probabilistic models (Markov chains or independently distributed nucleotides). This tool is very useful if you want to verify the significance of results obtained by programs of Motif Discovery like Oligo-Analysis or programs of Motif Matching like DNA-Pattern. You can easily generate a random sequence set corresponding to your "query dataset", simply by selecting the same sequence number and the same length. In addition, you may choose between 3 different models:
- Equiprobable nucleotides: This is the simplest model : all nucleotides have the same prior probability.
- Independent nucleotides with distinct probabilities: A specific prior probability can be attached to nucleotides (AT and CG are grouped). This probability is constant over the sequence, i.e. each nucleotide is generated independently of the preceding and succeeding nucleotides.
- Markov chains (calibrated on intergenic frequencies): The random sequence has the same oligonucleotide composition as observed in the intergenic regions of the selected organism. This is obtained by a Markov chain process, where nucleotide probabilities vary at each position, depending on the preceding nucleotides. Note that "oligonucleotide size" determines which expected oligonucleotide calibration table has to be used. The markov chain order is this value minus one. For example, calibrating with hexanucleoides (oligonucleotide length = 6) means that the nucleotide at each position depends on the 5 preceding nucleotides. This is this thus a Markov chain of order 5. Calibrating on single nucleotides (oligo length = 1) means that each nucleotide is chosen independently off the preceding one. This is thus a Bernouille model (or Markov chain of order 0).
NOTE: The output sequence list provides direct links to follow-up procedures like Pattern discovery and Pattern matching !!! Thereby, the random sequence set can directly be scanned for the presence of a specific pattern or for predicting "over-represented" patterns.

6.2. Random Genes: This tool performs a random selection among the genes of a selected organism. The selection can be performed with or without replacement (when this option is activated, a gene can appear several times in the list). This program is useful for estimating the rate of false positive for pattern discovery programs. The program can also generate several groups of random genes, which can be used to simulate the results of clustering.
The output is a two-column text. The first column gives the gene identifier (like "ENST00000248553" for Ensembl transcripts), the second column the group identifier (useful when several groups are exported). In addition, a link to Retrieve Sequences is provided, allowing to extract e.g. 1 kb of upstream promoter sequence for each gene. Note thjat you may select different labels, like gene name, gene ID, both, or full identifier.
The output sequence list provides direct links to follow-up procedures like Pattern discovery and Pattern matching !!! Thereby, the random sequence set can directly be scanned for the presence of a specific pattern or for predicting "over-represented" patterns.
(Leuven Katholieke Universiteit, Belgium)











Consensus Match
TOUCAN is a workbench for regulatory sequence analysis, especially for detecting significant transcription factor binding sites across species, and for detecting cis-regulatory modules (combinations of binding sites) in sets of coexpressed/ coregulated genes. It is a platform independent, standalone Java application that is tightly linked with Ensembl,and was built using the BioJava package. TOUCAN was developed by the bioinformatics group at the department of electrical engineering (ESAT) at the Katholieke Universiteit Leuven (Belgium). In Aug. 2004, version 2 was released, which has some very nice improved features.

0. General setup notes:
Note  that in order to run TOUCAN, you must have J2SE (Java 2 Platform, Standard Edition) installed on your machine (free download), notably the Java Runtime Environment (JRE) version including Java Web Start. It might also be necessary to install the latest Windows Service Pack first.
For the first-time-use, you have to start TOUCAN from its homepage, but later you may start the program directly from its desktop icon or from the "Start, Programs"- menu (in case you
had chosen the appropriate setup options).
It is also highly recommended to have a look at the TOUCAN tutorial and at the TOUCAN manual.

1. Sequence and annotation retrieval:
1.1. Promoter sequence retrieval from Ensembl:
From the TOUCAN menu, choose "Get_Seq", "From Ensembl", "New/Add". In the next window you can either paste a comma - separated ID-list, or browse to a local file containing a column of IDs (like EXCEL). Note that you may use a lot of different ID types, like LocusLink, RefSeq, HUGO, Interpro, Affymetrix, Ensembl, and many more.  You can also specify the sequence region to extract and choose if you want to include homologous sequences (like mouse, rat) for your analysis.  In V2, you may choose multiple species "in-batch" ! 

1.2. Promoter upload from local FASTA file:
Especially in cases when automatic retrieval does NOT extract the correct promoter sequence (which DOES occur !), you may want to get the sequence via other methods, which normally produces a FASTA sequence file, saved on your local machine.
Have a look at the FAQ page: GEN6 for an overview of methods for promoter extraction !
NOTE that you may e.g. use WORD for this purpose, first remove all line-breaks within your sequence (just leave the one between the ID-line and the sequence), save in "txt+linebreaks" format, and rename the file afterwards in *.fasta", ignoring the warning that the file may be unusable after the name change. Finally, upload the sequence in TOUCAN via "File", "Add Seq".

1.3. Annotation table of the active sequence set:
TOUCAN provides a module to generate annotation tables of gene lists. From the TOUCAN menu, choose "Get_Seq", "From Ensembl", "Get Info", and hit the "Update" button. This starts the process of retrieving the informations and creating the table, which can be followed in the progress bar. Each row represents one gene of your dataset and each column represents one kind of database identifier. This includes basic gene, mRNA and protein IDs as well as sites of "data integration" like Ensembl Gene, EMBL, HUGO, UniGene, UniProt, RefSeq, and LocusLink. In addition, IDs in the fields of protein domains and motifs (Interpro), as well as protein structure (PDB) are listed. As the number of microarray platforms increases rapidly, also the number of columns in TOUCAN displaying chip-specific identifiers is constantly growing. Thereby, not only the world-leader in this technology, Affymetrix, is covered but also other platforms. Finally, the table presents also IDs, which are related to functional information of each gene, like the involvement in biological processes, or the localization to specific cellular compartments (GO, MIM). To look at this information in detail, you can highlight all rows and copy / paste the data into programs like MS EXCEL.

2. Sequence manipulations:
After download, the main window displays the "active sequence set". Right click on a gene to remove, copy, or reverse complement (which is recommended for genes lying on the minus strand, showing the CDS below the horizontal line !) a sequence.
NOTE: In version 2 of TOUCAN, there is a very convenient way to reverse - complement ALL sequences which lie on the reverse strand at once, via "Tools", "RevCompl Negatives" !

3. General File Saving and Export:
Using "File", "Save List" you can save the sequences in e.g. FASTA format*.
"File", "Export without Ns" is similar. "File", "Export Figure" lets you save the figure and TF list as a JPEG- image. "File", "Export Frequencies" creates a tab-delimited txt-file with the names and frequencies of the current features. NOTE that this file can be used as a reference in the Statistics - tool (BUT NOT for the SAME dataset where it was derived from -this would not make sense !) ! "File", "Export GFF" generates a special file format, which is easily opened in e.g. EXCEL, representing all features in table format (incl. e.g. the sequences of TF binding sites !). "File", "Export Matrix" generates a txt-file with all features in columns and all sequences in rows, which can be used after splitting the sequences using the "Windows" functionality. "File", "Export Separate Fasta" generates a folder containing all sequences individually in Fasta format.
IMPORTANT: When you choose the ".embl" format for "File", "Save List", then annotations like exon positions, TF sites from MotifScanner etc. are SAVED along with the sequences, and can be recovered in TOUCAN using "File", "Load" at a later time-point !!! Note that the indicated IDs are Ensembl GeneIDs.
GFF is a format for describing genes and other features associated with DNA, RNA and Protein sequences.
A GFF-format description is available at Sanger Institute.

MotifScanner - Detection of TF binding sites:
MotifScanner is a program that can be used to screen DNA sequences with precompiled motif models, like transcription factor binding sites.
Choose "Motifs",  "MotifScanner". NOTE: MotifScanner is also available as individual program. The "stand-alone" version does not support the matrices from the JASPAR database, but the version integrated in TOUCAN does.

In the input window you may choose:
- "PWM database" is the database of transcription factor binding site profiles you want to use (Position-Weight-Matrices), which is quite self-explanatory, e.g. for human sequences you may choose the option "TRANSFAC 6.0 public - Vertebrates" or you may choose the independent JASPAR database;
- "Background Model" lists orders of Markov Models, 3rd order models are fine in most cases.
In a 1st order bg model, the 'genomic' frequencies are calculated for each dinucleotide (AA, AT, etc.), so 1 bp (1st order) before the actual bp that is being scored with the bg model and matrix model. In a 2nd order bg model, the score of a nucleotide for the bg model is the frequency of the trinucleotide (e.g. AAT if T is being scored).
- "Prior": In addition, it is quite useful to "play around" with the "prior"-value (which replaces the "core and matrix similarity values" of MatInspector), meaning a lower prior (like 0.1) is more stringent than a higher one (like 0.9). Prior-Examples: 0.1-0.2 for sequences smaller than 300 bp, 0.9 for sequences larger than 1500 bp.
The result* is a list and a color-code of TF sites and a visualization of these sites along the input sequences.  You can right-click on individual TFs to show/hide/or re-color.

: In order to quickly visualize single features, select the feature and hit the Enter-button (also works with CDS, exons, ...) !
NOTE: Don't forget to expand the two information windows at the bottom of the screen ! When hitting a feature like a TF-site, you will see a description of this element in these windows, like name, binding site, and matching score !
NOTE: MotifScanner can also be run on sub-lists of sequences (sequence parts) only, please refer to the "Sub-lists" description below !

5. MotifLocator - Detection of TF binding sites:
Choose "Motifs", "MotifLocator". Settings are very similar to MotifScanner except that the "prior" is replaced by "threshold" which represents the matrix similarity known from MatInspector (good value range 0.75 - 0.95).

6. Statistics - Find over-represented features:
Choose "Motifs", "Statistics". If you have performed an e.g.  MotifScanner analysis before, then the "Statistics" tool performs an evaluation which elements are over-represented and might therefore be significant. For this purpose, you have to provide a local "reference file" (in the field "Expected Freq. File"), calculated from a previous analysis of this species' promoters stored in databases, which by comparison to your own dataset allows an estimation of significance of single TF sites.
- Note: You may create your own expected frequencies reference file, using "File", "Export Frequencies" from a DIFFERENT promoter dataset (NOT the same set, but the same species, see also remarks under  3)). Example: If one wants to discover over-represented binding sites within a sequence set (TEST), as compared to another sequence set (REF), then one should first score REF with MotifScanner (choose a certain prior), then choose File->export->frequencies to save the frequency file. Then load TEST, and score with MotifScanner using the same prior. In the Statistics window, now point to the locally saved frequency file.
- Alternatively, you may download a frequencies file from the TOUCAN FTP-site,  like for human promoters the file "epd_homo-sapiens_prior0.1.freq", save as *.txt, and use this file for the field "Expected Freq. File". Just make sure you use the files with the SAME PRIOR ! 

The output is a table showing 3 values for each TF site:
n: number of times this feature (TF site) appears in your sequence set. NOTE that this does not tell you in how many of your sequences, because a feature might appear more than once in one sequence ! This feature is currently NOT available; compare the Genomatix suite which tells you e.g. "The following TFs are present in 80% of your sequences".  IMPORTANT: This means that TOUCAN indicates over-representation of a feature if it occurs in a certain number of  base pairs, and not in a certain number of sequences.
p (probability) value: probability to find even more occurrences than n in this number of base pairs. When analyzing only one feature, a p-value smaller than 0.05 could be selected as being over-represented. In case of multiple features, it is better to use the sig-value.
sig (significance) value: One expects to find at random one pattern with sig >=1 every ten families, one with sig>=2 every 100 families, and one with sig >=s every 10s families. Generally, negative sig values mean not significant. 
NOTE that you can easily copy/paste the output table into e.g. EXCEL-sheets !      

7. ModuleSearcher - Find combinations of TF binding sites:
The transcriptional regulation of a metazoan gene depends on the cooperative action of multiple transcription factors that bind to cis-regulatory modules (CRMs). ModuleSearcher is designed to find CRMs in a set of coexpressed or coregulated genes.
NOTE that you first should run a program like MotifScanner or MotifLocator to display all relevant TF sites. Then, choose "Motifs", "ModuleSearcher". In the next window, you have to select the "Feature Source", meaning you can select between all results of previous analyses (like MotifScanner sites), which ModuleSearcher shall now scan. You may leave the other parameters as they are and see what happens.  The "modules" will be displayed in TOUCAN, the names starting with the prefix "Mod" in the left frame (Feature List).   Again, if you want to exculsively display the modules, mark them and hit the Enter-key !
NOTE: If, in the first run, ModuleSearcher finds no hits, you may change these  parameters:
- lower the number of elements (meaning number of TF sites that have to appear "in common"), e.g. 2 or 3.
- increase the length (sequence stretch that includes the elements), e.g. 500bp.
- use A* (takes longer)  in the field "Algorithm".
- increase the number of generations when using "Genetic Algorithm".       
NOTE: ModuleSearcher at first only returns the best (highest scoring) combination of TFs. If you want to retrieve additional (second, third best) modules, you simply have to exculde (mask) these TFs of the first run, by entering their accessions (like "M00272-V$P53_02") in the field "Exclude matrices (comma separated)" of the ModuleSearcher input window.  Then repeat this process as often as you want. 
NOTE: If you want to know  if a predicted CRM is found in Human-Mouse CNS (Conserved Non-coding sequences) in a "whole-genome approach", you should use the tool ModuleScanner !

8. ModuleScanner - check predicted modules against whole-genome CNS regions:
ModuleScanner performs genomic searches with a predicted CRM or with a user-defined CRM known from the literature to find possible target genes.
- Input: Starting from a blank page, choose "Motifs", "ModuleScanner". You have to choose one of the databases which all comprise pre-computed sets of CNS - conserved noncoding regions (minimally 75% identity within 100bp windows) between 2 species within 10kb upstream of the coding regions, like CNS between human- mouse or human - zebrafish or human - chicken.  Also, you can choose to display either the regions of the "primary" or the "second" species in the output. Finally, you can choose between the TRANSFAC or the JASPAR matrices of TFBS to visualize.
Then, you have to select the transcription factor matrices from a list, which you would like to use for the scan, or you enter them manually as a string, separated by commas, like e.g. "[M00052-V$NFKAPPAB65_01,M00189-V$AP2_Q6]". You may enter also more than 2 TFs, or you may even look for clusters of only ONE special TF, by using e.g. "[M00052-V$NFKAPPAB65_01,M00052-V$NFKAPPAB65_01]". You may change the number of "top hits" to return, which is based on the score of the ModuleScanner.
- The output lists CNS regions where the chosen TF combination is found. Note: The numberings at each CNS in the output indicate the position relative to the coding sequence of the gene. Note that, ONLY conserved regions between the 2 species selected are scanned, BUT a displayed TF site is taken from the "primary" sequence only (the one indicated in the database selection list), and is not necessarily conserved in the second species. Note that also all other predicted TFs (using MotifScanner prior 0.2) are displayed as colored boxes, but you may selectively choose the "modules" by highlighting the "Mod..." entries in the left column and hitting "Enter".
If you want to know which genes correspond to the Ensembl GeneIDs, you may use the TOUCAN annotation tool via "Get_Seq", "From Ensembl", "Get Info". You may also paste the Ensembl GeneIDs into the Ensembl query field or other data retrieval tools like BioMart, in order to perform advanced annotation and data retrieval (see also BioMart chapter).
9. MotifSampler - Find over-represented motifs:
The purpose of this tool is NOT the detection of pre-defined motif models (like TF sites) but the general detection of "over-represented" motifs, which might turn out to be known or even unknown TF binding sites. Note: MotifSampler is also available as individual program.
Choose "Motifs", "MotifSampler". You have to choose a background model (see above), and you may change e.g. the number of motifs the program shall return, and the length of the motifs (hint: try lengths from 6 to 9 bp). The "motifs" will be displayed in TOUCAN, the names starting with the prefix "box" in the left frame (Feature List).
NOTE: The MotifSampler is a Gibbs sampling implementation, implying that it is a stochastic algorithm, thus returning different results each time ! Of course, if a certain motif is really over-represented, the same or similar motifs should be found in each run, depending on the parameters you've set !

MotifSampler - Example:  
The program returns a motif like "box_1_1_TTTGTTnT".  
Question: Is this a known TF site or an unknown regulatory motif ?
Answer: You should check the motif in several ways: NOTE:
This question is also treated in the last section of FAQ GEN5!!!
- You may try a search against the TF-site table at TRANSFAC, see "Search TRANSFAC" in the TRANSFAC chapter ! As wildcard, use "*" (NOT "n" !). The retrieved hits are actually promoters, where this site is included, and under "BF" you will find "Binding Factors" that bind to this site.
- Alternatively you may perform this "Query TRANSFAC" search at TESS.  Note that TESS uses older versions of TRANSFAC.  As "Search Field" choose "Sequence", and enter your pattern in the "text" field. Carefully analyze the individual hits.
- You can also use your motif as input in a TESS sequence search as if it was a large promoter sequence and see if TFs may bind to it.
- Best option for sequences >10-12 bp: You may also feed MatInspector professional with short input sequences, but tests showed that these should be at least >10-12 bases !
- Yet another option is the so-called Profile Comparison Tool provided at the site of the JASPAR database of TF profiles (PWMs) which was newly built from literature data, and is therefore "independent" from existing databases like TRANSFAC. You may either paste your consensus sequence or your binding matrix and search for similar motifs in the JASPAR database. The output is very user-friendly displaying all hits as instructive multi-colored sequence logos.

10. Alignments, conserved regions:
AVID - Finding highly conserved sequence regions:
This tool is especially suitable to compare promoter regions of orthologous genes (like the "same" gene in man, mouse, and rat). Note: AVID is also available as individual program, and there is a similar one called MAVID, see also MAVID section at the "GENOMICS" page !
If you want to, you can automatically retrieve ortholog sequences in TOUCAN (see point 1) above).
Then choose "Alignment", "2 Seq", "AVID". You have to select the 2 sequences that you want to compare, and you may choose the minimal percent identity (default: 75%) and the window length (default: 100bp), meaning a region is defined as "conserved" if in a window of 100bp the sequence identity is at least 75%. Conserved regions will be displayed by open rectangles along the sequences.
NOTE: In V2, it is possible to perform AVID on ALL homologous pairs "in-batch", choosing "Alignment", All Pairs" ! This works only on pairs (like man-mouse), not on triplets (like man-mouse-rat).
NOTE: Alternatively, I recommend to try rVISTA, see also the VISTA chapter !

10.2. LAGAN - Finding highly conserved sequence regions:
LAGAN is an example of a pairwise global alignment program. As independent program, LAGAN is also part of the LAGAN Alignment Toolkit available at Stanford University, described in this main section. In TOUCAN, it is used very similar to AVID described above.

10.3. BLASTZ - Finding highly conserved sequence regions:
BLASTZ is an algorithm for pairwise local alignment, which is used also in other programs like PipMaker or zPicture. In TOUCAN, it is used very similar to AVID described above.

11. FootPrinter:
FootPrinter is a program that performs phylogenetic footprinting. It takes as input a set of unaligned orthologous sequences from various species, together with a phylogenetic tree relating these species. It then searches for short regions of the sequences that are highly conserved, according to a parsimony criterion. The regions identified are good candidates for regulatory elements.
Note that, similiar to MotifSampler, FootPrinter predicts conserved motifs (not TFBS), BUT with the additional feature that the phylogenetic relationship between orthologous sequences from different species is considered.
NOTE that FootPrinter probably is NOT suitable for analysis of a "heterogenous" set of e.g. co-regulated promoters, but is much more designed for the analysis of ONE gene / promoter, extracted from a series of different species (i.e. orthologous sequences) !

NOTE: FootPrinter is also available as individual web application. Please refer to the FootPrinter chapter for detailed instructions, especially concerning input parameters !
NOTE: Unlike the web version, TOUCAN does NOT refer to a "pre-made" standard phylogenetic tree, but sets all input sequences at the same hierarchical level (only ONE pair of brackets spanning all sequences). If you e.g. have a set of a human and its orthologous mouse and rat promoters, you have to supply your own tree in the form: (humanID,(mouseID,ratID)).

12. Select Regions - Sublists:
Example: You can create sublists of sequence parts that are highly homologous and analyze these stretches independently. After performing AVID, right-click on a box (a homologous region), and hit "Cut". Choose the exact feature, or specify the exact bases to be cut, and add this stretch to the sublist (upper left window). You can save this sublist via "File", "Save Sublist", and then reload it as a new gene list. 

NOTE: MotifScanner can also be run on sub-lists of sequences (sequence parts) only. Example:
- Retrieve a set of (putatively) co-regulated human genes; Reverse complement those sequences where the CDS is below the black line;
- Run AVID on "All pairs"; Right-click on a CNS for the cut sequence dialog. Choose a CNS as feature and choose "all features with same source on all sequences". All CNS sequences are added to the sublist.
- Run MotifScanner, choosing "use SubList" for the fastA sequences.
- The results of this scoring will be annotated on the original set.

13. Consensus Match - Detect user-defined patterns in the sequence set:
NOTE: This tool is somehow equivalent to the "DNA-pattern" part of RSAT (see also RSAT description).
Choose "Motifs", "Consensus match". This tool quickly identifies the presence of a user-defined string, like "AAAGGTAA" or even "WWRYAATC{1,5}NCA" in the active sequence set. This way, you can immediately see which of the sequences contains "your personal" TF site of interest.

14. CpG Annotation - Detect CpG islands:
Choose "Tools", "CpG Island-Human", to quickly identify CpG regions within your sequence set.  CpG islands are longer than 200 bp, have over 50% G+C content, and a certain statistical frequency as compared to all human promoters stored in the EPD.

15. GFF Annotation - Use any GFF format file:
Choose "File", "Annot GFF", to upload a file in GFF format that contains features for one or more sequences in the active sequence set. Note that using this option, you can "merge" a sequence set and a previously saved GFF-file, like one containing the infos of an AVID analysis.
NOTE: For more information on the structure of GFF files, have alook at the TOUCAN manual !     
Transcription Factor and Motif Databases
NOTE: This section comprises databases of transcription factors and their binding site profiles. Programs for the matching of such sites to promoter sequences are contained in other sections, like TFBS Matching. Please note that resources which consider transcription factors in an "integrated view" of regulatory elements are listed in section Gene Regulation-centered Data Integration.       
(CGB, Karolinska)
JASPAR is an open-access database of transcription factors and their binding site profiles, which gained tremendous importance as the "mother" database for TF information, TRANSFAC, has been largely commercialized.

NOTE: The main section of JASPAR is located at the "GENOMICS" page, together with the description of the program ConSite, which is a program that couples phylogenetic footprinting with regulatory site detection (mainly promoter comparison), meaning that ConSite is primarily designed to compare 2 orthologous sequences and report conserved TFBS (Transcription Factor Binding Sites).
(BIOBASE, Germany)
TRANSFAC is the most comprehensive transcription factor database. TRANSFAC is maintained by BIOBASE, Germany. TRANSFAC contains data of transcription factors, transcription factor matrices, and their binding sites. TRANSFAC deals essentially with single factor-site interactions, in contrast to TRANSCompel which deals with composite elements, consisting of 2 or more neighbouring binding sites. 
that there is a public version of TRANSFAC, which is older and somehow outdated compared to the commercial version. Nevertheless, you have to register at BIOBASE (free for non-profit organizations) in order to gain access to this public version (except you search via SRS servers) !

TRANSFAC primary data are stored in different tables:
- Example: If you want to quickly identify which TF might bind to your "word of high frequency" (like AGGTCA), you can search the TFSITE table of TRANSFAC. Make sure to select the search field "sequence" ! As wildcard, use "*" (NOT "n" !). The retrieved hits ("R...") are actually short sequences taken from promoters, where this site is included, and under "BF" you will find "Binding Factors" that bind to this site. The relative position to the transcription start site is indicated. Links to the TFGene entry and to the binding factor(s) are contained. The TFSITE table also contains so-called artificial binding sites, mostly from random oligonucleotide selection assays.
- If you want to gain information on a specific Transcription Factor, search in the TFFACTOR table. This entry ("T...") shows the protein sequence of a TF, literature references, functional annotation, cell specificity, and links to the orthologs in other species, as well as to the binding site entries.
- If you want to obtain the complete binding matrix of a TF, search in the TFMATRIX table. A TFMatrix Table entry ("M...") shows a matrix, which is built by a series of individual binding sites related to a specific TF. Often, binding sites of several species are "merged". Such TF matrices are used by several programs to predict TFBS in sequences.
Note: Please refer to this Matrix documentation page for details concerning the matrix accessions !
- A TFGENE Table entry ("G...") shows a certain gene, where the promoter was experimentally analyzed and the binding TFs ("T...") and TF binding sites ("R...") were chracterized.

NOTE: At the BIOBASE web portal, several programs are available to predict TFBS in query sequences, like MATCH and PATCH ! Please refer to the section "TFBS Matching" for details !

2. Access TRANSFAC via SRS:
The different TRANSFAC databases are also searchable via public SRS (Sequence Retrieval System) servers (without need to register !). Please also refer to other SRS descriptions as in RET2 or RET5. There are 5 different libraries related to TRANSFAC, all starting with "TF...". If you have a look at this list, you realize that the content of the TFMATRIX databases varies according to the different SRS servers, so choose one of these databases. Using the yellow "Search" button at the top right corner opens the "Search" - form, where we can simply enter search terms. Note that if you retrieve no hits in the first run, you may select "*all entries*" as search field which will produce a list of ALL entries in the database. You can display the complete list in one window by adjusting the "Display Options" on the left side. This list you can scan for your factor / matrix of interest. Please refer to FAQ GEN7 for further details.

3. Access TRANSFAC Professional:
At this site, commercial offers are shown in case you are interested in the latest, so-called "professional" version of the TRANSFAC database.

4. Please note that if you want to search the TRANSFAC database for tissue/cell specificity of factors, please refer to the FAQ RET4

5. Typical TRANSFAC accessions: refer to section TRANSFAC IDs.
TF Module Databases
NOTE: This section specifically comprises databases of transcription factor modules (functionally relevant combinations of transcription factors). Note that programs for the matching of pre-specified TF-modules to promoter sequences are described in section TF Module Matching, and programs for the ab initio discovery of modules are listed under TF Module Discovery.
(GRESA group , ICG, Novosibirsk and GBF, Braunschweig, Germany)

(BIOBASE, Germany)

COMPEL (TRANSCompel since August 2000) is a database on composite regulatory elements affecting gene transcription in eukaryotes. COMPEL collects information about composite regulatory elements (CEs) - pairs of closely situated sites and transcription factors binding to them. The factors that cooperate at an individual CE mostly belong to different classes with respect to the structure of protein domains, namely DNA-binding and activation domain. The factors also differ in their functional properties: cell-specificity, inducibility and others. 

There are 3 different databases:
L 3.0, January 1999, contains 178 composite elements.
- The section "About" gives a concise introduction of the concept, and presents a very nice table showing the most frequent types of Composite Elements (CEs) and the corresponding COMPEL accessions (like "C00165").
- "Field description" explains all fields contained in COMPEL database entries.
- "Browse" lets you browse the COMPEL database by
DNA binding domains of factors involved.
- "Search" lets you search the COMPEL database via gene / factor names, DNA-binding domains, organism and more.

TRANSCompel 6.0 - public, January 2002, contains 256 composite elements, is free for non-commercial users. For commercial users TRANSCompel have to be licensed at BIOBASE, Germany. The table "Search TRANSCompel" is similar to the table "Search TRANSFAC" (see also TRANSFAC section), and both are available via the same registration at BIOBASE. Example: The term "NF-kB" retrieves all COMPEL accessions (composite elements) which contain NF-kB as one of the components.

3. TRANSCompel Professional 7.1, January 2003, contains 366 composite elements, new versions are regularly issued. TRANSCompel Professional can only be used in conjunction with the TRANSFAC Professional database (both commercial).

NOTE: COMPEL and TRANSCompel are databases of transcription factor modules (CEs), there is no option to scan a user-sequence for the presence of such CEs. For this purpose, other programs are used, like CompelPatternSearch, see this main section

Regulatory Unit Databases
NOTE: This section lists databases which store information on whole transcriptional regulatory units, including promoters, enhancers, silencers, and locus control regions.
(ICG, Novosibirsk)

TRRD (Transcription Regulatory Regions Database) is an informational resource containing an integrated description of the gene transcription regulation. Each TRRD entry corresponds to a particular gene and contains descriptions of several hierarchical levels of transcription regulation, including transcription factor binding sites, regulatory units (promoters, enhancers, and silencers), regulatory regions (5'-and 3'- regulatory regions, exons, and introns), and locus control regions. Description of each regulatory level may contain both its structural characteristics (sequence, localization, etc.) and functional properties (effect on transcription activity of a gene; cell type, tissue, or organ specificity; cell-cycle phase or ontogenetic stage specificity, etc.).
TRRD is regularly updated in accordance with new experimental evidence. All the information is inputted into the database by experts in biology basing on analysis and annotation of papers reporting experimental data.

Data in TRRD are contained in the following databases: TRRDGENES (general gene description), TRRDLCR (locus control regions); TRRDUNITS (regulatory regions: promoters, enhancers, silencers, etc.), TRRDSITES (transcription factor binding sites), TRRDFACTORS (transcription factors), TRRDEXP (expression patterns), and TRRDBIB (experimental publications).

1. Access TRRD via SRS: The different TRRD databases are also searchable via public SRS (Sequence Retrieval System) servers. Please also refer to other SRS descriptions as in RET2 or RET5. There are several libraries related to TRRD, all starting with "TRRD...". Using the "Search" button at the top right corner opens the "Search" - form, where you can simply enter search terms. Note that you may select "*all entries*" as search field which will produce a list of ALL entries in the database. You can display the complete list in one window by adjusting the "Display Options" on the left side.

Example: You may search the TRRDGENES database for gene names like "ELAM". Choose "GeneName" as search field. You will retrieve one entry, which contains links to other databases like TRRDSITES ("Binding Sites"), TRRDFACTORS ("Transcription Factors"), TRRDEXP ("Gene Expression Regulation"), and TRRDBIB ("Bibliography"). In addition, there may be links to the TRANSFAC database,  as well as to databases of data integration like SOURCE, Ensembl, or GeneCards. Most importantly, a link to the entry in TRRDUNITS is given ("Regulatory Unit"), which collects a series of TRRDSITES entries ("S..."). These entries list individual regulatory sites which have been described as functionally important for the regulation of the respective gene.

2. Access TRRD at GeneNetWorks: SRS is used as a basic tool for navigating and searching TRRD and integrating it with external informational and software resources.
- TRRD Viewer: is a visualization tool, which provides the information representation in a form of maps of gene regulatory regions.
- Browse Genes by Name: shows a complete list of the genes stored in TRRD.
- Browse Genes by Species: lists genes by species names.
- TRRD Sections: groups genes by specific functional context, like "Apoptosis genes" or "Cell cycle genes".
TFBS Matching
NOTE: This section lists programs for the prediction or the "matching" of single or multiple pre-selected transcription factor binding sites (TFBS) to promoter sequences. Thus, TFBS Matching is somehow a "special case" of Motif Matching. Please note that resources which perform this task using specific combinations of TFBS ("modules") are contained in section TF Module Matching
(CGB, Karolinska)
ConSite is a program that couples phylogenetic footprinting with regulatory site detection (mainly promoter comparison), meaning that ConSite is primarily designed to compare 2 orthologous sequences and report conserved TFBS (Transcription Factor Binding Sites). Note that ConSite uses a special database ("JASPAR") of TF profiles (PWMs).
Therefore, the main section of ConSite and JASPAR is located at the "GENOMICS" page !
At the ConSite start page, you have one option "Analyze a single sequence", which lets you analyze a single promoter for TFBS (without performing cross-species comparison).



Match is part of the programs at BIOBASE portal of gene regulation. Match is designed for searching potential binding sites for transcription factors (TF binding sites) nucleotide sequences. Match uses a library of mononucleotide weight matrices from TRANSFAC 6.0. Note that a registration is required in order to use Match, which is free for non-profit use. There is a good Help-page explaining the full functionality of Match.

1. Input:

1.1. Sequence: the sequence is entered in one of several formats.

1.2. Matrix selection: Note: Refer also to the Matrix section of the online Help !

- matrices are grouped according to different taxonomic entities like vertebrates, fungi, plant, or bacteria.

Cut-offs for core and matrix similarity: The matrix similarity is a score that describes the quality of a match between a matrix and an arbitrary part of the input sequences. Analogously, the core similarity denotes the quality of a match between the core sequence of a matrix (i.e. the five most conserved positions within a matrix) and a part of the input sequence. A match has to contain the "core sequence " of a matrix, i.e. the core sequence has to match with a score higher than or equal to the core similarity cut-off.  In addition, only those matches which score higher than or equal to the matrix similarity  threshold appear in the output.
Note: A value of 1.0 means "perfect match". Reasonable values usually are Core similarity cut-off: 0.9-0.95 and Matrix similiarity cut-off: 0.75-0.85.
Note: In the case the user defines these cut-offs by exact values, the cuto-off values are used for ALL matrices of the selected group.
The appropriate cut-off selection is very important and depends largely on the user´s objectives. Exact matches between matrix and sequence can lack any biological relevance since some transcription factors have low affinity binding sites of biological significance.
Therefore, additional pre-calculated datasets are available, each addressing a different purpose:

- Cut-off to minimize false-positive matches (minFP):
An algorithm is applied to second exon and third exon sequences, because these sequences are presumed to contain no biologically relevant TF binding sites. For every matrix the lowest cut-off for which no match is found in the set of exon sites is considered to be the minFP cut-off.
When a minFP cut-off is applied for searching a DNA sequence, the algorithm will find a relatively low number of matches per nucleotide. In the output the user will only find putative sites with a good similarity to the weight matrix; however, some known genomic binding sites could not be recognized.

Cut-off to minimize false-negative matches (minFN): minFN cut-offs are set to a value that provides recognition of at least 90% of oligonucleotides in tests using generated oligonucleotides, meaning that a 10% false-negative rate is tolerated.
Note: Applying the minFN cut-offs, the user will find most genomic binding sites, but in this case a high rate of false positives should be taken into account as well. The minFN cut-offs are useful for the detailed analysis of relatively short DNA fragments.

- Cut-off to minimize the sum of both error rates (minSUM):
A sum of both error rates is computed to find cut-offs that give an optimal number of false positives and false negatives. To do so, the number of matches found in the exon sequences for each matrix is computed, using a cut-off allowing 10% of false negative matches (minFN10). This number is defined as 100% of false positives. The sum of corresponding percentages for false positives and false negatives is then computed for every cut-off ranging from minFN10 to minFP. The cut-off that gives the minimum sum is refered to as minSum cut-off.

- "Use high quality matrices only: The high quality criterion denotes the following: When using a matrix with a cut-off which allows a false negative rate of 50%, the frequency of matches found in exon3 sequences (false positive rate) must drop below a certain threshold. This threshold is defined so that the matrices which produce the highest number of false positive matches are defined as low quality matrices (about 30% of the TRANSFAC matrices).

1.3. Profile selection:

The term
"profile" is used in this context for a specific subset of weight matrices from the TRANSFAC library with core similarity cut-off values and matrix similarity cut-off values for each matrix. You can either use one of the predefined profiles or one you have created yourself on the Match Profiler page.

- Predefined profiles provided by Match:
mainly tissue-specific profiles. Groups of transcription factors known to be active in a particular tissue have been collected for each profile with the help of information from the TRANSFAC database. Matrices linked to these transcription factors in TRANSFAC were then collected. Examples: immune cell, cell cycle, muscle, liver).

Match Profiler: Please refer to the Match Profiler Help section for details !
The user can select the matrices to be included in the profile from the total list of available matrices.

2. Results page:

The results page displays a listing of all matches found in the input sequence. The output of the program is limited to 500 000 matches per sequence. The results are presented in a table with the following columns:
- matrix identifier: ach identifier is linked to the TRANSFAC entry of its matrix or, if it is a user-defined matrix, the respective matrix is displayed.
position of the match in the input sequence and the strand ((+) or (-)) in which it can be found.
- score for core and matrix similarity.
- matching sequence:
capital letters indicate the positions in the sequence which match with the core sequence of the matrix, while the lower cases refer to the remaining position of a matrix.
name of the factor whose binding site is represented by the matrix. f an entry exists, a factor name is linked to a selection of the TRANSFAC factor table, showing all entries of this factor mentioned in the respective matrix entry (TRANSFAC matrix table).
It is also possible to view a graphic output of the results. Here the identifier of a matching matrix is "aligned" to the sequence being searched. When you use the "Back" button of your browser to return to the Match page, please press the Reset button.Then the new results can be found in the lists of your results.

P-Match (combined Pattern-Matrix search) is a new tool for identifying transcription factor binding sites (TF binding sites) in DNA sequences. It combines pattern matching and weight matrix approaches thus providing higher accuracy of recognition than each of the methods alone. P-Match uses a library of mononucleotide weight matrices from TRANSFAC 6.0 along with the site alignments associated with these matrices.

Note: In general, P-Match "looks" very similar to Match, so please refer to the information above.
Additional features:
- more predefined profiles (sets of TFs) available, corresponding to different tissue specific TF sets.

Patch: (formerly known as PatSearch):
Search for potential transcription factor binding sites in your own sequences with the pattern search program using TRANSFAC 6.0 public sites. The patterns, which Patch uses for searching, are TF binding sites of the TRANSFAC Professional database and the consensus sequences of weight matrices of TRANSFAC Professional.
Note: In general, much information described at  the Match program can also be applied here, so please refer to the information above. Please refer also to the Patch help page for details.

Note: The main difference is that there is no selection of cut-off values for matrix similarities (as used in Match and P-Match), but a selection of:
- the minimal length of sites (default: 4):
only sites longer than or equal to 4 will appear in the output.
- the maximum number of mismatches: it specifies how many positions may differ when comparing a binding site (search pattern) with some part of the input sequence.
- mismatch penalty:
When comparing a binding site (search pattern) with some part of the input sequence, each mismatching position will receive a mismatch penalty. This penalty value will have a negative influence on the overall score for the match between the whole site (search pattern) and the input sequence. Each matching nucleotide receives a bonus weight of 100. So, the default value for the mismatch penalty is also 100, and the negative influence of a mismatching position corresponds to the positive influence of a matching position. If you reduce the mismatch penalty, you will receive high scoring sites containing mismatches in the Patch output. If you increase this parameter, high scoring sites are not likely to contain mismatches.
- Lower score boundary:
The lower score boundary is a cut-off, which defines which matches between a site (search pattern) and the input sequence will be listed in the output. The score which is estimated for every match has to be higher than or equal to this cut-off. The default value for the lower score boundary is 87.5. 
MatInspector  ("professional" and "classic") 
(Genomatix Inc., Munich, Germany)
1. NOTE: The free ("classic") version of MatInspector was terminated ! MatInspector in general predicts potential transcription factor binding sites. Usually the output list is quite long and it is hard to differentiate between potential specific and non-specific hits, which becomes easier when using the "professional" version.

2. The MatInspector professional program significantly reduces the number of false positives and negatives.
This program (to be accessed here) is now part of the Genomatix Suite, meaning in principle free of charge for academic users, just register here !!! Anyway, you are restricted to max. 20 analyses (sequences) per month ! 
NOTE: Genomatix has termed the free academic access "evaluation account". Note that there is not only a limitation in the number of analyses but also in the functionality of the obtained data !
Note that there is a highly user-friendly option when choosing the matrix similarity threshold, called "optimized". This means that the program automatically chooses the optimal value for each matrix, which minimizes the number of false positives. This optimized value is defined in a way that a minimum number of matches is found in non-regulatory test sequences. At the MatInspector output list, you can easily compare the difference between the optimized matrix threshold and the actual matrix similarity for each site.
Note that you should use MS Internet Explorer (and not Netscape) in order to make use of the functionality of the Adobe SVG Viewer, allowing to interactively handle the TF sites - diagrams. If you have analyzed a multiple sequence file, you will find a button called "Search for common TF sites"at the bottom of the output page, opening a SVG window, where you can adjust to see only those TF sites present in x of total y input sequences. Via right-klick and "Copy SVG", you can paste the image into e.g. Corel Photopaint in order to save in any file-format.  

3. MatInspector professional is also part of the "GEMS Launcher"-section of the Genomatix Suite. Select "Search for transcription factor binding sites". NOTE that again, you are restricted to max. 20 GEMS analyses (sequences) per month !
Signal Scan
Signal Scan searches for transcription factor binding sites in TRANSFAC and TFD databases. Note that NO FURTHER information on individual factors is provided, so it may be recommendable to use other systems like TESS or MatInspector, instead.
TESS - Transcription Element Search Software
(Pennsylvania University)
TESS is provided by the University of Pennsylvania, Philadelphia. TESS is a set of software for locating and displaying transcription factor binding sites in DNA sequence. TESS uses the Transfac database as its store of transcription factors and their binding sites.

1. Analyze your promoter sequence:
combined search in various databases is performed (TRANSFAC site, TRANSFAC matrix, CBIL matrices, IMD..). 
You can choose between various forms of output; including colour coding of consensus mismatches (very useful !!), tables to show the significance of hits (!), Jave Applets to show the binding sites on the sequence,...
NOTE: Within the tabular results page, you can click at the header of every column to sort the output (e.g. click on "Sm" will sort the output by matrix similarity). 
NOTE: "===" within the "annotated sequence view" indicates hits above the secondary threshold, whereas "---" indicates below.
NOTE: The only disadvantage is the upper limit of 1 kb for input sequences !

2. NOTE that in addition to analyzing s sequence, you have a multitude of query options, like in the:
2.1. Query Transfac section:
Here you can query the TRANSFAC sites, factors, matrices and other tables. E.g., you can query the Transfac site table using a short potential regulatory sequence (Choose "Sequence" under "Searchg this field"), and look which known factors bind to your short sequence stretch.
2.2. Query matrices section:
You can obtain matrices from different sources (TRANSFAC, CBIL-GIBBS, IMD) very nicely at this site.
(CBRC, Japan)
TFSEARCH is maintained at the Computational Biology Research Center (CBRC) in Japan. TFSEARCH searches highly correlated sequence fragments against TFMATRIX transcription factor binding site profile database in the 'TRANSFAC' databases.
TFSEARCH produces a nice graphical output, but seems a little "outdated" !
(Leuven Katholieke Universiteit, Belgium)

MotifScanner is a program that can be used to screen DNA sequences with precompiled motif models, like transcription factor binding sites.

NOTE: As MotifScanner is also part of the TOUCAN program for regulatory sequence analysis, please refer to this main section of MotifScanner for details.
The "stand-alone" version does not support the matrices from the JASPAR database, but the version integrated in TOUCAN does.

Motif Matching
NOTE: Motif matching (also known as pattern matching) is used to locate a specific pattern or profile matrix within a set of sequences. The section "Motif Design" describes how to define motifs. Motif matching is also widely used in promoter analyses, hence there are also close relations between these two topics. At the FAQ page, Motif Matching is mainly discussed in FAQ GEN4 (set of sequences), but also in GEN2 (single sequences).
Consensus Match
Consensus Match is part of the TOUCAN program for regulatory sequence analysis. Consensus Match is not available as individual application. This tool quickly identifies the presence of a user-defined string, like "AAAGGTAA" or even "WWRYAATC{1,5}NCA" in a set of sequences.
Please refer to this main description above for detailed information.
(CGB, Karolinska)
ConSite is a program (and web interface) that couples phylogenetic footprinting with regulatory site detection (mainly promoter comparison), meaning that ConSite is primarily designed to compare 2 orthologous sequences and report conserved TFBS (Transcription Factor Binding Sites). At the ConSite start page, you have 3 different options. Note that the option "Analyze a single sequence" can also be used to scan the (single) sequence for the presence of a user-defined profile (raw counts matrix or position weight matrix), but not of a user-defined consensus sequence.
Please refer to this main section of ConSite for details.
(EMBOSS, Pasteur)
FUZZNUC is a program for nucleic acid pattern matching, using the typical "user-friendly" EMBOSS interface-style.
You simply paste your set of sequences which you want to search with your pattern and the pattern (consensus sequence) itself.
that patterns for fuzznuc are based on the format of pattern used in the PROSITE database (amended to refer to nucleic acid sequences, not proteins), with the difference that the terminating dot '.' and the hyphens, '-', between the characters are optional.
Example: [CG](5)TG{A}N(1,5)C
This searches for "C or G" 5 times, followed by T and G, then anything except A, then any base (1 to 5 times) before a C.
You can use ambiguity codes for nucleic acid searches but not within [ ] or {} as they expand to bracketed counterparts. For example, "s" is expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is illegal.
Note the use of X is reserved for proteins. You must use N for nucleic acids to refer to any base.
The search is case-independent, so 'AAA' matches 'aaa'.
Output: The matching hits are listed only as text one sequence after the other, there is no graphical output, which is a drawback when screening large sequence sets as it is not easy to get an overview of match counts and match positions.
Motif Matcher
Motif Matcher was developed by Jim Kent at UCSC and is part of the page "cis-Site Seeker". Motif Matcher is a program for finding where a given motif occurs in sequence data. You may paste up to six motifs in the small text areas and the sequence set to search screen below.

Please note that this program does not accept consensus sequences (like in the RSAT program DNA-Pattern) as input (e.g. CsTATsGG), but only motifs. The Motif Matcher help gives a very good, concise introduction on the "nature" of a motif, making it quite easy to convert a consensus into a motif.
A motif is something that can recognize a consensus sequence which may not be completely conserved. The numbers represent the probability of finding each nucleotide at the corresponding position of a motif. Motif Matcher can also take motifs that are in the form of counts rather than probabilities. This lets you construct motifs from conserved areas of multiple alignments easily. For instance you could represent the multiple alignment:


as the motif ("consensus" format):

A 0 4 0 0 3 0 1 3 0
C 4 0 1 1 0 0 3 2 0
G 0 0 0 3 1 0 0 0 0
T 0 0 3 0 0 4 0 0 4

NOTE: This "Motif" definition would be equivalent to "profile matrix" in Patser (RSAT). In the motif, there is simply one space between each number and a line break between each line. See the Patser section for general information about motif formats !
NOTE: This format would be compatible with the "consensus" format in the program Patser (RSAT). BUT note that the Patser-output is much longer and contains much more hits than the Motif Matcher output, at least when using the default parameters.

The output presents the highlighted motif matches along the sequences and a graphical summary of the motif positions, which is comparable to the RSAT "feature map" produced after performing a "DNA-pattern" or "Patser" search.
RSAT - Tools for pattern matching



Genome-scale DNA-pattern

Genome-scale Patser
1. DNA-Pattern (Strings) is part of the (RSAT) package of regulatory sequence analysis, which was created at the SCMBB, Brussels University. 
This method is used for situations where you know the regulatory motif (e.g. the consensus for a transcriptional factor), and you know the genes (set of sequences). Please refer to this main description above for detailed information.

2. Patser (Matrices) allows to scan a DNA sequence with a profile matrix, which can be either in Transfac, Gibbs, or Consensus format. Please refer to this main description above for detailed information.
NOTE: Patser allows to scan a set of sequences using ONE pattern but NOT using a specific combination of TFBS. If you want to perform such searches, you may choose other programs like Target Explorer (see Target Explorer section).
3. Genome-scale DNA-pattern is used for situations where you know the regulatory motif, you ignore the genes: you look in the genome for all genes possibly regulated by your element. Please refer to this main description above for detailed information.

4. Genome-scale Patser: When the pattern is relatively large and degenerated (which is not the case for GATA boxes), matrix-based pattern genome-scale matching (with patser) provides better results than string-based pattern counts. Please refer to this main description above for detailed information.
TF Module Matching
NOTE: This section lists programs for the prediction or the "matching" of pre-selected combinations of transcription factor binding sites (TF modules) to promoter or enhancer sequences. The number of targets can range from single sequences to whole-genome searches ("TF-target prediction"). At the FAQ page, TF-target prediction is mainly discussed in FAQ GEN4. Also, there are overlaps with the more general question of "Motif Matching".
(Zlab, Boston University)
Cluster-Buster is already the third generation program (after Cister and Comet) developed at Zlab for finding clusters of pre-specified motifs in nucleotide sequences. The main application is detection of sequences that regulate gene transcription, such as enhancers and silencers, but other types of biological regulation may be mediated by motif clusters too.
Cluster-Buster may be used via this web server, or downloaded (Linux executable) for use on your local computer. NOTE: Large-scale analyses (like whole-genome approaches) are only possible using the downloaded version, while the web server accepts only single sequences of 100 kb max. length !

1. Input: A DNA sequence (pasted or uploaded) of 100 kb max. and a selection of TFBS which you want to use as a "module" to search the sequence. Some common TFBS can be selected from a list, which may also be combined with additional TFBS selected from a Zlab list of JASPAR sites, from the JASPAR homepage, and from TRANSFAC (see TRANSFAC section for retrieval of TRANSFAC matrices). See also the NOTE below concerning formats !
NOTE: This means that Cluster-Buster works with pre-made TFBS matrices AND / OR self-made matrices !

Cis-elements can be entered as TRANSFAC-style matrices, which look like this:
NA   AML-1a
DE runt-factor AML-1
BF T02256; AML1a; Species: human, Homo sapiens.
P0 A C G T
01 5 1 2 49 T
02 2 2 52 1 G
03 4 14 1 38 T
04 0 0 57 0 G
05 1 0 55 1 G
06 1 4 0 52 T
You can copy-and-paste these directly from the TRANSFAC website. All lines except the name line (beginning with 'NA') and the nucleotide frequency lines (beginning with digits) are ignored and not required. The name line is required, and should be above the base frequency lines. Alternatively, each cis-element can have a title line starting with ">" and then the name of the element, followed by 4 numbers per line describing nucleotide frequencies at each position (A-C-G-T) in the cis-element. For example:
0 4 2 14
12 0 0 8
8 0 1 11
20 0 0 0
13 1 1 5
These numbers might come from a multiple alignment of experimentally determined cis-elements. The first column indicates the number of adenines observed in each position, the second column the number of cytosines, the third column the number of guanines, and the fourth column the number of thymines. The two formats can be mixed.
Zlab list of JASPAR sites contains the JASPAR profiles reformatted to the TRANSFAC style, which therefore can be directly used as input for Cluster-Buster. Note that the original JASPAR format displays the 4 nucleotides as lines (which lets you read the consensus from left to right) instead of columns in TRANSFAC (which lets you read the consensus top-down) !

Also, there are several parameters that may be tuned in Cluster-Buster:
- Gap Paramater: indicates the average distance between motifs within a cluster (default: 35). Low values enhance tight clusters of weak motifs, whereas high values enhance loose clusters of strong motifs.
- Residue Abundance Range: specifies how far either side of each point in the sequence to count residue frequencies.
- Filter low-complexity regions:
Dust is a program that masks (replaces with 'n') regions of low compositional complexity in the sequence. Such regions, e.g. tandem repeats, may harbor very strong motif clusters, which might be spurious artefacts of the repetitive nature of the sequence, or then again they might be functional. It's your call.
- Filter lowercase letters: Check this option to exclude lowercase letters in the sequence from matching motifs. Lowercase letters are sometimes used to indicate repetitive sequences.

2. Output:
The first diagram shows an overview of motif cluster locations in the sequence. The height of individual clusters is proportional to the log likelihood ratio score. Next, detailed information for each cluster is printed.  Motifs on the forward strand are drawn above the central line and those on the reverse strand are drawn below. The scores are log likelihood ratios. The cluster score is log [ prob(cluster sequence given that it's a cluster of real sites) / prob(cluster sequence given that it's random DNA) ]. The motif score is log [ prob(motif sequence given that it's a real site) / prob(motif sequence given that it's random DNA) ]. The higher the better.
DBTSS - Search for TF Binding Site
(Tokyo University)
DBTSS - Search for TF Binding Site is a "sub-program" of the DBTSS database, accessible via the links in the left frame at the start site. DBTSS contains exact information of the genomic positions of the transcriptional start sites and the adjacent promoters for most of the annotated human and mouse genes.
Please refer also to the DBTSS main section for details !

DBTSS - Search for TF Binding Site can search for promoters containing putative binding sites of particular transcription factors (TFs). There are 3 sequence databases which allow a "whole-genome" search for TFBS modules: human, mouse, and human-mouse conserved. The human and mouse sequence databases contain 1.2 kb for each gene (1.0 kb upstream of TSS and 0.2 kb downstream of TSS). This means that DBTSS focuses on proximal promoters, but not on distal regulatory elements like enhancers or silencers. Analysis of evolutionary conservation is therefore restricted to this region.    

- Selection of target sequences: human, mouse, or human-mouse conserved.
- Selection of up to 5 different TFBS. NOTE: This programs allows a combined search of TFBS matrices (taken from TRANSFAC 6.0) with user-defined consensus patterns, which may also include IUPAC-codes !!! This allows to search also for binding sites which are not represented in TRANSFAC database, or which you have just determined experimentally in the lab !
- in case of Transfac TFBS, select a core similarity and matrix similarity (stringency of consensus matching; "1" means perfect match).
- Choose a target region: TSS designated as 0; Available: -1000 to 200.

List of genes where the selected combination of patterns is found in the promoter region, with links to:
- Individual gene / promoter reports, displaying different types of cDNA (RefSeq, Ensembl transcripts, AND the DBTSS-specific oligo-capped cDNAs), together with features like SNPs, CpG islands, and repetitive elements. Also, a detailed view of the promoter region is shown, displaying the positions of the query patterns (TFBS and user-defined), and the positions of individual entries of the TRANSFAC TFSITE table, showing information on individual experimentally verified TF-DNA interactions. Please refer to the TRANSFAC section for details how to access these databases.
- Comparative View of the human / mouse promoters, aligned by LALIGN, showing the matching regions and the actual sequence alignments, together with the positions of the query patterns (TFBS and user-defined) within the 2 species.
EEL - Enhancer Element Locator
(University of Helsinki)
EEL - Enhancer Element Locator is a tool for locating distal gene enhancer elements in mammalian genomes by comparative genomics and to identify conserved TFBS in predicted enhancers. ELL is described in Hallikas et al., Cell 2006.

EEL can be used in 2 different ways:
1. The EEL software is available for download in various forms (Windows, Mac, LINUX) under the GNU General Public Licence.

2. EEL Database of precomputed EEL alignments: EELweb stores precomputed alignments between orthologous genes from human and many other species. The data is regularly updated with some synchronization with ENSEMBL database, which is used as source of genomic information. Again, EEL database can be used to address 2 different questions:
2.1. Search for conserved TFBS (all or selected from a list) in 100 kb upstream and downstream regions of a specific gene (set) of interest:
- A list of Ensembl Gene IDs can be used as query to search for precomputed TFBS in predicted enhancer regions of these genes.
- Select the suitable species comparison. The Ensembl IDs must correspond to the chosen organism.
- Sites in the module: Restrict the query by requiring certain types of transcription factor binding sites to be conserved in the elements. Using this option requires restricting the query also by Gene IDs, because of the database server computation power.
- Note that the maximum number of results listed is 1000. If you want to produce higher numbers, you have to install the local version of EEL.
2.1. Search for conserved TFBS in 100 kb upstream and downstream regions of ALL genes (whole-genome approach):
It is possible to use EEL in a whole-genome approach. For this purpose, simply leave the field for the Ensembl Gene IDs empty (which means that you want to use ALL Ensembl IDs).
- Note: works only with the "any site" selection of TFBS.
- Output of the web-based database version is restricted to max. 1000 hits.

- A test run using the human genes CD8A (ENSG00000153563) and CD8B (ENSG00000172116) did not produce any enhancer regions / TFBS, although this locus is well documented concerning functional enhancers. Thus, it may be questioned whether the EEL database really holds a comprehensive list of genes / enhancers. 
FastM and ModelInspector
(Genomatix Inc., Munich, Germany)
FastM is a method to develop user defined models of transcriptional regulatory DNA units (e.g. promoters). These models can be built using various individual elements (like transcription factor binding sites, repeats, hairpins) and their sequential order. Thus, IUPAC sequence elements can be successfully combined with different types of weight matrices and structural elements (e.g. hairpins) in the assessment of match quality. Between each pair of elements for a model a distance range has to be defined.

ModelInspector utilizes either a library of predefined models or models generated by FastM or FrameWorker to scan your own DNA sequences or sequence databases for new regulatory units matching the model. Examples of databases which may be scanned are: GPD (Genomatix Promoter Database), ElDorado Genomes, EPD (Eukaryotic Promoter Database), RefSeq, and various GenBank sections.

Note: FastM and ModelInspector is a system to help create your own model of regulatory elements; it is NOT a system to extract a model (co-occurring sites) from a given set of sequences (like in FrameWorker or in the TOUCAN program ModuleSearcher or in CREME) ! Rather, ModelInspector is similar to the TOUCAN program ModuleScanner.

NOTE: FastM and ModelInspector are part of the "GEMS Launcher"-section of the commercial Genomatix SuiteNOTE: Genomatix has termed the free academic access "evaluation account". Note that in general, there is not only a limitation in the number of analyses (max. 20 GEMS analyses (sequences) per month!) but also in the functionality of the obtained data !

ModuleScanner is part of the TOUCAN program for regulatory sequence analysis. ModuleScanner performs whole-genome searches with a predicted CRM (Cis-regulatory module = combination of transcription factor binding sites) or with a user-defined CRM known from the literature to find possible target genes. ModuleScanner restricts the search to evolutionary conserved regions between 2 species in order to raise the possibility to pick biologically meaningful CRMs.

Please refer to this main description  for detailed information.
(CGB, Karolinska)
MSCAN was developed at the Center for Genomics and Bioinformatics (CGB), a department of the Karolinska Institute. MSCAN is an algorithm that detects clusters of pre-specified transcription factor binding sites in genomic sequences.

1. Input:
- A set of one or multiple DNA sequences of interest. Note: The maximum size of the sequence set is 10 Mb, which is much more than e.g. in Cluster-Buster, but is still not enough for whole-genome approaches (A test  set of "all" human promoters (1 kb each) contained about 20 Mb). Sequences may contain all IUPAC characters.

- A set of transcription factor binding sites represented as position frequency matrices.
The binding site models of the transcription factors (TFs) must be given as position frequency matrices (PFMs). The postion weight matrices (PWMs) used by MSCAN will be calculated from the PFMs. A PFM here has four rows and summarizes a collection of aligned binding site sequences for the TF; each column in the PFM gives the observed number of nucleotides (A, C, G, and T) at the corresponding position in the alignment. E.g the following alignment
will give the matrix:
           A:[ 12  3  0  0  4  0 ]
           C:[  0  0  0 11  7  0 ]
           G:[  0  9 12  0  0  0 ]
           T:[  0  0  0  1  1 12 ]

Each PFM must be preceeded by a FASTA format ID line. The resulting format that is required for input is thus as follows:
0  5 13  0  0  0  0  0  0
3 11  0  5  0  0  0  1  1
0  0  2  1  1 17  0 16 15
15 2  3 12 17  1 18  1  2
18 6 16 18  3  0  5
0  0  0  0  9  4  2
0  2  0  0  2  2  0
0 10  2  0  4 12 11

The input page also offers the possibility to retrieve matrices from 2 databases. Matrices can be obtained from the JASPAR database (see also JASPAR section), an open access database with over a hundred binding profiles. Alternatively, matrices can be obtained from the TRANSFAC database (see TRANSFAC section for details). The matrices from JASPAR can be obtained, either by selecting them from the menu or by uploading a list of JASPAR ids. The Transfac matrices can only be obtained by uploading a list of TRANSFAC ids.The format for the files containing ids is as follows. The data should be organized in 2 columns. The first column contains the name of the database, JASPAR or TRANSFAC and the second column contains the identifier. E.g.
JASPAR            MA0004
JASPAR            MA0006
One file may contain both JASPAR and TRANSFAC ids.

2. Parameters:
- Window size: The window size is the number of basepairs of the sliding window. The minimum value is 111 basepairs. This number can be adjusted to reflect biological knowledge about the number of sites that are expected in a module and the spacing between the sites. If no prior knowledges about the module is available, 200 bp is a good default value.
- Minimum number of hits: When looking for modules, this parameter should be set to 2 or higher. Note: When set to 1, MSCAN will also output regions that contain a single high scoring binding site.
- Threshold score (most important paramater): MSCAN calculates a significance value for each window, based on the various sets of binding sites that can be found in the window. The significance value reflects how often a window with the found cluster would occur in a random sequence. Whenever the significance value of the window is below the given threshold, it is regarded as a regulatory region. If this parameter is set to a very low value, only clusters with a very high significance will be output.
- Replace lowercase with N: Sequences that contain low complexity regions or repeated regions may give rise to high scoring modules that are caused by repeated regions. The biological function of such modules may be questionnable. Repeated and low complexity regions can be detected by programs such as Repeat Masker and Dust. These non-informative sequences may be given as N's or as lowercase characters. In the latter case, checking the "replace lowercase with N" option will cause MSCAN to ignore these regions in the input serquences.

3. Output:
The DNA sequences are scanned for the occurrence of clusters of transcription factor binding sites. This is done as follows:
A window of a certain base pairs is slid over the DNA and is evaluated for the presence of transcription factor binding sites. The score for each match of a transcription factor binding is expressed as a P-value and is calculated based on the base pair frequency in the window that is evaluated.
The score for an entire window that contains multiple binding sites is a function of the scores for the individual transcription factor binding sites. This score is also expressed as a P-value. This P-value indicates the likelihood of observing a window with the same cluster of binding sites in a random sequence with the same basepair frequency. When the window scores below a user given score threshold, it is indicated as a regulatory region.
If multiple overlapping  high scoring windows are found, these windows are fused to create a larger regulatory region, consisting of multiple windows. The score of the best window within this regulatory region is shown in the output.
(, Lawrence Livermore National Lab)
Note: many programs of this portal, like rVISTA, zPicture, and ECR Browser are described under ""  in section "Comparative Genomics" !
SynoR performs genome-wide scans for clusters of evolutionary conserved transcription factor binding sites (cTFBS) in user-specified spatial configurations. SynoR is part of the portal for comparative genomics and gene regulation. The current version of this program scans human and mouse genomes for TFBS conserved in comparisons with either other mammals, chicken, frog, or fish. The identified cTFBS modules and corresponding genes go through several steps of functional annotation. (1) cTFBS modules are classified as promoters (regions 1.5 kb upstream of TSS), UTRs, introns, intergenic, or coding exons depending on their relationship to "UCSC known genes". (2) Interspecies conservation is performed for all the identified modules to describe the evolutionary history of different modules. (3) Gene Ontology (GO) characterization is performed for genes bracketing the identified noncoding modules. (4) GNF Expression Atlas 2 analysis is performed for these genes, thus allowing the prediction of tissue specificity of the identified modules.

1. Input: The user has to select individual TFBS which build a module and optionally may determine the count (if one TFBS occurs more than once in the module), and the directionality (+ or - strand for each TFBS). Also, the user may fix the order of the TFBS (in cases where a TF module is only functional when the TFBS are arranged in a specific order along the chromosome). The user may then specify the maximal distance (between 3 and 1000 bps) between neighbouring binding sites, and finally choose the base genome and the comparison genome which are then used to extract conserved TFBS and conserved TF modules.

Please refer also to the SynoR instructions page for further details !

2. Results:
2.1. Detailed modules' annotation in table and text form.
- The gene name where the TF module was detected "in proximity". NOTE: There is no direct gene related information in SynoR, but you may select the ECR Browser link and then hit the RefSeq sequence (blue line) to retrieve these data from UCSC Genome Browser.
- The type of genomic region (promoter, UTR, CDS, intron).
- Visualization of the TF module region in ECR Browser.
- TFBS Map: shows graphically the order and spatial distance between the individual TFBS.
- Fasta Sequence: sequence of the respective genomic region.
2.2. Functional annotation of genes corresponding to noncoding modules:
- Enrichment in Gene Ontology categories: This function allows a prediction of biological function of a certain combination of TFBS via the GO terms which have been attributed to the potential target genes. The program determines whether certain GO terms in the set are overrepresented as compared to the occurence of GO terms across ALL genes.
- Tissue-specificity of identified genes: Here, the user can quickly identify if the potential target genes are selectively expressed in certain tissues, implying that the combination of TFBS may be important for specific tissues. This is accomplished by integrating expression data from the GNF Expression Atlas (version 2). Please refer also to this main section for details !
2.3. Chromosomal distribution: an image is presented which allows the quick estimation if the potential target genes are clustered in certain chromosomal regions.

NOTE: If you do not know a combination of TFBS a priori and want to extract such modules from the promoters of a set of co-regulated genes, you may use the tool CREME for this purpose. Please refer to the main section of CREME for information !
Target Explorer
(Columbia University)
Target Explorer automates the entire process from the creation of a customized library of binding sites for known transcription factors through the prediction and annotation of putative target genes that are potentially regulated by these factors.

NOTE: Target Explorer was specifically designed for well-annotated Drosophila melanogaster genome, but some options can be used for any sequence of interest.
NOTE: A free registration is needed in order to use the Target Explorer programs.
NOTE: Target Explorer comes with a very well-written, informative manual, which also explains terms like "alignment matrix" or "weight matrix".

1. New weight matrix creation for a TFBS:
This tool generates a positional weight matrix representation of a TFBS starting from a series of  binding sites (experimentally determined or taken from the literature). You may enter your set of at least 5 sequences that contain known binding sites in fasta format. The sequences do not have to be aligned. The sequences must not contain any IUPAC-codes (as the purpose is the generation of a matrix from individual binding sites, it should be possible to enter the exact sequences of these sites) ! The user may choose the length of the binding site or leave this choice up to the program. Motif recognition in unaligned sequences is based on programs CONSENSUS and WCONSENSUS by Jerry Hertz. The resulting weight matrix can be saved either to the public domain or to the private library, where it can be recalled for use with the different tools.
2. Search for putative binding sites and their clusters: When choosing this link, the program first shows the list of matrices which have been saved. The user can choose one or several matrices, and can select whether to search for each matrix separately or to search only for combinations of sites by applying a user-defined maximum distance in potential target promoters. Scoring of potential binding sites is based on program PATSER by Jerry Hertz.
The user can choose between different target sequences:
- copy/paste a list of sequences in fasta format.
- upload a list of sequences in fasta (txt) format. Note: This works very similar to the interfaces of the RSAT tools DNA pattern or Patser.
- Select the whole Drosophila genome or parts of it (like individual chromosomes).
NOTE: The first 2 options can be used to scan ANY kind of sequence set (not only Drosophila sequences) for modules of TFBS. BUT: Test runs showed very long processing times when using larger sequence sets (like whole-genome promoter sets), and upload of a fasta sequence file did not work at all. In general, graphical output should be deselected for larger sequence sets. 
Unlike programs like SynoR, Target Explorer does not take any kind of evolutionary conservation of TFBS into account.
For comparison: The RSAT tool Patser allows to scan a set of sequences using ONE pattern but NOT using a specific combination of TFBS.
- list of clusters in potential target sequences.
- graph for individual clusters.
- graphical map of binding sites (works only if your search template is a single sequence).
3. Additional analyses restricted to Drosophila genome:
- Annotation of genes near predicted binding sites. Annotation is based on sequence annotation for Drosophila Genome.
- Analysis of clusters conserved between 2 Drosophila species.
- Map of binding sites with genes directly linked to the annotations of the Drosophila genome database.     
Selected TF-Target Databases
NOTE: The section "TF Module Matching" contains programs which try to predict the promoters where a certain TF or a combination of TFs could actually play a functional role. This section here lists databases which store the results of such analyses.
CREB Target Gene Database
(Salk Institute, La Jolla, CA)
The CREB Target Gene Database is dedicated to catogerize CREB target genes in a comprehensive and easy-to-search way. A multi-layered approach is used to predict, validate and chracterize CREB target genes. The database is described in Zhang et al., PNAS 2005. The cyclic AMP response element (CRE)-binding protein (CREB) family of activators (CREB1, CREM, ATF1) functions in diverse physiological processes, including the control of cellular metabolism, growth-factor-dependent cell survival, and development and plasticity of neurons. A diverse range of signals, including cAMP, calcium, stress and mitogenic stimuli, can activate CREB and promote target gene expression.

For each gene, you can search for:
1. CREB binding on the promoter (ChIP-on-chip data),
2. Presence of CRE (cAMP-repsonse element) on the promoter (promoter sequence analysis result),
3. Induction by cAMP in specific tissues (Affymetrix gene profiling data).

To start, enter the gene symbol (e.g. HDAC6), gene ID (e.g. 10013), or gene name (e.g. histone deacetylase 6) and click search. You should use the official gene symbol or gene ID (locuslink number) from NCBI . You can use Entrez Gene to get the symbol or geneID of your favorite gene. The CREB Target Gene Database is now a NCBI LinkOut resource provider, you can go directly from NCBI Entrez gene to this database using the linkout feature. Once you've found your favorite genes from NCBI Entrez Gene, choose LinkOut in the display options or in the "Links" dropdown menu. If that gene is in the CREB Target database, a link will be shown.

Note: To browse a table within the CREB Target Database, type "all" for query and all genes will be shown at 30 genes/page. If you only want to look at the best CREB target genes in a table, type the magic word "CREBtarget" as query.                 
TFBS Discovery
NOTE: This section lists programs which "discover" TFBS which are over-represented in a given sequence set. This means that TFBS are not pre-selected by the user in order to be matched onto the sequences, but are predicted by the programs as statistically significant for this set. 
OTFBS - Over-represented Transcription Factor Binding Site Prediction Tool
Institute of Bioinformatics, Tsinghua University, Beijing, China)
OTFBS, developed at Institute of Bioinformatics, Tsinghua University, Beijing, is a method which can detect over-represented motifs of known transcription factors from a set of related sequences. Particularly, promoters of the same gene family or from the same tissue can be submitted as analysis subject. Promoters of putative co-regulated genes clustered with gene expression data should be also a good candidate to analyze. The version of TRANSFAC Matrix OTFBS currently uses is Release 6.0.

Input: Simply submit the upstream regulatory regions of a group of related genes (max. 200), i.e. genes clustered together with microarray data, or just the genes of a same functional protein from a series of related species. Note: There is no option to adjust any parameters.
Output: A simple list of overrepresented TFBS, and the positions of all TFBS in all input sequences. Note that only the TRANSFAC Matrix accession numbers are listed (like "M00086"), NOT the names of the TFBS !!! If you want to know the identity of these TFBS matrices you have to query the individual accessions at TRANSFAC (see TRANSFAC section for instruction !).
TELiS - Transcription Element Listening System
(UCLA, Los Angeles)
TELiS is a very fast and very easy-to-use system to find transcription factor binding motifs (TFBMs) that are over-represented in promoters of differentially expressed genes. Modern high-throughput methods like microarrays often generate lists of genes which show a different expression pattern under different experimental conditions like biological stimuli. TELiS is capable of extracting very quickly the over-represented TFBMs in such promoter datasets. This is done by pre-solving the most computationally intensive part of the problem, scanning large nucleotide sequences for multiple TFBMs. Thus, the TELiS database contains information on the prevalence of TFBMs in the promoters of all human, mouse, and rat genes.
- 3 different promoter sizes
have been extracted from genome databases: 300 or 600 bases upstream of the Transcription Start Sites (TSS), or a region from -1000 to +200, all corresponding to mRNA sequences from the NCBI RefSeq database.
- The program MatInspector was used to determine the TFBMs in these sequences, using 3 different stringency values: Matrix similarity 0.8 ("low"), 0.9 ("high"), and 0.95 ("extreme").
- 2 different TFBM databases can be used, the public TRANSFAC database version 3.2, or the open-access JASPAR database. Note: Although TRANSFAC has a higher number of TFBS, this public version is not updated, in contrast to JASPAR, wh
ich is a smaller set that is non-redundant and curated. Note: If you want to get detailed information on individual TFBMs, please refer to FAQ GEN7 ! Note that also in TELiS, the matrices of all TRANSFAC TFBMs and JASPAR TFBMs can be browsed one after the other, which is still less convenient than using other options listed in FAQ GEN7.

The only input required from the user is a list of HUGO Gene Symbols, separated by tabs, spaces, or line breaks, and to choose one of the promoter sizes and one of the stringency values. The TELiS publication states that analyses of short promoter sequences (300 bases) with moderate stringency (0.90) provided optimal signal detection, whereas analyses using longer sequences or lower stringency produced poorer signal-to-noise ratios.
Finally the user has to select the microarray platform which was used in the personal experiment. The last point is necessary because the TFBMs identified in the selected genes are compared to the TFBMs pre-identified in all genes contained in the experimental platform as a reference population, in order to determine over- or under-representation (please also refer to the TELiS backbround page for additional discussion). NOTE: If your array is not listed, you may simply select "All human / mouse / rat genes" at the bottom of the dropdown-list, meaning that ALL genes of a species are taken as reference for the analysis.
Also note that the best results are achieved with sets of 100 or more genes/promoters, whereas the analytic sensitivity drops significantly for samples <20. Nevertheless, Incidence analysis p-values are described to remain accurate for any sample size.

There are 2 different ways to analyze a dataset:
1. "Differential expression analysis": Here, a list of the top-scoring TFBMs is produced, which are color-coded to allow easy identification of over-represented (dark blue), indifferent (grey) and under-represented (red) TFBMs. The "Incidence" indicates the number (n) and the percent ("Sample mean") of promoters which contain at least one binding site, which can be easily compared to the percent of total promoters of the platform containing this site ("Population mean"), and the resulting "Ratio" between promoters in the personal dataset and total promoters of the array.

2. "Get raw data": Using this option, the data can be downloaded as *.td format, which is best opened with programs like EXCEL (which maintains the tabular structure of the output), not with word processors like WORD. This table shows the number of each binding site in each of the gene's promoters.
(Leuven Katholieke Universiteit, Belgium)

The "Statistics" module integrated in TOUCAN can be used to determine those TFBS in a sequence set which are statistically over-represented. Note that you first have to run a program which detects TFBS in your set, and THEN you may run the "Statistics" tool.

Please refer to the main section of Statistics-TOUCAN and MotifScanner.
TRES - Transcription Regulatory Element Search
(Singapore University)
Using TRES you can simultaneously search up to 20 promoter sequences (of maximum 1000 bp each) for known transcription factor binding sites, cis-acting elements, palindromic motifs or conserved k-tuples (phylogenetic footprints). This is useful for comparative promoter sequence analysis to elucidate common themes (modules) in functionally or phylogenetically related promoters.

TF binding sites are searched from TRANSFAC database, from ooTFD database and also plant cis-acting elements from PLACE database.

Motif Discovery
NOTE: Motif Discovery means the ab-initio prediction of over-represented motifs from a sequence set. Many computational motif discovery tools work both on sets of DNA and PROTEIN sequences. The prediction of over-represented motifs is widely used in promoter analyses, hence there are also close relations between these two topics. At the FAQ page, Motif Discovery is mainly discussed in FAQ GEN5.
Linkpage 1 - Assessment of Computational Motif Discovery Tools
(Washington University)
This page of the Washington University provides an assessment of motif discovery programs, applied to the specific task of discovering novel transcription factor binding sites in DNA sequences. The purpose of the assessment is to provide some guidance regarding the accuracy of currently available tools in various settings, and to provide a benchmark of data sets for assessing future tools.
Programs that participated in the assessment are AlignACE, ANN-Spec, Consensus, GLAM, Improbizer, MEME, MITRA, MotifSampler, Oligo/Dyad-Analysis, QuickScore, SeSiMCMC, Weeder, and YMF.
(Harvard University)
AlignACE (Aligns Nucleic Acid Conserved Elements) is a program which finds sequence elements conserved in a set of DNA sequences. You may obtain AlignACE for local use from this Download Site, there is NO web-server. Two versions are offered, Linux and Microsoft. The Linux version is newer, faster, and is highly recommended. It is the only version that is being actively supported. 
(CBS, Denmark)
ANN-Spec (Artificial Neural Network to determine the Specificity of DNA-binding proteins) is a UNIX program for discovering motifs in biological sequences. It tries to solve the general problem of finding ungapped local multiple sequence alignments.
You may obtain ANN-Spec for local use from this Download Site, there is NO web-server.
The program is primarily intended for finding transcription factor binding sites in promoter regions though it can be applied to strings of arbitrary alphabets including protein sequences.
(Washington University in St.Louis)


1. Consensus is a matrix-based pattern discovery for DNA, RNA or protein sequence sets. The Stormo Lab located at Washington University in St.Louis developed a website where Consensus can be run on a set of user sequences, and the output is shown in graphic interfaces. The query interface is very user-friendly and therefore also suitable for "beginners".
NOTE: You may also download Consensus for Linux systems.
Input: You simply paste your set of FASTA - formatted sequences, choose the length of the motif to search, and select if you want to have the reverse strand included. Note that it is highly recommended to recieve the results via Email, otherwise the "live" process may extremely slow down your computer !
NOTE: Test runs using an example cluster of promoter sequences showed a quite "non-specific" result list of top-scoring patterns, like "AAAAAAAA" or "TTTTTTTT". It should be analyzed thoroughly if specific settings could overcome this problem.

2. This Consensus interface is provided with the RSAT package of regulatory sequence analysis.
Please refer to this main description above for detailed informations.
Gibbs Motif Sampler
(Wadsworth Center, New York)

Gibbs Sampler


1. The Gibbs Motif Sampler program is provided by the Wadsworth Center, New York. The Gibbs Motif Sampler will allow you to identify motifs, conserved regions, in DNA or protein sequences. There are many parameters which can be set, but you may for a start simly select either the "prokaryotic defaults" or the "eukaryotic defaults". Results are transferred via Email (you have to be patient!).

2. Gibbs Sampler is a widely-used program to detect matrices in a set of DNA or protein sequences. 
This interface here is provided with the
RSAT package of regulatory sequence analysis, which was created at the SCMBB, Brussels University. 
Please refer to this main description above for detailed informations.

3. MotifSampler is part of the TOUCAN program for regulatory sequence analysis. MotifSampler is based on Gibbs Sampling.
Please refer to this main description above for detailed informations.

4. GIBBS motif sampling is also available as web interface at Pasteur Institute.
(Zlab, Boston University)
GLAM is a program for discovering functional motifs shared by a set of nucleotide sequences. Examples of functional motifs include transcription factor binding sites, mRNA splicing control elements, signals for mRNA 3'-cleavage and polyadenylation.
GLAM attempts to find these motifs by obtaining the best possible gapless, multiple alignment of segments of the sequences. The 'best' alignment is the one that maximizes the value of a certain formula.
You may obtain GLAM for local use (LINUX, Sun, SGI/IRIX) from this Download Site, there is NO web-server.
Improbizer was developed by Jim Kent at UCSC and is part of the page "cis-Site Seeker". Improbizer searches for motifs in DNA or RNA sequences that occur with improbable frequency (to be just chance) using a variation of the expectation maximization (EM) algorithm.
Improbizer comes with a very user-friendly web interface, very clear-structured, and all parameters are very well and concisely explained.

NOTE: Longer sequences take more time than the same number of bases in short sequences. 100 sequences of 100 base pairs each will take about a minute. Unfortunately, the input sequence set is therefore restricted to quite small sizes, otherwise a message is displayed "Sorry, this job is too big for our web server". 

The MEME (Multiple Em for Motif Elicitation) system allows you to motifs (highly conserved regions) in groups of related DNA or protein sequences using MEME and, sequence databases using motifs using MAST (Motif Alignment and Search Tool).

Note: There is also a MEME mirror at Pasteur Institute.

Example - motifs in promoter sets: Simply provide a FASTA-file of your promoter sequence set, and set the number of motifs to be extracted. In addition, you may set a "minimum/maximum motif width", e.g. for TF binding sites you may choose 6 and 9, meaning the program will extract only motifs ranging from 6 to 9 bp. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.
Output: MEME sends 3 mail-messages: a confirmation, the MEME results and the MAST results. The MEME results include very nice multi-colored diagrams, and consensus sequences of extracted over-represented motifs, and links to perform additional analyses like BLOCKS, MAST, and MetaMEME.

1. Oligo-analysis is part of the (RSAT) package of regulatory sequence analysis, which was created at the SCMBB, Brussels University. 

This is
a simple and fast method to extract over-represented elements, and works for DNA, protein, or even "any" type of "sequences", meaning that you can even screen any type of text for over-represented motifs.
Please refer to this main description above for detailed informations.

2. Dyad-analysis is similar to Oligo-analysis, but specifically screens for "double" motifs separated by a spacer of defined length.
Please refer to this main description above for detailed informations.
(Pesole-Lab, Milan University, Italy)
WeederWeb is one of the tools developed in the lab of Graziano Pesole, Milan University. WeederWeb is a web interface to Weeder, a program for finding novel motifs (like transcription factor binding sites) conserved in a set of regulatory regions of related genes.
WeederWeb is very user-firendly web interface. You simply paste the sequences, check if you want to include the reverse strand, select your "guess" how many sequences will share a motif, and choose the "speed" of the analysis. Note that using the "quick scan" option, only short motifs (6 to 8 bases) are reported, whereas "normal" and "thorough" modes scan for motifs from 6 to 12 bases !

Note that you should use the extended input form if you want to exactly choose a certain motif length, or if you want to precisely define how many variations may be accepted. There is also a special option if you are using human 5' or 3'UTRs as input sequences !

Results are coming as text file via Email, also containing a hyperlink which displays the output in a "MEME-like" fashion (including a "Sequence Logo" - representation). The result also contains a user-friendly line "Interesting motifs seem to be:".
(Washington University)

(Washington University)
YMF is one of the tools developed at at the computational molecular biology group, University of Washington. YMF is a program that detects statistically overrepresented words (motifs) in DNA sequences. The user may specify the characteristics of the motifs to be detected. A motif here is a short string of nucleotides, degenerate symbols, and spacers. 'Motif size' is the number of non-spacer characters in a motif. Spacers ('N's) are constrained to be in the center of the motif. Degenerate symbols allowed in a motif are R (purine - A or G), Y (pyrimidine - C or T), W (A or T), and S (C or G).
YMF uses a very clear, user-friendly web interface. You simply choose the motif size, the maximum number of spacers and degenerate symbols (IUPACs), and the organism.
The reason for selecting the organism is that YMF reports motifs that are statistically overrepresented in the input sequences with respect to a "Background model" that captures the promoter regions of all genes from the organism. YMF has precomputed these background models for several common genomes.
NOTE: Although the page states "Total uploaded sequence data should be < 10000 characters", a test run using a much larger sequence set seemed to work without problems (and results were delivered very quickly!).
- simple text file listing the motifs in "descending order of reliance".
- graphical plotting of "Top-scoring motifs". NOTE: Works only with IE6.0+, and NOT with Netscaspe !
- Run "FindExplanators" on the output set (see below).

FindExplanators is a program that extracts from the set of significant motifs reported by YMF, a smaller set of "real" motifs. More specifically, given a set of DNA sequences P, and a set of motifs M (such as those reported by YMF), it extracts a subset E of motifs in M, such that given the occurrences of the motifs of E in the sequences P, the remaining motifs in M are not statistically significant.

NOTE: YMF and FindExplanators may also be downloaded.
TF Module Discovery
NOTE: This section lists programs which "discover" TF modules which are significant in a given sequence set or in a single sequence. This means that TF modules are not pre-selected by the user in order to be matched onto the sequences, but are predicted by the programs as statistically significant for this set. 
(GRESA group , ICG, Novosibirsk and GBF, Braunschweig, Germany)
CompelPatternSearch: lets you discover potential composite regulatory elements (CEs or modules) in your query sequence. You may define the maximal number of mismatched nucleotides in core positions of the 2 different binding sites and the possible variation of the distance between two sites (in %). Example: If you analyze 500 bp of upstream genomic sequence of the human IL6 gene, you will retrieve 1 single CE showing NO mismatch (C00152: NF-kB and C/EBPbeta).

NOTE: CompelPatternSearch can be used with SINGLE sequences only !
(, Lawrence Livermore National Lab)
Note: many programs of this portal, like rVISTA, zPicture, and ECR Browser are described under ""  in section "Comparative Genomics" !
CREME is a web-server for identifying and visualizing cis-regulatory modules in the promoter regions of a given set of potentially co-regulated human genes.

1. Scope:
Eukaryotic genes are often regulated by several transcription factors, whose binding sites are spatially clustered and form cis-regulatory modules.  CREME relies on a database of putative transcription factor binding sites that have been carefully annotated across the human genome using evolutionary conservation with the mouse and rat genomes. Promoter extraction was done by mapping RefSeq mRNAs onto the genome assemblies, and by taking 1.5 kb upstream of the TSS, or up to the next neighbouring gene. The CREME database is built of TFBS which are conserved in all 3 species (human, mouse, rat), and which show PWM similarity scores of 0.8 and above.
NOTE: The scope of CREME is very similar to the program ModuleSearcher, which is also part of the TOUCAN program (see TOUCAN chapter for details)., although ModuleSearcher is not "biased" to conserved (human-mouse-rat) TFBS.

2. Input:
Simply paste a list of accession numbers describing a set of co-expressed (or co-regulated) genes (mRNA accession numbers or LocusLink IDs).

- Hit threshold (TRANSFAC matrix score): this value determines the minimal specificity of a TF-binding consensus to be reported.
- Maximum module length: means the maximum allowed space between 2 TFs to be still reported as "module" (between 50 and 500 bp).
- Maximum number of TFs per module: between 2 and 4.

3. Output:
An efficient search algorithm is applied to this data set to identify combinations of transcription factors, whose binding sites tend to co-occur in close proximity within the promoter regions of the input gene set. These combinations are statistically evaluated, and significant combinations are reported and visualized.
- Enriched PWMs: are reported which are over-represented TF matrices, as compared to a background set.
- Graphic display of modules: Each promoter is depicted as horizontal line, which in fact represents the distance from each gene's transcription start site (located at the left end). For clarity, only the first 900 bp upstream of the TSS are shown.
NOTE: Several test runs using known co-regulated clusters suggested that the best results were produced when using a loose stringency concerning the TF matrix threshold (like 0.8), but a quite stringent maximum module length (like 150 bp). Of course, results may vary depending on the dataset used.

NOTE: If you now want to scan the whole genome for the occurrence of a specific TFBS module, you may use the tool SynoR. Please refer to the main section of SynoR for information !
(Genomatix Inc., Munich, Germany)
FrameWorker is is a complex software tool that allows users to extract a common framework of elements from a set of DNA sequences. These elements are usually transcription factor binding sites since this tool is designed for the comparative analysis of promoter sequences. FrameWorker returns the most complex models that are common to the input sequences (and satisfying the user parameters). These are all elements that occur in the same order and in a certain distance range in all (or a subset of) the input sequences. Typical input datasets may be, for instance, a set of promoters from orthologous genes (Phylogenetic footprinting) or a set of promoters from different genes which have been found to be co-regulated by cluster analysis of expression array data (Co-regulation).
NOTE: FrameWorker is part of the "GEMS Launcher"-section of the commercial Genomatix SuiteNOTE: Genomatix has termed the free academic access "evaluation account". Note that in general, there is not only a limitation in the number of analyses (max. 20 GEMS analyses (sequences) per month!) but also in the functionality of the obtained data !

ModuleSearcher is part of the TOUCAN program for regulatory sequence analysis. ModuleSearcher is designed to predict cis-regulatory modules (CRMs) in a set of coexpressed or coregulated genes.

Please refer to this main section of ModuleSearcher  for detailed information.
Motif Design
NOTE: This section collects programs which are designed to create a common motif from a set of (short) sequences, which is similar but still different from "Motif Discovery". Both "kinds of motif" are considered here: matrices and consensus sequences (see also section "Motif Matcher" for discussion). In addition, tools for the design of regulatory sequences are considered here.
(Genomatix Inc., Munich, Germany)
MatDefine is a tool for fully automatic definition and evaluation of weight matrices from a set of short DNA sequences. The resulting weight matrices can be used by MatInspector to scan nucleic acid sequences for matches to the described binding site. In "automatic mode" (default), the weight matrix is generated without any user interaction. A protocol describing the matrix definition process is delivered. In "interactive mode" ("More options"), the user can modify all parameters which are used in automatic mode.

NOTE: MatDefine is part of the "GEMS Launcher"-section of the commercial Genomatix SuiteNOTE: Genomatix has termed the free academic access "evaluation account". Note that in general, there is not only a limitation in the number of analyses (max. 20 GEMS analyses (sequences) per month!) but also in the functionality of the obtained data !

1. Pattern-assembly is a tool to assemble a set of oligonucleotides or dyads into clusters of overlapping patterns (the assembly). Pattern-assembly is part of the (RSAT) package of regulatory sequence analysis, which was created at the SCMBB, Brussels University. 
Please refer to this main description above for detailed information.
(Genomatix Inc., Munich, Germany)
SequenceShaper is is a software tool developed for the design of regulatory sequences. It allows the specific generation and deletion of transcription factor binding sites. SequenceShaper provides two different functions: Generate new transcription factor binding site or delete existing transcription factor binding sites. In both cases the results can be restricted to mutations that do not influence the sequence context (other elements).
NOTE: SequenceShaper is part of the "GEMS Launcher"-section of the commercial Genomatix SuiteNOTE: Genomatix has termed the free academic access "evaluation account". Note that in general, there is not only a limitation in the number of analyses (max. 20 GEMS analyses (sequences) per month!) but also in the functionality of the obtained data !
Target Explorer -
program modul
"weight matrix creation"
(Columbia University)
Target Explorer automates the entire process from the creation of a customized library of binding sites for known transcription factors through the prediction and annotation of putative target genes that are potentially regulated by these factors.

Target Explorer was specifically designed for well-annotated Drosophila melanogaster genome, but some options can be used for any sequence of interest, like the modul which allows the user-defined creation of a positional weight matrix representation of a TFBS starting from a series of  binding sites.

Please refer to the Target Explorer section for details !
Repetitive Elements
Internal sequence repeats (1)
(EMBOSS, Pasteur)
1. REPEATS: scans a dna sequence, looking for tandemly repeated patterns where the period of the repeat has a user specified size from 1 to 32 nucleotides.

2. EINVERTED: Finds DNA inverted repeats.

3. EQUICKTANDEM: Finds tandem repeats.

4. ETANDEM: Looks for tandem repeats in a nucleotide sequence for the repeat size equicktandem suggests.
5. PALINDROME: Looks for inverted repeats in a nucleotide sequence.      


Repbase Update is the most commonly used database of repetitive DNA elements, like in the widely used program RepeatMasker (see RepeatMasker section for details). Repbase is maintained at Genetic Information Research Institute (GIRI), CA, USA. It is possible to browse the Repbase alphabetical list of elements or to search using a search string (like "Tigger" or "MER2"). Please note that a registration (free for academic use) is needed in order to open the individual Repbase database entries.

CENSOR: CENSOR is a software tool which screens query sequences against a reference collection of repeats and "censors" (masks) homologous portions with masking symbols, as well as generating a report classifying all found repeats. Thus, CENSOR is somehow similar to RepeatMasker.
In general, the CENSOR output is very informative as it presents data in several formats:
- the graphical SVG Viewer gives a very good impression about the positions and the sizes of individual repeats. Note that the SVG Viewer works better in MS Internet Explorer than in Netscape.
- the summary table lists all elements, very similar to RepeatMasker.
- the masked sequence masks the query sequence in a way that all repeats are replaced by "N". In addition, all masked segments are listed as separate fasta sequences.
- all pairwise alignments of the query and the repeat sequences are shown.
- the database entries of all repeats are shown.
Please note that although sequence analysis using CENSOR is not restricetd, the viewing of individual repeat database entries is dependent on registration (see above) !

NOTE: Access Repbase via SRS:
Repbase is also searchable via public SRS (Sequence Retrieval System) servers (without need to register !). Please also refer to other SRS descriptions as in RET2 or RET5. If you have a look at this list, you will find different SRS servers providing Repbase, so choose one of these servers. Using the yellow "Search" button at the top right corner opens the "Search" - form, where we can simply enter search terms. Note that if you retrieve no hits in the first run, you may select "*all entries*" as search field which will produce a list of ALL entries in the database. You can display the complete list in one window by adjusting the "Display Options" on the left side. This list you can scan for your repeat of interest. NOTE that the Repbase version at GIRI is much more up-to-date than the ones found via these SRS-linked servers ! Example: look for the repeat "Tigger" and see how many elements you will retrieve.   
Internal sequence repeats (2) -

(Zlab, Boston University)
REPFIND is a program to find clustered, exact repeats in nucleotide sequences. For each repeat cluster that it finds, it calculates a P-value, which indicates the probability of finding such a concentration of that particular repeat just by chance. Of the many possible clusters for each repeated word, REPFIND selects the one with the most significant P-value.
Note that REPFIND is especially useful to detect regulatory signals in 3'-UTR sequences of mRNAs which often consist of repeat clusters (like e.g. AU rich elements known to be involved in mRNA decay). Please also refer to FAQ GEN8 !

1. Input:
- Sequence: REPFIND only accepts single sequences as input (no batch submission), either in raw sequence format or as GenBank accession number.
- P value cutoff: Only repeat clusters with P-values lower than this cutoff will be displayed.
- Minimum repeat length: If you already know the length of the repeat you are looking for, use this option as it drastically reduces the output length !
- Low Complexity Filter: Real nucleotide sequences often contain so-called low complexity sequence, meaning tracts of predominantly one nucleotide, dinucleotide tandem repeats, and the like. Since they probably do not correspond to the type of signal that you are looking for with REPFIND, they may be masked out with the program dust, which is widely used with other sequence analysis tools, such as BLAST, for similar reasons.
Please NOTE that you have to be cautious when looking for motifs like AU-rich elements, which are similar to a "low complexity" sequence, and which therefore would be masked out (hidden) in advance !
- Statistical Background: To calculate how unlikely a repeat is, REPFIND needs to know how abundant the nucleotides within the repeated word are. By default these abundances are obtained from the input sequence. However, they may also be obtained from databases of Xenopus, human, and S. cerevisiae 3' UTRs. By default, the abundances of dinucleotides are used, thus accounting for, e.g., reduced abundance of CpG relative to C and G. Alternatively, you may select the use of mononucleotides up to hexanucleotides, by selecting a Markov model of order zero to five. Since a 5th order Markov model requires frequencies of 4^6 = 4096 hexanucleotides, the dataset used for counting them should contain many more than this number of basepairs to get meaningful results.

2. Output:
- REPFIND produces a graphical display of the repeats that it finds, followed by a textual summary of each repeat cluster. The colored bars indicate those repeats that form the strongest cluster. The different colors serve to distinguish repeats that are close together and have no meaning beyond that. You can save the image by right clicking on it and selecting the appropriate menu option.
- A P-value of 1e-05 means that such a concentration of that particular repeated word would be expected to occur by chance only one time in 10^5 clusters of the individual repeat being examined.
(ISB, Seattle)
RepeatMasker is an excellent program that screens DNA sequences for interspersed repeats (Alu,SINES,LINES...) and low complexity DNA sequences. Repeat Masker is maintained at the Institute for Systems Biology (ISB), Seattle.

1. Input:
- Sequence, species
- Lineage annotation options: If your query sequence is mammalian, RepeatMasker can determine if a repeat instance is expected to be present in one or more other mammalian species. This information can be used to annotate the RepeatMasker output or control the masking process.
- Advanced options: e.g. alignments (repeats aligned with query sequence), masking of repeats in query sequence, contamination check, repeat options, and more.

2. Output:
2.1. The first page is a summary table listing all repeat types that were identified within the query sequence. Also, all links to further output files are presented.
2.2. The link "Repeat Annotations" lists detailed annotation of the repeats that are present in the query sequence, which includes:
- SW: Smith-Waterman score of the match, usually complexity adjusted. The SW scores are not always directly comparable. Sometimes the complexity adjustment has been turned off, and a variety of scoring-matrices are used. Example: 1306. In general: The higher the better.
- perc div: Example: 15.6 = % substitutions in matching region compared to the consensus. The lower the better.
- perc del: Example: 6.2 = % of bases opposite a gap in the query sequence (deleted bp). The lower the better.
- perc ins: Example: 0.3 = % of bases opposite a gap in the repeat consensus (inserted bp). The lower the better.
- name of query sequence, and exact position of the repeat in the query sequence.
- the repeat type, the family it belongs to, and the part of the repeat which matches to the query sequence (from-to (left)). The latter point is worth looking at as it gives an impression if only a small fraction or almost the complete repeat actually is present in the query sequence.
2.3. "Masked Sequence" shows a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns).
2.4. "Alignments" shows all pairwise alignments of the repeats with the query sequence.

NOTE: The interspersed repeat databases screened by RepeatMasker are based on the repeat databases (Repbase) copyrighted by the Genetic Information Research Institute (G.I.R.I.). The Repbase Update database contains annotation of most repeats with respect to divergence level, affiliation, etc. The nomenclature of the interspersed repeats in the output of RepeatMasker is nearly identical to that of the reference database which in most cases corresponds to that in the literature.
Please refer to the Repbase section for details!

Alternative RepeatMasker: BCM Search Launcher: Sequence Utilities