
-> GENES
-> GEN1...know if a stretch of
genomic sequence contains a potential promoter region ? (last update May 28, 2006)
-> GEN2...know which
transcription factor binding sites, TF modules, or user-defined
patterns and
profiles are present in my promoter region ? (last
update May 30, 2006)
-> GEN3...know if
there are repetitive elements in my DNA sequence ? (last update Nov.
18,
2005)
-> GEN4...know
which promoters or enhancers in a whole
genome contain a binding site for a single or a combination of
transcription factors (Motif
Matching; Module Scanners) ? (last
update Mar. 14, 2006)
->
GEN5...know which
regulatory elements are common in a set of promoter sequences and check
if these motifs are known transcription factor binding sites (Motif Discovery) ? (last update Mar. 30, 2006)
-> GEN6...quickly extract
potential promoter sequences for a batch of human genes ? (last update May 29, 2006)
-> GEN7...quickly see the binding
site profiles of individual transcription factors ? (last update May 18, 2005)
-> GEN8...detect
regulatory elements in UTRs (UnTranslated Regions) in a whole-genome
approach ? -> see RNA1 !
-> GEN9...get the promoter/protein sequences of all proteins
homologous to my query within a certain species ? ->
see RET8 !
-> GEN10...check how often a specific motif is present in a
randomly generated sequence set ? (last update Jun. 3, 2005)
GEN1...know if a stretch of
genomic sequence contains a potential promoter region ? (last update
May 28, 2006)
There are several programs on the web which scan
DNA
sequences for potential promoter regions and / or Transcription
Start Sites (TSS). Thereby, not only
TATA-containing but also TATA-less promoters are predicted. Like
always, each method has advantages and disadvantages. It is best to use
several of these programs in parallel and compare the results.
1. Resources performing predictions of promoter / TSS position:
The program PromoterInspector
predicts promoter regions in mammalian genomic sequences. PromoterInspector is now part of the commercial
GenomatixSuite, which means that there are limitations
for
the use of the program. The output
only shows the predicted sequence position, not the
transcription factor details.
Please refer
to the main section of
PromoterInspector for details.
NNPP
(Promoter Prediction by Neural Network), provided in the
context of the Berkeley
Drosophila Genome Project, is a widely used (often cited
!) method that finds eukaryotic and prokaryotic promoters in a DNA
sequence. The output
is simply a list
of predicted (core) promoter sequences with the predicted TSS
indicated. Note that NNPP predicts very short sequences (only 50 bp) in
proximity to the TSS. There are no further options for follow-up
analyses. Note that test runs showed that NNPP is significantly
less
stringent than other promoter prediction programs, which results in
a higher number of potential promoter sequence regions. Please refer to
the main section of NNPP for details.
Eponine,
developed at Sanger Institute, is a probabilistic method for detecting
transcription start sites (TSS) in mammalian genomic sequence.
Results are presented in GFF format. A
simple list of TSS positions, together with the predicted strand is
shown. Please
refer to the main section of Eponine
for details.
Tip!
FirstEF
(First Exon Finder), provided at Cold Spring
Harbor Labs, is a 5' terminal exon and promoter
prediction program. It consists of different discriminant functions
structured
as a decision tree. The probabilistic models are optimized to find
potential
first donor sites and CpG-related and non-CpG-related
promoter regions based on
discriminant analysis. For every potential first donor site (GT)
and an upstream
promoter region, FirstEF decides whether or not the intermediate
region can be
a potential first exon, based on a set of quadratic
discriminant functions.
FirstEF calculates the a posteriori probabilities of exon,
donor, and
promoter for a given GT and an upstream window of length 570 bp. Taken
together, FirstEF shows predicted positions of promoters, first
exons, and CpG islands. NOTE: FirstEF predictions are also
presented in the UCSC
Genome Browser display ("Expression and Regulation" tracks)!
Please refer
to the main section of FirstEF for
details.
2. Resources performing predictions of promoter / TSS
position and Transcription Factor Binding Sites (TFBS):
Tip!
PromoterScan
predicts promoter regions via comparison to eukaryotic Pol II promoter
sequences. It has the advantage that it also shows the names and
positions of significant transcription factor binding sites within the
sequence. The results show the location of predicted promoter
sequences. Predicted
sequence regions are regions of DNA that contain a significant number
and
type of transcriptional elements (TEs) that are usually
associated with Pol II promoter sequences. Reported putative promoters
are those regions of your sequence that score past a predetermined
cutoff score set to recognize 70% of primate promoter sequences in the
Eukaryotic Promoter Database. Please refer to the main section of PromoterScan for
details.
Another program is called DRAGON
Promoter Finder which is
part of the portal DRAGON
Genome Explorer of the Institute for Infocomm Research
(I2R),
Singapore. The program attempts to recognize the exact location of the
transcription start site (TSS), i.e. the +1 position relative to the
TSS. Therefore, the first output is a list of potential TSS. The
program also includes very nice follow-up analyses, like BLAST against
the EPD (Euk. Promoter Database), or prediction of TF sites. For the latter
option, the Match program is used. Please refer to the main section of Dragon
Promoter Finder for details.
GEN2...know which
transcription factor binding sites, TF modules, or user-defined
patterns and
profiles are present in my promoter region ? (last
update May 30, 2006)
This is in general a quite tricky field, not because
you don't find any hits but because you find too many of them. In most
cases it is therefore necessary to manually screen through the lists of
potential transcription factor binding sites in a sequence and try to
pick those of "highest interest". Nevertheless, there are different
databases and resources that can be compared.
Please note that
the general subject of "motif
matching" in a whole-genome approach (instead of single
promoters) is discussed in FAQ GEN4 C) , but the
programs discussed there usually are also applicable to single
sequences !
Please note
that the number of false positive predictions of TFBS may be
drastically reduced by using comparative genomics approaches,
as discussed in FAQ GENOM6 !
1. Resources based on TRANSFAC database:
TRANSFAC
is the "classical" transcription factor database. Please refer to the TRANSFAC section at the main page for
a more detailed description. You can either search
through lists of transcription
factors or their binding sites, or you can analyze your own input
sequence for the presence of these motifs. For the latter, there are
different possibilities. The "gold-standard application" is MatInspector,
available at Genomatix, Inc. (see below). This program has been largely
commercialized, only a limited number of free runs is provided for
academic use. Instead, the programs Match and P-Match
are provided for free directly at the BIOBASE portal, which ensures a
very good database cross-referencing between the predicted TFBS in the
output and the corresponding TF database entries in TRANSFAC ! Note that a registration
at BIOBASE is
required in order to use Match, and P-Match which is free for
non-profit
use. It is the same registration that is needed to access the TRANSAFC
public database at this portal !
Tip! Match
is part of the programs at
BIOBASE portal of gene regulation. Match is designed for searching
potential binding sites for transcription factors (TF binding sites)
nucleotide sequences. Match uses a library of mononucleotide weight
matrices from TRANSFAC 6.0. Please refer to the main section of Match for a detailed
description.
Tip!
P-Match
(combined Pattern-Matrix search) is a new tool
for identifying transcription factor binding sites (TF binding sites)
in DNA sequences. It combines pattern matching and weight matrix
approaches thus providing higher accuracy of recognition than each
of
the methods alone. P-Match uses a library of mononucleotide weight
matrices from TRANSFAC 6.0 along with the site alignments associated
with these matrices. Note: In general, P-Match "looks" very
similar to Match, please refer to the main
section of P-Match for a detailed description.
The improved version of the "historical" public
version of MatInspector is called MatInspector
professional, which significantly reduces the number of false
positives and negatives. This program is
now part of the Genomatix Suite,
meaning in principle free of charge for academic users, just register here.
Anyway, you are restricted to max. 20 analyses (sequences) per month
! The user may select the values for core similarity and matrix
similarity (in both cases "1" means perfect match). Note that
there is a highly user-friendly option when
choosing the matrix similarity threshold, called "optimized".
This means that the program automatically chooses the optimal value
for each matrix, which minimizes the number of false positives.
This optimized value is defined in a way that a minimum number of
matches is found in non-regulatory test sequences. At the MatInspector
output list, you can easily compare the difference between the
optimized matrix threshold and the actual matrix similarity for each
site. Note that you should use MS Internet Explorer (and
not Netscape) in order to make use of the functionality of the Adobe
SVG Viewer, allowing to interactively handle the TF sites -
diagrams. If you have analyzed a multiple sequence file, you will find
a button called "Search for common TF sites"at the bottom of the output
page, opening a SVG window, where you can adjust to see only those TF
sites present in x of total y input sequences. Via right-klick and
"Copy SVG", you can paste the image into e.g. Corel Photopaint in order
to save in any file-format.
NOTE: If you know the binding activity
of a novel
TF (not present in TRANSFAC) with a series of oligonucleotides
and you want to build a
profile from these sequences, you may use the program MatDefine.
MatDefine is a tool for fully automatic definition and evaluation
of weight
matrices
from a set of short DNA sequences. The resulting weight
matrices can be used by MatInspector to scan
nucleic acid sequences for matches to the described binding
site. In "automatic mode" (default), the weight matrix is
generated without any user interaction. A protocol describing the
matrix definition process is delivered. In "interactive mode"
("More options"), the user can modify all parameters which are used in
automatic mode. NOTE: MatDefine is part of the "GEMS
Launcher"-section
of the commercial Genomatix
Suite. NOTE: Genomatix has termed
the free
academic access "evaluation
account". Note that in general, there is not only a limitation
in the number of analyses (max. 20 GEMS
analyses (sequences) per month!) but
also in the functionality
of the obtained data !
NOTE that you have to register at BIOBASE (free for
non-profit organizations) in order to gain access to the individual
transcription factor information files !
Another resource in this field is TESS, which stands for
Transcription Element Search Software. TESS is a set of software for
locating and displaying transcription factor binding sites in a DNA
sequence. TESS uses older versions (4.0) of the TRANSFAC
database (public version, not
the updated "professional version !) as its store of transcription
factors and their binding sites. In fact a combined search in various
databases is performed (TRANSFAC site, TRANSFAC matrix, CBIL matrices,
IMD..). By the way, all these databases can also be queried using
keywords. Click at the link "Combined" to get to the "Combined Search
Page". If you are not sure how to handle the different input
parameters, simply click at the button "Analyze using the default
settings !". You can choose between various forms of output, including
colour coding of consensus mismatches (very useful !!), tables to show
the significance of hits (!), and Jave Applets to show the binding
sites on the sequence. NOTE: Within the tabular results page, you can
klick at the header of every column to sort the output (e.g. klick on
"Sm" will sort the output by matrix similarity). NOTE: "==="
within the
"annotated sequence view" indicates hits above the secondary threshold,
whereas "---" indicates below. NOTE: The only disadvantage is
the upper
limit of 1 kb for input sequences !
If you specifically want to search for TFBS-modules (combinations
of regulatory elements) in SINGLE sequences, you may use one of
the following programs. Please NOTE that
programs which predict such TFBS modules in a whole SET of sequences
are discussed in FAQ GEN5: B2). Some of
these may also be suitable for the analysis of SINGLE sequences !
CompelPatternSearch
lets you search a query sequence for the presence of potential
composite regulatory
elements. This tool is based on COMPEL, a
database on composite
regulatory elements affecting gene transcription in eukaryotes,
which again is based on TRANSFAC. COMPEL collects information
about composite regulatory elements (CEs) - pairs of closely
situated sites and transcription factors binding to them.You may define the maximal
number of mismatched
nucleotides in core positions of the 2 different binding sites and the
possible variation of the distance between two sites (in %). Example:
If you analyze 500 bp of upstream genomic sequence of the human IL6
gene, you will retrieve 1 single CE showing NO mismatch (C00152:
NF-kB and C/EBPbeta). NOTE: CompelPatternSearch can be used
with SINGLE
sequences only !
2. Resources based on JASPAR database:
Tip! ConSite is
a program (and web interface) that couples phylogenetic footprinting
with regulatory site detection (mainly promoter
comparison), meaning that ConSite is primarily designed to compare 2
orthologous sequences and report conserved TFBS (Transcription Factor
Binding Sites). Note that ConSite uses a database ("JASPAR") of TF profiles
(PWMs) that was newly built from literature data, and that is therefore
"independent" from existing databases like TRANSFAC. At the ConSite
start page, you have 3 different options, one of them is "Analyze
a single sequence", which lets you analyze a single
promoter for TFBS (without performing cross-species comparison). This
option is comparable to the TESS
system, but utilizes the JASPAR profile collection instead of TRANSFAC.
Note that the option "Analyze
a single sequence" of ConSite can also be used to see if individual
TFBS,
which can be selected from a list, are present in a query sequence. In
addition, there is the option to scan the sequence for the presence of
a user-defined profile (raw counts matrix or position weight
matrix), but not of a user-defined consensus sequence. This can be
useful if you know the binding activity of a novel TF (not present in
JASPAR) with a series of oligonucleotides and you are able to build a
profile from these sequences (see also description of MatDefine
above), and finally want to scan a promoter
sequence for the presence of this binding site (profile).
Please refer to FAQ
GEN4 C) for additional programs in the field of Motif Matching
!
JASPAR also provides a "quick
and easy" way of analyzing
a promoter sequence (pasted in the field on the right side) for the
presence of individual TFBMs, which have to be selected from
the list
first. Note that there is no option "Select all" which means that this
feature is designed to just show the positions of single (or a small
group of) TFBMs in a query sequence. For more complex analyses, the
ConSite system should be used.
3. Resources based on TRANSFAC and JASPAR databases:
Tip! MotifScanner,
which is also part of the TOUCAN
package, is available as individual program from the software
page of the bioinformatics
group at the department of electrical engineering (ESAT) at the Katholieke Universiteit Leuven
(Belgium). In contrast to TOUCAN, the output is not
displayed graphically but as a simple list of TRANSFAC matrices
which match to your input sequences, according to the
criteria specified by the user. You may upload a multiple FASTA
sequence file but there will be no comparisons of the
TF composition between sequences. NOTE: The web interface of
MotifScanner does not support the JASPAR matrices, in contrast
to MotifScanner implemented in TOUCAN ! On the other hand, it
allows the selection of individual TRANSFAC TFBMs to scan your
query sequence. Please refer also to the MotifScanner section at the main page
!
GEN3...know if there are
repetitive elements in my DNA sequence ? (last update Nov. 18,
2005)
1. Screen for (longer) repetitive elements (like LINES
and SINES):
Tip! A
program which is often used in that context is RepeatMasker. RepeatMasker
screens DNA sequences in fasta (or raw) format against a library of
repetitive elements and returns a masked query sequence ready for
database searches as well as a table annotating the masked regions.
Simply paste in your sequence, select the
DNA source (species) and further options if you like. The output table
nicely lists groups of repetitive
elements (like SINES, LINES, LTR elements,...) and their occurrence
within the sequence. Please refer to the RepeatMasker section at the main
page for details ! Note: RepeatMasker uses the Repbase database
of repetitive elements, but possibly the most recent versions
of Repbase are used by CENSOR (see below) !
Tip! CENSOR
is a software tool which screens query sequences against a
reference collection of repeats and "censors" (masks) homologous
portions with masking symbols, as well as generating a report
classifying all found repeats. Thus, CENSOR is somehow similar to
RepeatMasker. In general, the CENSOR output is very informative
as it presents data in several formats: The graphical SVG Viewer
gives a very good impression about the positions and the sizes of
individual repeats. Note that the SVG Viewer works better in MS
Internet Explorer than in Netscape. Tthe summary table lists
all elements, very similar to RepeatMasker. The masked sequence
masks the query sequence in a way that all repeats are replaced by "N".
In addition, all masked segments are listed as separate fasta
sequences. All pairwise alignments of the query and the repeat
sequences are shown. The database entries of all repeats are
shown.
Please note that although
sequence analysis using CENSOR is not
restricetd, the viewing of individual repeat database entries is
dependent on registration (free for academic use) ! NOTE:
As CENSOR is provided by the same site which maintains the Repbase Update
database (GIRI), one can be sure
to use the most recent version of this database for the
analysis ! If you want to get detailed information about individual
repeats, you may browse or search the Repbase Update database.
Another alternative is the program Repeat,
which in addition includes a nice graphical output showing the
positions of repeats in the query. The masked output sequence is ready
for Copy/Paste. Note that Repeat has an upper limit of 31 kb
of input sequence !
H-Invitational
Database (H-InvDB) is a human gene database opened to the
public in
April 2004, which is hosted by the Japan Biological Information
Research
Center (JBIRC)
and by the DNA Databank of Japan (DDBJ). The scope of
H-InvDB is to provide an integrative annotation of full-length cDNA
clones available from high throughput cDNA sequencing projects. If you
want to scan a cDNA sequence of interest for repetitive
elements, you may
perform a simple
keyword
query, and then look at the "cDNA view", which contains a
link
concerning repetitive elements within the "cDNA information" section.
This
leads to the display of the Repeat Mask Viewer, which nicely
shows
the position of the repeat within the input sequence as lower case
letters.
Please refer to the H-InvDB
section at the Data Integration page
for a detailed description !
2. Screen for (short) nucleotide repeats:
There is a whole list of EMBOSS tools for
this purpose available also as web-interfaces at the Pasteur Institute.
Repeats
scans a DNA sequence, looking for tandemly repeated patterns where the
period of the repeat has a user specified size from 1 to 32
nucleotides. Einverted
finds DNA inverted repeats. Equicktandem
finds tandem repeats. Etandem
looks for tandem repeats in a nucleotide sequence for the repeat size
equicktandem suggests. Palindrome
looks for inverted repeats in a nucleotide sequence.
REPFIND
is a program to find clustered, exact repeats in nucleotide
sequences. For each repeat cluster that it finds, it calculates a P-value,
which indicates the probability of finding such a concentration of that
particular repeat just by chance. Of the many possible clusters for
each repeated word, REPFIND selects the one with the most significant
P-value. REPFIND only accepts single sequences as input (no
batch
submission). Please note one point concerning the "Low
Complexity Filter". You should carefully select / deselect this
option
when looking for motifs like AU-rich elements, which are similar to a
"low complexity" sequence, and which therefore would be masked out
(hidden)
prior to the analysis. Please also refer to the REPFIND description at the main page.

GEN4...know
which promoters or enhancers in a whole
genome contain a binding site for a single or a combination of
transcription factors (Motif
Matching; Module Scanners) ? (last
update Mar. 14, 2006)
This question
addresses the problem, that it is not very useful to BLAST e.g. the
whole
human genome using a short sequence stretch like "AACAATG". NOTE:
The first step, the identification of the binding site
for an individual transcription factor, is described in question
GEN7 !
There are several ways to deal with this question (Tip: use options 3 and 4 !):
1. Databases of curated promoter sequences:
These databases contain sequence data of experimentally
verified
promoter sequences. This means that they are of high quality, but
normally contain only a small fraction of promoters/genes of a species'
genome.
The Eukaryotic Promoter Database (EPD)
is an annotated non-redundant collection of eukaryotic POL II
promoters, for which the transcription start site has been determined experimentally.
Note that it is only possible to perform SRS-like keyword searches
in the EPD database but NO sequence searches (like BLAST). BUT
you can download
the promoter sequences into a simple FASTA file. You can then - after
removing the line breaks in a text editor like WORD - simply search
this text file via the
WORD search function using the nucleotide string as query. Of course,
the more elegant version is to analyze the sequence
set with the RSAT tool DNA-pattern,
also
allowing to search for complex patterns including IUPAC-codes (see
description below). Note that
in principle you can BLAST the EPD
at the swiss Embnet, but it does not accept short
oligo sequences as input.
2. Databases of in silico predicted/extracted promoter
sequences:
These databases contain "precomputed" sets
of promoters, which
normally were derived from genome databases by extracting 5'-upstream
regions of cDNA starts. Note that the size and the quality of
these databases depends on the type of cDNA sequence used (RefSeqs,
other curated mRNAs or ESTs) AND on the definition of the reference
position (position +1; see e.g. RSAT).
PRESTA is a
tool/database that combines EST databases and putative
GenBank/EMBL promoters to yield datasets of predicted promoters
at high accuracy. A high stringeny BLAST-search reveals ESTs that
assist in transcription start-site verification. PRESTA is therefore
very useful for promoter verification by mapping EST 5' ends. In
comparison with EPD, PRESTA offers somewhat additional value in
adding EST sequence information to promoter data, therefore not only
verifying transcription start sites but also providing expression data.
Similarly to EPD, you can download the
complete sets of human and mouse promoters into simple FASTA-text
files, which then are searchable in a WORD-like text editor. BUT
note that also in PRESTA, many genes are not found / retrievable
in this database.
Alternatively, a good tool to address this question
is part of the package RSAT -
Regulatory Sequence Analysis Tools. The module of this suite
which can be used in this context is called Genome-scale pattern
matching. You can search e.g. the whole (meaning NOT restricted to
EPD promoters !) human genome using the consensus binding site of a
certain TF ("Genome-scale
DNA pattern") or use a matrix-based pattern of a TF ("Genome-scale
Patser"). Please refer to the corresponding
chapter at the main page for further instructions. Note: In
general, RSAT does not provide a selection of TF matrices from a
database but YOU have to know the pattern or matrix of a TF of interest
! If you do not know this information, you may want to have a look at FAQ GEN7 first ! NOTE: Like in RSAT-Retrieve Sequence, there is now an option to select the mRNA start
as "reference position"
for upstream sequence retrieval in both programs (see this section for
comments) !
In general,
you may also combine the powers of different programs to achieve a good
result, see next part
!
3. "Do-it-yourself" whole genome promoter
extraction and Motif Matching: Tip!
This is a feasable way to deal with this question,
as it
circumvents the weakness of RSAT to correctly extract especially
vertebrate promoter sequences. And I will show an example for human
promoters in the following text. Anyway, the next section discusses
some of the limitations which are still present.
3.1. "Do-it-yourself" whole genome promoter
extraction:
The first step is to use BioMart to
extract the
complete set of human promoters, please also refer to the BioMart
chapter in question GEN6. At the start page,
choose the human genome, and select "Ensembl Genes". At the filter
page, you may either deselect all boxes, meaning you will retrieve all
genes, or you may limit the output to at least "a little characterized
genes", by choosing "Genes with Entrez Gene IDs" or "with RefSeq IDs".
At
the output page, choose the Sequences Page", where you can select for
5' flanking regions. Note that you should select "Genes -
transcript information ignored (one output per gene)", in order to you
recieve only the upstream region of each LONGEST 5'-cDNA. A range of
1000 bp of 5'-flanking sequence is possibly a good start. Select "Text,
fasta" as
output format. After a few minutes, you will retrieve a (long)
txt-file of promoter sequences, which can be saved and used for further
analyses.
3.2. Consensus-based (pattern-driven) Motif Matching:
Note that there is sometimes confusion when
talking about definitions of words like "consensus", "pattern",
"profile", or "motif". There is a good, concise introduction in a paper
describing the server WeederWeb, which actually is used for
Motif Discovery. In consensus-based (or pattern-driven)
methods, the different oligonucleotides recognized by a given
transcription factor are described by their consensus,
representing, for each position, the nucleotide that appears most
frequently in the binding sites. Profile-based (or alignment-driven)
methods, on the other hand, work with profile matrices instead
of consensus sequences (see below).
Please note that the subject of motif matching
on single promoter sequences is also discussed in FAQ GEN2 !
Tip! The RSAT program DNA-Pattern
(Strings) is an example of pattern-driven motif matching.
Patterns can contain spacers of fixed length (e.g.
CGGn{11}CCG) or variable length (GATAAGn{0,60}GATAAG). In general,
several patterns (separated
by breaks) can be searched at once against several sequences, meaning
that you can search for several TF sites at once against all promoters
! Note that matches will be displayed independently for each
pattern, unless you combine several patterns to one common one. Example:
If you screen a set of sequences of 1000 bp in length with 2 patterns
and you want to know only those sequences which contain BOTH of them,
you may create a query pattern like "CGGACGn{0,1000}GATTTT".
Note that you should also include a query pattern of "opposite order": "GATTTTn{0,1000}CGGACG". Note that
if you know a "maximum allowed distance" between 2 patterns like 2
transcription factor binding sites you may specify this value in the
query, like "CGGACGn{0,60}GATTTT".
The first word of each line
is the string description of the pattern,
the second word is an identifier for this pattern (if you do not enter
an identifier, then the string (sequence) of the pattern is also used
as
identifier). Example: Type the following text in the Query pattern(s)
box: GATAAG Gata_Box. You can
display the results also graphically, please refer to the tutorial
for furhter information. Using "Browse" you select the "whole genome
promoter - txt file" that you created in the first step. Note
that when scanning large sequence sets, one might be interested in counting
the number of matches, rather than returning their precise
positions. This can be done by deselecting the checkbox match
positions and selecting match counts instead, and by
specifying a threshold. Threshold means that
only those hits (promoters) are returned having more matches than the
specified threshold (e.g. promoters showing at least 2 or 3 copies of a
TF site).
FUZZNUC
is a program for nucleic acid pattern matching, using the
typical "user-friendly" EMBOSS interface-style. You simply paste your
set of sequences (or upload a local file in case of very large datasets
like "all human promoters"), which you want to search with your pattern
and the pattern (consensus sequence) itself. Note that patterns
for fuzznuc are based on the format of pattern used in the
PROSITE database (amended
to refer to nucleic acid sequences, not proteins). Please refer to the FUZZNUC main section for details. Note
that the matching hits are listed only as text one sequence after
the other, there is no graphical output,
which is a drawback when screening large sequence sets as it is not
easy to get an overview of match counts and match positions.
3.3. Profile-based (or alignment-driven) Motif Matching:
Profile-based (or alignment-driven) methods, on the
other hand, work with profile matrices instead of consensus
sequences. Briefly, a profile matrix is not a "one-line sequence", but
a table (a matrix) indicating the "importance" of each nucleotide at
each position via specific values.
Please note that the subject of motif
matching on single promoter sequences is also discussed in FAQ GEN2 !
Tip! The RSAT
program Patser
(Matrices) is an example of profile-based motif matching.
Patser allows to scan a set of DNA
sequences with a profile matrix, which
can be either in Transfac, Gibbs, or Consensus
format. Please refer to the Patser main
section for information on the different formats.
Motif
Matcher was developed by Jim Kent at UCSC and is part of the
page "cis-Site
Seeker". Motif Matcher is a program for finding where a given motif
occurs in sequence data. The Motif
Matcher help gives a very good, concise introduction on the
"nature" of a motif, making it quite easy to convert a consensus
sequence into a
motif. Please refer also to the Motif
Matcher main section for details on "motif construction". The output
presents the highlighted motif matches along the
sequences and a graphical summary of the motif positions, which is
comparable to the RSAT "feature map" produced after performing a "DNA-pattern"
or "Patser"
search. Note that Motif Matcher is not suitable for
screening very large datasets (like "all human promoters"), as
there is no file upload option (only copy/paste the sequence set).
3.4. Annotating the lists of matching genes / promoters:
In another step (optional), you may
feed the accession numbers of the output table of step 2 into TOUCAN
and look if other transcription factors are also over-represented in
this set of promoters, potentially revealing TFs which might be
co-regulated with your TF of interest. Please refer to the TOUCAN chapter at the main
page for detailed instructions (Sequence
Retrieval, MotifScanner, Statistics, ModuleSearcher). Note that
in TOUCAN, you can also quickly extract a complete annotation table
for your set
of genes ! Alternatively, you can feed your RSAT list of potential
target genes back into BioMart
(at the "Filter" Page, "Limit to Genes with these IDs") in order to
achieve an annotation table, which maintains all hyperlinks
actively in an EXCEL sheet (at the "Output" Page, choose the "Features"
you would like to see).
3.5. "Functional Clustering" of matching genes / promoters:
In a final step (optional), if the RSAT
list should be very long, you may perform a "functional clustering"
of the genes. For this purpose, you may use the KEGG tool KAAS - KEGG Automatic
Annotation Server. Please refer to FAQ
PATH1 for details.
4. Strategies based on Comparative Genomics
and / or combinations of TFBS ("Module Scanners"):
Tip!
It has to be clearly stated that strategies like the
one described in the previous section often produce long lists of
potential hits which means at the same time that the background
or "dust" is quite high. Also, these approaches definitely are
restricted to "proximal" promoter regions, let's say 1 kb
upstream of the TSS (Transcription Start Sites). Especially in higher
eukaryotes, data have been emerging which strongly suggest that it is
not sufficient to concentrate on these proximal regions if one wants to
get a comprehensive insight into a gene's expression regulation.
Regulatory elements like enhancers, silencers, or insulators can be
found 50-100 kb upstream or downstream of a gene, and also introns of
neighbouring genes can be "hot candidates" for regulatory elements. On
the bioinformatics side, comparative genomics has shown to be
THE
method of choice for predicting regulatory regions as these normally
are highly conserved. The selection of species is critical for this
purpose, and it has been shown that comparisons of moderately
related species, like human and mouse, are ideal.
In addition,
transcription factors often do not act alone but in combination with
other factors ("modules"), or they have multiple binding
sites within regulatory regions. There are a few tools available
which perform "whole-genome approaches" to identify potential
target
genes of TF combinations.
In FAQ GEN5,
strategies are
described, how to extract potential combinations of TFs from a set of
genes, using programs like CREME or ModuleSearcher. If
you now want to know which regulatory regions in the whole human
genome
also contain this special combination of TFs, you may use one of the
following programs, which collectively might be called Module
Scanners. These programs are also described at the main page, section "TF Module Matching".
Tip! SynoR performs genome-wide scans
for clusters of evolutionary conserved
transcription
factor binding sites (cTFBS) in user-specified spatial
configurations. SynoR is part of the portal Dcode.org for comparative genomics
and gene regulation.
The current version of this program scans human and mouse genomes
for TFBS conserved in comparisons with either other mammals, chicken,
frog, or fish. The identified cTFBS modules
and corresponding genes go through several steps of functional
annotation. (1) cTFBS modules are classified as
promoters (regions 1.5 kb upstream of TSS), UTRs, introns, intergenic,
or coding exons depending on
their relationship to "UCSC known genes". (2) Interspecies
conservation
is performed for all the identified modules to describe the
evolutionary history of different modules. (3) Gene Ontology
(GO)
characterization is performed for genes bracketing the identified
noncoding
modules. (4) GNF Expression Atlas 2 analysis is performed for
these
genes, thus allowing the prediction of tissue specificity
of the identified modules. Please refer to the SynoR section at the main page for
details.
Tip! ModuleScanner,
integrated in the TOUCAN
package, performs genomic
searches with a predicted CRM (cis-regulatory
module) or with a user-defined CRM known from
the
literature to find possible target genes. Please also refer to the TOUCAN
chapter at the main page for additional information, like general
program setup. Starting from
a blank page in TOUCAN, choose "Motifs", "ModuleScanner".
You
have to
choose one of the databases which all comprise pre-computed
sets of CNS - conserved noncoding regions
(minimally 75%
identity within 100bp windows) between 2 species within 10kb
upstream of the coding regions, like CNS between human- mouse or human
- zebrafish or human - chicken. Also, you can choose to display
either the regions of the "primary" or the "second" species in the
output. Finally, you can choose between the TRANSFAC or the JASPAR
matrices of TFBS to visualize.
Then, you have to
select the transcription factor matrices from a list, which you
would
like to use for the scan, or you
enter them manually as a string, separated by commas, like e.g.
"[M00052-V$NFKAPPAB65_01,M00189-V$AP2_Q6]". You may enter also more
than 2 TFs, or you may even look for clusters of only ONE special TF,
by using e.g. "[M00052-V$NFKAPPAB65_01,M00052-V$NFKAPPAB65_01]". You
may change the number of "top hits" to return, which is based on
the score of the ModuleScanner.
The output lists CNS regions
where the chosen TF combination is
found. Note: The numberings at each CNS in the
output indicate the position relative to the coding sequence of the
gene. Note that, ONLY conserved regions between the 2 species
selected
are
scanned, BUT a displayed TF site is taken from the "primary" sequence
only (the one indicated in the database selection list),
and is not necessarily conserved in the second species. Note
that also all other predicted TFs (using
MotifScanner prior 0.2) are displayed as colored boxes, but you may
selectively choose the "modules" by highlighting the "Mod..." entries
in the left column and hitting "Enter".
If you want to know which genes correspond
to the Ensembl
GeneIDs, you may use the TOUCAN annotation tool via "Get_Seq", "From
Ensembl",
"Get Info". You may also paste the Ensembl GeneIDs into the Ensembl query field or other data
retrieval tools like BioMart,
in order to perform advanced annotation and data retrieval (see also BioMart chapter at the main page).
Tip! DBTSS - Search for TF Binding
Site is a "sub-program" of the DBTSS database, accessible via
the links in the left frame at the start site. DBTSS - Search for TF
Binding Site can search for
promoters containing
putative binding sites of particular transcription factors (TFs). There
are 3 sequence databases which allow
a
"whole-genome" search for TFBS modules:
human, mouse, and human-mouse conserved. The human and mouse sequence
databases contain 1.2 kb for each gene
(1.0 kb upstream of TSS and 0.2 kb downstream of TSS). This means that
DBTSS focuses on proximal promoters, but not on distal
regulatory elements like enhancers or silencers. Analysis of evolutionary conservation is therefore
restricted to this region. Please refer to the DBTSS - Search for TF
Binding Site section at the main page for details.
Target
Explorer automates the entire process from the creation of
a customized library of binding sites for known transcription
factors through the prediction and annotation of putative target
genes that are potentially regulated by these factors. Target
Explorer was specifically designed
for well-annotated Drosophila melanogaster genome, but some
options can be used for any sequence of interest. A free registration
is needed in order to
use the Target Explorer programs. Specific options can be used
to scan ANY kind of sequence set
(not only Drosophila
sequences) for modules of TFBS. Unlike programs like
SynoR, Target Explorer does not take any kind of evolutionary
conservation of TFBS into account. Please refer to the Target Explorer section at the main
page for details.
ModelInspector
is a commercial program to scan sequence databases for the
presence of TFBS modules (which can also be self-created using
the
program FastM).
FastM
is a method to develop user defined models of transcriptional
regulatory DNA units (e.g. promoters). These models can be built using various
individual elements (like transcription factor binding sites,
repeats, hairpins) and their sequential order. Thus, IUPAC
sequence elements can be successfully combined
with different types of weight matrices and structural elements (e.g.
hairpins) in the assessment of match quality. Between each
pair of elements for a model a distance
range has to be defined. ModelInspector
utilizes
either a library
of predefined models
or models generated by FastM
or FrameWorker to scan
your own DNA sequences or sequence databases for new regulatory units
matching the
model. Examples of databases
which may be scanned are: GPD
(Genomatix Promoter Database), ElDorado
Genomes, EPD (Eukaryotic
Promoter Database), RefSeq, and various GenBank sections.
The Genomatix
Promoter Database (GPD) is part of the commercial
Genomatix suite
of
products. With GPD, Genomatix claims to offer
the "most complete eukaryotic promoter database" and the "only one
containing promoters for alternative transcripts". Promoter
extraction via GPD is available for entire organisms or for microarray
platforms (like Affymetrix arrays). There are three
possible quality levels (gold-silver-bronze) assigned to each
transcript which is associated with a promoter. GPD also offers
pre-made annotation for promoter modules (combinations of TFBS)
(EXCEL sheet) as well as module descriptions and TFBS matrix
descriptions (txt files). Access to the GPD is exclusively
commercial, and not part of the free academic "evaluation account".
Note:
FastM and ModelInspector is a system to help create
your own model of regulatory elements; it is NOT a system to extract
a model (co-occurring sites) from a given set of sequences (like in FrameWorker
or in the TOUCAN program ModuleSearcher
or in CREME) ! Rather, ModelInspector
is similar
to the TOUCAN program ModuleScanner.
NOTE: FastM and ModelInspector are part of the "GEMS
Launcher"-section
of the commercial Genomatix
Suite. NOTE: Genomatix has termed
the free
academic access "evaluation
account". Note that in general, there is not only a limitation
in the number of analyses (max. 20 GEMS
analyses (sequences) per month!) but
also in the functionality
of the obtained data !
EEL
- Enhancer Element Locator is a tool for locating distal
gene enhancer elements in mammalian genomes by comparative genomics
and to identify conserved TFBS in predicted enhancers.
ELL is described in Hallikas
et al., Cell 2006. Please refer also to the main section of EEL.
In order to address this specific question, you may
try a search in EEL
Database of precomputed EEL alignments. EELweb stores
precomputed alignments between orthologous genes from human and many
other species. The data is regularly updated with some synchronization
with ENSEMBL database, which is used as source of genomic information. EELweb
can be search for conserved TFBS
in 100 kb upstream and downstream
regions of ALL genes (whole-genome
approach). For this purpose,
simply leave the field for the Ensembl Gene IDs empty (which means that
you want to use ALL Ensembl IDs). Note: this option works only
with the "any site" selection of TFBS. NOTE: Output of the
web-based database version is restricted to max. 1000 hits, so if you
expect more hits you may have to use the download version of the EEL
program !

GEN5...know which
regulatory elements are common in a set of promoter sequences and check
if these motifs are known transcription factor binding sites (Motif Discovery) ? (last
update Mar.
30, 2006)
This question demands a multi-step procedure
of bioinformatic tools, where each step can be performed either
seperately or as part of a complete software suite. I am going to
describe both "All-in-one packages" and "Individual single-step tools".
Please note that this question also includes
programs /databases for phylogenetic footprinting, meaning
procedures to compare promoters from different species and thereby
refine the results, based on the assumption that functionally important
elements are conserved in evolution. An important prerequisite, of
course, is the availability and reliability of orthologous promoter
sequences. In addition, an essential step is the choice of species used
for comparison. If the evolutionary distance is too far, then
the chance of retrieving conserved motifs rapidly decreases. On the
other hand, if the species are too closely related (like e.g. mouse and
rat), it is often hard to distinguish regions conserved by evolutionary
pressure from the "overall" sequence similarity. In any case, you
should try different parameter settings at the input forms of the
different
programs and compare the results. Note that programs
integrating
phylogenetic footprinting are listed under headings containing
"...Multiple
(or 2) species...".
1. "All-in-one packages":
1.1. Genomatix Suite (commercial !):
There is a very nice tutorial
addressing that question available at the Genomatix webserver. It
describes a process of different steps for the analysis of DNA
expression array clusters, thereby using software modules of the Genomatix
suite (Chip2Promoter, GEMS Launcher, and ElDorado). Note
that there are two different (FREE) registrations, one for the
package ElDorado, Chip2promoter and GEMS, and the other for
MatInspector professional. Please also note that the amount of free
runs is very small so you should consider one of the numerous
options of subscription if you want to use these databases regularly.
Anyway, the first step of the procedure is
the clustering of the microarray data, which is performed by
other programs like Cluster,
TreeView or Expression Profiler.
Please refer to the section "Gene Expression and Pathways" at the main
page.
The second step is the automatic in batch extraction
of the promoter sequences belonging to all the cDNAs of the
cluster, performed by Chip2Promoter. For this purpose,
different kinds of accession numbers can be used as input, including Affymetrix ProbeSet
IDs (separated by spaces). Currently, Chip2Promoter is available for
human, mouse, rat, and Arabidopsis. The extracted promoter sequences
can be downloaded for further use. Of course, this process could be
performed "by-hand", like using the UCSC Genome
Browser, for a limited number of sequences, but will be highly
time-consuming for large data sets.
The third step is the search for comnmon
transcription factor (TF) sites in this set
of promoters. For this purpose, two buttons are available at the output
file of Chip2Promoter. The link "Show common TF sites" shows a
graphical and tabular representation of TFs common to a user-defined
percentage of sequences (like 80 % or 90 %). The link "Definition of
common framework" performs a more sophisticated analysis as it screens
for conserved pairs (termed "modules") of TFs (termed
"elements"), with a user-defined maximum distance between two elements.
This procedure reveals a list of potentially significant pairs of TFs
involved in the regulation of the whole cluster of genes.
Steps 4 and 5 describe a series of methods
to evaluate the significance of the modules. In step 3, it is
possible to save the predicted models at the Frameworker output page.
In steps 4 and 5, these files can be used to perform comparisons with
promoters of the EPD (Eukaryotic Promoter Database), or to compare the
predicted modules of one cluster with another cluster. Please refer to
the tutorial
for detailed instructions.
Finally, step 6 describes refined
literature analyses to estimate the functional significance of the
found modules. For this purpose, the Genomatix package ElDorado can be
used. This tool creates cocitation networks of transcription factors
and genes of interest.
1.2. RSAT:
A suitable (and FREE) alternative for the Genomatix
suite is a software package termed RSAT - Regulatory Sequence
Analysis Tools. RSAT
consists of a series of modular computer programs specifically designed
for the detection of regulatory signals in intergenic sequences.
The only input
required is a list of genes of interest (e.g. a family of co-regulated
genes). From this information, you can retrieve the upstream sequences
over a desired distance, discover putative regulatory signals, search
the matching positions for these signals in your original dataset or in
whole genomes, and display the results graphically in the form of a
feature map. Each tool is presented as a form to fill. For each form, a
help page provides detailed information about the parameters. Please
refer to the corresponding chapter at the
main page for detailed instructions.
1.3. TOUCAN:
Tip!
Another excellent (and FREE) alternative for the Genomatix suite is TOUCAN,
which is a platform
independent, standalone Java application that is
tightly linked with Ensembl. TOUCAN was developed by the
bioinformatics group at the department of electrical engineering (ESAT) at the Katholieke Universiteit Leuven
(Belgium). Please also refer to the TOUCAN chapter at the main page for
detailed instructions (Sequence Retrieval, MotifScanner, Statistics,
ModuleSearcher). I will give here a brief overview about the
program's capabilities. First, you can perform in-batch extraction
of promoter sequences using a list of diverse identifiers (like
LocusLink, Ensembl, RefSeq and many more), which works similarly to the
BioMart sequence
extraction, yielding one output (promoter) per gene. You may also
simultaneously extract promoters of orthologous genes from human, mouse
and rat, potentially raising the functional relevance of common
transcription factor binding sites (TFBS). The tools MotifScanner
and MotifLocator detect TFBS in your sequence set and
graphically display these sites. The Statistics tool detects
over-represented features in your set, thereby pointing at significant
TFBS. The ModuleSearcher scans your sequences for high-scoring
combinations of TFBS. The ModuleScanner checks
predicted modules against whole-genome CNS (Conserved Non-coding
Sequence) regions. The MotifSampler detects over-represented patterns
in your sequence set,
which then might turn out to
be known or even unknown TF binding sites. The AVID
tool
displays regions of high similarity between 2 sequences, which in turn
can be saved as a sequence sublist and be analyzed with e.g.
MotifScanner. FootPrinter is a program that performs
phylogenetic footprinting.
It takes as input a set of unaligned orthologous sequences from various
species,
together with a phylogenetic tree relating these species. It then
searches
for short regions of the sequences that are highly conserved. Via the
tool
"Consensus match", you can scan your sequences using your own
patterns like "AAAGGTAA" or "WWRYAATC{1,5}NCA". Finally, you
can also quickly extract a complete annotation table for your
set of genes, using "Get_Seq", "From Ensembl", "Get Info", "Update" !
2. "Individual single-step tools":
2.1. Automated in-batch promoter extraction:
Please refer to question GEN6,
A) for a detailed description of the different tools, especially PromoSer,
BioMart, and TOUCAN.
2.2. Multiple genes, single species, TFBS and
TFBS modules search
(Co-regulation):
MotifScanner
is not only implemented in TOUCAN, but also available as stand-alone web
interface. This program detects TFBS in your sequence set and
graphically displays these sites. At the input window you have to
supply one common FASTA sequence file, and select the motif model (e.g.
for human sequences you may choose the option "TRANSFAC 6.0 public -
Vertebrates") and the type (species) and order of "Background Model"
(3rd order models are fine in most cases). In addition, it is quite
useful to "play around" with the "prior"-value (probability
of finding copy of a motif:),
which replaces the "core and matrix similarity values" of MatInspector,
meaning a lower prior (like 0.1) is more stringent than a higher one
(like 0.9). In contrast to TOUCAN, the output is not
displayed graphically but as a simple list of TRANSFAC matrices which
match to your input sequences, according to the criteria specified by
the user. You may upload a multiple FASTA sequence file but there will
be no comparisons of the TF composition between sequences.
Using MatInspector
professional (commercial, limited number of free runs !)
available at Genomatix, you can search a multiple FASTA sequence file
for the presence of TF sites. Note that you should use MS Internet
Explorer (and not Netscape) in order to make use of the functionality
of the Adobe SVG Viewer, allowing to interactively handle the TF sites
- diagrams. At the bottom of the output page
there is a button called "Search for common TF sites", opening
a SVG window, where you can adjust to see only those TF sites present
in x of total y input sequences. Via right-klick and "Copy SVG", you
can paste the image into e.g. Corel Photopaint in order to save in any
file-format.
Tip! TELiS is a very
fast and very easy-to-use system to find transcription factor
binding motifs (TFBMs) that are over-represented
in promoters of differentially expressed genes. Modern high-throughput
methods like microarrays often generate lists of genes which show a
different expression pattern under different experimental conditions
like biological stimuli. TELiS is capable of extracting very quickly
the over-represented TFBMs in such promoter datasets. This is done by
pre-solving the most computationally
intensive part of the problem, scanning large nucleotide sequences for
multiple TFBMs. Thus, the TELiS database contains information on the
prevalence of TFBMs in the promoters of all human, mouse, and rat
genes. 3 different promoter sizes have been extracted from
genome
databases: 300 or 600 bases upstream of the Transcription Start Sites
(TSS), or a region from -1000 to +200, all corresponding to mRNA
sequences from the NCBI RefSeq
database. The program MatInspector
was used to determine the TFBMs in these sequences, using 3
different stringency values: Matrix similarity 0.8 ("low"), 0.9
("high"), and 0.95 ("extreme"). 2 different TFBM databases
can be used, the public TRANSFAC
database version 3.2, or the open-access JASPAR database. Note:
Although TRANSFAC has a higher number of TFBS, this public version is
not updated, in contrast to JASPAR, which is a smaller set that is
non-redundant and curated. Note: If you want to get detailed
information on individual TFBMs, please refer to FAQ
GEN7 ! Note that also in TELiS, the matrices of all TRANSFAC TFBMs
and JASPAR
TFBMs can be browsed one after the other, which is still less
convenient than using other options listed in FAQ GEN7.
The only input required from the user is a
list of HUGO Gene
Symbols,
separated by tabs, spaces, or line breaks, and to choose one of the
promoter sizes and one of the stringency values. The TELiS
publication
states that analyses of short promoter sequences (300 bases) with
moderate stringency (0.90) provided optimal signal detection, whereas
analyses using longer sequences or lower stringency produced poorer
signal-to-noise ratios. Finally the user has to select the
microarray platform which was used in the personal experiment. The last
point is
necessary because the TFBMs identified in the selected genes are
compared to the TFBMs pre-identified in all genes contained in
the experimental platform as a reference population, in order to
determine over- or under-representation (please also refer to the TELiS backbround
page for additional discussion). NOTE: If your array is not
listed, you may simply select "All
human / mouse / rat genes" at the bottom of the dropdown-list, meaning
that ALL genes of a species are taken as reference for the analysis.
Also note that the best
results are achieved with sets of 100 or more
genes/promoters, whereas the analytic sensitivity drops significantly
for samples <20. Nevertheless, Incidence analysis p-values are
described to remain accurate for any sample size.
There are 2 different ways to analyze a dataset:
1. "Differential
expression analysis":
Here, a list of the top-scoring TFBMs is produced, which are
color-coded to allow easy identification of over-represented (dark
blue), indifferent (grey) and under-represented (red) TFBMs. The
"Incidence" indicates the number (n) and the percent ("Sample mean") of
promoters which contain at least one binding site, which can be easily
compared to the percent of total promoters of the platform containing
this site ("Population mean"), and the resulting "Ratio" between
promoters in the personal dataset and total promoters of the array.
2. "Get raw data":
Using this option, the data can be downloaded as *.td format, which is
best opened with programs like EXCEL (which maintains the tabular
structure of the output), not with word processors like WORD. This
table shows the number of each binding site in each of the gene's
promoters.
OTFBS,
developed at Institute of
Bioinformatics, Tsinghua University, Beijing, is a method which can detect
over-represented motifs of known transcription factors from a set
of related sequences. Particularly, promoters of the same gene family
or from the same tissue can be submitted as analysis subject. Promoters
of putative co-regulated genes clustered with gene expression data
should be also a good candidate to analyze. The version of TRANSFAC
Matrix OTFBS currently uses is Release 6.0. Simply submit the upstream
regulatory regions of a
group of related genes (max. 200), i.e. genes clustered together
with microarray data, or just the genes of a same functional protein
from a series of related species. Note: There is no option to
adjust any parameters. The Output consists of a simple list of
overrepresented TFBS, and the
positions of all TFBS in all input sequences. Note that only
the TRANSFAC Matrix accession numbers are listed (like "M00086"), NOT
the names of the TFBS !!! If you want to know the identity of these
TFBS matrices you have to query the individual accessions at TRANSFAC
(see TRANSFAC section for
instruction !).
Tip! CREME, which is part of the
Dcode programs provided by the Lawrence Livermore Nat. Lab,
is a web-server for identifying and visualizing cis-regulatory
modules in the promoter regions of a given set of potentially
co-regulated human genes. Eukaryotic genes are often regulated
by several transcription
factors, whose binding sites are spatially clustered and form cis-regulatory
modules. CREME relies on a database of putative transcription
factor
binding sites that have been carefully annotated across the human
genome
using evolutionary conservation with the mouse and rat genomes.
Promoter
extraction was done by mapping RefSeq mRNAs onto the genome assemblies,
and
by taking 1.5 kb upstream of the TSS, or up to the next neighbouring
gene.
The CREME database is built of TFBS which are conserved in
all
3 species (human, mouse, rat), and which show PWM similarity scores
of
0.8 and above. This means that CREME can be queried using a set of
HUMAN
genes, but the predicted TFBS modules are pre-computed according to the
fact
whether they are conserved in the 3 species or not. Please refer to the
CREME chapter at the main page for
details.
Tip! ModuleSearcher
is integrated in the TOUCAN
package, and scans your sequences for high-scoring
combinations of TFBS. Please also refer to the TOUCAN chapter at the main page for
detailed instructions. Note that, if you want to know
if a predicted CRM (cis-regulatory module) is found in Human-Mouse CNS
(Conserved Non-coding
sequences) in a "whole-genome approach", you should use the tool ModuleScanner
! Please refer to FAQ GEN4, which specifically
deals with this kind of "whole-genome approach !
FrameWorker
is is a complex software tool that allows users to extract a common
framework of elements from a set of DNA sequences. These elements
are usually transcription factor binding sites since this tool is
designed for the comparative analysis of promoter sequences.
FrameWorker returns the most complex models that are common to the
input sequences (and satisfying the user parameters). These are all
elements that occur in the same order and in a certain distance range
in all (or a subset of) the input sequences. Typical input datasets may
be, for instance, a set of promoters from orthologous genes
(Phylogenetic footprinting) or a set of promoters from different genes
which have been found to be co-regulated by cluster analysis of
expression array data (Co-regulation). NOTE: FrameWorker is
part of the "GEMS
Launcher"-section
of the commercial Genomatix
Suite. NOTE: Genomatix has termed
the free
academic access "evaluation
account". Note that in general, there is not only a limitation
in the number of analyses (max. 20 GEMS
analyses (sequences) per month!) but
also in the functionality
of the obtained data !
2.3. Multiple genes, single species, motif
search (Co-regulation):
Note that there
is a whole section at the main page called "Motif Discovery", where you
will find additional descriptions of the numerous programs in this
field. Here, I will try to concentrate on those which have
user-friendly and easy-to-use web interfaces.
MotifSampler,
which is also part of the TOUCAN
package, is available as individual program from the software
page of the bioinformatics
group at the department of electrical engineering (ESAT) at the Katholieke Universiteit Leuven
(Belgium). MotifSampler tries to find over-represented motifs in
the upstream region of a set of co-regulated genes. This motif finding
algorithm uses Gibbs sampling to find the position probability matrix
that represents the motif. You simply paste a multiple FASTA sequence
file. The output gives nice graphical representations of the
over-represented motifs, and displays their positions along the input
sequences. Please note that the MotifSampler is a Gibbs
sampling implementation, implying that it is a stochastic algorithm,
thus returning different results each time ! Of course, if a
certain motif is really over-represented, the same or similar motifs
should be found in each run, depending on the parameters you've set.
The RSAT
tool Oligo-Analysis
can be used as individual program for the detection of over-represented
oligo sequences of defined length. Please also refer to the RSAT chapter at the main page ! You
should select the oligo size, the background model and the organism.
The results of the analysis are displayed in a table. Each row
corresponds to one oligonucleotide, and each column to one statistical
criterion. The E-value
represents the number of patterns with the same level of
over-representation which would be expected by chance alone. E.g., the
E-value is of the order of 10e-6, indicating that, if we would submit
random sequences to the program, such a level of over-representation
would be expected every 1,000,000 trials. NOTE: In the bottom
of the result page, click on the button Pattern matching
(dna-pattern); then hit "GO" and then click the Feature map
button, which will produce a graphical image of the results.
Tip! The
MEME (Multiple Em
for Motif Elicitation) system allows you to
discover motifs (highly conserved regions) in groups of related DNA or
protein sequences using MEME
(or
MEME mirror at
Pasteur Inst.) and search sequence databases using motifs using MAST (Motif
Alignment and Search Tool).
Simply provide a FASTA-file of your promoter sequence set, and set the
number
of motifs to be extracted. In addition, you may set a "minimum/maximum
motif width", e.g. for TF binding sites you may choose 6 and 9, meaning
the program will extract only motifs ranging from 6 to 9 bp. Individual
MEME motifs do not
contain gaps. Patterns with variable-length gaps are split by MEME into
two or more separate motifs. MEME sends 3 mail-messages: a
confirmation, the MEME results and the MAST results. The MEME results
include very nice multi-colored diagrams, and consensus sequences of
extracted over-represented motifs, and links to perform additional
analyses like BLOCKS, MAST, and MetaMEME.
WeederWeb
is one
of the tools
developed in the lab of Graziano
Pesole, Milan University.
WeederWeb is a web interface to Weeder, a program for finding
novel motifs (like transcription factor binding sites) conserved in a
set of regulatory regions of related genes. WeederWeb is very user-firendly
web interface. You simply paste
the sequences, check if you want to include the reverse strand, select
your "guess" how many sequences will share a motif, and choose the "speed"
of the analysis. Note that using the "quick scan"
option, only short motifs (6 to 8 bases)
are reported, whereas "normal" and "thorough" modes
scan for motifs from 6 to 12 bases ! Note that you should use
the extended
input
form
if you want to exactly choose a certain motif length, or if you
want to precisely define how many variations may be accepted. There is
also a special option if you are using human 5' or 3'UTRs as
input sequences ! Results are coming as text file via Email,
also containing
a hyperlink which displays the output in a "MEME-like" fashion
(including a "Sequence Logo" - representation). The result also
contains a user-friendly line "Interesting motifs seem to be:".
Tip! YMF
is one of the tools
developed at the computational
molecular biology group, University of Washington. YMF is a program
that
detects statistically
overrepresented words (motifs) in DNA sequences. The user may specify
the characteristics of the motifs to be
detected. A motif here is a short string of nucleotides, degenerate
symbols, and spacers. 'Motif size' is the number of non-spacer
characters in a motif. Spacers ('N's) are constrained to be in
the center of the motif. Degenerate symbols allowed in a
motif are R (purine - A or G), Y (pyrimidine - C or T),
W (A or T), and S (C or G). YMF
uses a very clear, user-friendly web interface. You simply
choose the motif size, the maximum number of spacers and degenerate
symbols (IUPACs), and the organism. Note that although
the page states "Total uploaded sequence data should be < 10000
characters", a test run using a much larger sequence set seemed to work
without problems (and results were delivered very quickly!). The Output contains a simple text
file listing the motifs in "descending order of
reliance", graphical plotting of "Top-scoring motifs" (works
only
with IE6.0+, and NOT with Netscaspe), and the option "FindExplanators".
FindExplanators
is a program that extracts from the set of significant motifs
reported by YMF, a smaller set of "real"
motifs. More specifically, given a set of DNA sequences P, and a
set of motifs M (such as those reported by YMF), it extracts
a subset E of motifs in M, such that given the occurrences of the
motifs of E in the sequences P, the remaining motifs in M are not
statistically significant.
2.4. Single gene, 2 species, TFBS search (Phylogenetic Footprinting):
Tip! rVISTA
(regulatory VISTA)
combines transcription factor binding sites
(TFBS) database search with a comparative sequence analysis,
thereby reducing the number of predicted transcription factor
binding sites by several orders of magnitude. As example, when
comparing promoter sequences of a human gene and its orthologous mouse
counterpart, it is possible to extract those TFBS that are conserved
between the 2 species and therefore are expected to be functionally significant. Note that, in
contrast to mVISTA,
rVISTA works only for 2 input sequences. Note that you may access
rVISTA at 2 different sites, at Lawrence Berkeley Lab.
and at Lawrence Livermore Nat. Lab.,
as described in individual sections at the main page (LBL, LLL). At LLL, there are at
least 3 different ways to run rVISTA (which can be also used as
individual programs !), which are explained at the rVISTA start page (zPicture, ECR Browser, Genome Alignment).
Please refer to the rVISTA chapter
at LLL at the main page for detailed instructions on these programs
and the rVISTA output. Please also refer to FAQ GENOM6 for further information
concerning comparative genomics analyses.
Tip! ConSite is
a program (and web interface) that couples phylogenetic
footprinting with regulatory site detection (mainly
promoter comparison). ConSite is designed to compare 2
orthologous sequences and report conserved TFBS (Transcription
Factor Binding Sites). Note that ConSite uses a database ("JASPAR") of TF profiles
(PWMs) that was newly built from literature data, and that is
therefore "independent" from existing databases like TRANSFAC. At the
ConSite start page, you have 3 different options. Analyze
orthologous pairs of genomic sequences lets you e.g. paste 2
promoter sequences of the "same" (orthologous) gene from human and
mouse, and the program will generate the alignment. Analyze
an existing alignment of 2 genomic sequences lets you use your
pre-made alignment (CLUSTALW format) directly for the TFBS analysis.
Analyze a single sequence lets you analyze a single promoter
for TFBS (without performing cross-species comparison). This option
is comparable to the TESS
system, but utilizes the JASPAR profile collection instead of TRANSFAC.
Please refer to the ConSite
chapter at the main page for detailed instructions about parameters
and options.
2.5. Single gene, multiple species, TFBS search (Phylogenetic
Footprinting):
Tip! multiTF identifies
transcription
factor binding sites conserved across multiple species. There
are
2 diffrent ways to initiate a multiTF search, and I would
suggest
to use MULAN, as this
program
is integrated in the same web-portal. Multiple sequence alignments
generated
by MULAN can be automatically submitted to multiTF from the results web
page.
The "handling" and output of multiTF is very simillar to rVISTA, e.g.
the
user can set the parameters for detection of TFBS (like matrix
similarity,
individual TF selection). TFBS can be dynamically visualized along the
sequences
(similar display as in rVISTA but for multiple species). It is
possible
to list and display either ALL TFBS or only those which are conserved
across ALL species. You may also highlight individual TFBS positions in
the alignment.
Taken together, MULAN and the interconnected tool multiTF
somehow represent the "multi-species" equivalent to the system
mVISTA-rVISTA, where rVISTA is based on the TF prediction for 2
aligned species (2 sequences). Please refer also to the main
chapter describing different other programs of the Lawrence
Livermore
National Lab for comparative genomics. NOTE: Now it is also
possible to first
locate your region of interest in ECR Browser, extract the
pre-made alignments with other species, and finally, via the link
"Synteny/Alignments", you may send ALL selected
sequences to MULAN to generate phylogenetic trees and identify multi-species
transcription factor binding sites via multiTF. NOTE: If you
are specifically looking for a TFBS which is not contained in the TF
database used (like TRANSFAC) but where you have a certain consensus
sequence from (like WWCAAWG), you may scan the MULAN alignment for this
pattern by using the option "User-defined consensus sequences" within
the multiTF input window "Defining transcription factor binding sites".
DiAlignTF
displays transcription factor (TF) binding site (TFBS)
matches within a multiple alignment. It is possible to display all
TF binding site matches, TF binding site matches common to all or
subset of the input sequences, or common TF binding site matches that
are located in aligned regions. The TF binding sites are visualized in
the alignment as colored boxes. The input sequences are aligned with
the multiple alignment program DiAlign.
TF binding site matches are identified by MatInspector.
DiAlign and DiAlignTF are part of the "GEMS
Launcher"-section
of the commercial Genomatix
Suite. NOTE: Genomatix has termed
the free
academic access "evaluation
account". Note that in general, there is not only a limitation
in the number of analyses (max. 20 GEMS
analyses (sequences) per month!) but
also in the functionality
of the obtained data !
2.6. Single gene, multiple species, motif search
(Phylogenetic Footprinting):
Tip! FootPrinter
was developed by the Computational Molecular Biology Group at the
University of Washington. It is available at the respective software site
either as a web
service or as a downloadable
program. Note that there is also a very good FootPrinter
manual, explaining e.g. all the input parameters. Please note
that FootPrinter is also implemented in the TOUCAN software package,
please refer to the TOUCAN chapter
for details. FootPrinter is a program that performs phylogenetic
footprinting. It takes as input a set of unaligned orthologous
sequences from various species, together with a phylogenetic
tree relating these species. It then searches for short regions
of the sequences that are highly conserved, according to a parsimony
criterion. The regions identified are good candidates for regulatory
elements. By default, the program searches for regions that are well
conserved across all of the input sequences, but this can
be relaxed to allow finding regions conserved in only a subset
of the species. Please refer to the FootPrinter chapter at the main
page for detailed instructions on input paramaters, construction of
phylogenetic trees, and on handling the different output formats. In
order you
want to check if the derived motifs correspond to known TFBS
please refer to the chapter below.
2.7. Multiple genes, 2 species, TFBS search
(Co-regulation
AND Phylogenetic Footprinting):
Whole Genome
rVISTA (beta version):
This Web site provides access to the computational tool that allows for
evaluation of which transcription factor binding sites (TFBS)
are over-represented in upstream regions in a group of genes.
This beta-version of the tool has been developed for the mm4
version of the Mouse genome (October 2003). A database of all
TFBS in mm4 conserved in the alignment with the Human July 2003 (hg16)
using rVISTA (regulatory VISTA) method was created.
Input: genes of your interest using locus link IDs or RefSeq
names (only Mouse !!!). The programs will calculate which TFBS
located in 5Kbp upstream
regions of these genes are over-represented
(at the P-value cutoff 0.006) using all 5Kbp upstream regions of mm4
RefSeq genes as outgroup. You can also get a list of TFBS
overrepresented in each individual gene of interest. NOTE: Test
runs showed that the program tends to produce quite
long lists of "over-represented" TFBS. It is hard to estimate which of
these are actually biologically meaningful. NOTE: This program
is similar to TELiS,
but uses
conserved TFBS between 2 species instead of TFBS from a single
species, and counts the number of TFBS relative to all
mouse genes (RefSeq) whereas TELiS specifically sets the TFBS of the
sample in relation to the TFBS of all genes present on the microarray
platform used.
2.8. Multiple genes, multiple species, motif search (Co-regulation
AND Phylogenetic Footprinting):
Using TRES you can
simultaneously search up to 20 promoter sequences (of
maximum 1000 bp each) for known transcription factor binding sites,
cis-acting elements, palindromic motifs or conserved k-tuples
(phylogenetic footprints).
This is useful for comparative promoter sequence analysis to elucidate
common themes (modules) in functionally or phylogenetically related promoters. Please note that this
program nevertheless does not cover the full functionality of others in
the field of phylogenetic footprinting, therefore results using a sequence set
of different species have to be taken with care. TF binding sites are
searched from TRANSFAC
database,
from ooTFD database and also plant
cis-acting elements
from PLACE database.
When searching for TRANSFAC weight matrices, you have to select a
matrix cut-off, meaning "1.0" would be a perfect match, and reasonable
values are not less than 0.90. There is also a nice function "Report
sites only when present in at least 70, 80, 90 or 100 % of the
sequences".
PhyloCon
(Phylogenetic Consensus) is an algorithm that takes into account both
conservation among orthologous genes from different species (Phylogenetic
Footprinting), and co-regulation of genes within a species.
PhyloCon first aligns (by use of the program Wconsensus)
conserved regions of orthologous sequences into multiple sequence
alignments, or profiles, and then compares profiles representing
non-orthologous sequences (as e.g. in clusters derived from microarray
data). Motifs are found as unusually well conserved substrings by
comparative genomic analysis. Note that PhyloCon does not need
the length of the motif a priori. There is currently no web
interface for PhyloCon but the program can be downloaded as Linux
executable at
the Washington University.
2.9. Check if a predicted motif corresponds to a
known TFBS:
If you want to know if the over-represented
motif is a known TF binding site, you should best
check all of the following options: You may try a search
against the TF-site
table at TRANSFAC, see "Search TRANSFAC" in the TRANSFAC
chapter ! As wildcard, use "*" (NOT "n" !). The retrieved hits are
actually promoters, where this site is included, and under "BF" you
will find "Binding Factors" that bind to this site. Alternatively
you may perform this search at TESS.
Note that TESS uses older versions of TRANSFAC. As
"Search Field" choose "Sequence", and enter your pattern in the "text"
field. Carefully analyze the individual hits. You may also feed MatInspector
professional with short input sequences, but tests showed that
these should be at least >10-12 bases long. If so, this option is
possibly the best one ! Yet another option is the so-called Profile
Comparison
Tool provided at the site of the "alternative" JASPAR website of TF
profiles (PWMs) which was newly built from literature data, and is
therefore "independent" from existing databases like TRANSFAC. You may
either paste your consensus sequence or your binding matrix and search
for similar motifs in the JASPAR database. The output is very
user-friendly displaying all hits as instructive multi-colored sequence
logos. If you have a user-created profile instead of a
consensus sequence, you may submit
this profile and
compare it to the profiles of TFBMs in the JASPAR database, using the option
"Compare custom profile to database profile". Please refer to the JASPAR Help-section to see how profiles may look
like.
2.10. Verify that a predicted motif is not "over-represented"
in a randomly generated sequence set:
There is a separate FAQ addressing this question,
please refer to GEN10 !
2.11. Produce sequence logos for over-represented motifs in the
sequence set:
This is actually a matter which is tightly
associated with multiple sequence alignments, so please refer
to question SIM4 for that purpose.

GEN6...quickly extract potential promoter sequences
for a batch
of human genes ? (last update May
29, 2006)
This is in general a tricky question, because from
many genes, the translation start (ATG) is known BUT NOT the exact
transcription start site (TSS), which of course is the reference
position to extract the promoter sequence, let's say to take -800 bp
upstream and +100 bp downstream of genomic sequence. Today, several
tools are available that address this specific question. There are
3 groups of programs:
1. Supporting batch-queries:
Tip! PromoSer is a
service for promoter extraction for human, mouse, and rat genes
provided as part of the Gene
Regulation Tools of the Zlab, which belongs to
the Boston University Bioinformatics.
PromoSer comes with a compact, but very instructive Help-file
describing all the different options, making PromoSer one of the
best tools for this purpose. As input, you can use lists of GenBank
accession numbers (RefSeqs, mRNAs, and ESTs). There is no option to
use e.g. Affymetrix IDs. You then define the region upstream and
downstream of the TSS (Transcription Start Site) which you want to
extract. Then, choose the "Quality" and the "Support" levels.
The TSS "Quality" is a rating system (between 0 and 4) which describes
the composition
of the sequences that support this TSS (described in the Help-file).
The extraction of alternative promoters is in fact a great
feature allowing the user to select which of the mRNA sequences to
define as reference for the location of the TSS. The option "only
the one that is best supported and is 5' most" defines the TSS at
the position which is best supported by RefSeq, mRNAs and ESTs.
Otherwise, you may choose to extract only the promoter that starts
5' most (most aggressive extension). In the case of the presence of
ESTs containing "5'-upstream first exons" as compared to the RefSeq, a
totally different promoter may be extracted. The option "ignore all
extension info and return the immedite upstream region"
extracts the 5'-flanking genomic region relative to the supplied
accession number, meaning that also single ESTs can be defined as
reference point for the promoter definition.
As output, PromoSer first presents the
extracted sequences in the form of a table which is highly
instructive as it lists the exact genomic positions, chromosome number,
the quality level, the number of supporting sequences, and the "genomic
extension", which means the amount of genomic sequence added at 5'
(positive value) relative to the accession number provided. In case
that the promoter is extracted at a downstream (3') position, a
negative value is indicated. Finally, the promoter sequences can also
be displayed (copied) as a FASTA sequence file, and thereby be
transfered to other applications (like e.g. TOUCAN).
A very nice example to test the
different options is the human gene MMP26; just see what
happens
when you use the RefSeq NM_021801 or the EST accession BG189720, as
input along with the different options of alternative promoters. You
may directly use the extracted FASTA sequences for a BLAT
search at UCSC, quickly revealing the genomic position of the
individual sequences.
Tip!
BioMart: BioMart
is a data retrieval tool that generates lists of
biological objects (e.g. genes, SNPs) from data held in the Ensembl
(and other) databases. NOTE
that there are different web interfaces for BioMart, please
refer to the BioMart main section
for details. BioMart contains a 'query builder' interface to allow
users
to specify genomic regions, and refine the result set using filters.
BioMart can generate a number of different types of output,
including sequence and tabulated list data. Multiple
output formats, including HTML, text and Microsoft Excel, are
also supported.
In order to retrieve potential promoter sequences,
you may perform the following steps. At the Start Page you may
select "Ensembl Genes". Then, you
can provide your own list of e.g. Entrez Gene IDs, MIM IDs,
RefSeq IDs, Affymetrix ProbeSet IDs (!), and many more at the Filter
Page. Finally, at the
Output Page, you then choose the "Sequences Page", where
you have different options to select for 5' upstream regions (potential
proximal promoter regions).
Taken together, advantages of BioMart are
batch submission and nice query options. But there is no method
comparable to the versatility of e.g. PromoSer.
In addition, there is no support from promoter prediction programs
(which by the way is not a major drawback), and no integration of
curated data from the EPD (Eukaryotic Promoter Database).
Tip!
TOUCAN:
If you have installed this program package, then you can perform in-batch
extraction of promoter sequences using a list of diverse
identifiers (like LocusLink, Ensembl, RefSeq and many more), which
works similarly to the BioMart
sequence extraction (meaning the "conservative" way based on Ensembl
Genes), yielding one output (promoter) per gene. By the way, this
option therefore suffers from similar limitations as the ones mentioned
at the BioMart description. This tool is especially useful if you want
to make comparative analyses of transcription factor binding sites.
Anyway, you may also only download the sequences as FASTA-file. Note
that in version 2 (released Aug. 2004), you may download the
orthologous promoter regions from MULTIPLE species "in-batch" !
TRASER:
The Transcript Sequence Retreiver () of
Stanford University provides rapid (in a true sense, the
program is VERY EASY TO HANDLE !) retrieval of transcript and upstream
(putative promoter -containing) sequences for predicted human
genome mRNAs. The underlying database is built using the human genome
annotation files provided by the NCBI. The program accepts ONLY
LocusLink IDs as input but allows batch-submission !
You can choose the length of sequence to retrieve. Note
that the database is solely based on RefSeq sequences (no
ESTs included), but is able to retrieve more than one upstream region
for a gene in cases where several RefSeqs exist. NOTE that the
output sequences follow the UPPER/lower case model for EXON1/upstream
sequences. NOTE that there are 2 output formats, as
FASTA sequence file, or as tab-delimited text (making it possible to
e.g. paste the
sequences into an EXCEL sheet of pre-existing data !). Again,
there is no support from promoter prediction programs, and
no integration of curated data from the EPD.
Chip2Promoter:
The module "Chip2Promoter" of the Genomatix
suite performs automatic in batch extraction of the
promoter sequences. For this purpose, different kinds of accession
numbers can be used as input, including Affymetrix ProbeSet
IDs (separated by spaces). The extracted promoter sequences can be
downloaded for further use. Currently,
Chip2Promoter is available for human, mouse, rat,
and Arabidopsis. A big advantage is the possibility to perform batch-retrievals.
Note that you only have 5 free runs (= 5 accession numbers) per
month, as registered academic
user. Genomatix has a "3 level rating system" of promoters (gold =
experimentally verified: promoter described in the Eukaroytic Promoter
Database (EPD) or promoter derived from mapping of full length cDNAs;
silver = supported by PromoterInspector
prediction; bronze = upstream region, 500 bp upstream and 100 bp
downstream of an annotated transcript). Note that the program
does NOT consider
EST sequences !
RSAT-Retrieve
Sequence: allows the automatic extraction of 5'-flanking
sequences (pot. promoters) for your genes of interest. You have to
choose the organism,
in the case of human there are 2 different databases "Homo sapiens" and
"Homo sapiens EnsEMBL". In test runs, there was no big difference in
the output between these two. The gene
names must be separated by carriage returns, because only the first
word of each line is considered as a query. Genes can be specified
either by the systematic ORF identifier or by a common name. Synonyms
are also supported. Note that the option "prevent
overlap with upstream ORFs" should be inactivated when working with
eukaryotes. "From To" describes the limits of the region to
retrieve. For upstream sequences, the default reference position is the
ORF
start* (and NOT the transcription start !). Negative
coordinates are used to indicate sequences located upstream the start
codon; a reasonable pair of values could be: From -800 to -1. Note
that you might want to re-check the obtained
sequence via BLAT
search at UCSC. *Please
note that for genes which do NOT have the start ATG
in the first exon the correct promoter retrieval might be a problem
because in these cases the
tool will retrieve sequence from the first intron, and NOT the promoter
sequence !!! BUT NOW, the user can choose between different "Feature
types", like CDS (Coding Sequence), mRNA, tRNA, etc.
The advantage of using mRNA is that, if the mRNA is complete (which is
not always the case), the upstream regions are retrieved relative to
the
transcription start site (TSS), rather than the start codon!!! If you
want to see a nice example,
you can try to extract the upstream sequence (e.g. -1000 to -1) of the
gene "SELE" (E-Selectin), and compare the output when choosing "CDS"
versus "mRNA" as "feature type".
2. Single queries but automated sequence extraction:
Tip!
DBTSS (Database of
Transcriptional Start Sites) stores human sequences which were produced
by the oligo-capping
method to obtain full-length cDNAs. Sequence comparison between
DBTSS and reference sequence database, RefSeq,
revealed that 34.2 % of RefSeq sequences should be extended towards
the 5' ends.
DBTSS (2006) contains 1.359.000 clones corresponding to 19.753
human RefSeqs. After clustering (of splice variants), these data
correspond to 15.262 genes. For comparison, EPD (release 82)
contains promoters for 1.767 human genes. DBTSS data suggest that
approx. 55% of the human loci have two
promoters or more. Therefore, it is essential to address the topic
of Alternative Promoters (APs). DBTSS includes such predictions
of APs in locus-specific result views. In addition, mutually
homologous genes between human and
mouse were determined and their
promoters could be compared with each other. Using this information,
DBTSS
enables users to investigate what kind of sequence elements are
contained in
the promoters of their genes of interest and which of them are conserved
between human and mouse. Also, users can search for promoters
containing
putative binding sites of particular transcription factors (TFs).
Please refer to the section
DBTSS
- Search for TF Binding Site for details !
DBTSS offers versatile query options: RefSeq
ID, UniGene ID, EntrezGene ID, Gene Symbol, Ensembl Transcript ID, and
more. The output
consists of very instructive graphs showing the positions
of
RefSeqs and Ensembl-transcripts in relation to the positions of
individual Oligo-capped cDNAs. The user then can select "the favourite
reference position" for the TSS, either RefSeq, ENST, or the longest
Oligo-capped cDNA, and download the potential promoter region.
The only disadvantages are that there is NO batch query
option, and ESTs are not included. In
addition, not all of the human genes are supported by "oligo-capped
cDNAs".
FIE
(version 2.0) is another tool to retrieve the region upstream and/or
downstream of the 'start of exon 1' (Transcription Start Site, TSS)
for a particular gene. This user-specified region requires the LocusLink
ID or Gene/Protein Name and Organism Type as well as the
Upstream and Downstream
length with respect
to the 'start of
exon
1'. This reference position
is determined by the longest annotated mRNA (RefSeqs, which
also include un-characterized potentially full-length mRNAs like
'DKFZ', 'KIAA', or 'FLJ'). Note
that version 2.0
is considerably improved, as it lists all mRNA
sequences individually, so the user can decide which upstream
region to extract (which was not the case in version 1.1). "Ordinary"
ESTs are not considered. NO batch retrieval option. Currently
only available for human genes.
PRESTA
is a tool/database that combines EST databases and putative
GenBank/EMBL promoters to yield datasets of predicted promoters
at high accuracy. A high stringeny BLAST-search reveals ESTs that
assist in transcription start-site verification. In principle, PRESTA
would therefore be useful for promoter verification by mapping EST 5'
ends. BUT: Limited query options (NO LocusLink IDs, NO RefSeq IDs
etc.), NO batch query, NO user-definition of region to extract, many
genes simply NOT included. Solely based on ESTs, RefSeqs are not
considered.
3. Single queries, "by hand" sequence extraction:
Of course, there is also the possibility to extract
promoter sequences "by hand", for this purpose I would recommend
the UCSC
Genome Browsers (Human, Mouse, Rat). The best way is to start
from the NCBI
Entrez Gene entry of your gene of interest, and klick at the
UCSC-link at the top of the page. The position of the RefSeq sequence
is shown in the genome browser window. Move left or right (via the
"<" and ">" buttons) to see if the RefSeq or another mRNA or EST
has the longest 5'-sequence. Klick onto this sequence, write down
the "Start on chromosome" (if gene lies on the + strand) or the "End on
chromosome" (if gene lies on the minus strand). Go back to the browser
window and enter this number (minus e.g. 1000 or plus e.g. 1000,
respectively) in the
"position" field on top of the page, along with the according other end
of the sequence to extract. Then, hit the link
"DNA" on top of the browser window, to retrieve the sequence.
Don't forget to select "Reverse Complement" if the gene lies on the
minus strand. In addition, you may try the "Extended case/color
options".

GEN7...quickly see the binding site profiles of individual
transcription factors ? (last update May 18, 2005)
In general, transcription factor binding sites
(TFBS) are defined by a consensus sequence, or even more
accurate by a Position-Weight Matrix (PWM), representing the occurrence
of each individual nucleotide at each position. TFBS are usually short
stretches of sequence (mostly between 6 and 12 bases). Transcription
factors normally are part of protein families, that share a similar
DNA binding specificity. Therefore, it is a good start to know the
binding
profiles of the major TF families, but finally also the subtle
differences
between individual members.
1. Resources based on TRANSFAC database:
The cis-element information page is provided as part of the Gene Regulation Tools of
the Zlab, which
belongs to the Boston
University
Bioinformatics. The PWMs of the major classes of cis-regulatory
elements and a short description of the respective transcription
factors are listed.
TRANSFAC
is the most comprehensive database on eukaryotic
transcription factors, their genomic binding sites and DNA-binding
profiles. Please
note that you have to register at BIOBASE (free for
non-profit organizations) in
order to gain access to the individual transcription factor information
files. NOTE that
still, there is no access to the latest versions of TRANSFAC
which are only commercially available ! In order to Search
TRANSFAC for the binding matrix of individual TFs, you should
choose the Matrix-table, wheras if you want to gain information on a
specific Transcription Factor, search in the TF-factor table. Don't
forget
to set the "table field to search in" to "All Fields" if you are not
sure
which field might correspond to your search term.
Tip! The different TRANSFAC
databases are also
searchable via public
SRS (Sequence Retrieval System) servers (without need to
register !). Please also refer to other SRS
descriptions as in RET2 or RET5. There are 5 different libraries
related to TRANSFAC, all starting with "TF...". If you have a look at
this list, you realize that the content of the TFMATRIX databases
varies according to the different SRS servers, so choose one of these
databases..
Using the yellow "Search" button at the top right corner opens the
"Search" - form, where we can simply enter keywords like "MAF". A list
of hits in TFMATRIX is displayed which contain the word "MAF". A
TFMATRIX entry displays the consensus binding site of a transcription
factor as a matrix which often has been compiled from a series of
experiments (like binding assays). This matrix shows the probability
for each nucleotide to be present at each position of the sequence.
Also, links to the transcription factors themselves (accessions
starting with "T...") and individual binding sites ("R...") are
available.
By the way, this is a good example that it is important to always look
precisely at database entries. MAF_Q6 obviously has nothing to do with
e.g. MAF_01 and seems to be a completely different protein. Note
that if you retrieve no hits in the first run, you may select
"*all
entries*" as search field which will produce a list of ALL entries in
the database. You can display the complete list in one window by
adjusting the "Display Options" on the left side. This list you can
scan for your factor / matrix of interest.
Alternatively,
you may query
TRANSFAC at TESS, but note that TESS uses older versions of the
TRANSFAC database. The link "Matrices"
allows a query against the TRANSFAC matrix table similar to the one
described above.
2. Resources based on JASPAR database:
Tip! JASPAR
is a collection of
transcription
factor DNA-binding preferences, modelled as position-specific weight
matrices (PSSMs). The
prime difference
to similar resources (TRANSFAC, TESS etc) consists of non-redundancy
and quality. JASPAR is a smaller set that is non-redundant and
curated. All
profiles are derived from published collections of experimentally
defined transcription factor binding sites for multicellular
eukaryotes. The database represents a curated collection of target
sequences. The JASPAR
Help-section
describes in detail how the profiles are generated. NOTE: As
the access to TRANSFAC has been commercialized,
and only the "public" version (which has not been updated for some
years) is available for free, the open-access JASPAR database is a
highly valuable resource. You may quickly identify the individual
profiles by using the "Browse" or "Search" functions at
the JASPAR start page ! Please refer also to the JASPAR section at the main page !
A concise page summarizing familial
binding models for major transcription factor classes is
provided
by the "alternative" JASPAR
website. A multi-colored "Logo"-representation is shown allowing a
quick
impression
of individual binding site profiles.
3. Resources based on TRANSFAC and JASPAR databases:
TELiS
is a very fast and very easy-to-use system to find transcription
factor binding motifs (TFBMs) that are over-represented
in promoters of differentially expressed genes. 2 different TFBM databases
can be used, the public TRANSFAC
database version 3.2, or the open-access JASPAR database. Note that also
in TELiS, the matrices of all TRANSFAC TFBMs
and JASPAR
TFBMs can be browsed one after the other, which is still less
convenient than using the other options listed above. Please refer also to the TELiS section at the main page !
GEN8...detect regulatory elements in UTRs (UnTranslated
Regions) in a whole-genome approach ? -> see RNA1 !
GEN9...get the
promoter/protein sequences of all proteins homologous to my query
within a certain species ? ->
see RET8 !
GEN10...check how often a specific motif is present in a
randomly generated sequence set ? (last update Jun. 3, 2005)
Often, when analyzing a group of genes, like a
cluster from a microarray experiment, for the presence of
over-represented sequence patterns (Motif Discovery, GEN5), or when scanning this set for the presence of
one specific motif (Motif Matching, GEN4),
it is questionable if the match count of a motif is really higher as
compared to a random sequence set of the same size. In order to
evaluate this point, it is necessary to generate such random
sequence sets without being biased towards certain criteria, which
normally happens in "manual random" selections.
Tip! Random Sequence,
a tool which is integrated in the RSAT portal of regulatory
sequence analysis, generates random DNA sequences
according to various probabilistic
models (Markov chains or independently distributed nucleotides). This
tool is very useful if you want to verify the significance of results
obtained by programs of Motif Discovery like Oligo-Analysis or programs
of Motif Matching like DNA-Pattern. You can easily generate a random
sequence set corresponding to your "query dataset", simply by selecting
the same sequence number and the same length. In addition, you may
choose between 3 different models: "Equiprobable
nucleotides" is the simplest model, where all nucleotides have the
same prior probability. In "Independent nucleotides with distinct
probabilities" a specific prior probability can be attached to
nucleotides (AT and CG
are grouped). This probability is constant over the sequence, i.e. each
nucleotide is generated independently of the preceding and succeeding
nucleotides. In "Markov chains (calibrated on intergenic
frequencies)", the random sequence has the same oligonucleotide
composition as
observed in the intergenic regions of the selected organism. This is
obtained by a Markov chain process, where nucleotide probabilities
vary at each position, depending on the preceding nucleotides. Note
that "oligonucleotide size" determines which expected
oligonucleotide calibration table has to be used. The markov chain
order is this value minus one. For example, calibrating with
hexanucleoides (oligonucleotide length = 6) means
that the nucleotide at each position depends on the 5 preceding
nucleotides. This is this thus a Markov chain of order 5. Calibrating
on single nucleotides (oligo length = 1) means that
each nucleotide is chosen independently off the preceding one. This is
thus a Bernouille model (or Markov chain of order 0).
NOTE: The output sequence list provides direct
links to follow-up procedures like Pattern discovery
and Pattern matching
!!! Thereby, the random sequence set can directly be scanned for the
presence of a specific pattern or for predicting "over-represented"
patterns.
Tip! Random Genes,
a tool which is integrated in the RSAT
portal of regulatory
sequence analysis, performs a random selection among the genes
of a selected
organism. The selection can be performed with or without
replacement (when this option is activated, a gene can appear several
times in the list). This program is useful for estimating the rate of
false positive
for pattern discovery programs.
The program can also generate several groups of random genes, which can
be used to simulate the results of clustering. The output is a
two-column text. The first column gives the gene
identifier (like "ENST00000248553" for Ensembl transcripts), the second
column the group identifier (useful when
several groups are exported). In addition, a link to Retrieve
Sequences
is provided, allowing to extract e.g. 1 kb of upstream promoter
sequence for each gene. Note thjat you may select different labels,
like gene name, gene ID, both, or full identifier.
NOTE: The output sequence list provides direct
links to follow-up procedures like Pattern discovery
and Pattern matching
!!! Thereby, the random sequence set can directly be scanned for the
presence of a specific pattern or for predicting "over-represented"
patterns.