
-> RNA
-> RNA1...detect
regulatory elements in UTRs (UnTranslated Regions) in a whole-genome
approach ? (last update Sep. 7,
2005)
-> RNA2...get a structural
prediction for the 3'-UTR sequence of my RNA of interest ? (last update
Sep. 6, 2005)
-> RNA3...get detailed information about a regulatory microRNA
called miR-16 ?
(last update Oct. 31, 2005)
-> RNA4...predict the potential targets of a microRNA like
miR-16 ?
(last update Nov. 2, 2005)
-> RNA5...predict if a specific mRNA of interest may be the
target of microRNAs ?
(last update Nov. 2, 2005)
RNA1...detect regulatory elements in UTRs (UnTranslated
Regions) in a whole-genome approach ? (last update Sep. 7, 2005)
In general, the untranslated regions of genes
(mRNAs) have not been investigated with the same enthusiasm compared to
other regions like promoter sequences. Likewise, the number and the
size of databases storing information on regulatory elements in UTR
regions is still quite low, and there is no direct analysis tool
allowing a batch-submission of thousands of sequences. Nevertheless, a
specific strategy can be developed to address this question. I will try
to describe the different approaches via the example of identifying
so-called AU-rich elements (ARE) in 3'-UTRs of human genes,
which are involved in accelerated mRNA decay, mediated by interaction
with specific ARE-recognizing proteins.
The UTResource
is a collection of internet resources for sequence analysis of 5' and
3' untranslated regions of eukaryotic mRNAs. Note that a registration
form has to be
completed to use these programs. This site contains links to the
following databases/programs.
UTRdb
is a
specialized sequence collection, deprived from redundancy, of 5' and 3'
UTR sequences from eukaryotic mRNAs. UTRSite is a collection
of functional sequence patterns located in 5' or 3' UTR sequences. UTRScan looks
for UTR functional elements by searching through user submitted query
sequences for the patterns defined in the UTRsite collection. UTRScan
does not allow
a batch-submission of sequences, but is designed to analyze single
sequences. This means that it is not the appropriate tool to screen
whole genomes for certain elements. By the way, a second point is quite
interesting. Although the mRNA for the well-characterized gene
TNF-alpha is known to contain functional ARE sites in the 3'-UTR,
UTRScan does not detect these sites. Obviously, the settings are quite
stringent, producing a hit only in the case of a direct tandem repeat
of the sequence ATTTA, not allowing for short "nucleotide spacers".
REPFIND
is a program to find clustered, exact repeats in nucleotide
sequences. For each repeat cluster that it finds, it calculates a P-value,
which indicates the probability of finding such a concentration of
that particular repeat just by chance. Note that REPFIND is
especially useful to detect regulatory signals in 3'-UTR sequences of
mRNAs which often consist of repeat clusters, although it detects any
kind
of clustered repeats. REPFIND nicely extracts the ARE sites within the
TNF-alpha 3'-UTR as best scoring hit, but only when the "Low Complexity
Filter" is turned off. Therefore, you should carefully select /
deselect
this option when looking for motifs like AU-rich elements, which are
similar
to a "low complexity" sequence, and which therefore would be masked out
(hidden) prior to the analysis. Like UTRScan, REPFIND only accepts single
sequences as input (no batch submission), meaning that it is not
suitable in a whole-genome approach. Please also refer to the REPFIND description at the main page.
Tip! A suitable strategy for
a whole-genome approach is already described in FAQ GEN4,
part C) and it is also applicable for this question. Briefly, as first
step you can use BioMart
to extract the complete set of human 3'-UTR sequences. At the start
page, choose the human genome, and select "Ensembl Genes". At the
filter page, you may either deselect all boxes, meaning you will
retrieve all genes, or you may limit the output to at least "a little
characterized genes", by choosing "Genes with
LocusLink IDs" or "with RefSeq IDs". At the output page, choose the
"Sequences Page", where you can select for 3'-UTR
regions only. Select "Text, fasta" as output format. After a few
minutes, you will retrieve a (long) txt-file of FASTA sequences, which
can be saved directly from your browser window and used for further
analyses.
The second step is performed using the RSAT program DNA-Pattern
(Strings). When looking
for ARE elements, we may define 3 patterns with decreasing specificity.
ATTTAn{0,6}ATTTAn{0,6}ATTTA allows spacers of up to 6
nucleotides composed of any base. ATTTAw{0,6}ATTTAw{0,6}ATTTA allows
spacers of up to 6 nucleotides composed of only A or T.
ATTTAw{0,2}ATTTAw{0,2}ATTTA allows spacers of max. 2 nucleotides
composed of A or T. Using
"Browse" you select the "whole genome 3'-UTR - txt file" that you
created in the first step for screening with your defined patterns.
Note that when scanning large sequence sets, one might be
interested in counting the number of matches, rather than
returning their precise positions. This can be done by deselecting the
checkbox match positions and selecting match counts
instead, and by specifying a threshold. Threshold means
that only those hits (promoters) are returned having
more matches than the specified threshold (e.g. promoters showing at
least 2 or 3 copies of a TF site). As expected, the 3 patterns produce
a decreasing number of hits in the genome. Please note that in FAQ GEN4, part C) additional
programs for Motif
Matching are described in detail !
Finally, you can
feed your RSAT list of potential target genes back into BioMart (at the
"Filter" Page, "Limit to Genes with these IDs") in order to achieve an annotation
table, which maintains all hyperlinks actively in an EXCEL sheet
(at the "Output" Page, choose the "Features" you would like to see).
Thereby,
you can select target genes which are also relevant in the specific
biological context.
Tip! If the search for
regulatory elements is limited to so-called AU-rich elements (ARE)
in 3'-UTRs of human genes, a specific database may be used called ARED, the AU-Rich Element
Database. ARED is maintained at the King Faisal Specialist Hospital
& Research Centre (KFSH&RC)
in Riyadh, Saudi Arabia. ARED contains GenBank entries where the 3'UTR
matches the ARE motif, a 13-bp pattern WWWUAUUUAUWW (W=A/U), which was
computationally derived from a list of functionally labile
ARE-containing mRNAs. ARED demonstrates that ARE-mRNAs
represent as much as 5-8% of human genes, but ARED contains computationally
predicted ARE-mRNAs, there is no evidence how many of them are
actually regulated by this mechanism. AREs are known to be
recognized by specific proteins and / or
small regulatory RNAs which dramatically influence the stability of the
mRNA. Most of them are negative regulators (like ZFP36) which promote
mRNA decay, but also positive regulators exist which stabilize the
target mRNA (like HuR). Known examples of ARE-dependent regulation are
the mRNAs of TNFalpha, PTGS2 (COX2), CSF2 (GMCSF), and IL3. Thus,
several diseases like chronic inflammatory conditions exist
which are known to be caused by stabilized ARE-mRNAs.
There are already 3 different versions of
the ARED database. While v1 and v2 only support single queries
(gene names, IDs, mRNA acc., RefSeq, UniGene, etc.), v3 also
supports batch queries using e.g. a list of gene names from a
microarray experiment. NOTE: The list has to be pasted in column-format
(like copied from Excel), not as space-delimited text ! ARED will produce
a table which presents all genes with predicted AREs in their 3'UTRs
stored in ARED database. Note
that the actual sequences are NOT shown, only the "Class" and
the
"Cluster" of the respective AREs. Note: The result table may be
saved as tab-delimited txt-file, which can easily be opened in Excel.
Note: When using the "Advanced search" option,
you may also browse the (long) lists of ARE-mRNAs by selecting
an ARE cluster and leaving all other fields empty. ARE-mRNAs are
clustered according to
the length of the individual AREs: Cluster 1 mRNAs contain 5 continuous
AREs, Cluster 2 contain 4, and Cluster 5 contain 1 ARE in a 13-bp ARE
context. So, in order to answer the
question of a "whole-genome" human approach, you may simply
select "ALL" ARE Clusters, and download the table of 2476 (ARED v3)
human mRNAs containing AREs.
The authors of ARED state that the database is
available as single GenBank flat file (i.e. nucleotide sequence
with annotation) upon request.
RNA2...get a
structural prediction for the 3'-UTR sequence of my RNA of interest ?
(last update Sep. 6, 2005)
It is known that a growing list of genes is
regulated by influencing the mRNA stability via so-called AU-rich
elements (ARE) in 3'-UTRs (see also FAQ RNA1).
The consensus ARE was computationally derived as the 13bp-pattern
WWWUAUUUAUWW. It is also known that specific regulators which recognize
AREs, like the positive regulator HuR, bind to the ARE only
when the pattern is in a single-stranded conformation within the RNA
secondary structure, meaning that it is located in one of the "loops";
see Meisner
et al., 2004 for information. Now, it might be interesting
to investigate other RNAs for the presence and structure of AREs within
their 3'-UTRs. For this purpose, a resource which predicts secondary
structure of RNAs is needed. NOTE: Structure prediction of RNAs in
general is NOT a trivial task. In many cases, there will be a lot of
"sub-optimal" structures which are only slightly less preferred than
the "best" structure. Results have to be taken with caution.
The Vienna RNA Package
was developed for the prediction and comparison of RNA secondary
structures at the Theoretical Biochemistry Group (TBI) of the University of
Vienna, Austria. The package is free software and can be downloaded as
C source code that should be easy to compile on almost any flavor of Unix
and Linux. Note: This package developed for UNIX
command-line use; there are no graphical user interfaces. Nevertheless,
the Vienna RNA secondary
structure server offers access to the most popular features of
the Vienna RNA Package via easy to use web interfaces. RNAfold
is the web interface to the RNAfold program. This server will predict secondary
structures
of single stranded
RNA or DNA sequences. Thus, RNAfold provides both the most basic and
most widely used function. The output presents the predicted mfe
(minimum free energy)
structure both as a string in bracket notation and links to the plots
generated for visualization. Plots are produced in Postscript format. A
suitable alternative is the new standard for Scalable Vector Graphics,
SVG. For this purpose, the browser has to be equipped with a SVG plugin
(typically from Adobe).
Tip! A very easy-to-use and
straightforward approach would be the following: The UCSC Genome Bioinformatics site
provides pre-computed structures of 5'- and 3'-UTR regions of
all RNAs, which were produced using RNAfold. The estimated
folding energy is in kcal/mol. The more negative the
energy, the more secondary structure the RNA is likely to have. As
there are no stable URLs of individual gene entries in UCSC, you may
follow this example: Open the UCSC Gene Sorter
and search for the human gene PTGS2. You will retrieve a table where
you can access the specific gene entry via the link "Description" (last
column). There are several sections in this PTGS2-specific file, one of
them is "mRNA Secondary Structure of 3' and 5' UTRs". There are several
display formats for the predicted structures: "Picture"
produces a PDF-file of the structure. You need to have a program
installed capable of
displaying PDF-files like Adobe Acrobat. "PostScript" produces
a PS-format of the structure. You need to have a program installed
capable of
displaying PS-files like GSview. "Text" produces a "string in
bracket notation"-format of the structure. Interestingly, you can see
that the "core motif" AUUUA is indeed present in some of the predicted
loop structures of this mRNA.
RNA3...get detailed
information about a regulatory microRNA called miR-16 ?
(last update Oct. 31, 2005)
This is an example how to retrieve data concerning a
non-protein coding microRNA which is known from the literature to be
involved in regulatory processes like the destabilisation of the RNA of
the inflammatory mediator COX2.
Rfam
is a joint
project involving researchers
based at the Wellcome Trust Sanger
Institute, and Washington
University, St. Louis (also providing a Rfam mirror site). Rfam is a large
collection of multiple
sequence alignments and covariance models covering many common non-coding
RNA families. For each family in Rfam you can:
View and download multiple sequence alignments, read family annotation,
examine species distribution of family members, and follow links to
other databases.
In order to address this specific question, Rfam
provides a simple keyword
search allowing to query using any keyword, like "miR-16". You will
retrieve not only the sequences from different species, but also the
consensus secondary structure for family mir-16. In addition, you may
produce multiple sequence alignments and view literature references.
Tip! miRBase is the
new home for microRNA data, incorporating the database and gene naming roles
previously provided by the miRNA Registry, and including the new
miRBase Target database. miRBase contains 3 main sections, one of them
is miRBase
Sequences which contains all published miRNA sequences, genomic
locations
and associated annotation. Each entry in
the miRBase Sequence database
represents a predicted hairpin portion of a miRNA transcript (termed
mir in the database), with information on the location and sequence of
the mature miRNA sequence (termed miR). Both hairpin and mature
sequences are available for searching
using
BLAST and SSEARCH, and entries can also be retreived by name, keyword,
references and annotation. All sequence and annotation data are also available
for download.
Note that when searching for "miR-16" here, a
longer list of molecules is presented, from various species, allowing a
batch retrieval of either mature or precursor sequences, in
multi-FASTA format or in ClustalW format. A single entry like the one
for miR-16 also provides both the stem-loop sequence and the mature one
as individual accession numbers in one common page.
Tip! microRNAs are also
displayed within the UCSC Genome Browser.
You may try the example "miR-16" as input in the query field. You can
see that miR-16-1 is located at chromosome 13, within an intron
of a protein-coding gene and in close proximity to another microRNA
calles miR-15a. NOTE that it is essential not to use the name
miR16 (without "-") as this query will retrieve a totally different
(protein-coding) RNA located on chromosome 16 !!!
RNA4...predict the
potential targets of a microRNA like miR-16 ?
(last update Nov. 2, 2005)
Tip! miRBase
contains all published miRNA sequences, genomic
locations
and associated annotation. In addition, miRBase provides links to
databases which predict the potential targets of microRNAs.
Thus, the easiest way to address this question is to look for the database
entry of miR-16, and then jump to the referenced URLs of these
target databases:
miRBase
Targets, a part of miRBase, is a web
resource developed by the Enright Lab at the
Wellcome Trust Sanger Institute
containing computationally predicted targets for microRNAs
across many species. The miRNA sequences are obtained from the miRBase Sequence
database
and most genomic sequence from EnsEMBL.
This resource aims
to provide the most up-to-date and accurate predictions of miRNA
targets and hence this
resource will be updated regularly to incorporate new miRNAs or EnsEMBL
sequences.
The predicted
targets of miR-16 in miRBase comprise a list of more than 300
genes. Note that it is possible to rank this list
according to different values, like best P-value from all target sites
in a transcript (default), or total number of conserved target sites,
or number of conserved species for which a target site is found, or
number of different miRNAs predicted to hit this transcript. Each
target gene entry provides a very nice viewer (in HTML or in Java)
which displays the miRNAs along the potential target sequences and
which shows a multiple sequence alignment allowing to estimate very
quickly the evolutionary conservation of a specific target region. Note
that miR-16 microRNA was shown to be
involved in
the mRNA-destabilization of inflammatory mediators like TNFalpha and
PTGS2 (COX2), as shown in Jing
et al., Cell 2005, but both of these targets are not listed
in this miRBase Target entry.
PicTar is an algorithm for
the identification of microRNA targets. This searchable website
provides details (3' UTR alignments with predicted sites, links to
various public databases etc) regarding microRNA target predictions in
vertebrates and microRNA
target predictions across seven Drosophila species. PicTar can be used BOTH for
predicting the targets of a
certain microRNA OR for predicting the microRNAs which may target a
specific mRNA of interest.
The
predicted targets for miR-16 comprise a list of over 700 human genes. A list of potential target
genes is ranked by a specific PicTar score, with links to RefSeq and to
the custom view of UCSC Genome Browser, displaying the PicTar miRNA
prediction sites.
TargetScan
is a portal at MIT storing several datasets of predictions of
microRNA targets, either targeting only the 3'-UTRs or also
targeting the ORF regions. TargetScan
can be used BOTH for
predicting the targets of a certain microRNA OR for predicting the
microRNAs which may target a specific mRNA of interest. The user may choose a microRNA
family (like "miR-15/16/195") in order to predict the targets of this
family. The output shows a list of potential target genes ranked by an
EFDR (estimated false discovery rate) score,
with links to NCBI sequence database and to UCSC Genome Browser. Note
that this list contains quite detailed summaries of the individual
genes functions.
RNA5...predict if a
specific mRNA of interest may be the target of microRNAs ?
(last update Nov. 2, 2005)
Tip! miRBase
Targets, a part of miRBase, is a web
resource developed by the Enright Lab at the
Wellcome Trust Sanger Institute
containing computationally predicted targets for microRNAs
across many species. The miRNA sequences are obtained from the miRBase Sequence
database
and most genomic sequence from EnsEMBL.
This resource aims
to provide the most up-to-date and accurate predictions of miRNA
targets and hence this
resource will be updated regularly to incorporate new miRNAs or EnsEMBL
sequences.
In order to look for a specific target gene of
interest, like PTGS2 (COX2), simply enter this term at the "Search"
page, within the field "Gene name". A list of all species is presented
which contain information on the specific gene. All genes sre
references via their Ensembl Transcript IDs, like ENST00000186982.
Each target gene entry provides a very nice viewer (in HTML or in Java)
which displays the miRNAs along the potential target sequences and
which shows a multiple sequence alignment allowing to estimate very
quickly the evolutionary conservation of a specific target region. Note
that miR-16 microRNA was shown to be
involved in
the mRNA-destabilization of inflammatory mediators like PTGS2 (COX2),
as shown in Jing
et al., Cell 2005, but this miRNA is not shown along the
PTGS2 mRNA, but a different miRNA (mmu-miR-350).
PicTar is an algorithm for
the identification of microRNA targets. PicTar can be used BOTH for
predicting the targets of a
certain microRNA OR for predicting the microRNAs which may target a
specific mRNA of interest. The user may enter a certain gene ID (like
PTGS2) for which the potential matching microRNAs shall be predicted.
The output presents a multiple species alignment of the cDNA of the
chosen gene, highlighting the positions of individual predicted miRNA
sites.
TargetScan
is a portal at MIT storing several datasets of predictions of
microRNA targets, either targeting only the 3'-UTRs or also
targeting the ORF regions. TargetScan
can be used BOTH for
predicting the targets of a certain microRNA OR for predicting the
microRNAs which may target a specific mRNA of interest. The user may enter a certain human
EntrezGene ID (like 5743 for human PTGS2) for
which the potential matching microRNAs shall be predicted. NOTE:
In fact, a search for the gene name (PTGS2) was successful here but NOT
using "5743"! The output is a tabular list of matching microRNA
families to the mRNA of interest, with links to Rfam and to UCSC Genome
Browser.