-> PROTEINS
-> PROT1...know which domains and motifs can be found in my
protein query sequence ? (last update May 15, 2006)
-> PROT2...know
all proteins which contain a certain domain or motif in their sequence ?
-> see RET1 !
-> PROT3...get
a structural prediction for my protein of interest ? (last update Jul.
21, 2005)
-> PROT4...know which protein family my protein belongs to ?
-> see GENOM2 !
-> PROT5...screen
a batch of protein sequences for transmembrane regions ? (last update Mar. 30, 2006)
-> PROT6...predict the subcellular
localization or retrieve experimental localization data of my protein
of interest ? (last
update May 15, 2006)
-> PROT7...know which protein domains
are present / overrepresented in my gene set of interest ? (last update Jan. 24, 2006)
PROT1...know which domains
and motifs can be found in my protein query sequence ? (last update May 15, 2006)
1.) Domains and Motifs - integrated search:
Tip! A
very good point to start at is InterPro.
InterPro is a valuable tool that searches simultanously in Pfam,
PRINTS, ProDom, PROSITE, SMART, SWISS-PROT, TIGRFAMs, PIRSF (PIR
Superfamily), and Superfamily for domains, families, repeats
and short sequence motifs. In order to perform a sequence
search, click at the option InterProScan,
and either enter (or cut and paste) your protein sequence into the text
box, or, if you have the sequence in a file on your computer (like in a
*.txt format), click the 'Browse' button to upload it directly. Make
sure to enter your Email address, even if you like to have your results
by "interactive run". You will recieve a table of hits in the different
databases and a graphical representation of
the positions within the query sequence. In PFAM, you will find
alignments of all known members sharing a certain protein domain. In SMART,
you can produce alignments in diverse formats, you can generate an
alignment consensus sequence, you can group by species, you can make a
subcellular localization prediction, or
you can even produce a FASTA-formatted sequence file of the
proteins of choice.
Tip! If
you are focused on human proteins, you may use HPRD
-
Human Protein Reference
Database. You may "Query HPRD"
using either a protein name, gene symbol, or accession number (like
Entrez Gene or Swiss-Prot). The output will present - among a lot of
other information - a graphical image of the protein, showing the
positions of domains, motifs, transmembrane regions, signal peptides,
and sites of post-translational modification. NOTE: Always stay
aware of the fact that the output will show only those features which
have been manually annotated in a certain context, as HPRD is strongly
focused on manual annotation from literature references !!! Please
refer
also to the HPRD section at the Pathways page for information ! Please note
that there should also be a "BLAST
HPRD" option, which was not functional at the time of testing.
2. Domain search:
Please note
that by using one of the databases listed in InterPro
individually, you may be able to retrieve other / more distantly
related matches
by personally adjusting the sensitivity parameters ! If you are looking
for protein domains only, you may use one
of the following databases.
Tip! Pfam is
a large collection of multiple sequence alignments and hidden Markov
models covering many common protein domains. For each family in
Pfam you can: Look at multiple alignments,
view protein domain architectures, examine species distribution, follow
links to other databases, and view known protein structures.
In order to address this question, the user may
perform a Protein
name or sequence search. Note that only a UniProt
name or accession number is
accepted as input! If you do not know these IDs, first query UniProt or
use the "Sequence Search" option. A protein sequence in FASTA format
can be used as query. The output offers a multitude of follow-up
analysis tools, please refer to the Pfam
main section for details !
SMART
(Simple Modular Architecture Research Tool) allows the identification
and annotation of genetically mobile domains and the analysis of domain
architectures. More than 500 domain families found in signalling,
extracellular and chromatin-associated proteins are detectable. These
domains are extensively annotated with respect to phyletic
distributions, functional class, tertiary structures and functionally
important residues.
Tip! CDD
(Conserved Domain Database) is another valuable resource,
provided by the NCBI. The CD-Search
service is a very user-friendly program to identify the
conserved domains present in a protein sequence. CDD can be searched
either by a query protein sequence or by keyword searches. CDD
currently contains domains derived from two popular collections, SMART
and PFAM, plus contributions from the NCBI. The source
databases also provide descriptions and links to citations. Since
conserved domains correspond to compact structural units, CDs contain
links to 3D-structure via Cn3D whenever possible.
CDART
determines the domain architecture of a protein sequence by
comparison to a database of conserved domain alignments, CDD,
using RPS-BLAST. It then compares the protein's domain architecture to
that of other proteins in NCBI's non-redundant sequence database, nr.
Related sequences are identified as those proteins which share one or
more similar domains. CDART displays these sequences using a graphical
summary showing the types and locations of domains
identified within each sequence, with links to the individual sequences
and to further information on their domain architectures. CDART
searches
the domain databases SMART and PFAM. Note that
the first output can be very huge. But you can query for sequences
containing only the domains you are interested in by clicking the
checkboxes at the
bottom of the results pages and pressing "Subset by selected
domains".
3. Motif Search:
3.1. Motifs - integrated search:
In general, predictions of short linear
motifs in protein sequences have to be taken with even
more caution than those of large globular domains, as represented in
databases like Pfam
and SMART. Historically
seen,
there was one major database collecting short protein motifs, PROSITE.
In the meantime, other databases emerged addressing this topic, like ELM
and Scansite.
There are several ways to scan your query sequence
for PROSITE patterns, the PROSITE homepage
provides a simple query form, whereas the program ScanPROSITE
allows
to scan a protein sequence using advanced options, or to search
protein databases with a user-entered pattern. You can also use the
tool Motif
Scan, which searches simultanously for
profiles and patterns in PROSITE profiles, PROSITE patterns, and Pfam.
It has a nice multi-color output, a zoomable
graphical
display of the matches, and significant matches are extra-coded. PPSEARCH
is yet another tool which scans a sequence against the PROSITE protein
profile database (allows a graphical output), provided by the EBI.
Tip! ELM (Eukaryotic Linear Motif
Resource) has developed into the largest collection of linear
protein motifs, followed by PROSITE and Scansite. ELM is a resource
for predicting functional
sites in eukaryotic proteins. Putative functional sites
are identified by patterns (regular
expressions), which have a slightly different syntax than PROSITE
patterns. ELM is easy to query, you either enter a valid
SwissProt/TrEMBL ID or AC, or a protein sequence. You may also specify
the species and the cellular compartment, if
known, and thereby "activate" filters which are designed to
reduce
the number of false positive hits. Please refer to the ELM chapter at the main page for
additional details.
Scansite
(at MIT) is a database of motifs within proteins that are likely to be phosphorylated
by specific protein kinases or bind to domains such as SH2
domains, 14-3-3 domains or PDZ domains. The program MotifScan
can be queried by either a protein accession/ID or sequence. The
program then indicates the percentile ranking of the candidate motif in
respect to all potential motifs in proteins of a protein database. So,
the smaller the percentage value, the better the identified hit. Note
that you can choose between 3 stringency levels (high, medium,
low) ! A high stringency setting limits the motifscanner to
only show you the candidate motifs that have scores that fall in the
top 0.2% of scores within the whole SWISS-PROT vertebrate database. Medium
stringency has a threshold limit of the top 1%, while low
stringency has a threshold limit of the top 5%. Database search
using a Scansite motif: You may also search databases (Swiss-Prot,
TREMBL; Ensembl) for proteins bearing a certain motif. From the output
list, you can directly perform a MotifScan of the proteins of interest.
3.2. Motifs - specialized searches:
In addition to the "global" motif databases, there
are many sites that collect information and provide prediction
tools for individual motifs, like in the fields of post-translational
modifications, and protein targeting / localization. Please
refer to the respective chapters at the main page to see detailed
lists: Motifs 3 -
Modification and Motifs
4 - Localization (also refer to FAQ PROT6 !).
PROT2...know all proteins
which contain a certain domain or motif in their sequence ? ->
see RET1
!
This is actually a matter of "Sequence
retrieval", so
please refer to RET1 .
PROT3...get a structural prediction for my protein of
interest ? (last update
Jul. 21, 2005)
This question essentially covers programs
for secondary structure prediction of proteins, although the
borders between secondary and tertiary structure prediction are often
not very sharp (as you will see below in e.g. PredictProtein). There
are a few tools for this purpose, which are also summarized at the
appropriate
ExPASy linkpage for
Secondary Structure Prediction. Note that
additional information for 3D prediction
will be given in the section "3D
Structures". Also note that the prediction
of transmembrane regions is described individually in FAQ PROT5.
Tip! Jpred
takes either a single protein sequence or a multiple
alignment of protein sequences, and predicts secondary structure
(helices, sheets, turns, coiled coils, transmembrane regions, solvent
accessibility...). It works by combining a number of modern, high
quality prediction methods to form a consensus (!). It runs a
series of programs at one-button-click like: PHD, PREDATOR,
NNSSP, MULPRED, ZPRED, JNET, COILS, MULTICOIL, PHDhtm (TM prediction).
In general, predictions work better for multiple alignments than for
single sequences. Therefore single sequences are first used to create
automatic multiple alignments with the best hits in the non redundant
database. Then the prediction algorithms are run on this alignment. The
output is quite "compact" and is presented in many different
formats (HTML, Postscript, Java). NOTE: Input sequences should be
in UPPER CASE letters, as some test runs using lower case letters did
not function properly !!!
Tip! PredictProtein,
provided by Columbia University, New York, is a program to predict
secondary structure of proteins (helices,
sheets, solvent accessibility, PROSITE motifs, low-complexity regions)
AND performs similarity searches to identify related
sequences from databases. Similar to Jpred, PredictProtein runs a
series of programs at one-button-click like: PHDsec, PHDacc,
PHDhtm, PHDtopology, PHDthreader, MaxHom, and EvalSec. Note that even
the primary result is quite long, and it takes some time to
extract the essential information. Note that the program performs a
series of "additional features" like ProDom domain search,
transmembrane region prediction, and GLOBE prediction of globularity.
In addition, there is an "intermediate result page" which allows
the submission of your sequence via a single-page interface
to a variety of other servers by using the so-called META PP submission
page, or by manually choosing individual sites. As an example, you are
able to
send your sequence to SwissModel, in order to try a tertiary
structure prediction by sequence homology to known protein
structures (homology modelling). Alternatively, you may perform a 3D
prediction by threading, using Loopp and Superfamily.
Note that Jpred
is also
available here.
Tip! META
PP, also provided by Columbia University, New York, allows a "one-button"
submission of your sequence via a single-page interface to a
variety of servers, for the purpose of secondary and tertiary
structure prediction. The linked servers include SWISS-MODEL,
Superfamily, DAS, JPRED, PHD, PROF and more. You will recieve
individual Emails containing the results of these predictions.
PROT4...know which protein family my protein belongs to ?
-> see GENOM2 !
PROT5...screen a batch of protein
sequences for transmembrane regions ? (last update Mar.
30, 2006)
There is a whole list of programs which deal with
the prediction of transmembrane (TM) regions in proteins, and
some of them also perform such an analysis "in-batch", meaning that you
can analyze many protein sequences at once. In general, if you have
just a hand-ful of sequences, I would recommend to use several programs
and compare the results. If you want to use a "quick test sequence",
you may enter the one of the erythrocyte anion exchanger, showing 12 TM
helices: RefSeq NP_000333.
A personal series of test runs yielded the following result:
TMHMM produced the best predictions (being quite "conservative"),
followed by SOSUI, whereas TMAP, TMPred, and TopPred all are less
stringent and proposed too many TM-regions.
1. Programs supporting batch queries:
Tip! The
SOSUI
system, provided by the Tokyo Univ. of Agriculture and Technology, is a
tool for secondary structure prediction of membrane proteins
from a protein sequence. The basic idea of prediction
is based on the physicochemical properties of amino acid sequences such
as hydrophobicity and charges. The system deals with three
types of prediction: discrimination of membrane proteins from
soluble ones, prediction of existence of transmembrane helices and
determination of transmembrane helical regions. SOSUI has a very
nice graphical output which shows the hydropathy profile, the
"helical wheel representation", and the possible membrane topology. Note
that there is also an interface for in-batch
sequence submission, allowing the input of a multi-FASTA file. In
this case, there is no graphical output but a simple table listing the
number and positions of potential TM regions. Note that SOSUI
is also integrated in the "data super-integration tool" Bioinformatic Harvester of the
EBI.
Tip! TMHMM, provided
by the CBS, Denmark, is a program for prediction of transmembrane
helices in proteins, providing the
option to submit many proteins at once (!) in one fasta file.
Please
limit each submission to at most 4000 proteins, and note that you
should
de-select the graphical output when submitting many sequences to
speed-up
processing time. TMHMM produces a very nice graphical and tabular
output,
and discriminates between "inside" and "outside" helices.
TopPred2,
provided by Pasteur Inst. as EMBOSS tool, is a program for prediction
of transmembrane helices in proteins. It provides a lot of
control options like different hydrophobicity scales, and many output
formats, like membrane topology, hydropathy profile, lists of
hydrophobicity values, and more. Note that, although not
explicitely stated, TopPred2 can also be queried using a multiple
sequence
fasta file as input for in-batch analysis.
Tip! A
quite different approach would be to use the BioMart
data retrieval system for this purpose. Please refer to the BioMart description at the main page
for detailed information. In this case, you would need a list of
accession
numbers of your proteins (like Entrez Gene, RefSeq or many others), and
then
filter the output list to show only proteins with transmembrane regions
(at the "Filter" page, section "Protein"). Note that this
procedure
would not make a "denovo" TM-prediction but would be based on
information stored in the database for each protein. Ensembl uses TMHMM
(see
above) for the prediction of TM-regions.
2. Programs working only on single sequences:
TMPred,
provided by the Swiss EMBnet, is a program for prediction of transmembrane
helices in proteins and their orientation. TMPred works only on single
sequences, and provides
different possible models of TM count and orientations.
TMAP,
provided by the Karolinska Inst., Sweden, predicts transmembrane
helices from multiple sequence alignments or from single
sequences. The alignment should be in GCG format (.MSF). Note
the difference between multiple alignments (meaning that the sequences
have to be homologous and have to be pre-aligned), compared with
multiple single sequences
(see e.g. TMHMM) as input !!!
PROT6...predict the subcellular
localization or retrieve experimental localization data of my protein
of interest ? (last
update May 15, 2006)
In principal, there are 3 different ways to
approach this question. The first one uses a list of programs
which aim
to predict the localization of a protein within the cell according to
specific sequence motifs and patterns. The second uses data
mining techniques from the literature ro retrieve localization
information. The third approach employs
databases which store the results of large-scale protein localization
experiments as microscopy image files and / or tabular data.
1. Protein localization prediction:
The programs which are relevant here are described
at the Main Index within the section Motifs 4 -
Localization. You may want to refer to this section to try
additional resources. Note that the start page of PSORT.org (see below) is also a
general linkpage to resources for protein localization prediction.
Tip! PSORT
is one of
the best known programs for analysis of protein sorting signals
and prediction of subcellular localization. PSORT
receives the information of an amino acid sequence
and its source orgin, as inputs. Then, it analyzes the input sequence
by applying the stored rules for various
sequence features of known protein sorting signals. Finally, it reports
the
possiblity for the input protein to be localized at each candidate site
with additional information. PSORT.org
provides links to the PSORT family of programs for subcellular
localization prediction. PSORT2 is the current version
of the "standard" PSORT program. PSORT only accepts single
sequences as input. Note that PSORT II comes with a highly
instructive user
manual
explaining the diverse predictions. WoLF PSORT is a recently updated version of PSORT
II for the prediction of eukaryotic sequences. Also note that PSORT provides a
quite detailed output
as compared to e.g. LOCtarget. Note that PSORT II is also
available at the Pasteur
Institute.
Note that PSORT2
is also integrated in the "data super-integration tool" Bioinformatic Harvester of the
EBI.
Tip! LOCtarget
is a database
of predicted subcellular localization for potential targets for
structural genomics from TargetDb. You may either search or browse
the LOCtarget database,
or you may submit your own FASTA protein sequence for
localization prediction. Subcellular localization is currently
predicted using four different methods: predictNLS
(nuclear localization signal), LOChom ( using homology ), LOCkey
(using keywords) and LOCtree (prediction based on hierarchical
support vector machines).
The reported localization is based on the method which predicts
localization of a given protein with the highest confidence. Note: Upto
100
protein sequences can be submitted at a time. If more than 10 sequences
are submitted the job is run in low priority mode.
2. Protein localization by literature data mining:
Tip! HPRD - Human Protein Reference
Database represents a centralized platform
to visually depict and integrate information pertaining to domain
architecture, post-translational modifications, interaction
networks
and disease association for each protein in the human proteome.
All the
information in HPRD has been manually extracted from the
literature by
expert biologists who read, interpret and analyze the published data.
Each protein entry in HPRD is composed of several
"tabs"
which correspond to specific data. The "Summary" tab includes
data like subcellular localization, including a link to the
corresponding reference.
3. Protein localization and tissue expression databases:
The programs which are relevant here are described
at the Main Index within the Protein Localization
Databases. Note that this part of this FAQ is somehow
related to the FAQ EXP20, listing
resources which store RNA
localization images based on in situ
hybridization experiments.
Tip! GFP-cDNA is an ongoing project of localising
novel GFP-tagged
human cDNA products to subcellular compartments of the
eukaryotic
cell.
This information provides an entry point for many other downstream
functional assays that are designed and implemented for the subsets of
new proteins localising to defined subcellular organelles. Images
of all localised proteins and their bioinformatic analysis can be
viewed via the ‘Results Table’ or ‘Results Images’ buttons. In
addition, a search window can be used to fid proteins containing
features or motifs of particular interest to you that have been
localised in this project. Note that Protein Localization
images are also integrated in
the data super-integration tool Bioinformatic
Harvester; please refer to the main section of this tool
for details ! The names of GFP-cDNA entries are clone
names, which mostly give no hint about the nature of the proteins. If
you want to extract the complete list of all localized proteins
via the Bioinformatic Harvester, you may use the following "trick":
enter "pepperkok" as search term (derived from Rainer
Pepperkok, one of the two project heads, together with Jeremy Simpson).
You may also perform combined searches like "pepperkok golgi" or
"pepperkok endoplasmic", and select the checkbox "AND search".
Tip! HPR
- the Human Protein Atlas, contains hundreds of
thousands of images of protein expression in normal
human tissues and cancer cells. Note that there is just
tissue expression shown, NOT subcellular localization !
The Swedish Human Proteome
Resource (HPR)
program, funded by the Knut and Alice Wallenberg Foundation,
has been set-up to allow the systematic exploration of the human
proteome with Affinity (Antibody)
Proteomics, combining high-throughput generation of affinity-purified
(mono-specific) antibodies with
protein profiling using tissue arrays. The basic concept of this
resource centre is to produce specific antibodies to human target
proteins using a high-throughput method that involves the cloning
and
expression of protein epitope signature tags.
At
the top of the page, you'll find information about HPR, descriptions
and annotations, as well as useful information on image-usage policies.
Available proteins (genes) can be reached through a specific search
(by
gene/protein name/id or classification, such as kinase or protease) or
by browsing the individual chromosomes. The data are presented as
high-resolution images representing immunohistochemically stained
tissue sections.
The final goal is to produce datasets for all of the about 22,000
different proteins, one for each human gene. The vision, as indicated
on the Human Protein Atlas site, is “...to enable the systematic
generation of quality assured antibodies to all non-redundant human
proteins and to use these reagents to functionally explore human
proteins, protein variants and protein interactions.” An example (human
PTGS2) can be seen in section HPR IDs.
PROT7...know which protein domains
are present / overrepresented in my gene set of interest ? (last update Jan. 24, 2006)
This question is somehow the "batch version"
of FAQ PROT1. It refers to larger datasets, like
a cluster of genes from a microarray experiment, where the user wants
to extract the protein domains which are involved. In addition, this
FAQ lists programs which predict an overrepresentation of
protein domains as compared to a reference gene/protein set. Therefore,
this FAQ links to resources which are described in FAQ section "Pathways, Interactions, Functions",
especially in FAQ PATH1.
1. Annotation without statistical overrepresentation:
BioMart,
a powerful data retrieval tool, can also
be used to extract the protein domains for large gene/protein
datasets. Please refer also to the BioMart
section at the main page for
information. You may, as example, paste a list of Entrez Gene IDs
/ Affymetrix IDs etc. corresponding to the genes which are selected
from a microarray
experiment, into the field "ID list limit" at the "Filter" page of
BioMart. Then, at the "Features" page of the output, you may choose to
selectively display the protein domain data in the final result table,
like InterPro
ID, InterPro short description, and PFAM ID. If you are interested in protein
motifs, you may select PROSITE ID, instead. You can choose between
different output formats: Text, html, or MS EXCEL. Note: Each
gene is listed separately in the result file,
listing all associated protein domains, BUT there is NO
examination of "overrepresented" domains.
2. Annotation and statistical overrepresentation:
Tip! WebGestalt
is a "WEB-based GEne SeT AnaLysis Toolkit". WebGestalt incorporates
information from different public resources and provides an easy way
for biologists to make sense out of large sets of genes. It
enables biologists to manipulate integrated information and find
patterns
that are not detectable otherwise. WebGestalt is designed for
functional genomic, proteomic, and large scale genetic studies from
which high-throughput data are continously produced. It currently works
from human and mouse. WebGestalt is free for academic
use after registration. NOTE: If you have already
registered for GOTM, you
can use this login ! In general, save and download options are
more versatile than in e.g.
DAVID. WebGestalt incorporates practically ALL FIELDS of functional
annotation: GO, Pathways, Co-citation Networks, and even
protein domain data and expression data !!! Taken together,
WebGestalt is an excellent resource for functional
annotation of gene datasets. Please refer to
the WebGestalt main section for
details !
Regarding this specific question, WebGestalt offers two
different approaches. After uploading and analyzing the gene
dataset, the Gene set information retrieval tool generates a
user-defined annotation table, somehow similar to BioMart. In order to
include protein domain information, the attribute "Function Info ->
Domain" has to be selected. In contrast, the Protein Domain Table,
which is part of the Gene set organization tool, lists those
protein domains which are overrepresented in the query gene set as
compared to a premade reference file. If, for example, a gene
set is derived from an Affymetrix HG-U133A array experiment, the
reference set "WEBGESTALT_HG_U133A" shall be used. In addition, the
user may select a significance level, and may choose a minimum
numer of genes for the enriched PFAM protein
domains. PFAM protein domains with fewer genes will not be reported as
enriched PFAM protein domains. The default setting is 2.
NOTE: Obviously, the "Protein Domain Table"
shows less
hits in the results file as compared to the extraction of protein
domains via the "Gene set information retrieval tool", even
when selecting "1" as "minimum number of genes" ! Thus, if you want to
get ALL protein domains, you should use the latter tool.