| Proteins Linkpages | |
| NOTE: Databases and tools which provide integrated views and high-throughput retrieval of protein data are described in the "Data Integration" section ! | |
| Linkpage 1 - ExPASy (SIB) |
ExPASy
(Expert Protein Analysis System) is a
proteomics server provided by the SIB
(Swiss
Institute of Bioinformatics), dedicated to the
analysis of protein sequences and structures as well
as 2-D PAGE. The ExPASy Site Map
shows a graphical display which links to all important resources
maintained at ExPASy. Note: ExPASy can also be accessed via these ExPASy mirror sites. Tools are provided for the topics like similarity searches, pattern and profile searches, post-translational modification prediction, primary, secondary, tertiary structure prediction, transmembrane region detection, sequence alignment and many more. |
| Linkpage 2 - Pattern search (Pasteur Institute) |
The Pasteur Institute provides a highly
recommendable
linkpage for pattern
searches, including many EMBOSS
tools. |
| Linkpage 3 - Hidden Markov Models (SWBIC) |
This HMM linkpage
is provided by the SWBIC (Southwest Biotechnology
and Informatics Center), covering
databases/applications using HMMs, as well as introductory
papers in
this field. |
| Primary
Structure |
|
| Linkpage 1 - ExPASy (SIB) |
ExPASy (Expert Protein
Analysis System) is a
proteomics server provided by the SIB
(Swiss
Institute of Bioinformatics). Among many other links, it provides also a very good list
of
links concerning protein primary structure
(sequence) analysis,
including graphical tools. Please also have alook at the ExPASy main section. |
| 2ZIP (Max Planck Institute for Molecular Genetics) |
2ZIP is a
program
for predicting leucine zippers in proteins, provided by the Max Planck Institute for
Molecular Genetics in Berlin. 2ZIP combines a
standard coiled coil prediction algorithm with an approximate search
for the characteristic leucine repeat. No further information from
homolog ues is required for prediction. This approach improves
significantly over existing methods, especially in that the coiled coil
prediction turns out to be highly informative and avoids large numbers
of false positives. |
| ANTIGENIC
(EMBOSS, Pasteur) |
Antigenic is an EMBOSS tool that finds antigenic sites in proteins. |
| PEPSTATS (EMBOSS, Pasteur) |
Pepstats
outputs
a report of simple protein sequence information like Molecular
weight, Number of residues, Charge, Isoelectric
point, For each type of amino acid: number, molar percent,
DayhoffStat, Molar extinction coefficient
(A280), and more.
|
| pI/MW (ExPASy) |
The pI/MW
tool, provided by ExPASy, performs a determination
of the
isoelectric point and the molecular weight of protein sequences. |
| SAPS (SIB) |
SAPS,
provided by the SIB, performs statistical
analysis of protein sequences (amino acid composition, charge
distribution, hydrophobic segments, cysteine spacings, repetitive
structures, etc.). Note: SAPS is also available as EMBOSS tool. |
| Secondary Structure | |
| Linkpage 1 - ExPASy (SIB) |
ExPASy (Expert Protein
Analysis System) is a
proteomics server provided by the SIB
(Swiss
Institute of Bioinformatics). Among many other links, it provides also
very good list of
links concerning protein secondary structure
prediction. Please also have alook at the ExPASy
main section. |
| HelixTurnHelix (EMBOSS, Pasteur) |
Helixturnhelix
uses the
method of Dodd and Egan and finds helix-turn-helix nucleic acid
binding motifs in proteins, as e.g. found in many transcription
factors. The helix-turn-helix motif was originally identified as the DNA-binding domain of phage repressors. One alpha-helix lies in the wide groove of DNA; the other lies at an angle across DNA. |
| JPRED2 (EBI) |
Jpred
takes either a single protein sequence or a multiple
alignment of
protein sequences, and predicts secondary structure (helices,
sheets, turns, coiled coils, transmembrane regions, solvent
accessibility...). It works by combining a number of modern,
high quality prediction methods to form a consensus (!). Jpred
is provided by the Barton
Group at Dundee, which also offers a FAQs
page and a "Hints
and Tips" site. It runs a series of programs at one-button-click like: PHD, PREDATOR, NNSSP, MULPRED, ZPRED, JNET, COILS, MULTICOIL, PHDhtm (TM prediction), ... NOTE: Predictions work better for multiple alignments than for single sequences. Therefore single sequences are first used to create automatic multiple alignments with the best hits in the non redundant database. Then the prediction algorithms are run on this alignment. NOTE: Input sequences should be in UPPER CASE letters, as some test runs using lower case letters did not function properly !!! |
| PepCoil (EMBOSS, Pasteur) |
PepCoil
predicts coiled
coil regions in proteins. Coiled coils are
formed by two or three alpha helices in parallel
and in register that cross at an angle of approximately 20 degrees, are
strongly amphipathic and display a pattern of hydrophilic and
hydrophobic
residues that is repeated every seven residues. The seven positions
of the heptad repeat are designated a through g, a and d being
generally
hydrophobic, while the others are hydrophilic. The parallel
two-stranded alpha-helical
coiled coil is the most frequently encountered subunit-oligomerization
motif in proteins. |
| PredictProtein and META PP (Columbia University, New York) |
PredictProtein
is a program to predict secondary
structure of proteins (helices, sheets, solvent accessibility,
PROSITE motifs, low-complexity regions, AND similarity searches
to identify related sequences from databases). Similar to Jpred, PredictProtein runs a series of programs at one-button-click like: PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, and EvalSec. META PP allows a "one-button" submission of your sequence via a single-page interface to a variety of servers, for the purpose of secondary and tertiary structure prediction. The linked servers include SWISS-MODEL, Superfamily, DAS, JPRED, PHD, PROF and more. You will recieve individual Emails containing the results of these predictions. |
| SOSUI (Tokyo University of Agriculture and Technology) |
The SOSUI
system
is a tool for secondary structure prediction of membrane proteins
from a protein sequence. The basic idea of prediction is based on the
physicochemical properties of amino acid sequences such as hydrophobicity
and charges. The system deals with three types of prediction:
discrimination of membrane proteins from soluble ones, prediction of
existence of transmembrane helices and determination of transmembrane
helical regions. If you want to see a nice example for use in a SOSUI sequence query, enter the sequence of the erythrocyte anion exchanger, showing 12 TM helices: RefSeq NP_000333. SOSUI has a very nice graphical output which shows the hydropathy profile, the "helical wheel representation", and the possible membrane topology. Note that there is also an interface for in-batch sequence submission, allowing the input of a multi-FASTA file. In this case, there is no graphical output but a simple table listing the number and positions of potential TM regions. NOTE: SOSUI is also integrated in the "data super-integration tool" Bioinformatic Harvester of the EBI. |
| TMAP (Karolinska Institute, Sweden) |
TMAP predicts transmembrane
helices from multiple
sequence alignments
or from single
sequences. The alignment should be in GCG format
(.MSF). NOTE the difference between multiple alignments and multiple single sequences !!! |
| TMHMM (CBS, Denmark) |
TMHMM is a
program for prediction of transmembrane helices in proteins.
NOTE: You can submit many proteins at once (!!!) in one
fasta file. Please limit each submission to at most 4000 proteins.
TMHMM produces a very nice graphical and tabular output, and
discriminates between "inside" and "outside" helices. |
| TMPred (EMBnet.ch) |
TMPred
is a
program for prediction of transmembrane helices in proteins and
their orientation. TMPred works only on single
sequences, and provides different possible models of TM count and
orientations. |
| TopPred2 (EMBOSS, Pasteur) |
TopPred2
is a
program for prediction of transmembrane helices in proteins. It
provides a lot of control options like different hydrophobicity scales,
and many output formats, like membrane topology, hydropathy profile,
lists of hydrophobicity values, and more Note that, although not explicitely stated, TopPred2 can also be queried using a multiple sequence fasta file as input for in-batch analysis. |
| Domains, Families | |
| BLOCKS
1 - BLOCKS server and Logos
from mult. seq.
alignments or blocks (FHCRC) |
The BLOCKS server
is a service for biological sequence analysis at the Fred Hutchinson Cancer Research Center
in Seattle, Washington, USA. There is also a Blocks mirror site at the Weizmann Institute of Science in Israel. A sequence logo is a graphical representation of aligned sequences where at each position the size of each residue is proportional to its frequency in that position and the total height of all the residues in the position is proportional to the conservation (information content) of the position. Please note that there is a whole section about Logos, and a detailed description about Logos from BLOCKS. |
| BLOCKS 2 - Trees from mult. seq. alignments or blocks (FHCRC) |
Trees
from Blocks means that a set of blocks can also be used
to construct and display a neighbor-joining tree for examination of
possible subfamily relationships. Because blocks represent the most
highly conserved regions of proteins, misaligned regions are avoided,
and so trees from blocks should be of high quality. ! NOTE the
difference to Trees from global alignments like those included with the
Biology Workbench !
The procedure is the same like the one
described for Logos (e.g. starting with Block
Maker and a set of input FASTA sequences), except that the link
"tree" on the Blocks output
has to be used. |
| BLOCKS 3 - 3D Blocks (FHCRC) |
3D
Blocks
is a search and
display tool which allows you to view blocks on a protein structure. It
uses a block (either stored or generated by Block Maker) as the query
sequence for a MAST
search of the PDB database. It
then determines which regions on the high-scoring protein sequences
correspond to the blocks, and creates output files to display its
results. Again, a direct link to 3D Blocks is available at the Blocks
output generated by Block
Maker. |
| BLOCKS 4 - Block Searcher (FHCRC) |
As an aid to
detection and
verification of protein sequence homology, the Block
Searcher
compares a protein or DNA sequence to the current database of protein
blocks. Blocks are short multiply aligned ungapped segments
corresponding to the most highly conserved regions of proteins.
Typically, a group of proteins has more than one region in common and
their relationship is represented as a series of blocks separated by
unaligned regions. If a second block for a group also scores highly in
the search, the evidence that the sequence is related to the group is
strengthened. It is recommendable to search both Blocks+ and Prints. Blocks+ has automatically - generated blocks, while Prints has hand-crafted blocks. !! SEE ALSO REMARK IN BLOCKS 5 SECTION !!! |
| BLOCKS 5 - IMPALA (FHCRC) |
IMPALA
(Integrating Matrix
Profiles And Local Alignments) searches a protein query sequence
against a multiple alignment database represented as a collection of
PSI-BLAST checkpoint files. IMPALA has been implemented on the Blocks
Server to search a blocks database, such as Blocks+.
Although the Blocks Searcher performs a similar type of search, there are differences between IMPALA and the Blocks Searcher in the PSSMs used, in the alignments reported, and in the calculation of statistics that can lead to somewhat different results. Therefore, any marginal similarity detected with one searching program should be confirmed using the other. Both programs generally detect true positive hits but they tend to report different false positives, and so any hit not detected by both searching programs should be regarded with caution (!!!). Whereas the Block Searcher scores individual blocks separately and then combines the scores for blocks in a family, IMPALA scores the set of blocks for a family as a whole so a hit is for the whole family, not for an individual block. Since IMPALA scores not only the blocks but also regions between them, its alignments may extend beyond the blocks. The resulting BLAST-like output gives scores and expected values. |
| BLOCKS 6 - RPS-BLAST (FHCRC) |
Reversed
Position Specific
Blast is one of the BLAST series of searching programs
from
NCBI. RPS-Blast uses the query sequence to search a database of
pre-calculated PSSMs (in this case PSI-BLAST checkpoint files made from
multiple alignments of protein families) and report significant hits in
a single pass. The role of the PSSM has changed from "query" to
"subject", hence the term "reverse" in RPS-Blast. Here the checkpoint
files are made from the Blocks and Prints alignments,
and are the same files searched with the similar IMPALA
searching program. Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity. |
| BLOCKS 7 - Block Maker and Blocks Multiple Alignment Processor (FHCRC) |
Block
Maker finds conserved
blocks in a group of two or
more unaligned protein sequences, which are assumed to be related,
using two different algorithms. Blocks are short multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Typically, a group of proteins has more than one region in common and their relationship is represented as a series of blocks separated by unaligned regions. If you already have a multiple alignment (like in ClustalW format), please use the Blocks Multiple Alignment Processor instead. At least two related protein sequences (and a maximum of 250) must be provided to make blocks. Each sequence must have a unique name of 10 characters or less. If you have the accession numbers of some sequences you would like to use, NCBI Entrez or similar programs can create a file for you in FASTA format. NOTE: The main difference to programs like ClustalW is that Block Maker does not perform a global alignment covering the complete sequences but selects for the sequence blocks which are best conserved. NOTE: Block Maker output allows a series of follow-up analyses, like the display of phylogenetic trees, or sequence logos (see individual sections). |
| BLOCKS 8 - LAMA (FHCRC) |
LAMA
(Local Alignment of
Multiple
Alignments) is a program for comparing protein multiple sequence
alignments with each other. The program can search databases of
such multiple alignments. The search is for sequence similarities
between conserved regions of protein families. The method is sensitive,
detecting weak sequence relationships between protein families.
Sequence similarities beyond the range of conventional sequence
database searches can be detected by the method. LAMA compares multiple sequence alignments of proteins. If you have only a single protein sequence you first need to find other members of its family. The protein sequences also need to be multiply aligned in BLOCKS format. |
| BLOCKS 9 - COBBLER (FHCRC) |
COBBLER
means COnsensus
Biasing
By Locally Embedding Residues. A single sequence is selected from a set of blocks and enriched by replacing the conserved regions delineated by the blocks by consensus residues derived from the blocks. Embedding consensus residues improved performance with readily available single sequence query searching programs, such as BLAST and FASTA, in comprehensive tests; especially useful in PSI-BLAST searches !!! Block Maker makes a COBBLER sequence automatically, but this page can be used if you want to embed your blocks in a different sequence. |
| BLOCKS 10
- CODEHOP (FHCRC) |
CODEHOP
designs PCR
primers
from protein multiple-sequence alignments. The program is intended
for cases where the protein sequences are distant from each
other and degenerate primers are needed. The multiple-sequence alignments should be in the Blocks Database format, such as in Block Maker output. The output of this program contains links that send the resulting blocks directly to the CODEHOP page. The result of the CODEHOP program are suggested degenerate sequences of DNA primers that you can use for PCR. You have to choose appropriate primer pairs, get them synthesized and perform the PCR. |
| BLOCKS 11 - SIFT (FHCRC) |
SIFT is a
sequence
homology-based
tool that sorts intolerant from tolerant amino acid substitutions
and predicts whether an amino acid substitution in a protein will have a
phenotypic effect. SIFT is based on the premise that protein
evolution is correlated with protein function. Positions important for
function should be conserved in an alignment of the protein family,
whereas unimportant positions should appear diverse in an
alignment. In addition, if you have mutant proteins with single amino acid substitutions, SIFT will predict which mutants may have a phenotypic effect before you carry out your functional assays. NOTE: SIFT is most efficient if you use multiple sequence alignments as input (or BLOCKS, generated by the Blocks Multiple Alignment Processor) ! NOTE: Your query sequence is the sequence that you would like prediction on (i.e. the sequence in which you have introduced the amino acid substitutions). The query sequence should be first in the alignment. The alignment must correspond to the length of your query sequence (i.e. no gaps in your query sequence in the alignment). Partial sequences should be flanked by Xes at the beginning and end of the sequence so that those positions are not considered gaps ! |
| CDART - Conserved Domain Architecture
Retrieval Tool (NCBI) |
CDART
determines the domain
architecture of a protein sequence by comparison to a database of
conserved domain alignments, CDD,
using RPS-BLAST. It then compares the protein's domain architecture to
that of other proteins in NCBI's non-redundant sequence database, nr.
Related sequences are identified as those proteins which share one or
more similar domains. CDART displays these sequences using a graphical
summary showing the types and locations of domains
identified within each sequence, with links to the individual sequences
and to further information on their domain architectures. DART searches the domain databases SMART and PFAM. NOTE: The first output can be very huge. But you can query for sequences containing only the domains you are interested in by clicking the checkboxes at the bottom of the results pages and pressing "Subset by selected domains". |
| CDD - Conserved Domain Database (NCBI) | The CD-Search
service is
a very user-friendly program to identify the conserved
domains present in a protein sequence. CDD can be searched either
by a query protein sequence or by keyword searches. CDD currently
contains domains derived from two popular collections, Smart
and Pfam, plus contributions from the NCBI. The source
databases also provide descriptions and links to citations. Since
conserved domains correspond to compact structural units, CDs contain
links to 3D-structure via Cn3D whenever possible.
NOTE: Which hits are reliable ? Hits that have an E value lower than E=0.01 ! NOTE: If you run CD-search not from the main page but from the query page, you are able to specify parameters like E-value or search mode, which may significantly alter your output ! NOTE: CD-Search now is run by default in parallel with protein BLAST searches. UPDATE: CD-Search was extended to also screen for similarities to known COG and KOG clusters of orthologous proteins. Please refer to the COG / KOG chapter for details. |
| FPS - Family
Pairwise Search (SDSC) |
Family
Pairwise search (FPS),
provided by the San Diego Supercomputer
Center,
allows you to search a library of protein families with a
protein sequence, GCG profile or BLAST checkpoint file that you
provide. You can chose between the following protein family libraries: SCOP, PROSITE, PFAM. E-values let you easily determine the significance of a hit. The hits are directly linked to the SCOP database. |
| InterPro (EBI) |
InterPro is
an integrated documentation resource of protein families,
domains, and functional sites. InterPro was created to integrate
the major protein signature databases: Pfam, PRINTS, ProDom,
PROSITE,
SMART, TIGRFAMs, PIRSF (PIR Superfamily), Superfamily, CATH, and
PANTHER. Signatures are manually integrated into InterPro
entries that are curated to provide biological and functional
information. Annotation is provided in an abstract, Gene
Ontology mapping, and links to specialized databases. InterPro covers
over 78% of all proteins in the Swiss-Prot and TrEMBL components of
UniProt. 1. Search InterPro: 1.1. Text Search: - Search entries: InterPro is searched for: name, abstract, method name, method accession and InterPro entry accession - Find protein matches: InterPro's proteins are searched for: protein ID (UniProt) short name and protein (UniProt) accession The result of the search will be the overview match pages for the proteins which match the query. - IPR: Go to the InterPro entry with the given accession number. Accession numbers are of the form IPRxxxxxx where the x's are digits. 1.2. InterProScan (Sequence Search): This form allows you to query your protein or DNA sequence against InterPro. The user can select the applications to run on the query sequence. 2. InterPro output: 2.1. General: Each InterPro entry is described by one or more signatures, and corresponds to a biologically meaningful family, domain, repeat, or site. A site may be a PTM - post-translational modification, an active site or a binding site. InterPro entries are annotated with a name, an abstract, mapping to Gene Ontology (GO) terms, and links to specialized databases. InterPro groups all protein sequences matching related signatures into entries. All hits of the protein signatures in InterPro against a composite of Swiss-Prot and TrEMBL components of UniProt are precomputed. 2.2. Protein match view: For each protein signature, a list of proteins in UniProt that it matches is precomputed. The match lists may be viewed in different formats. Where structures are available, links to the corresponding PDB entries are shown. In addition, the graphical view indicates where the protein signatures correspond to structural chains. Protein structures can be visualized using the AstexViewer Java applet. 2.3. Domain architecture (IDA) viewer: is a graphical representation of protein domain architecture, shown as a series of non-overlapping domains. Note that clicking on the count of proteins retrieves all proteins sharing a common architecture. 2.4. Taxonomy viewer: provides an overview of the taxonomic range of the sequences associated with each InterPro entry. 2.5. 3D structural information: Mapping between UniProt and PDB entries can be many-to-many, so the "Structure" link displays all the PDB entries associated with that particular protein. The user is able to see the residue-by-residue mapping between UniProt and a PDB chain of interest. Typical InterPro accessions: refer to section InterPro IDs. |
| Pfam (Sanger) including: iPfam PfamAlyzer (CGB, Karolinska) |
Pfam is a
large collection
of multiple
sequence alignments and hidden Markov models covering many common protein
domains and families. For each family in Pfam you can: Look at multiple alignments, view protein domain architectures, examine species distribution, follow links to other databases, and view known protein structures. Pfam can be used to view the domain organisation of proteins. Notice that a single protein can belong to several Pfam families. 75% (May 2006) of protein sequences have at least one match to Pfam. This number is called the sequence coverage. 1. Pfam database organization: Pfam is a database of two parts. Pfam-A is the curated part of Pfam containing over 7973 (May 2006) protein families. To give Pfam a more comprehensive coverage of known proteins a supplement called Pfam-B is automatically generated. This contains a large number of small families taken from the PRODOM database that do not overlap with Pfam-A. Although of lower quality Pfam-B families can be useful when no Pfam-A families are found. 2. Search Pfam: 2.1. Protein name or sequence search: - Note that only a UniProt name or accession number is accepted as input! If you do not know these IDs, first query UniProt or use the "Sequence Search" option. - A protein sequence in FASTA format can be used as query. - Note: To do large scale sequence searching against Pfam, you can upload a TEXT file (Not Word) in FASTA format. The results are emailed "within the next few days". 2.2. Keyword Search: This page allows you to query a variety of databases (Pfam, PROSITE, UniProt) for keywords. You can enter multiple words separated by whitespace into the box and these will be implicitly joined with a logical AND. 2.3. DNA sequence search: This form allows you to compare your DNA sequence against the whole of Pfam using the Wise2 software package. 2.4. Domain query: The domain query is a simple way to find proteins with certain combinations of domains. For example you can get all proteins with a CBS domain and IMPDH domain with the query 'CBS and IMPDH'. Or if you want to find IMPDH proteins without a CBS domain use the query 'IMPDH not CBS'. 2.5. Taxonomy query: This is an easy to use way to restrict the output to Pfam entries specific for certain species. 3. Browse Pfam: 3.1. Browse by Pfam family ID: Note that if the family you are interested in does not appear here, then try searching Pfam for it.The name of the family may have changed or you may know the family by a different name than Pfam. 3.2. Browse by Genomes: - Compare Genomes: Pfam families for 2 or more species can be compared by clicking the 'Compare genomes' select box beside the species then clicking the 'Compare selected genomes' button at the top or bottom of the page. - Single species: More detailed information about the Pfam families in a species can be obtained by clicking on the species name. 3.3. Browse by Pfam clans: A clan contains two or more Pfam families that have arisen from a single evolutionary origin. Evidence of their evolutionary relationship is usually determined by similar tertiary structures, or when structures are not available, by common sequence motifs. Clans have been introduced as some protein families are very divergent, thereby making it very difficult to represent the family with a single HMM. These families are closely related, so sequences may significantly hit more than one members of the clan. Note: All of the clan information and a list of the Pfam families that are members of the clan is contained in Pfam-C, an additional release flatfile. 3.4. Browse by Interaction: iPfam iPfam is a resource that describes domain-domain interactions that are observed in PDB entries. The domains are defined by Pfam. When two or more domains occur in a single structure, the domains are analysed to see if they form an interaction. If the domains are close enough to form an interaction, the bonds forming the interaction are calculated. More information on how the bonds are calculated can be found in the help section. The interaction information is re-calculated at each Pfam release, so as Pfam changes, the information within iPfam is kept up to date. You can access the information in iPfam from each domain family page, or you can browse by domain interaction. The browse page also allows a search by domain name or accession. 4. Pfam Domain or Family Output: A Pfam domain or family entry typically provides the basic description of the entry and several options for further data analysis. - Pfam Clan: related Pfam families are grouped into clans ("superfamilies") (see below for details). - Description of the entry including PubMed references. - Gene Ontology information via QuickGO. - Alignment: several formats to display family alignments and HMM logos. - Domain organisation: view other proteins sharing the same the domain organisation (meaning the same composition of domains in the same order along the protein sequence). - Species distribution: view alignments and domain organisation by species. The species tree can be displayed using several depth values (sizes). Values in brackets represent the number of proteins containing the domain in the respective families. - Phylogenetic Tree: The trees are generated using Quicktree. - 3D Structures: links to related 3D structures from PDB with the possibility to view these structures. 5. Pfam Clan Output: - A Pfam clan summary page contains the description, annotation and membership of the clan. From this page, several data are retrievable: - Family relationship diagram: Relationships between families are represented by solid lines (significant profile-profile comparison score) or dashed lines (non-significant profile-profile comparison score). Beside each line, the profile-profile comparison E-value score is indicated. This score is directly linked to a visualization of the profile-profile comparison alignment. - Clan multiple sequence alignment: contains all of the clan members seed alignments. The alignments are colored using Jalview. 6. Special Pfam Tools: 6.1. PfamAlyzer: PfamAlyzer, which is only available at the Karolinska mirror of Pfam, is a graphical user-friendly interface to Pfam. It contains most of the Pfam functionalities and features an extended domain query based on an intuitive graphical query language. PfamAlyzer adds taxonomic analysis functionality to the domain query. Selected Pfam mirror servers: - St. Louis (USA) - Karolinska Institutet (Sweden) Typical Pfam accessions: refer to section Pfam IDs. |
| PRINTS (Manchester University) including: SPRINT |
PRINTS
is a compendium of protein fingerprints, provided by
the Manchester University.
A fingerprint
is a group of conserved motifs used to characterise a protein family;
its diagnostic power is refined by iterative scanning of a SWISS-PROT /
TrEMBL composite. Usually the motifs do not overlap, but are separated
along a sequence, though they may be contiguous in 3D-space. Note that PRINTS is somehow similar to BLOCKS as it also creates blocks of conserved / related protein sequences. Note that you can access PRINTS in various ways, like by accession number, sequence, text, author, and more. If you want to screen your sequence for PRINTS motifs, you may use FingerPRINTScan: Note that PRINTS is also integrated in InterPro, but if you want to also see "non-significant" matches, you have to search directly at PRINTS ! SPRINT (Search PRINTS-S) provides an interface to the PRINTS-S database. PRINTS-S is the relational cousin of the PRINTS data bank of protein family fingerprints. In order for the search options to work effectively JavaScript must be enabled. Typical PRINTS accessions: refer to section PRINTS IDs. |
| SCOP (MRC, Cambridge, UK) |
SCOP (Structural
Classification
of Proteins), provided by the Medical
Research Council, Cambridge) aims to provide a detailed and
comprehensive
description of the structural and evolutionary relationships
between all proteins whose structure is known. As such, it provides a
broad survey of all known protein folds, detailed information about the
close relatives of any particular protein, and a framework for future
research and classification. The SCOP database is organised as a tree structure. Entering at the top of the hierarchy the user can navigate through the levels of Class, Fold, Superfamily, Family and Species to the leaves of the tree which are structural domains of individual PDB entries. SCOP contains the domains of all PDB entries available at the time of the current release's construction. For each of these entries a coordinate file is available and can be displayed via the various graphical interfaces, like Chime or RasMol. NOTE: SCOP is more or less designed like a big encyclopedia. For protein family searches using a query sequence, please refer to other programs like FPS or Superfamily ! |
| SMART (EMBL) |
SMART (a Simple
Modular
Architecture Research Tool) allows the identification and annotation of
genetically mobile domains and the analysis of domain architectures.
More than 500 domain families found in signalling,
extracellular and chromatin-associated proteins are detectable. These
domains are extensively annotated with respect to phyletic
distributions, functional class, tertiary structures and functionally
important residues. 1. SMART database organization: The basic data of SMART are high-quality manually derived alignments of protein domain families. As hidden Markov Models these allow to identify protein domains in sequence databases. These results are stored in the SMART database. The source databases are Swiss-Prot, SP-TrEMBL, and Ensembl. 2. Query SMART: 2.1. "Normal Mode": You may use either the swissprot/sptrembl sequence identifier/accession number (SwissProt, SwissProtNew, SpTrembl, SpTremblNew and Ensembl) or the protein sequence itself (COPY/PASTE) to request the smart service. Alternatively, you can search for proteins with combinations of specific domains in different species or taxonomic ranges (like "TyrKc AND SH3 AND NOT SH2"). 2.2. "Genomic Mode": The main difference to "normal mode" is the underlying protein database. In genomic mode, only the proteins from 170 completely sequenced genomes are included. This database has minimal redundancy and is therefore particularly useful for whole genome studies of domain architectures or single domain distributions. 3. SMART output: 3.1. Domain architecture: SMART also offers very nice tools to explore other proteins with similar domain composition. At the output page, a graphical image showing identified domains is displayed. In addition, two links are provided, namely "Display all proteins with similar domain organisation" and "Display all proteins with similar domain composition" (note the difference !). A taxonomic tree is shown, where you can easily select those species and those proteins that you want to further investigate. You may either display the domain architecture or get a fasta formatted sequence file of these entries. 3.2. Catalytic amino acids: Protein sequences can be scanned for the presence of important catalytic amino acids, as essential catalytic sites are annotated for all enzymatic domains in SMART. 3.3. Protein interaction data: these are imported from the STRING database, in which known and predicted protein-protein associations are integrated from a variety of sources. The interactions include physical binding interactions as well as functional associations. 3.4. Prediction of protein disorder: DisEMBL predictions of intrinsic protein disorder are included in SMART's analysis methods. NOTE: The smart database is also integrated in other search routines like the NCBI-CDD Search and the InterPro search. Typical SMART accessions: refer to section SMART IDs. |
| Superfamily (MRC, Cambridge, UK) |
The
purpose of Superfamily
is to
provide structural (and hence implied functional) assignments to
protein sequences at the superfamily level. This server does not
attempt (at present) to distinguish between families within
superfamilies, but is able to detect the broader and more distant
relationships at the superfamily level. A superfamily contains all
proteins for which there is structural evidence of a common
evolutionary ancestor. The server can be entered in three ways: 1.Begin with a sequence (search the library). 2.Begin with a superfamily (select from SCOP). 3.Begin with a genome (select from list). The output for 1. will give you a list of hits that your sequences make to models belonging to superfamilies, their alignments to the model, and assigned genome sequences (a very instructive list of genomes in which a certain superfamily has already been described !) |
| TIGRFAMs (TIGR) |
TIGRFAMs
are
a collection of protein families featuring curated multiple
sequence alignments, Hidden Markov Models (HMMs) and associated
information designed to support the automated functional
identification of proteins by sequence homology. Classification by
equivalog family, where achievable, complements classification by
orthologs,
superfamily, domain or motif. Use this page to see the curated seed alignmet for each TIGRFAM, the full alignment of all family members and the cutoff scores for inclusion in each of the TIGRFAMs. You can query TIGRFAMs by Text Search or query by sequence using the option Sequence Search. Note that by default, both databases (TIGRFAMs and PFAM) are searched. Note: TIGRFAMs are automatically searched when performing InterPro queries (see InterPro description). Note: PFAM is a collection of HMM models of protein families complementary to TIGRFAMs. PFAM models are constrained to be non-overlapping with one another and thus are more likely to describe domains rather than full-length proteins. |
| Motifs 1 - Integrated Search | |
| ELM
- Eukaryotic Linear Motif Resource (EMBL and others) |
ELM is the
largest collection of linear protein motifs, followed by
PROSITE and Scansite. ELM is a resource for predicting functional
sites in eukaryotic proteins. Putative functional sites are
identified by patterns (regular
expressions). Please note that "regular
expressions" are similar to the PROSITE
patterns, but have a slightly different syntax. Please refer to the
ELM-Help for information ! ELM is a co-operation between EMBL and
several other institutions. ELM is easy to query, you either enter a valid SwissProt/TrEMBL ID or AC, or a protein sequence. You may also specify the species and the cellular compartment, if known. In general, predictions of short linear motifs in protein sequences have to be taken with even more caution than those of large globular domains, as represented in databases like Pfam and SMART. NOTE the following statement at the ELM homepage:"...This means that most matches shown are more likely to be false positives than true matches. We hope that ELM server results will prove useful as guides to experimentation but they should not be treated as factual findings." Therefore, it is and will be essential to improve the significance of the results via usage of filters. ELM currently has 3 filters: - Taxonomic filter: each ELM is annotated with one or more NCBI taxonomy identifiers to indicate its known phylogenetic distribution. If the user provides a query species, all ELMs are filtered out which do not belong to this lineage. - Cell compartment filter: If the user specifies the compartment in which the query protein functions, all ELMs not found in this compartment are filtered out. - Globular domain filter: This filter is based on the fact that many functional motifs are found in disordered (unstructured) regions of proteins, but not in globular domains, which are collected in Pfam or SMART. Therefore, all matches within predicted Pfam or SMART domains are filtered out. Anyway, some patterns like phosphorylation sites are often found in exposed loops of globular domains, therefore users should always have also a look at the unfiltered ELM results. Note: If you want to scan your query sequence for disordered (unstructured) regions, you may have a look at the programs DisEMBL and GlobPlot. |
| Motif
Scan (MyHits, SIB) |
Motif Scan is part of MyHits, an extension of Hits, a free database devoted to protein domains. Motif Scan searches simultanously for profiles and patterns in PROSITE profiles, PROSITE patterns, Pfam and Gribskov collection. There is a nice multi-color output, nice zoomable graphical display of the matches, and significant matches are extra-coded. |
| PIR (Georgetown University Medical Center, Washington) including: Pattern Search Peptide Match |
The Protein Information
Resource (PIR)
produces the PIR-International Protein Sequence Database (PIR-PSD)
- a comprehensive, non-redundant, expertly annotated, fully classified
and extensively cross-referenced protein sequence database in
the public domain. It provides an integration of sequences, functional,
and structural information to support genomics and proteomics research.
Pattern Search of the PIR database performs two different tasks. First, Pattern Search can search the iProClass database for proteins sharing a certain user-defined pattern. There is also a comprehensive help file on how to write a peptide pattern. In the output list, each entry can be selected individually, and there are options to generate a common FASTA sequence file as well as ClustalW multiple alignments and domain architecture images. Second, Pattern Search can search your query sequence for known patterns against the PROSITE database. Peptide Match is a program which retrieves protein sequences via exact peptide match (meaning that you know the exact peptide sequence without IUPAC codes etc.). Peptide Match searches either UniProt or UniRef databases for matching protein sequences. The output and download options are similar to Pattern Search. |
| PPSearch (EBI) |
PPSearch, provided by
the
EBI, scans a sequence against the PROSITE
protein profile database allowing also a graphical output. |
| PROSITE (ExPASy) including ScanPROSITE |
PROSITE database
consists of a large collection of biologically meaningful signatures
that are described as patterns or profiles. Each signature is linked to
a documentation that provides useful biological information on the
protein family, domain or functional site identified by the signature.
This helps to reliably identify to
which known protein family (if any) a new sequence belongs. 1. Signature types used in PROSITE: Patterns or regular expressions are useful tools to identify short and well-conserved regions, such as catalytic sites, binding sites, post-translational modifications (PTMs) or zinc fingers. Note: Patterns need to be updated regularly to introduce new variabilities in the regular expression as new sequences are added. 2.ScanPROSITE allows to scan protein sequence(s) (either from UniProt Knowledgebase (Swiss-Prot/TrEMBL) or PDB or provided by the user) for the occurrence of patterns, profiles and rules (motifs) stored in the PROSITE database, or to search protein database(s) for hits by specific motif(s). This means that ScanSITE does not only scan a user sequence but may also be used to retrieve ALL proteins sharing a certain domain or motif. Note: The PROSITE pattern syntax rules are described in this help- section. Note: Another documentation how to write a peptide pattern is available at the PIR database along with thge documentation of the program Pattern Search. Typical PROSITE accessions: refer to section PROSITE IDs. |
| Scansite (MIT) |
Scansite
searches for motifs within
proteins that are likely to be phosphorylated by specific
protein kinases or bind to domains such as SH2 domains, 14-3-3
domains or PDZ domains. 1) The program MotifScan
utilizes
an entropy approach that assesses the probability of a site matching
the
motif using the selectivity values and sums the logs of the probability
values for each amino acid in the candidate sequence. The program then
indicates the percentile ranking of the candidate motif in respect to
all
potential motifs in proteins of a protein database. So, the smaller the
percentage
value, the better the identified hit. You may also scan by accnr or ID. 2) Database search
using a Scansite motif:
You may also search databases (Swiss-Prot, TREMBL; Ensembl) for
proteins bearing a certain motif. From the output list, you can
directly perform a MotifScan of the proteins of interest.
|
| Motifs
2 - Motif Discovery |
| NOTE:
Many computational motif discovery tools work both on sets of DNA and PROTEIN
sequences. Therefore, ALL of these programs are listed in one section "Motif Discovery" which is part
of the "Gene Analysis" main page. The
prediction of over-represented motifs is widely used in promoter
analyses, hence there are also close relations between these two
topics. |
| Motifs
3 - Modification |
|
| Linkpage 1 - ExPASy (SIB) |
ExPASy (Expert Protein
Analysis System) is a
proteomics server provided by the SIB
(Swiss
Institute of Bioinformatics). Among many other links, it provides also
very good list of
links concerning protein post-translational
modification prediction. Please also have alook at the ExPASy main section. |
| NetChop (CBS, Denmark) |
The NetChop
WWW
server produces neural network predictions for cleavage sites
of the human proteasome. |
| NetNglyc (CBS, Denmark) |
The NetNglyc
WWW server
predicts N-Glycosylation sites in human proteins using
artificial neural networks that examine the sequence context of
Asn-Xaa-Ser/Thr sequons. |
| NetOglyc (CBS, Denmark) |
NetOglyc
performs prediction of
potential sites for O-glycosylation in mammalian proteins. |
| NetPhos (CBS, Denmark) |
The NetPhos
WWW
server produces neural network predictions for serine, threonine and
tyrosine phosphorylation sites in eukaryotic proteins. |
| PESTfind Analysis Webtool (EMBnet Austria) |
PESTfind
Analysis Webtool is provided by the Austrian EMBnet
node. The PESTfind
algorithm allows rapid and objective identification of PEST motifs
in protein target sequences. Briefly, the PEST hypothesis was based on
a literature survey that combined both information on protein stability
as well as protein primary sequence information. Initially, the study
relied on 12 short-lived proteins with well-known properties, but was
continually extended later. Although all these proteins exerted various
different cellular functions it became apparent that they shared high
local concentrations of amino acids proline (P),
glutamic acid (E), serine (S),
threonine (T)
and to a lesser extent aspartic acid (D). From that it was concluded
that PEST motifs reduce the half-lives of proteins dramatically and
hence, that they target proteins for proteolytic degradation. |
| PESTFIND
and EPESTFIND (EMBOSS, Pasteur) |
PESTFIND
finds PEST
motifs
as potential proteolytic cleavage sites in proteins. EPESTFIND allows rapid and objective identification of PEST motifs in protein target sequences. Briefly, the PEST hypothesis was based on a literature survey that combined both information on protein stability as well as protein primary sequence information. The initial group of proteins studied included E1A, c-myc, p53, c-fos, v-myb, and others. Although all these proteins exerted various different cellular functions it became apparent that they shared high local concentrations of amino acids proline (P), glutamic acid (E), serine (S), threonine (T) and to a lesser extent aspartic acid (D). From that it was concluded that PEST motifs reduce the half-lives of proteins dramatically and hence, that they target proteins for proteolytic degradation. |
| RESID
Database (EBI and NCIFCRF) |
The RESID Database of Protein
Modifications is a comprehensive collection of annotations
and structures for protein
modifications and
cross-links including pre-, co-, and post-translational modifications.
The database provides: systematic and alternate names, atomic formulas
and masses, enzymatic activities that generate the modifications,
keywords, literature citations, Gene Ontology (GO) cross-references,
protein sequence database feature table annotations, structure
diagrams, and molecular models. Each RESID Database entry presents a chemically unique modification and shows how that modification is currently annotated in the protein sequence databases, Swiss-Prot and the Protein Information Resource (PIR). The RESID Database provides a table of corresponding equivalent feature annotations that is used in UniProt. The RESID Database can be searched via keywords (like "palmitoylation" or "palmitate") using an "SRS-based" interface. NOTE: The RESID Database is not a site where you can screen a query sequence for potential modifications, but it is the largest catalog available in this field, which also strongly aims at a "standardized vocabulary" for protein modifications. RESID is also available as SRS-database. Typical RESID accessions: refer to section RESID IDs. |
| SignalP (CBS, Denmark) |
The SignalP
World
Wide Web server predicts the presence and location of signal
peptide cleavage sites in amino acid sequences from different
organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and
eukaryotes. The method incorporates a prediction of cleavage sites and
a signal peptide/non-signal peptide prediction based on a combination
of several artificial neural networks. Note: There is a good explanation of the SignalP output available. |
| Motifs 4 - Localization | |
| NOTE: This section lists programs which predict protein localization based on sequence motifs, in contrast to databases which store localization data of proteins based on laboratory experiments. These are described in the section "Protein Localization Databases" ! | |
| Linkpage 1 - CUBIC: Columbia
Univ. Bioinf. Center (Columbia University, New York) |
The CUBIC (Columbia
Univ. Bioinformatics Center) provides a
whole list of services and databases with emphasis on prediction of
subcellular localization and the structural prediction of
proteins. |
| ER-GolgiDB (Columbia University, New York) |
ER-GolgiDB
is a database of predictions for Endoplasmic Reticulum and
Golgi Apparatus localization based on sequence homology to
experimentally
annotated proteins. The assigned localization is inferred from the
homologue
that most accurately predicts localization for the protein and the
accuracy based on HSSP distance threshold is provided. Subsets of membrane
and lumenal predictions are also provided. You may either search specific parts of the database (like ER membrane subset) or download the whole database. Unfortunately, only lists of proteins can be displayed (incl. links to individual entries); there is no option like in-batch FASTA sequence download. |
| LOCtarget (Columbia University, New York) |
LOCtarget
is a database
of predicted subcellular localization for potential targets for
structural genomics from TargetDb. You may either search or browse the LOCtarget database, or you may submit your own FASTA protein sequence for localization prediction. Subcellular localization is currently predicted using four different methods: predictNLS (nuclear localization signal), LOChom ( using homology ), LOCkey (using keywords) and LOCnet (neural network based prediction). The reported localization is based on the method which predicts localization of a given protein with the highest confidence. Please note that the output is emailed to you and is quite short, without giving a lot of details. |
| NESbase (CBS, Denmark) |
NESbase
is a
database of proteins in which the presence of Leucine-rich nuclear
export signal (NES) has been experimentally verified. It is
curated from literature. The Link "Database in HTML" lists all entries (proteins) that are contained in this database. Note that there is NO option to predict a NES in a user's query sequence. (Which is in contrast to PredictNLS for nuclear localization signals). The widely accepted NES consensus is: L-x(2,3)-[LIVFM]-x(2,3)-L-x-[LI]. So, you may scan your sequence "by hand" for the presence of this motif. |
| NMP-db (Columbia University, New York) |
NMP-db
is a database of nuclear
matrix associated proteins. You may either search
or download the whole or specific subsets of the database. |
| PredictNLS (Columbia University, New York) |
PredictNLS
is an automated tool for the analysis and determination of Nuclear
Localization Signals (NLS). You submit a protein sequence
or a potential NLS. PredictNLS predicts that your
protein is nuclear or finds out whether your potential NLS
is found in our database. |
| ProtComp (Softberry) |
ProtComp is a program for the prediction of subcellular localizations of proteins. |
| PSORT (Tokyo University) |
PSORT
is one of
the best known programs for analysis of protein sorting signals
and prediction of subcellular localization. PSORT
receives the information of an amino acid sequence
and its source orgin, as inputs. Then, it analyzes the input sequence
by applying the stored rules for various
sequence features of known protein sorting signals. Finally, it reports
the
possiblity for the input protein to be localized at each candidate site
with additional information. 1. PSORT.org provides links to the PSORT family of programs for subcellular localization prediction. 2. PSORT2 is the current version of the "standard" PSORT program. PSORT only accepts single sequences as input. Note that PSORT II comes with a highly instructive user manual explaining the diverse predictions. 3. WoLF PSORT is a recently updated version of PSORT II for the prediction of eukaryotic sequences. Note: PSORT provides a quite detailed output as compared to e.g. LOCtarget. Note that PSORT II is also available at the Pasteur Institute. Note that PSORT2 is also integrated in the "data super-integration tool" Bioinformatic Harvester of the EBI. 3. iPSORT is a program for classification of eukaryotic N-terminal sorting signals. Given a protein sequence , it will predict whether it contains a Signal Peptide (SP), Mitochondrial Targeting Peptide (mTP), or Chloroplast Transit Peptide (cTP). |
| TargetP (CBS, Denmark) |
TargetP
predicts the subcellular location of eukaryotic protein
sequences. The
subcellular location assignment is based on the predicted presence of
any of the N-terminal presequences chloroplast transit peptide (cTP),
mitochondrial targeting peptide (mTP) or secretory pathway signal
peptide (SP). |
| TOPO
Viewer (H-InvDB) |
TOPO
Viewer is a Java
applet for viewing both sub-cellular targeting signals
predicted by PSORT II and
TargetP, as well as the presence trans-membrane helices predicted by
SOSUI and
TMHMM. Please note that test runs showed that the TOPO Viewer works well with MS Internet Explorer, but not with Netscape 7.0 ! TOP Viewer is a tool which is integrated into the H-Invitational Database (H-InvDB), which provides an integrative annotation of full-length cDNA clones. You first have to get the specific database entry of your gene of interest, either via BLAST (sequence) search or via keyword search, and then look for the section "Prediction of subcellular localization" within the so-called "cDNA view". You will get a tabular summary of the results of the different prediction programs as well as a link to open the TOPO Viewer. Please also refer to the H-InvDB section at the Data Integration page for a detailed description ! |
| Protein
Localization Databases |
|
| NOTE:
This section lists databases which store localization
data of proteins based on laboratory experiments, in contrast to
programs which predict
protein localization based on sequence motifs.
These are described in the section "Motifs
4 - Localization" ! NOTE: This section is related to the section "RNA Localization Databases" listing resources which store RNA localization images based on in situ hybridization experiments. |
|
| GFP-cDNA (EMBL and DKFZ) |
GFP-cDNA is an ongoing project of localising
novel GFP-tagged
human cDNA products to subcellular compartments of the eukaryotic
cell.
This information provides an entry point for many other downstream
functional assays that are designed and implemented for the subsets of
new proteins localising to defined subcellular organelles. Images of all localised proteins and their bioinformatic analysis can be viewed via the ‘Results Table’ or ‘Results Images’ buttons. In addition, a search window can be used to find proteins containing features or motifs of particular interest to you that have been localised in this project. NOTE: Protein Localization images are also integrated in the data super-integration tool Bioinformatic Harvester; please refer to the main section of this tool for details ! NOTE: The names of GFP-cDNA entries are clone names, which mostly give no hint about the nature of the proteins. If you want to extract the complete list of all localized proteins via the Bioinformatic Harvester, you may use the following "trick": enter "pepperkok" as search term (derived from Rainer Pepperkok, one of the two project heads, together with Jeremy Simpson). You may also perform combined searches like "pepperkok golgi" or "pepperkok endoplasmic", and select the checkbox "AND search". |
| HPR - Human Protein Atlas (HPR program, Sweden) |
HPR
- the Human Protein Atlas, contains hundreds of
thousands of images of protein expression in normal
human tissues and cancer cells. The Swedish Human Proteome
Resource (HPR)
program, funded by the Knut and Alice Wallenberg Foundation,
has been set-up to allow the systematic exploration of the human
proteome with Affinity (Antibody)
Proteomics, combining high-throughput generation of affinity-purified
(mono-specific) antibodies with
protein profiling using tissue arrays. The basic concept of this
resource centre is to produce specific antibodies to human target
proteins using a high-throughput method that involves the cloning
and
expression of protein epitope signature tags. Query: At the top of the page, you'll find information about HPR, descriptions and annotations, as well as useful information on image-usage policies. Available proteins (genes) can be reached through a specific search (by gene/protein name/id or classification, such as kinase or protease) or by browsing the individual chromosomes. Output: The data are presented as high-resolution images representing immunohistochemically stained tissue sections. The final goal is to produce datasets for all of the about 22,000 different proteins, one for each human gene. The vision, as indicated on the Human Protein Atlas site, is “...to enable the systematic generation of quality assured antibodies to all non-redundant human proteins and to use these reagents to functionally explore human proteins, protein variants and protein interactions.” Typical HPR accessions: refer to section HPR IDs. |