Bioinformatics World    
         
 Main Index -> PROTEINS
                -> Proteins Linkpages
                -> Primary Structure
                -> Secondary Structure
                -> Domains, Families
 
               -> Motifs 1 - Integrated Search
                -> Motifs 2 - Motif Discovery
                -> Motifs 3 - Modification
                -> Motifs 4 - Localization
                -> Protein Localization Databases
               
                           
Navigate    AtoZ   Search this Site   Site Journal    FAQ Index   Main Index   Appendix       
          
Proteins Linkpages
NOTE: Databases and tools which provide integrated views and high-throughput retrieval of protein data are described in the "Data Integration" section !
Linkpage 1 - 
ExPASy
(SIB) 
  
ExPASy (Expert Protein Analysis System) is a proteomics server provided by the SIB (Swiss Institute of Bioinformatics), dedicated to the analysis of protein sequences and structures as well as 2-D PAGE. The ExPASy Site Map shows a graphical display which links to all important resources maintained at ExPASy.
Note: ExPASy can also be accessed via these ExPASy mirror sites.

Tools are provided for the topics like similarity searches, pattern and profile searches, post-translational modification prediction, primary, secondary, tertiary structure prediction, transmembrane region detection, sequence alignment and many more. 
Linkpage 2 - 
Pattern search 
(Pasteur Institute)
The Pasteur Institute provides a highly recommendable linkpage for pattern searches, including many EMBOSS tools.
Linkpage 3 -
Hidden Markov Models
(SWBIC) 
This HMM linkpage is provided by the SWBIC (Southwest Biotechnology and Informatics Center), covering databases/applications using HMMs, as well as introductory papers in this field.

 
Primary Structure
Linkpage 1 - ExPASy
(SIB)
ExPASy (Expert Protein Analysis System) is a proteomics server provided by the SIB (Swiss Institute of Bioinformatics). Among many other links, it provides also a very good list of links concerning protein primary structure (sequence) analysis, including graphical tools.
Please also have alook at the ExPASy main section.

2ZIP
(Max Planck Institute for Molecular Genetics)
2ZIP is a program for predicting leucine zippers in proteins, provided by the Max Planck Institute for Molecular Genetics in Berlin. 2ZIP combines a standard coiled coil prediction algorithm with an approximate search for the characteristic leucine repeat. No further information from homolog ues is required for prediction. This approach improves significantly over existing methods, especially in that the coiled coil prediction turns out to be highly informative and avoids large numbers of false positives.
ANTIGENIC
(EMBOSS, Pasteur)
Antigenic is an EMBOSS tool that finds antigenic sites in proteins.
PEPSTATS
(EMBOSS, Pasteur)
Pepstats outputs a report of simple protein sequence information like Molecular weight, Number of residues, Charge, Isoelectric point, For each type of amino acid: number, molar percent, DayhoffStat, Molar extinction coefficient (A280), and more.

pI/MW
(ExPASy)
The pI/MW tool, provided by ExPASy, performs a determination of the isoelectric point and the molecular weight of protein sequences. 
SAPS
(SIB) 
SAPS, provided by the SIB, performs statistical analysis of protein sequences (amino acid composition, charge distribution, hydrophobic segments, cysteine spacings, repetitive structures, etc.).

Note: SAPS is also available as EMBOSS tool.


Secondary Structure
Linkpage 1 - ExPASy
(SIB) 
ExPASy (Expert Protein Analysis System) is a proteomics server provided by the SIB (Swiss Institute of Bioinformatics). Among many other links, it provides also very good list of links concerning protein secondary structure prediction. Please also have alook at the ExPASy main section.
HelixTurnHelix
(EMBOSS, Pasteur)
Helixturnhelix uses the method of Dodd and Egan and finds helix-turn-helix nucleic acid binding motifs in proteins, as e.g. found in many transcription factors.
The helix-turn-helix motif was originally identified as the DNA-binding domain of phage repressors. One alpha-helix lies in the wide groove of DNA; the other lies at an angle across DNA. 
JPRED2
(EBI)
Jpred takes either a single protein sequence or a multiple alignment of protein sequences, and predicts secondary structure (helices, sheets, turns, coiled coils, transmembrane regions, solvent accessibility...). It works by combining a number of modern, high quality prediction methods to form a consensus (!). Jpred is provided by the Barton Group at Dundee, which also offers a FAQs page and a "Hints and Tips" site.
It runs a series of programs at one-button-click like: PHD, PREDATOR, NNSSP, MULPRED, ZPRED, JNET, COILS, MULTICOIL, PHDhtm (TM prediction), ...
NOTE: Predictions work better for multiple alignments than for single sequences. Therefore single sequences are first used to create automatic multiple alignments with the best hits in the non redundant database. Then the prediction algorithms are run on this alignment.
NOTE: Input sequences should be in UPPER CASE letters, as some test runs using lower case letters did not function properly !!!
PepCoil
(EMBOSS, Pasteur)
 
PepCoil predicts coiled coil regions in proteins. Coiled coils are formed by two or three alpha helices in parallel and in register that cross at an angle of approximately 20 degrees, are strongly amphipathic and display a pattern of hydrophilic and hydrophobic residues that is repeated every seven residues. The seven positions of the heptad repeat are designated a through g, a and d being generally hydrophobic, while the others are hydrophilic. The parallel two-stranded alpha-helical coiled coil is the most frequently encountered subunit-oligomerization motif in proteins.
PredictProtein
and
META PP
(Columbia University, New York)
PredictProtein is a program to predict secondary structure of proteins (helices, sheets, solvent accessibility, PROSITE motifs, low-complexity regions, AND similarity searches to identify related sequences from databases).
Similar to Jpred, PredictProtein runs a series of programs at one-button-click like: PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, and EvalSec.

META PP allows a "one-button" submission of your sequence via a single-page interface to a variety of servers, for the purpose of secondary and tertiary structure prediction. The linked servers include SWISS-MODEL, Superfamily, DAS, JPRED, PHD, PROF and more. You will recieve individual Emails containing the results of these predictions. 
SOSUI
(Tokyo University of Agriculture and Technology)  
The SOSUI system  is a tool for secondary structure prediction of membrane proteins from a protein sequence. The basic idea of prediction is based on the physicochemical properties of amino acid sequences such as hydrophobicity and charges. The system deals with three types of prediction: discrimination of membrane proteins from soluble ones, prediction of existence of transmembrane helices and determination of transmembrane helical regions.

If you want to see a nice example for use in a SOSUI sequence query, enter the sequence of the erythrocyte anion exchanger, showing 12 TM helices: RefSeq NP_000333.
SOSUI has a very nice graphical output which shows the hydropathy profile, the "helical wheel representation", and the possible membrane topology.
Note that there is also an interface for in-batch sequence submission, allowing the input of a multi-FASTA file. In this case, there is no graphical output but a simple table listing the number and positions of potential TM regions.
NOTE: SOSUI is also integrated in the "data super-integration tool" Bioinformatic Harvester of the EBI.  
TMAP
(Karolinska Institute, Sweden)
TMAP predicts transmembrane helices from multiple sequence alignments or from single sequences. The alignment should be in GCG format (.MSF).
NOTE the difference between multiple alignments and multiple single sequences !!!
TMHMM
(CBS, Denmark)
TMHMM is a program for prediction of transmembrane helices in proteins. NOTE: You can submit many proteins at once (!!!) in one fasta file. Please limit each submission to at most 4000 proteins.  TMHMM produces a very nice graphical and tabular output, and discriminates between "inside" and "outside" helices.
TMPred
(EMBnet.ch)
TMPred is a program for prediction of transmembrane helices in proteins and their orientation.  TMPred works only on single sequences, and provides different possible models of TM count and orientations.
TopPred2
(EMBOSS, Pasteur)
TopPred2 is a program for prediction of transmembrane helices in proteins. It provides a lot of control options like different hydrophobicity scales, and many output formats, like membrane topology, hydropathy profile, lists of hydrophobicity values, and more
Note that, although not explicitely stated, TopPred2 can also be queried using a multiple sequence fasta file as input for in-batch analysis.


Domains, Families
BLOCKS 1 - BLOCKS server  and Logos from mult. seq. alignments or blocks
(FHCRC)
The BLOCKS server is a service for biological sequence analysis at the Fred Hutchinson Cancer Research Center in Seattle, Washington, USA.
There is also a Blocks mirror site at the Weizmann Institute of Science in Israel.

A sequence logo is a graphical representation of aligned sequences where at each position the size of each residue is proportional to its frequency in that position and the total height of all the residues in the position is proportional to the conservation (information content) of the position.
Please note that there is a whole section about Logos, and a detailed description about Logos from BLOCKS
BLOCKS 2 -  
Trees
from mult. seq. alignments or blocks
(FHCRC)
Trees from Blocks means that a set of blocks can also be used to construct and display a neighbor-joining tree for examination of possible subfamily relationships. Because blocks represent the most highly conserved regions of proteins, misaligned regions are avoided, and so trees from blocks should be of high quality. ! NOTE the difference to Trees from global alignments like those included with the Biology Workbench !

The procedure is the same like the one described for Logos (e.g. starting with Block Maker and a set of input FASTA sequences), except that the link "tree" on the Blocks output has to be used.
!!! Trees needs alignments of at least 4 input sequences !!!

BLOCKS 3 - 
3D Blocks
(FHCRC)
3D Blocks is a search and display tool which allows you to view blocks on a protein structure. It uses a block (either stored or generated by Block Maker) as the query sequence for a MAST search of the PDB database. It then determines which regions on the high-scoring protein sequences correspond to the blocks, and creates output files to display its results. Again, a direct link to 3D Blocks is available at the Blocks output generated by Block Maker.
BLOCKS 4 - 
Block Searcher
(FHCRC)
As an aid to detection and verification of protein sequence homology, the Block Searcher compares a protein or DNA sequence to the current database of protein blocks. Blocks are short multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Typically, a group of proteins has more than one region in common and their relationship is represented as a series of blocks separated by unaligned regions. If a second block for a group also scores highly in the search, the evidence that the sequence is related to the group is strengthened.
It is recommendable to search both Blocks+ and Prints. Blocks+ has automatically - generated blocks, while Prints has hand-crafted blocks.
!! SEE ALSO REMARK IN BLOCKS 5 SECTION !!!
BLOCKS 5 - IMPALA
(FHCRC)
IMPALA (Integrating Matrix Profiles And Local Alignments) searches a protein query sequence against a multiple alignment database represented as a collection of PSI-BLAST checkpoint files. IMPALA has been implemented on the Blocks Server to search a blocks database, such as Blocks+.

Although the Blocks Searcher performs a similar type of search, there are differences between IMPALA and the Blocks Searcher in the PSSMs used, in the alignments reported, and in the calculation of statistics that can lead to somewhat different results. Therefore, any marginal similarity detected with one searching program should be confirmed using the other. Both programs generally detect true positive hits but they tend to report different false positives, and so any hit not detected by both searching programs should be regarded with caution (!!!).

Whereas the Block Searcher scores individual blocks separately and then combines the scores for blocks in a family, IMPALA scores the set of blocks for a family as a whole so a hit is for the whole family, not for an individual block. Since IMPALA scores not only the blocks but also regions between them, its alignments may extend beyond the blocks. The resulting BLAST-like output gives scores and expected values.

BLOCKS 6 - RPS-BLAST
(FHCRC)
Reversed Position Specific Blast  is one of the BLAST series of searching programs from NCBI. RPS-Blast uses the query sequence to search a database of pre-calculated PSSMs (in this case PSI-BLAST checkpoint files made from multiple alignments of protein families) and report significant hits in a single pass. The role of the PSSM has changed from "query" to "subject", hence the term "reverse" in RPS-Blast. Here the checkpoint files are made from the Blocks and Prints alignments, and are the same files searched with the similar IMPALA searching program. 

Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity.
BLOCKS 7 - 
Block Maker

and

Blocks Multiple Alignment Processor
(FHCRC)
Block Maker finds conserved blocks in a group of two or more unaligned protein sequences, which are assumed to be related, using two different algorithms.
Blocks are short multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Typically, a group of proteins has more than one region in common and their relationship is represented as a series of blocks separated by unaligned regions.
If you already have a multiple alignment (like in ClustalW format), please use the Blocks Multiple Alignment Processor instead.
At least two related protein sequences (and a maximum of 250) must be provided to make blocks. Each sequence must have a unique name of 10 characters or less. If you have the accession numbers of some sequences you would like to use, NCBI Entrez or similar programs can create a file for you in FASTA format.

NOTE: The main difference to programs like ClustalW is that Block Maker does not perform a global alignment covering the complete sequences but selects for the sequence blocks which are best conserved.
NOTE: Block Maker output allows a series of follow-up analyses, like the display of phylogenetic trees, or sequence logos (see individual sections).
BLOCKS 8 - LAMA
(FHCRC)
LAMA (Local Alignment of Multiple Alignments) is a program for comparing protein multiple sequence alignments with each other. The program can search databases of such multiple alignments. The search is for sequence similarities between conserved regions of protein families. The method is sensitive, detecting weak sequence relationships between protein families. Sequence similarities beyond the range of conventional sequence database searches can be detected by the method.
LAMA compares multiple sequence alignments of proteins. If you have only a single protein sequence you first need to find other members of its family. The protein sequences also need to be multiply aligned in BLOCKS format.
BLOCKS 9 - COBBLER
(FHCRC)
COBBLER means COnsensus Biasing By Locally Embedding Residues. 
A single sequence is selected from a set of blocks and enriched by replacing the conserved regions delineated by the blocks by consensus residues derived from the blocks. Embedding consensus residues improved performance with readily available single sequence query searching programs, such as BLAST and FASTA, in comprehensive tests; especially useful in PSI-BLAST searches !!!
Block Maker makes a COBBLER sequence automatically, but this page can be used if you want to embed your blocks in a different sequence.
BLOCKS 10 - 
CODEHOP
(FHCRC)
CODEHOP designs PCR primers from protein multiple-sequence alignments. The program is intended for cases where the protein sequences are distant from each other and degenerate primers are needed.
The multiple-sequence alignments should be in the Blocks Database format, such as in Block Maker output. The output of this program contains links that send the resulting blocks directly to the CODEHOP page.

The result of the CODEHOP program are suggested degenerate sequences of DNA primers that you can use for PCR. You have to choose appropriate primer pairs, get them synthesized and perform the PCR.
BLOCKS 11 - SIFT
(FHCRC)
SIFT is a sequence homology-based tool that sorts intolerant from tolerant amino acid substitutions and predicts whether an amino acid substitution in a protein will have a phenotypic effect. SIFT is based on the premise that protein evolution is correlated with protein function. Positions important for function should be conserved in an alignment of the protein family, whereas unimportant positions should appear diverse in an alignment. 
In addition, if you have mutant proteins with single amino acid substitutions, SIFT will predict which mutants may have a phenotypic effect before you carry out your functional assays.
NOTE: SIFT is most efficient if you use multiple sequence alignments as input (or BLOCKS, generated by the Blocks Multiple Alignment Processor) ! 
NOTE: Your query sequence is the sequence that you would like prediction on (i.e. the sequence in which you have introduced the amino acid substitutions). The query sequence should be first in the alignment. The alignment must correspond to the length of your query sequence (i.e. no gaps in your query sequence in the alignment). Partial sequences should be flanked by Xes at the beginning and end of the sequence so that those positions are not considered gaps !
CDART - Conserved Domain Architecture Retrieval Tool 
(NCBI)
CDART determines the domain architecture of a protein sequence by comparison to a database of conserved domain alignments, CDD, using RPS-BLAST. It then compares the protein's domain architecture to that of other proteins in NCBI's non-redundant sequence database, nr. Related sequences are identified as those proteins which share one or more similar domains. CDART displays these sequences using a graphical summary showing the types and locations of domains identified within each sequence, with links to the individual sequences and to further information on their domain architectures. 
DART searches the domain databases SMART and PFAM

NOTE: The first output can be very huge. But you can query for sequences containing only the domains you are interested in by clicking the checkboxes at the bottom of the results pages and pressing "Subset by selected domains".
CDD - Conserved Domain Database (NCBI) The CD-Search service is a very user-friendly program to identify the conserved domains present in a protein sequence. CDD can be searched either by a query protein sequence or by keyword searches. CDD currently contains domains derived from two popular collections, Smart and Pfam, plus contributions from the NCBI. The source databases also provide descriptions and links to citations. Since conserved domains correspond to compact structural units, CDs contain links to 3D-structure via Cn3D whenever possible. 

NOTE
: Which hits are reliable ? Hits that have an E value lower than E=0.01 !
NOTE: If you run CD-search not from the main page but from the query page, you are able to specify parameters like E-value or search mode, which may significantly alter your output !
NOTE: CD-Search now is run by default in parallel with protein BLAST searches.

UPDATE: CD-Search was extended to also screen for similarities to known COG and KOG clusters of orthologous proteins. Please refer to the COG / KOG chapter for details.
FPS - Family Pairwise Search
(SDSC)
Family Pairwise search (FPS), provided by the San Diego Supercomputer Center, allows you to search a library of protein families with a protein sequence, GCG profile or BLAST checkpoint file that you provide.
You can chose between the following protein family libraries: SCOP, PROSITE, PFAM.
E-values let you easily determine the significance of a hit. The hits are directly linked to the SCOP database.
InterPro
(EBI) 
InterPro is an integrated documentation resource of protein families, domains, and functional sites. InterPro was created to integrate the major protein signature databases: Pfam, PRINTS, ProDom, PROSITE, SMART, TIGRFAMs, PIRSF (PIR Superfamily), Superfamily, CATH, and PANTHER. Signatures are manually integrated into InterPro entries that are curated to provide biological and functional information. Annotation is provided in an abstract, Gene Ontology mapping, and links to specialized databases. InterPro covers over 78% of all proteins in the Swiss-Prot and TrEMBL components of UniProt.

1. Search InterPro:
1.1. Text Search:
- Search entries: InterPro is searched for: name, abstract, method name, method accession and InterPro entry accession
- Find protein matches: InterPro's proteins are searched for: protein ID (UniProt) short name and protein (UniProt) accession The result of the search will be the overview match pages for the proteins which match the query.
- IPR: Go to the InterPro entry with the given accession number. Accession numbers are of the form IPRxxxxxx where the x's are digits.
1.2. InterProScan (Sequence Search):
This form allows you to query your protein or DNA sequence against InterPro. The user can select the applications to run on the query sequence.

2. InterPro output:
2.1. General: Each InterPro entry is described by one or more signatures, and corresponds to a biologically meaningful family, domain, repeat, or site. A site may be a PTM - post-translational modification, an active site or a binding site. InterPro entries are annotated with a name, an abstract, mapping to Gene Ontology (GO) terms, and links to specialized databases. InterPro groups all protein sequences matching related signatures into entries. All hits of the protein signatures in InterPro against a composite of Swiss-Prot and TrEMBL components of UniProt are precomputed.
2.2. Protein match view: For each protein signature, a list of proteins in UniProt that it matches is precomputed. The match lists may be viewed in different formats. Where structures are available, links to the corresponding PDB entries are shown. In addition, the graphical view indicates where the protein signatures correspond to structural chains. Protein structures can be visualized using the AstexViewer Java applet.
2.3. Domain architecture (IDA) viewer: is a graphical representation of protein domain architecture, shown as a series of non-overlapping domains. Note that clicking on the count of proteins retrieves all proteins sharing a common architecture.
2.4. Taxonomy viewer: provides an overview of the taxonomic range of the sequences associated with each InterPro entry.
2.5. 3D structural information: Mapping between UniProt and PDB entries can be many-to-many, so the "Structure" link displays all the PDB entries associated with that particular protein. The user is able to see the residue-by-residue mapping between UniProt and a PDB chain of interest.

Typical InterPro accessions: refer to section InterPro IDs.
Pfam
(Sanger) 

including:
iPfam

PfamAlyzer
(CGB, Karolinska)
Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families.
For each family in Pfam you can: Look at multiple alignments, view protein domain architectures, examine species distribution, follow links to other databases, and view known protein structures.
Pfam can be used to view the domain organisation of proteins. Notice that a single protein can belong to several Pfam families. 75% (May 2006) of protein sequences have at least one match to Pfam. This number is called the sequence coverage.
 
1. Pfam database organization:
Pfam is a database of two parts. Pfam-A is the curated part of Pfam containing over 7973 (May 2006) protein families. To give Pfam a more comprehensive coverage of known proteins a supplement called Pfam-B is automatically generated. This contains a large number of small families taken from the PRODOM database that do not overlap with Pfam-A. Although of lower quality Pfam-B families can be useful when no Pfam-A families are found. 

2. Search Pfam:
2.1. Protein name or sequence search:
- Note that only a UniProt name or accession number is accepted as input! If you do not know these IDs, first query UniProt or use the "Sequence Search" option.
- A protein sequence in FASTA format can be used as query.
- Note: To do large scale sequence searching against Pfam, you can upload a TEXT file (Not Word) in FASTA format. The results are emailed "within the next few days".
2.2. Keyword Search: This page allows you to query a variety of databases (Pfam, PROSITE, UniProt) for keywords. You can enter multiple words separated by whitespace into the box and these will be implicitly joined with a logical AND.
2.3. DNA sequence search:  This form allows you to compare your DNA sequence against the whole of Pfam using the Wise2 software package.
2.4. Domain query: The domain query is a simple way to find proteins with certain combinations of domains. For example you can get all proteins with a CBS domain and IMPDH domain with the query 'CBS and IMPDH'. Or if you want to find IMPDH proteins without a CBS domain use the query 'IMPDH not CBS'.
2.5. Taxonomy query: This is an easy to use way to restrict the output to Pfam entries specific for certain species.
                       
3. Browse Pfam:
3.1. Browse by Pfam family ID: Note that if the family you are interested in does not appear here, then try searching Pfam for it.The name of the family may have changed or you may know the family by a different name than Pfam.
3.2. Browse by Genomes:
- Compare Genomes: Pfam families for 2 or more species can be compared by clicking the 'Compare genomes' select box beside the species then clicking the 'Compare selected genomes' button at the top or bottom of the page.
- Single species: More detailed information about the Pfam families in a species can be obtained by clicking on the species name.

3.3. Browse by Pfam clans: A clan contains two or more Pfam families that have arisen from a single evolutionary origin. Evidence of their evolutionary relationship is usually determined by similar tertiary structures, or when structures are not available, by common sequence motifs. Clans have been introduced as some protein families are very divergent, thereby making it very difficult to represent the family with a single HMM. These families are closely related, so sequences may significantly hit more than one members of the clan.
Note: All of the clan information and a list of the Pfam families that are members of the clan is contained in Pfam-C, an additional release flatfile.
3.4. Browse by Interaction: iPfam
iPfam is a resource that describes domain-domain interactions that are observed in PDB entries. The domains are defined by Pfam. When two or more domains occur in a single structure, the domains are analysed to see if they form an interaction. If the domains are close enough to form an interaction, the bonds forming the interaction are calculated. More information on how the bonds are calculated can be found in the help section. The interaction information is re-calculated at each Pfam release, so as Pfam changes, the information within iPfam is kept up to date. You can access the information in iPfam from each domain family page, or you can browse by domain interaction. The browse page also allows a search by domain name or accession.

4. Pfam Domain or Family Output:
A Pfam domain or family entry typically provides the basic description of the entry and several options for further data analysis.
- Pfam Clan: related Pfam families are grouped into clans ("superfamilies") (see below for details).
- Description of the entry including PubMed references.
- Gene Ontology information via QuickGO.
- Alignment: several formats to display family alignments and HMM logos.
- Domain organisation: view other proteins sharing the same the domain organisation (meaning the same composition of domains in the same order along the protein sequence).
- Species distribution: view alignments and domain organisation by species. The species tree can be displayed using several depth values (sizes). Values in brackets represent the number of proteins containing the domain in the respective families.
- Phylogenetic Tree: The trees are generated using Quicktree.        
- 3D Structures: links to related 3D structures from PDB with the possibility to view these structures.

5. Pfam Clan Output:
- A Pfam clan summary page contains the description, annotation and membership of the clan. From this page, several data are retrievable:
- Family relationship diagram: Relationships between families are represented by solid lines (significant profile-profile comparison score) or dashed lines (non-significant profile-profile comparison score). Beside each line, the profile-profile comparison E-value score is indicated. This score is directly linked to a visualization of the profile-profile comparison alignment.
- Clan multiple sequence alignment: contains all of the clan members seed alignments. The alignments are colored using Jalview.

6. Special Pfam Tools:
6.1. PfamAlyzer:
PfamAlyzer, which is only available at the Karolinska mirror of Pfam, is a graphical user-friendly interface to Pfam. It contains most of the Pfam functionalities and features an extended domain query based on an intuitive graphical query language. PfamAlyzer adds taxonomic analysis functionality to the domain query.

Selected Pfam mirror servers:
- St. Louis (USA)
- Karolinska Institutet (Sweden)

Typical Pfam accessions: refer to section Pfam IDs.
PRINTS
(Manchester University)

including:
SPRINT
PRINTS  is a compendium of protein fingerprints, provided by the Manchester University. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SWISS-PROT / TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space.
Note that PRINTS is somehow similar to BLOCKS as it also creates blocks of conserved / related protein sequences.

Note that you can access PRINTS in various ways, like by accession number, sequence, text, author, and more.
If you want to screen your sequence for PRINTS motifs, you may use FingerPRINTScan
Note that PRINTS is also integrated in InterPro, but if you want to also see "non-significant" matches, you have to search directly at PRINTS !

SPRINT (Search PRINTS-S) provides an interface to the PRINTS-S database. PRINTS-S is the relational cousin of the PRINTS data bank of protein family fingerprints. In order for the search options to work effectively JavaScript must be enabled.

Typical PRINTS accessions: refer to section PRINTS IDs.
SCOP
(MRC, Cambridge, UK)
SCOP (Structural Classification of Proteins), provided by the Medical Research Council, Cambridge) aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.

The SCOP database is organised as a tree structure. Entering at the top of the hierarchy the user can navigate through the levels of Class, Fold, Superfamily, Family and Species to the leaves of the tree which are structural domains of individual PDB entries. SCOP contains the domains of all PDB entries available at the time of the current release's construction. For each of these entries a coordinate file is available and can be displayed via the various graphical interfaces, like Chime or RasMol.

NOTE: SCOP is more or less designed like a big encyclopedia. For protein family searches using a query sequence, please refer to other programs like FPS or Superfamily !
SMART
(EMBL)
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 500 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues.

1. SMART database organization:
The basic data of SMART are high-quality manually derived alignments of protein domain families. As hidden Markov Models these allow to identify protein domains in sequence databases. These results are stored in the SMART database. The source databases are Swiss-Prot, SP-TrEMBL, and Ensembl.

2. Query SMART:
2.1. "Normal Mode": You may use either the swissprot/sptrembl sequence identifier/accession number (SwissProt, SwissProtNew, SpTrembl, SpTremblNew and Ensembl) or the protein sequence itself (COPY/PASTE) to request the smart service.
Alternatively, you can search for proteins with combinations of specific domains in different species or taxonomic ranges (like "TyrKc AND SH3 AND NOT SH2").
2.2. "Genomic Mode": The main difference to "normal mode" is the underlying protein database. In genomic mode, only the proteins from 170 completely sequenced genomes are included. This database has minimal redundancy and is therefore particularly useful for whole genome studies of domain architectures or single domain distributions.

3. SMART output:
3.1. Domain architecture: SMART also offers very nice tools to explore other proteins with similar domain composition. At the output page, a graphical image showing identified domains is displayed. In addition, two links are provided, namely "Display all proteins with similar domain organisation" and "Display all proteins with similar domain composition" (note the difference !). A taxonomic tree is shown, where you can easily select those species and those proteins that you want to further investigate. You may either display the domain architecture or get a fasta formatted sequence file of these entries.
3.2. Catalytic amino acids: Protein sequences can be scanned for the presence of important catalytic amino acids, as essential catalytic sites are annotated for all enzymatic domains in SMART.
3.3. Protein interaction data: these are imported from the STRING database, in which known and predicted protein-protein associations are integrated from a variety of sources. The interactions include physical binding interactions as well as functional associations.
3.4. Prediction of protein disorder: DisEMBL predictions of intrinsic protein disorder are included in SMART's analysis methods.

NOTE: The smart database is also integrated in other search routines like the NCBI-CDD Search and the InterPro search.

Typical SMART accessions: refer to section SMART IDs.
Superfamily
(MRC, Cambridge, UK)
The purpose of Superfamily is to provide structural (and hence implied functional) assignments to protein sequences at the superfamily level. This server does not attempt (at present) to distinguish between families within superfamilies, but is able to detect the broader and more distant relationships at the superfamily level. A superfamily contains all proteins for which there is structural evidence of a common evolutionary ancestor.
The server can be entered in three ways: 
 1.Begin with a sequence (search the library). 
 2.Begin with a superfamily (select from SCOP). 
 3.Begin with a genome (select from list). 
The output for 1. will give you a list of hits that your sequences make to models belonging to superfamilies, their alignments to the model, and assigned genome sequences (a very instructive list of  genomes in which a certain superfamily has already been described !)
TIGRFAMs
(TIGR)
TIGRFAMs are a collection of protein families featuring curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information designed to support the automated functional identification of proteins by sequence homology. Classification by equivalog family, where achievable, complements classification by orthologs, superfamily, domain or motif.

Use this page to see the curated seed alignmet for each TIGRFAM, the full alignment of all family members and the cutoff scores for inclusion in each of the TIGRFAMs.
You can query TIGRFAMs by Text Search or query by sequence using the option
Sequence Search. Note that by default, both databases (TIGRFAMs and PFAM) are searched.

Note: TIGRFAMs are automatically searched when performing InterPro queries (see InterPro description).
Note:
PFAM is a collection of HMM models of protein families complementary to TIGRFAMs. PFAM models are constrained to be non-overlapping with one another and thus are more likely to describe domains rather than full-length proteins.


Motifs 1 - Integrated Search
ELM - Eukaryotic Linear Motif Resource
(EMBL and others)
ELM is the largest collection of linear protein motifs, followed by PROSITE and Scansite. ELM is a resource for predicting functional sites in eukaryotic proteins. Putative functional sites are identified by patterns (regular expressions). Please note that "regular expressions" are similar to the PROSITE patterns, but have a slightly different syntax. Please refer to the ELM-Help for information ! ELM is a co-operation between EMBL and several other institutions.

ELM is easy to query, you either enter a valid SwissProt/TrEMBL ID or AC, or a protein sequence. You may also specify the species and the cellular compartment, if known.
In general, predictions of  short linear motifs in protein sequences have to be taken with even more caution than those of large globular domains, as represented in databases like Pfam and SMART. NOTE the following statement at the ELM homepage:"...This means that most matches shown are more likely to be false positives than true matches. We hope that ELM server results will prove useful as guides to experimentation but they should not be treated as factual findings."     
Therefore, it is and will be essential to improve the significance of the results via usage of filters. ELM currently has 3 filters:
- Taxonomic filter: each ELM is annotated with one or more NCBI taxonomy identifiers  to indicate its known phylogenetic distribution. If the user provides a query species, all ELMs are filtered out which do not belong to this lineage.
- Cell compartment filter: If the user specifies the compartment in which the query protein functions, all ELMs not found in this compartment are filtered out.
- Globular domain filter: This filter is based on the fact that many functional motifs are found in disordered (unstructured) regions of proteins, but not in globular domains, which are collected in Pfam or SMART. Therefore, all matches within predicted Pfam or SMART domains are filtered out. Anyway, some patterns like phosphorylation sites are often found in exposed loops of globular domains, therefore users should always have also a look at the unfiltered ELM results.
Note: If you want to scan your query sequence for disordered (unstructured) regions, you may have a look at the programs DisEMBL and GlobPlot.
Motif Scan
(MyHits, SIB)
Motif Scan is part of MyHits, an extension of Hits, a free database devoted to protein domains. Motif Scan searches simultanously for profiles and patterns in PROSITE profiles, PROSITE patterns, Pfam and Gribskov collection. There is a nice multi-color output, nice zoomable graphical display of the matches, and significant matches are extra-coded.
PIR
(Georgetown University Medical Center, Washington)

including:
Pattern Search

Peptide Match
The Protein Information Resource (PIR) produces the PIR-International Protein Sequence Database (PIR-PSD) - a comprehensive, non-redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence database in the public domain. It provides an integration of sequences, functional, and structural information to support genomics and proteomics research.

Pattern Search of the PIR database performs two different tasks. First, Pattern Search can search the iProClass database for proteins sharing a certain user-defined pattern. There is also a comprehensive help file on how to write a peptide pattern. In the output list, each entry can be selected individually, and there are options to generate a common FASTA sequence file as well as ClustalW multiple alignments and domain architecture images.
Second, Pattern Search can search your query sequence for known patterns against the PROSITE database.

Peptide Match is a program which retrieves protein sequences via exact peptide match (meaning that you know the exact peptide sequence without IUPAC codes etc.). Peptide Match searches either UniProt or UniRef databases for matching protein sequences. The output and download options are similar to Pattern Search.
PPSearch
(EBI)
PPSearch, provided by the EBI, scans a sequence against the PROSITE protein profile database allowing also a graphical output.
PROSITE
(ExPASy)

including
ScanPROSITE
PROSITE database consists of a large collection of biologically meaningful signatures that are described as patterns or profiles. Each signature is linked to a documentation that provides useful biological information on the protein family, domain or functional site identified by the signature. This helps to reliably identify to which known protein family (if any) a new sequence belongs.

1. Signature types used in PROSITE:
Patterns or regular expressions are useful tools to identify short and well-conserved regions, such as catalytic sites, binding sites, post-translational modifications (PTMs) or zinc fingers. Note: Patterns need to be updated regularly to introduce new variabilities in the regular expression as new sequences are added.

2.ScanPROSITE allows to scan protein sequence(s) (either from UniProt Knowledgebase (Swiss-Prot/TrEMBL) or PDB or provided by the user) for the occurrence of patterns, profiles and rules (motifs) stored in the PROSITE database, or to search protein database(s) for hits by specific motif(s). This means that ScanSITE does not only scan a user sequence but may also be used to retrieve ALL proteins sharing a certain domain or motif.
Note: The PROSITE pattern syntax rules are described in this help- section.
Note: Another documentation how to write a peptide pattern is available at the PIR database along with thge documentation of the program Pattern Search.

Typical PROSITE accessions: refer to section PROSITE IDs.
Scansite
(MIT)

Scansite searches for motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains. 

1) The program MotifScan utilizes an entropy approach that assesses the probability of a site matching the motif using the selectivity values and sums the logs of the probability values for each amino acid in the candidate sequence. The program then indicates the percentile ranking of the candidate motif in respect to all potential motifs in proteins of a protein database. So, the smaller the percentage value, the better the identified hit. You may also scan by accnr or ID.
Note
that you can choose between 3 stringency levels (high, medium, low) ! A high stringency setting limits the motifscanner to only show you the candidate motifs that have scores that fall in the top 0.2% of scores within the whole SWISS-PROT vertebrate database. Medium stringency has a threshold limit of the top 1%, while low stringency has a threshold limit of the top 5%.
Note that there is also a Tutorial available for MotifScan.

2) Database search using a Scansite motif: You may also search databases (Swiss-Prot, TREMBL; Ensembl) for proteins bearing a certain motif. From the output list, you can directly perform a MotifScan of the proteins of interest.      



Motifs 2 - Motif Discovery
NOTE: Many computational motif discovery tools work both on sets of DNA and PROTEIN sequences. Therefore, ALL of these programs are listed in one section "Motif Discovery" which is part of the "Gene Analysis" main page. The prediction of over-represented motifs is widely used in promoter analyses, hence there are also close relations between these two topics.


Motifs 3 - Modification 
Linkpage 1 - ExPASy
(SIB)
ExPASy (Expert Protein Analysis System) is a proteomics server provided by the SIB (Swiss Institute of Bioinformatics). Among many other links, it provides also very good list of links concerning protein post-translational modification prediction.
Please also have alook at the ExPASy main section.
NetChop
(CBS, Denmark)
The NetChop WWW server produces neural network predictions for cleavage sites of the human proteasome.
NetNglyc
(CBS, Denmark)
The NetNglyc WWW server predicts N-Glycosylation sites in human proteins using artificial neural networks that examine the sequence context of Asn-Xaa-Ser/Thr sequons.
NetOglyc
(CBS, Denmark)
NetOglyc performs prediction of potential sites for O-glycosylation in mammalian proteins.
NetPhos
(CBS, Denmark)
The NetPhos WWW server produces neural network predictions for serine, threonine and tyrosine phosphorylation sites in eukaryotic proteins. 
PESTfind Analysis Webtool
(EMBnet Austria)
PESTfind Analysis Webtool is provided by the Austrian EMBnet node. The PESTfind algorithm allows rapid and objective identification of PEST motifs in protein target sequences. Briefly, the PEST hypothesis was based on a literature survey that combined both information on protein stability as well as protein primary sequence information. Initially, the study relied on 12 short-lived proteins with well-known properties, but was continually extended later. Although all these proteins exerted various different cellular functions it became apparent that they shared high local concentrations of amino acids proline (P), glutamic acid (E), serine (S), threonine (T) and to a lesser extent aspartic acid (D). From that it was concluded that PEST motifs reduce the half-lives of proteins dramatically and hence, that they target proteins for proteolytic degradation.
PESTFIND and EPESTFIND (EMBOSS, Pasteur)
PESTFIND finds PEST motifs as potential proteolytic cleavage sites in proteins.

EPESTFIND allows rapid and objective identification of PEST motifs in protein target sequences. Briefly, the PEST hypothesis was based on a literature survey that combined both information on protein stability as well as protein primary sequence information. The initial group of proteins studied included E1A, c-myc, p53, c-fos, v-myb, and others. Although all these proteins exerted various different cellular functions it became apparent that they shared high local concentrations of amino acids proline (P), glutamic acid (E), serine (S), threonine (T) and to a lesser extent aspartic acid (D). From that it was concluded that PEST motifs reduce the half-lives of proteins dramatically and hence, that they target proteins for proteolytic degradation.
RESID Database
(EBI and NCIFCRF)
The RESID Database of Protein Modifications is a comprehensive collection of annotations and structures for protein modifications and cross-links including pre-, co-, and post-translational modifications. The database provides: systematic and alternate names, atomic formulas and masses, enzymatic activities that generate the modifications, keywords, literature citations, Gene Ontology (GO) cross-references, protein sequence database feature table annotations, structure diagrams, and molecular models.
Each RESID Database entry presents a chemically unique modification and shows how that modification is currently annotated in the protein sequence databases, Swiss-Prot and the Protein Information Resource (PIR). The RESID Database provides a table of corresponding equivalent feature annotations that is used in UniProt.
The RESID Database can be searched via keywords (like "palmitoylation" or "palmitate") using an "SRS-based" interface.

NOTE: The RESID Database is not a site where you can screen a query sequence for potential modifications, but it is the largest catalog available in this field, which also strongly aims at a "standardized vocabulary" for protein modifications.    
RESID is also available as SRS-database.

Typical RESID accessions: refer to section RESID IDs.
SignalP
(CBS, Denmark)
The SignalP World Wide Web server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks.
Note: There is a good explanation of the SignalP output available.


Motifs 4 - Localization
NOTE: This section lists programs which predict protein localization based on sequence motifs, in contrast to databases which store localization data of proteins based on laboratory experiments. These are described in the section "Protein Localization Databases" !
Linkpage 1 - CUBIC: Columbia Univ. Bioinf. Center
(Columbia University, New York) 
The CUBIC (Columbia Univ. Bioinformatics Center) provides a whole list of services and databases with emphasis on prediction of subcellular localization and the structural prediction of proteins.
ER-GolgiDB
(Columbia University, New York) 
ER-GolgiDB is a database of predictions for Endoplasmic Reticulum and Golgi Apparatus localization based on sequence homology to experimentally annotated proteins. The assigned localization is inferred from the homologue that most accurately predicts localization for the protein and the accuracy based on HSSP distance threshold is provided. Subsets of membrane and lumenal predictions are also provided.

You may either search specific parts of the database (like ER membrane subset) or download the whole database. Unfortunately, only lists of proteins can be displayed (incl. links to individual entries); there is no option like in-batch FASTA sequence download.
LOCtarget
(Columbia University, New York) 
LOCtarget is a database of predicted subcellular localization for potential targets for structural genomics from TargetDb.

You may either search or browse the LOCtarget database, or you may submit your own FASTA protein sequence for localization prediction.
Subcellular localization is currently predicted using four different methods: predictNLS (nuclear localization signal), LOChom ( using homology ), LOCkey (using keywords) and LOCnet (neural network based prediction). The reported localization is based on the method which predicts localization of a given protein with the highest confidence. Please note that the output is emailed to you and is quite short, without giving a lot of details.
NESbase
(CBS, Denmark)
NESbase is a database of proteins in which the presence of Leucine-rich nuclear export signal (NES) has been experimentally verified. It is curated from literature.
The Link "Database in HTML" lists all entries (proteins) that are contained in this database.
Note that there is NO option to predict a NES in a user's query sequence.  (Which is in contrast to PredictNLS for nuclear localization signals).  The widely accepted NES consensus is: L-x(2,3)-[LIVFM]-x(2,3)-L-x-[LI]. So, you may scan your sequence "by hand" for the presence of this motif. 
NMP-db
(Columbia University, New York) 
NMP-db is a database of nuclear matrix associated proteins. You may either search or download the whole or specific subsets of the database.
PredictNLS
(Columbia University, New York) 
PredictNLS is an automated tool for the analysis and determination of Nuclear Localization Signals (NLS). You submit a protein sequence or a potential NLS. PredictNLS predicts that your protein is nuclear or finds out whether your potential NLS is found in our database.    
ProtComp
(Softberry)
ProtComp is a program for the prediction of subcellular localizations of proteins.
PSORT
(Tokyo University)
PSORT is one of the best known programs for analysis of protein sorting signals and prediction of subcellular localization. PSORT receives the information of an amino acid sequence and its source orgin, as inputs. Then, it analyzes the input sequence by applying the stored rules for various sequence features of known protein sorting signals. Finally, it reports the possiblity for the input protein to be localized at each candidate site with additional information.

1. PSORT.org provides links to the PSORT family of programs for subcellular localization prediction.


2.
PSORT2 is the current version of the "standard" PSORT program. PSORT only accepts single sequences as input. Note that PSORT II c
omes with a highly instructive user manual explaining the diverse predictions.

3.
WoLF PSORT is a
recently updated version of PSORT II for the prediction of eukaryotic sequences.

Note: PSORT provides a quite deta
iled output as compared to e.g. LOCtarget.
Note that
PSORT II is also available at the Pasteur Institute.
Note that PSORT2 is also integrated in the "data super-integration tool" Bioinformatic Harvester of the EBI. 

3.
iPSORT is a program for classification of eukaryotic N-terminal sorting signals. Given a protein sequence , it will predict whether it contains a Signal Peptide (SP), Mitochondrial Targeting Peptide (mTP), or Chloroplast Transit Peptide (cTP).
TargetP
(CBS, Denmark)
TargetP predicts the subcellular location of eukaryotic protein sequences. The subcellular location assignment is based on the predicted presence of any of the N-terminal presequences chloroplast transit peptide (cTP), mitochondrial targeting peptide (mTP) or secretory pathway signal peptide (SP).
TOPO Viewer
(H-InvDB)
TOPO Viewer is a Java applet  for viewing both sub-cellular targeting signals predicted by PSORT II and TargetP, as well as the presence trans-membrane helices predicted by SOSUI and TMHMM.
Please note that test runs showed that the TOPO Viewer works well with MS Internet Explorer, but not with Netscape 7.0 !

TOP Viewer is a tool which is integrated into the  H-Invitational Database (H-InvDB), which provides an integrative annotation of full-length cDNA clones. You first have to get the specific database entry of your gene of interest, either via BLAST (sequence) search or via keyword search, and then look for the section "Prediction of subcellular localization" within the so-called "cDNA view".  You will get a tabular summary of the results of the different prediction programs as well as a link to open the TOPO Viewer.
Please also refer to the H-InvDB section at the Data Integration page for a detailed description !
   
                         
Protein Localization Databases
NOTE: This section lists databases which store localization data of proteins based on laboratory experiments, in contrast to programs which predict protein localization based on sequence motifs. These are described in the section "Motifs 4 - Localization" !
NOTE: This section is related to the section "RNA Localization Databases" listing resources which store RNA localization images based on in situ hybridization experiments.
GFP-cDNA
(EMBL and DKFZ) 
GFP-cDNA is an ongoing project of localising novel GFP-tagged human cDNA products to subcellular compartments of the eukaryotic cell. This information provides an entry point for many other downstream functional assays that are designed and implemented for the subsets of new proteins localising to defined subcellular organelles.
Images of all localised proteins and their bioinformatic analysis can be viewed via the ‘Results Table’ or ‘Results Images’ buttons. In addition, a search window can be used to find proteins containing features or motifs of particular interest to you that have been localised in this project.

NOTE: Protein Localization images are also integrated in the data super-integration tool Bioinformatic Harvester; please refer to the main section of this tool for details !
NOTE: The names of GFP-cDNA entries are clone names, which mostly give no hint about the nature of the proteins. If you want to extract the complete list of all localized proteins via the Bioinformatic Harvester, you may use the following "trick": enter
"pepperkok" as search term (derived from Rainer Pepperkok, one of the two project heads, together with Jeremy Simpson). You may also perform combined searches like "pepperkok golgi" or "pepperkok endoplasmic", and select the checkbox "AND search". 
HPR - Human Protein Atlas
(HPR program, Sweden)
HPR - the Human Protein Atlas, contains hundreds of thousands of images of protein expression in normal human tissues and cancer cells. The Swedish Human Proteome Resource (HPR) program, funded by the Knut and Alice Wallenberg Foundation, has been set-up to allow the systematic exploration of the human proteome with Affinity (Antibody) Proteomics, combining high-throughput generation of affinity-purified (mono-specific) antibodies with protein profiling using tissue arrays. The basic concept of this resource centre is to produce specific antibodies to human target proteins using a high-throughput method that involves the cloning and expression of protein epitope signature tags.

Query:
At the top of the page, you'll find information about HPR, descriptions and annotations, as well as useful information on image-usage policies. Available proteins (genes) can be reached through a specific search (by gene/protein name/id or classification, such as kinase or protease) or by browsing the individual chromosomes.
Output:
The data are presented as high-resolution images representing immunohistochemically stained tissue sections. The final goal is to produce datasets for all of the about 22,000 different proteins, one for each human gene. The vision, as indicated on the Human Protein Atlas site, is “...to enable the systematic generation of quality assured antibodies to all non-redundant human proteins and to use these reagents to functionally explore human proteins, protein variants and protein interactions.”

Typical HPR accessions: refer to section HPR IDs.