Bioinformatics World FAQ Center
  FAQ Index -> PROTEINS
                -> PROT1...know which domains and motifs can be found in my protein query sequence ? (last update May 15, 2006)
                -> PROT2...know all proteins which contain a certain domain or motif in their sequence ? -> see RET1 !
                -> PROT3...get a structural prediction for my protein of interest ? (last update Jul. 21, 2005)
                -> PROT4...know which protein family my protein belongs to ? -> see GENOM2 !
                -> PROT5...screen a batch of protein sequences for transmembrane regions ? (last update Mar. 30, 2006)
                -> PROT6...predict the subcellular localization or retrieve experimental localization data of my protein of interest ? (last update May 15, 2006
                -> PROT7...know which protein domains are present / overrepresented in my gene set of interest ? (last update Jan. 24, 2006)      
                   
                        
Navigate   AtoZ   Search this Site   Site Journal    FAQ Index   Main Index   Appendix       
               

PROT1...know which domains and motifs can be found in my protein query sequence ? (last update May 15, 2006)
 
1.) Domains and Motifs - integrated search:

    Tip! A very good point to start at is InterPro. InterPro is a valuable tool that searches simultanously in Pfam, PRINTS, ProDom, PROSITE, SMART,  SWISS-PROT, TIGRFAMs, PIRSF (PIR Superfamily), and Superfamily for domains, families, repeats and short sequence motifs. In order to perform a sequence search, click at the option InterProScan, and either enter (or cut and paste) your protein sequence into the text box, or, if you have the sequence in a file on your computer (like in a *.txt format), click the 'Browse' button to upload it directly. Make sure to enter your Email address, even if you like to have your results by "interactive run". You will recieve a table of hits in the different databases and a graphical representation of the positions within the query sequence. In PFAM, you will find alignments of all known members sharing a certain protein domain. In SMART, you can produce alignments in diverse formats, you can generate an alignment consensus sequence, you can group by species, you can make a subcellular localization prediction, or you can even produce a FASTA-formatted sequence file of the proteins of choice.

    Tip! If you are focused on human proteins, you may use HPRD - Human Protein Reference Database. You may "Query HPRD" using either a protein name, gene symbol, or accession number (like Entrez Gene or Swiss-Prot). The output will present - among a lot of other information - a graphical image of the protein, showing the positions of domains, motifs, transmembrane regions, signal peptides, and sites of post-translational modification. NOTE: Always stay aware of the fact that the output will show only those features which have been manually annotated in a certain context, as HPRD is strongly focused on manual annotation from literature references !!! Please refer also to the HPRD section at the Pathways page for information ! Please note that there should also be a "BLAST HPRD" option, which was not functional at the time of testing.

2. Domain search:
 
    Please note that by using one of the databases listed in InterPro individually, you may be able to retrieve other / more distantly related matches by personally adjusting the sensitivity parameters ! If you are looking for protein domains only, you may use one of the following databases.

    Tip! Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. For each family in Pfam you can: Look at multiple alignments, view protein domain architectures, examine species distribution, follow links to other databases, and view known protein structures.
    In order to address this question, the user may perform a Protein name or sequence search. Note that only a UniProt name or accession number is accepted as input! If you do not know these IDs, first query UniProt or use the "Sequence Search" option. A protein sequence in FASTA format can be used as query. The output offers a multitude of follow-up analysis tools, please refer to the Pfam main section for details !

    SMART (Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 500 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues.

    Tip! CDD (Conserved Domain Database) is another valuable resource, provided by the NCBI. The CD-Search service is a very user-friendly program to identify the conserved domains present in a protein sequence. CDD can be searched either by a query protein sequence or by keyword searches. CDD currently contains domains derived from two popular collections, SMART and PFAM, plus contributions from the NCBI. The source databases also provide descriptions and links to citations. Since conserved domains correspond to compact structural units, CDs contain links to 3D-structure via Cn3D whenever possible.

    CDART determines the domain architecture of a protein sequence by comparison to a database of conserved domain alignments, CDD, using RPS-BLAST. It then compares the protein's domain architecture to that of other proteins in NCBI's non-redundant sequence database, nr. Related sequences are identified as those proteins which share one or more similar domains. CDART displays these sequences using a graphical summary showing the types and locations of domains identified within each sequence, with links to the individual sequences and to further information on their domain architectures. CDART searches the domain databases SMART and PFAM. Note that the first output can be very huge. But you can query for sequences containing only the domains you are interested in by clicking the checkboxes at the bottom of the results pages and pressing "Subset by selected domains". 

3. Motif Search:

3.1. Motifs - integrated search:   

    In general, predictions of  short linear motifs in protein sequences have to be taken with even more caution than those of large globular domains, as represented in databases like Pfam and SMART. Historically seen, there was one major database collecting short protein motifs, PROSITE. In the meantime, other databases emerged addressing this topic, like ELM and Scansite.

    There are several ways to scan your query sequence for PROSITE patterns, the PROSITE homepage provides a simple query form, whereas the program ScanPROSITE allows to scan a protein sequence using advanced options, or to search protein databases with a user-entered pattern. You can also use the tool Motif Scan, which searches simultanously for profiles and patterns in PROSITE profiles, PROSITE patterns, and Pfam. It has a  nice multi-color output, a zoomable graphical display of the matches, and significant matches are extra-coded. PPSEARCH is yet another tool which scans a sequence against the PROSITE protein profile database (allows a graphical output), provided by the EBI.

    Tip! ELM (Eukaryotic Linear Motif Resource) has developed into the largest collection of linear protein motifs, followed by PROSITE and Scansite. ELM is a resource for predicting functional sites in eukaryotic proteins. Putative functional sites are identified by patterns (regular expressions), which have a slightly different syntax than PROSITE patterns. ELM is easy to query, you either enter a valid SwissProt/TrEMBL ID or AC, or a protein sequence. You may also specify the species and the cellular compartment, if known, and thereby "activate" filters which are designed to reduce the number of false positive hits. Please refer to the ELM chapter at the main page for additional details. 

    Scansite (at MIT) is a database of motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains. The program MotifScan can be queried by either a protein accession/ID or sequence. The program then indicates the percentile ranking of the candidate motif in respect to all potential motifs in proteins of a protein database. So, the smaller the percentage value, the better the identified hit. Note that you can choose between 3 stringency levels (high, medium, low) ! A high stringency setting limits the motifscanner to only show you the candidate motifs that have scores that fall in the top 0.2% of scores within the whole SWISS-PROT vertebrate database. Medium stringency has a threshold limit of the top 1%, while low stringency has a threshold limit of the top 5%. Database search using a Scansite motif: You may also search databases (Swiss-Prot, TREMBL; Ensembl) for proteins bearing a certain motif. From the output list, you can directly perform a MotifScan of the proteins of interest.      

3.2. Motifs - specialized searches:

    In addition to the "global" motif databases, there are many sites that collect information and provide prediction tools for individual motifs, like in the fields of post-translational modifications, and protein targeting / localization. Please refer to the respective chapters at the main page to see detailed lists: Motifs 3 - Modification and Motifs 4 - Localization (also refer to FAQ PROT6 !).

Main Index  FAQ Index  



PROT2...know all proteins which contain a certain domain or motif in their sequence ? -> see RET1

    This is actually a matter of "Sequence retrieval", so please refer to RET1 .

Main Index  FAQ Index   



PROT3...get a structural prediction for my protein of interest ? (last update Jul. 21, 2005)

    This question essentially covers programs for secondary structure prediction of proteins, although the borders between secondary and tertiary structure prediction are often not very sharp (as you will see below in e.g. PredictProtein). There are a few tools for this purpose, which are also summarized at the appropriate ExPASy linkpage for Secondary Structure Prediction. Note that additional information for 3D prediction will be given in the section "3D Structures". Also note that the prediction of transmembrane regions is described individually in FAQ PROT5.

    Tip! Jpred takes either a single protein sequence or a multiple alignment of protein sequences, and predicts secondary structure (helices, sheets, turns, coiled coils, transmembrane regions, solvent accessibility...). It works by combining a number of modern, high quality prediction methods to form a consensus (!). It runs a series of programs at one-button-click like: PHD, PREDATOR, NNSSP, MULPRED, ZPRED, JNET, COILS, MULTICOIL, PHDhtm (TM prediction). In general, predictions work better for multiple alignments than for single sequences. Therefore single sequences are first used to create automatic multiple alignments with the best hits in the non redundant database. Then the prediction algorithms are run on this alignment. The output is quite "compact" and is presented in many different formats (HTML, Postscript, Java). NOTE: Input sequences should be in UPPER CASE letters, as some test runs using lower case letters did not function properly !!!

    Tip! PredictProtein, provided by Columbia University, New York, is a program to predict secondary structure of proteins (helices, sheets, solvent accessibility, PROSITE motifs, low-complexity regions) AND performs similarity searches to identify related sequences from databases. Similar to Jpred, PredictProtein runs a series of programs at one-button-click like: PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, and EvalSec. Note that even the primary result is quite long, and it takes some time to extract the essential information. Note that the program performs a series of "additional features" like ProDom domain search, transmembrane region prediction, and GLOBE prediction of globularity. In addition, there is an "intermediate result page" which allows the submission of your sequence via a single-page interface to a variety of other servers by using the so-called META PP submission page, or by manually choosing individual sites. As an example, you are able to send your sequence to SwissModel, in order to try a tertiary structure prediction by sequence homology to known protein structures (homology modelling). Alternatively, you may perform a 3D prediction by threading, using Loopp and Superfamily. Note that Jpred is also available here.

    Tip! META PP, also provided by Columbia University, New York, allows a "one-button" submission of your sequence via a single-page interface to a variety of servers, for the purpose of secondary and tertiary structure prediction. The linked servers include SWISS-MODEL, Superfamily, DAS, JPRED, PHD, PROF and more. You will recieve individual Emails containing the results of these predictions.

Main Index  FAQ Index 



 PROT4...know which protein family my protein belongs to ? -> see GENOM2 !      

Main Index  FAQ Index  


               
PROT5...screen a batch of protein sequences for transmembrane regions ? (last update Mar. 30, 2006)      

    There is a whole list of programs which deal with the prediction of transmembrane (TM) regions in proteins, and some of them also perform such an analysis "in-batch", meaning that you can analyze many protein sequences at once. In general, if you have just a hand-ful of sequences, I would recommend to use several programs and compare the results. If you want to use a "quick test sequence", you may enter the one of the erythrocyte anion exchanger, showing 12 TM helices: RefSeq NP_000333. A personal series of test runs yielded the following result: TMHMM produced the best predictions (being quite "conservative"), followed by SOSUI, whereas TMAP, TMPred, and TopPred all are less stringent and proposed too many TM-regions.
   
1. Programs supporting batch queries:

    Tip! The SOSUI system, provided by the Tokyo Univ. of Agriculture and Technology, is a tool for secondary structure prediction of membrane proteins from a protein sequence. The basic idea of prediction is based on the physicochemical properties of amino acid sequences such as hydrophobicity and charges. The system deals with three types of prediction: discrimination of membrane proteins from soluble ones, prediction of existence of transmembrane helices and determination of transmembrane helical regions. SOSUI has a very nice graphical output which shows the hydropathy profile, the "helical wheel representation", and the possible membrane topology. Note that there is also an interface for in-batch sequence submission, allowing the input of a multi-FASTA file. In this case, there is no graphical output but a simple table listing the number and positions of potential TM regions. Note that SOSUI is also integrated in the "data super-integration tool" Bioinformatic Harvester of the EBI.  

    Tip! TMHMM, provided by the CBS, Denmark, is a program for prediction of transmembrane helices in proteins, providing the option to submit many proteins at once (!) in one fasta file. Please limit each submission to at most 4000 proteins, and note that you should de-select the graphical output when submitting many sequences to speed-up processing time. TMHMM produces a very nice graphical and tabular output, and discriminates between "inside" and "outside" helices.

    TopPred2, provided by Pasteur Inst. as EMBOSS tool, is a program for prediction of transmembrane helices in proteins. It provides a lot of control options like different hydrophobicity scales, and many output formats, like membrane topology, hydropathy profile, lists of hydrophobicity values, and more. Note that, although not explicitely stated, TopPred2 can also be queried using a multiple sequence fasta file as input for in-batch analysis.

    Tip! A quite different approach would be to use the BioMart data retrieval system for this purpose. Please refer to the BioMart description at the main page for detailed information. In this case, you would need a list of accession numbers of your proteins (like Entrez Gene, RefSeq or many others), and then filter the output list to show only proteins with transmembrane regions (at the "Filter" page, section "Protein"). Note that this procedure would not make a "denovo" TM-prediction but would be based on information stored in the database for each protein. Ensembl uses TMHMM (see above) for the prediction of TM-regions.

2. Programs working only on single sequences:
                              
    TMPred, provided by the Swiss EMBnet, is a program for prediction of transmembrane helices in proteins and their orientation. TMPred works only on single sequences, and provides different possible models of TM count and orientations.

    TMAP, provided by the Karolinska Inst., Sweden, predicts transmembrane helices from multiple sequence alignments or from single sequences. The alignment should be in GCG format (.MSF). Note the difference between multiple alignments (meaning that the sequences have to be homologous and have to be pre-aligned), compared with multiple single sequences (see e.g. TMHMM) as input !!!
               
Main Index  FAQ Index  
                     

              
PROT6...predict the subcellular localization or retrieve experimental localization data of my protein of interest ? (last update May 15, 2006)      

    In principal, there are 3 different ways to approach this question. The first one uses a list of programs which aim to predict the localization of a protein within the cell according to specific sequence motifs and patterns. The second uses data mining techniques from the literature ro retrieve localization information. The third approach employs databases which store the results of large-scale protein localization experiments as microscopy image files and / or tabular data.
   
1. Protein localization prediction:

    The programs which are relevant here are described at the Main Index within the section Motifs 4 - Localization. You may want to refer to this section to try additional resources. Note that the start page of PSORT.org (see below) is also a general linkpage to resources for protein localization prediction.

    Tip! PSORT is one of the best known programs for analysis of protein sorting signals and prediction of subcellular localization. PSORT receives the information of an amino acid sequence and its source orgin, as inputs. Then, it analyzes the input sequence by applying the stored rules for various sequence features of known protein sorting signals. Finally, it reports the possiblity for the input protein to be localized at each candidate site with additional information. PSORT.org provides links to the PSORT family of programs for subcellular localization prediction. PSORT2 is the current version of the "standard" PSORT program. PSORT only accepts single sequences as input. Note that PSORT II comes with a highly instructive user manual explaining the diverse predictions. WoLF PSORT is a recently updated version of PSORT II for the prediction of eukaryotic sequences. Also note that PSORT provides a quite detailed output as compared to e.g. LOCtarget. Note that PSORT II is also available at the Pasteur Institute.
    Note that PSORT2 is also integrated in the "data super-integration tool" Bioinformatic Harvester of the EBI.  

    Tip! LOCtarget is a database of predicted subcellular localization for potential targets for structural genomics from TargetDb. You may either search or browse the LOCtarget database, or you may submit your own FASTA protein sequence for localization prediction. Subcellular localization is currently predicted using four different methods: predictNLS (nuclear localization signal), LOChom ( using homology ), LOCkey (using keywords) and LOCtree (prediction based on hierarchical support vector machines). The reported localization is based on the method which predicts localization of a given protein with the highest confidence. Note: Upto 100 protein sequences can be submitted at a time. If more than 10 sequences are submitted the job is run in low priority mode.

2. Protein localization by literature data mining:

    Tip!
HPRD - Human Protein Reference Database represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome. All the information in HPRD has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data.
    Each protein entry in HPRD is composed of several "tabs" which correspond to specific data. The "Summary" tab includes data like subcellular localization, including a link to the corresponding reference.

3. Protein localization and tissue expression databases:
                              
    The programs which are relevant here are described at the Main Index within the Protein Localization Databases. Note that this part of this FAQ is somehow related to the FAQ EXP20, listing resources which store RNA localization images based on in situ hybridization experiments.

    Tip! GFP-cDNA is an ongoing project of localising novel GFP-tagged human cDNA products to subcellular compartments of the eukaryotic cell. This information provides an entry point for many other downstream functional assays that are designed and implemented for the subsets of new proteins localising to defined subcellular organelles. Images of all localised proteins and their bioinformatic analysis can be viewed via the ‘Results Table’ or ‘Results Images’ buttons. In addition, a search window can be used to fid proteins containing features or motifs of particular interest to you that have been localised in this project. Note that Protein Localization images are also integrated in the data super-integration tool Bioinformatic Harvester; please refer to the main section of this tool for details ! The names of GFP-cDNA entries are clone names, which mostly give no hint about the nature of the proteins. If you want to extract the complete list of all localized proteins via the Bioinformatic Harvester, you may use the following "trick": enter "pepperkok" as search term (derived from Rainer Pepperkok, one of the two project heads, together with Jeremy Simpson). You may also perform combined searches like "pepperkok golgi" or "pepperkok endoplasmic", and select the checkbox "AND search".

    Tip! HPR - the Human Protein Atlas, contains hundreds of thousands of images of protein expression in normal human tissues and cancer cells. Note that there is just tissue expression shown, NOT subcellular localization ! The Swedish Human Proteome Resource (HPR) program, funded by the Knut and Alice Wallenberg Foundation, has been set-up to allow the systematic exploration of the human proteome with Affinity (Antibody) Proteomics, combining high-throughput generation of affinity-purified (mono-specific) antibodies with protein profiling using tissue arrays. The basic concept of this resource centre is to produce specific antibodies to human target proteins using a high-throughput method that involves the cloning and expression of protein epitope signature tags.
    At the top of the page, you'll find information about HPR, descriptions and annotations, as well as useful information on image-usage policies. Available proteins (genes) can be reached through a specific search (by gene/protein name/id or classification, such as kinase or protease) or by browsing the individual chromosomes. The data are presented as high-resolution images representing immunohistochemically stained tissue sections. The final goal is to produce datasets for all of the about 22,000 different proteins, one for each human gene. The vision, as indicated on the Human Protein Atlas site, is “...to enable the systematic generation of quality assured antibodies to all non-redundant human proteins and to use these reagents to functionally explore human proteins, protein variants and protein interactions.” An example (human PTGS2) can be seen in section HPR IDs.
                       
Main Index  FAQ Index  
                     

               
PROT7...know which protein domains are present / overrepresented in my gene set of interest ? (last update Jan. 24, 2006)   

    This question is somehow the "batch version" of FAQ PROT1. It refers to larger datasets, like a cluster of genes from a microarray experiment, where the user wants to extract the protein domains which are involved. In addition, this FAQ lists programs which predict an overrepresentation of protein domains as compared to a reference gene/protein set. Therefore, this FAQ links to resources which are described in FAQ section "Pathways, Interactions, Functions", especially in FAQ PATH1.
   
1. Annotation without statistical overrepresentation:

    BioMart, a powerful data retrieval tool, can also be used to extract the protein domains for large gene/protein datasets. Please refer also to the BioMart section at the main page for information. You may, as example, paste a list of Entrez Gene IDs / Affymetrix IDs etc. corresponding to the genes which are selected from a microarray experiment, into the field "ID list limit" at the "Filter" page of BioMart. Then, at the "Features" page of the output, you may choose to selectively display the protein domain data in the final result table, like InterPro ID, InterPro short description, and PFAM ID. If you are interested in protein motifs, you may select PROSITE ID, instead. You can choose between different output formats: Text, html, or MS EXCEL. Note: Each gene is listed separately in the result file, listing all associated protein domains, BUT there is NO examination of "overrepresented" domains.

2. Annotation and statistical overrepresentation:
                              
    Tip! WebGestalt is a "WEB-based GEne SeT AnaLysis Toolkit". WebGestalt incorporates information from different public resources and provides an easy way for biologists to make sense out of large sets of genes. It enables biologists to manipulate integrated information and find patterns that are not detectable otherwise. WebGestalt is designed for functional genomic, proteomic, and large scale genetic studies from which high-throughput data are continously produced. It currently works from human and mouse. WebGestalt is free for academic use after registration. NOTE: If you have already registered for GOTM, you can use this login ! In general, save and download options are more versatile than in e.g. DAVID. WebGestalt incorporates practically ALL FIELDS of functional annotation: GO, Pathways, Co-citation Networks, and even protein domain data and expression data !!! Taken together, WebGestalt is an excellent resource for functional annotation of gene datasets. Please refer to the WebGestalt main section for details !
    Regarding this specific question, WebGestalt offers two different approaches. After uploading and analyzing the gene dataset, the Gene set information retrieval tool generates a user-defined annotation table, somehow similar to BioMart. In order to include protein domain information, the attribute "Function Info -> Domain" has to be selected. In contrast, the Protein Domain Table, which is part of the Gene set organization tool, lists those protein domains which are overrepresented in the query gene set as compared to a premade reference file. If, for example, a gene set is derived from an Affymetrix HG-U133A array experiment, the reference set "WEBGESTALT_HG_U133A" shall be used. In addition, the user may select a significance level, and may choose
a minimum numer of genes for the enriched PFAM protein domains. PFAM protein domains with fewer genes will not be reported as enriched PFAM protein domains. The default setting is 2.
    NOTE:
Obviously, the "Protein Domain Table" shows less hits in the results file as compared to the extraction of protein domains via the "Gene set information retrieval tool", even when selecting "1" as "minimum number of genes" ! Thus, if you want to get ALL protein domains, you should use the latter tool.
                 
Main Index  FAQ Index