Bioinformatics World FAQ Center
  FAQ Index -> SEQUENCE SIMILARITY
                -> SIM1...search databases with a batch of sequences using BLAST or other tools ? (last update May 5, 2006)
                -> SIM2...search databases using short sequences like oligos and peptides ? (last update May 5, 2006)
                -> SIM3...produce a multiple sequence alignment from a group of related sequences ? (last update Sep. 19, 2006)
               
-> SIM4...produce sequence logos from a group of related sequences ? (last update May 9, 2006)       
                -> SIM5...test the specificity of a PCR primer pair within a genome? -> see DNA4 !      
                -> SIM6...match my query sequence very quickly with a whole-genome assembly ? (last update May 5, 2006)
                -> SIM7...search for (also distant) orthologs/homologs of my gene/protein of interest ? -> see GENOM2 !      
           
                       
Navigate   AtoZ   Search this Site   Site Journal    FAQ Index   Main Index    Appendix      
                       

SIM1...search databases with a batch of sequences using BLAST or other tools ? (last update May 5, 2006)
               
    This question can also be found at NCBI's BLAST FAQ page and it is also discussed in the Appendix of the BLAST Program Selection Guide, including a "quickview-table" which lists all the "pros" and "cons" of the different options ! It means if you want to BLAST e.g. 20 sequences at one button klick and not one after the other. Similar programs and interfaces are available from other web portals. We may differentiate  3 options concerning BLAST (1-3). Below that, other search tools are listed.

1. Web-based BLAST-program and web-based databases:

    MegaBLAST: This program is optimized for aligning highly similar nucleotide sequences that differ slightly as a result of sequencing or other similar "errors". MegaBLAST is the fastest BLAST program as it defaults to a large word-size (an exact match of 28 bases is required to initiate an extension). In addition, it has a quite fast gapped extension algorithm which has no gap existence cost but merely a gap extension cost. Especially when larger word size is used , it is up to 10 times faster than standard BLAST programs. Word size is roughly the minimal length of an identical match an alignment must contain if it is to be found by the algorithm. Mega BLAST is most efficient with word sizes 16 and larger, although word size as low as 8 can be used. Mega BLAST is also able to efficiently handle much longer DNA sequences than the blastn program of traditional BLAST algorithm.
    MegaBLAST is good for scanning a large number of EST type sequences (about 500 kb in length) against large database in search of the closest matches. You can import a file EST sequences in FASTA format or as a list of GenBank accessions or/GIs and have them compared to the BLAST databases. The default is an easily reviewable Hit Table format, although you can download and save the results in Standard pairwise HTML or any of the other result output options. Web MegaBLAST is available from the main BLAST web page and a description is available. MegaBLAST is also part of the Standalone BLAST exectuables, and an option in the Network BLAST client (see below).
    Note: MegaBLAST works only for blastn searches of nucleotide databases. There is NO batch option at the NCBI BLAST web site for protein sequences. For this purpose, other options have to be selected (as discussed in the Appendix of the BLAST Program Selection Guide).

    The Ensembl BLASTView is a good tool if you have a series of sequences (max. 30) you want to search "in-batch", and if you want to have full and easy-to-understand control over all available options, like E-values, dust filtering, repeats and more. You can follow the status of every sequence individually. You can even define the columns which will be displayed in the summary output table. But note that the structure of databases is different from e.g. NCBI BLAST; there are databases like "Ensembl cDNAs" or "Ensembl genome", but not e.g. "dbEST" ! The link "Contig view" will reveal the region of interest in the Ensembl graphical genome browser, which is also highly customizable but has some drawbacks like the display of EST sequences. On the other hand, there are some very nice features like the positions of individual Affymetrix ProbeSets, a sub-classification of repeats, multiple export options, and a direct connection to the other major genome browsers, UCSC and NCBI via the "Jump to" button.

2.
Local BLAST-program and local databases:

    Standalone BLAST executables:
The Standalone BLAST executables are command line programs which run BLAST searches against local downloaded copies of the NCBI BLAST databases. The programs will handle either a single large file with multiple FASTA query sequences, or you can create a script to send multiple files one at a time. The executables are available for a wide variety of platforms, including many "flavors" of UNIX (LINUS, Solaris, etc.) Windows PC and even Mac OSX. The Standalone executables are available at the anonymous FTP location: ftp://ftp.ncbi.nih.gov/blast/executables/. NOTE that using this option, you need to provide the appropriate disc space, and permanently have to update your local copies of the databases !  

3. Local BLAST-program but web-based databases:

    BLAST Network Client 'blastcl3': The BLAST Network client will allow you to submit a file of FASTA sequences over an internet connection to the NCBI BLAST databases. The BLAST Network client executables are located at NCBI's BLAST executables FTP site, and the most recent file versions are located in the subdirectory "LATEST". Depending on the system you are using, you simply have to download the appropriate file (for Windows PC this is: netblast-2.2.6-ia32-win32.exe). Once downloaded, double-klick at the exe-file, which leads to the extraction of some files. One of them is the actual program (blastcl3.exe).
    In order to execute the program, you first have to open the DOS-prompt window of your system, then change to the directory where you saved the blastcl3.exe file. Now you have to write a command line at the DOS prompt, which looks a bit weird at first sight, but is not that complicated. In principle, you have to define the same parameters as over the BLAST web-interface, which are blast program, database, input file name, output file name. NOTE that the input sequence file must be in FASTA format (each sequence starting with a header line starting with >seqname). So you can gather your sequences one after the other in one word processor file, and save the file as *.txt format, at best directly in the directory where you also have saved the blastcl3.exe file.
    A simple command line would look like that: ">blastcl3 -p blastn -d nr -i test.txt -o out.test". You can see that the -p argument defines the program (in this case blastn, so you have a nucleotide query), the -d argument defines the database (in this case nr, non-redundant database), -i is the input sequence file, and -o the output file. The output file can be opened in a standard word processor. Note that the output file can become very large (all BLAST hits of all 20 input sequences are written into one output file), if you do not set arguments to restrict its size. In order to do so, you can add to your command line "-v" which defines the number of one-line descriptions, and "-b" which defines the number of pairwise alignments to show. In addition, you can produce a html-output file if you want to by using the argument "-T" and the option "T". Note that many arguments allow two options: "T" (true) or "F" (false). The complete command line now would look like this: ">blastcl3 -p blastn -d nr -i test.txt -o out.test -v 50 -b 20 -T T". Note that when using the html-option, the output file still must be renamed (from out.test to test.html), in order to be readable by a browser.

4. Web-based BLAT-program and web-based (genome) databases:

    BLAT was developed at the UCSC (University of California, Santa Cruz). BLAT uses query sequences to quickly find sequences of 95% and greater similarity of length 40 bases or more. BLAT is not BLAST. DNA BLAT works by keeping an index of the entire genome in memory. The index consists of all non-overlapping 12-mers except for those heavily involved in repeats. BLAT can be used to locate extremely fast DNA or protein sequences within the UCSC genome browser. For single queries, only DNA sequences of 25,000 or fewer bases and protein or translated sequence of 10000 or fewer letters will be processed.
    BLAT can also be used for batch sequence search against the UCSC genome databases. An entire set of query sequences can be looked up simultaneously when provided in fasta format. Up to 25 sequences can be submitted at the same time. The total limit for multiple sequence submissions is 50,000 bases or 25,000 letters. Main advantage is that BLAT works extremely fast, even in "batch mode".
                       
Main Index  FAQ Index
                      
                       
SIM2...search databases using short sequences like oligos and peptides ? (last update May 5, 2006)

    If you want to search databases using short sequences (oligos, peptides), you can try programs which automatically are set to the appropriate parameters, because normally these short matches are filtered out as being "unspecific" hits. 

    The NCBI BLAST page provides a series of general and special BLAST programs, suitable for different purposes and requirements. In order to run similarity searches using short oligos or peptides, 2 special programs are available within the sections "Nucleotide BLAST" and "Protein BLAST". You should note that both of the following links are very long and may be unstable; if so you can easily find the programs starting from the BLAST-"Home" page.
    BLAST at NCBI - Search for short, nearly exact sequences (Nucleotide) can be used with oligo sequences as input. A short query is more likely to occur by chance in the database. Therefore increasing the Expect value threshold, and also lowering the word size is often necessary before results can be returned. Low Complexity filtering has also been removed since this filters out larger percentage of a short sequence, resulting in little or no query sequence remaining.
    BLAST at NCBI - Search for short, nearly exact sequences (Peptide) can be used with peptide sequences as input. Also for short protein sequence searches the Matrix is changed to PAM-30 which is better suited to finding short regions of high similarity.

   PHI-BLAST (Pattern-hit initiated BLAST) searches a database looking only for alignments that include a specified pattern. This means that in order to run PHI-BLAST, both a query protein sequence AND a peptide pattern ("PHI pattern") which is part of the query sequence must be provided. PHI-BLAST then searches the database for hits which contain both the pattern AND an "overall homology" to the query sequence, thereby eliminating hits which are based on the (short) pattern alone. Please refer to the PHI-BLAST main section for details.

   BlastView is an Ensembl-based program which provides an integrated platform for sequence similarity searches against Ensembl databases. As there is a very user-friendly selection of the search sensitivity, one of the options is called "near exact matches (oligo)". It is possible to use this program to match short oligo sequences onto the genome.    

    The Protein Information Resource (PIR) provides, among many others, two tools to search protein databases using short peptide sequences (Peptide Match) or user-defined patterns (Pattern Search).
    Peptide Match is a program which retrieves protein sequences via exact peptide match (meaning that you know the exact peptide sequence without IUPAC codes etc.). Peptide Match searches either UniProt or UniRef databases for matching protein sequences. The output and download options are similar to Pattern Search.
    Pattern Search performs two different tasks. First, Pattern Search can search the iProClass database for proteins sharing a certain user-defined pattern. There is also a comprehensive help file on how to write a peptide pattern. In the output list, each entry can be selected individually, and there are options to generate a common FASTA sequence file as well as ClustalW multiple alignments and domain architecture images. Second (not relevant for this FAQ), Pattern Search can search your query sequence for known patterns against the PROSITE database.

NOTE: Programs which match PAIRS of primers onto genomes ("In silico PCR") are described in section "Primers and In silico PCR" !   
     
Main Index  FAQ Index
       
                 
SIM3...produce a multiple sequence alignment from a group of related sequences ? (last update Sep. 19, 2006)
   
      Tip! The best known program for this purpose is ClustalW. There are several ways to access the program, one of them is at the EBI. You should first create a simple text-file (WORD) where the input sequences are copied one after the other in FASTA format (each sequence starting with a header line beginning with >seqname; delete numbers and blank lines !). Alternatively, you may download a sequence set using programs like NCBI-Entrez Protein. All sequences must be in the correct orientation, ClustalW does NOT perform reverse/complements !
    At the ClustalW input page, you can adjust a number of options. If you chose "Input" under "output order" the sequences will be aligned according to their position within the input file, whereas if you choose "aligned" the sequences will be aligned according to their similarities. You should chose "ALN without numbers" as output format if you want to color the alignment using the program Boxshade (please see FAQ GRAPH1 !). 
    ClustalW output includes the sequence alignment itself, which can be transfered via COPY/PASTE to a WORD processing program like MS-WORD and manually edited there before coloring in Boxshade. The alignment may also be investigated using a Java applet called JalView, which provides a number of options to sort and color the alignment. Nevertheless, it is not possible to download the images which are produced. ClustalW also produces a phylogenetic tree of your input sequences, showing evolutionary distances. In addition, you may want to display stretches of high conservation using a tool that produces sequence logos. Please refer to question SIM4 for that purpose. 

    NOTE: There are good tutorials (including references to ClustalW and other alignment programs) at the EBI educational portal 2can, for the following topics: ClustalW Nucleotide alignment, ClustalW Protein alignment, how to find the "optimal alignment", what are gaps and gap penalties, what are matrices (like PAM or BLOSUM).
    NOTE: Also, the ClustalW site itself contains detailed help and FAQ sections !

    NOTE: Please refer to the main section "Multiple Alignment" for the description of additional programs in this field, like MUSCLE and TCoffee.
                     
Main Index  FAQ Index
               

             
SIM4...produce sequence logos from a group of related sequences ? (last update May 9, 2006)
   
    A sequence logo is a graphic representation of an aligned set of sequences. A logo displays the frequencies of bases at each position, as the relative heights of letters, along with the degree of sequence conservation as the total height of a stack of letters, measured in bits of information . The vertical scale is in bits, with a maximum of 2 bits possible at each position. In general, a sequence logo provides a richer and more precise description of, for example, a binding site, than would a consensus sequence. Please note that there is a whole chapter at the main page, dealing with the topic of sequence logos. The "Gallery of Sequence Logos" is a kind of portal for this topic, which is highly recommendable.
    In general, the tools which finally lead to the generation of sequence logos, may be grouped in two categories.

1. Tools that generate sequence logos from pre-aligned sequences:

These are tools that do not perform a multiple alignment or extract over-represented motifs by themselves, but need a pre-aligned sequence set as input. By the way, there is only a quite loose definition of "pre-aligned", so this point has to be checked individually for every program. Anyway, this group of tools offers more options concerning the graphic representation of the logos, meaning that the final output is highly customizable.

      Tip! WebLogo is possibly the best program for this purpose if you have a set of pre-aligned nucleic or protein sequences, showing the greatest versatility and the best quality of Logo images. The input sequences should be in "FASTA alignment format", or you may also use a CLUSTALW generated multiple alignment (see question SIM3). In any case, all sequences must be the same length. Note that this is also the case in CLUSTAL alignments, which have a "-" at gap positions, finally yielding sequences of the same length ("-" counts like an amino acid). You may choose between 4 different output formats (most compared with other programs !): GIF, PNG, EPS, and PDF. Note that vector formats (EPS and PDF) are better for printing, while bitmaps (GIF and PNG) are more suitable for displaying on the screen, or embedding into a web page. You can also determine the dimensions of the output logo in centimeters, inches, pixels, or points (!), making WebLogo the only application with this extremely useful feature ! You may even change the bitmap resolution of the Logo in dpi (300 or 600 dpi for publications/printing). Using "first position number" and "Logo range", you may restrict the Logo to a sub-range of your sequence set, and set the numbering accordingly. Thus, if the first position number is "2", start is "5" and end is "10", then the 4th through 9th (inclusive)  positions will be displayed, and they will be numbered "5", "6", "7", "8", "9" and "10".

    GENIO/logo is a similar program to generate sequence logos, although it does not seem to cover the full functionality of WebLogo. This program accepts only the "FASTA alignment format" as input, but not e.g. CLUSTAL format. Although the sequences do not necessarily have to be of the same length to be accepted by the program, GENIO/logo does not extract the best-scoring motifs but simply compares the residues one position after the other (like WebLogo). In general, you have a lot of options influencing the design of the output, like coloring and replacement rules, background lines, or numberings, but, as mentioned, less than in WebLogo. You will recieve (very quickly !) the sequence logo in 3 file formats for download (GIF, EPS, and Postscript).

2. Tools that "all-in-one" align sequences, extract conserved motifs AND generate sequence logos:

In principle, these programs are described in question GEN5, part B3). I would like to focus on the Sequence Logos - related information here.

    Tip! The MEME (Multiple Em for Motif Elicitation) system allows you to discover motifs (highly conserved regions) in groups of related DNA or protein sequences using MEME and to display these motifs as logos. Simply provide a multiple FASTA-file of your DNA or protein sequence set, and set the number of motifs to be extracted. It is not necessary that sequences show the same length ! Note that MEME can search both strands of DNA, so it is not necessary to reverse/complement individual sequences first ! MEME seems to be the only program (tested so far) which reliably detects motifs on the reverse strand. In addition, you may set a "minimum/maximum motif width", e.g. for TF binding sites you may choose 6 and 9, meaning the program will extract only motifs ranging from 6 to 9 bp. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs. As part of the output, MEME will display a link "Submit to BLOCKS multiple alignment processor". At the output page, hit one of the format buttons (GIF, PDF, or Postscript) to display the Logos. The Logo represents graphically the PSSM (Position-Specific Scoring Matrix) where each position in the motif is characterized by the amount of information it contains (measured in bits). Note that the graphic is not customizable at all, but has to be accepted "as is". Anyway, a good alternative is the following: At the MEME output page, hit the button "View FASTA1" which displays an "aligned FASTA file" of the extracted motif. This file you may simply copy/paste into the WebLogo input form, in order to "restore" the fully customizable range of graphical options !

     MotifSampler, which is also part of the TOUCAN package, is a tool for the detection of "over-represented" motifs in a DNA or protein sequence set. The "stand-alone" version of MotifSampler contains a feature to display the motif as a sequence logo. You simply provide a FASTA formatted multiple sequence file. It is not necessary that sequences are pre-aligned or show the same length ! Anyway, although MotifSampler can search both strands of DNA, it is highly preferable to reverse /complement those sequences first, which are known to bear the specific motif on the reverse strand ! You have to choose a background model, and you may change e.g. the number of motifs the program shall return, and the length of the motifs (hint: try lengths from 6 to 9 bp for TFs). Note that you can not define a range of length (from-to) but only the exact length (e.g. "8"), but you may of course run the same data set using different lengths. The "motifs" will be displayed as tables and graphically as Logo, again showing the information content for each individual position. Note that, as in MEME, the graphic is not customizable at all, but has to be accepted "as is". In contrast to MEME, there is no FASTA version (only a list) of the extracted sequence block containing the motif, therefore it is much harder to transfer the output into programs like WebLogo. 

    SEQLOGO is a module of the software package Expression Profiler, for clustering, analysis and visualization of gene expression and other genomic data, provided by the EBI. Like similar programs, SEQLOGO is a logo drawing tool to visualize the information content of patterns. In principle, there are both options here. You may provide a set of pre-aligned sequences, using the option " Use sequences as provided, assuming they are prealigned". Otherwise, another module called SPEXS (also available as individual application), is automatically used to find the pattern that matches. There are some important notes to consider. The input sequences must be in capital letters, and must NOT be in FASTA format, but as a "plain list", each sequence starting at a new line (which is very unusual, by the way). SPEXS does not look for patterns on the reverse strand, so you first have to reverse complement sequences, which bear the motif on the reverse strand (which is only possible IF you already know that !).  The  logo graphics are similar to those produced by WebLogo, but are not customizable at all, aso have to be accepted as they are. In summary,  the combination MEME plus WebLogo seems to be the much better approach.

Main Index  FAQ Index
                   

       
SIM5...test the specificity of a PCR primer pair within a genome? -> see DNA4 !
                 
Main Index  FAQ Index
                   

                                       
SIM6...match my query sequence very quickly with a whole-genome assembly ? (last update May 5, 2006)
               
    If you want to search whole-genome sequence databases in a very quick fashion, there are special programs which will perform this task. A "regular" BLAST search normally takes quite a while, and the respective servers often are jammed with user queries. Of course, it has to be stated that the following programs are designed to find only very close matches between sequences, let's say if there is a higher than about 90% of homology. Therefore, they are especially suitable to align mRNA, promoter etc. sequences of the same species to the genome, unless evolutionary sequence conservation is very high.

    Tip! BLAT was developed at the UCSC, and uses query sequences to quickly find sequences of 95% and greater similarity of length 40 bases or more. BLAT is not BLAST. DNA BLAT works by keeping an index of the entire genome in memory. The index consists of all non-overlapping 12-mers except for those heavily involved in repeats. BLAT can be used to locate extremely fast DNA or protein sequences within the UCSC genome browser. Please refer to the BLAT main section for details.

    SSAHA was developed within the Ensembl project and is used for example to search the Ensembl databases via BLASTView. The Sequence Search and Alignment by Hashing Algorithm is a fast nearly exact matcher and used as example to search DNA queries against the unmasked genome sequence assembly at Ensembl. SSAHA is very fast and useful matching closely related sequences such as mRNAs to a genomic sequence. It will however fail if the sequence similarity between query and subject falls below 90%. For weaker similarities suitable BLAST algorithms should be considered. Please refer to the SSAHA main section for details.

    MegaBLAST uses BLAST parameters to quickly search for highly similar sequences (e.g. from the same organism). MegaBLAST is the fastest BLAST program as it defaults to a large word-size (an exact match of 28 bases is required to initiate an extension). In addition, it has a quite fast gapped extension algorithm which has no gap existence cost but merely a gap extension cost. Especially when larger word size is used , it is up to 10 times faster than standard BLAST programs. Word size is roughly the minimal length of an identical match an alignment must contain if it is to be found by the algorithm. Mega BLAST is most efficient with word sizes 16 and larger, although word size as low as 8 can be used. Mega BLAST is also able to efficiently handle much longer DNA sequences than the blastn program of traditional BLAST algorithm. 
         
Main Index   FAQ Index 
       
      
SIM7...search for (also distant) orthologs/homologs of my gene/protein of interest ? -> see GENOM2 !
                 
Main Index  FAQ Index