-> SEQUENCE SIMILARITY
-> SIM1...search databases with a batch of sequences using
BLAST or other tools ? (last
update May 5, 2006)
-> SIM2...search databases using
short sequences like oligos and peptides ? (last update May 5, 2006)
-> SIM3...produce a multiple sequence
alignment from a group of related sequences ? (last update Sep. 19, 2006)
-> SIM4...produce sequence logos from a group of related
sequences ? (last update May 9, 2006)
-> SIM5...test the specificity of a PCR primer pair within a
genome? -> see DNA4 !
-> SIM6...match my query sequence very quickly with a
whole-genome
assembly ? (last update May 5, 2006)
-> SIM7...search for
(also distant) orthologs/homologs of my gene/protein of interest ? -> see GENOM2 !
SIM1...search databases with a batch of sequences using BLAST
or other tools ? (last
update May 5, 2006)
This question
can also be found at NCBI's BLAST
FAQ page and it is also discussed in the Appendix
of the BLAST
Program Selection Guide, including a "quickview-table" which lists
all the "pros" and "cons" of the different options ! It means if you
want to BLAST e.g. 20 sequences at
one button klick and not one after the other. Similar programs and
interfaces are available from other web portals. We
may differentiate 3 options concerning BLAST (1-3). Below
that, other search tools are listed.
1. Web-based BLAST-program and web-based databases:
MegaBLAST:
This program is optimized for aligning highly similar nucleotide
sequences that differ
slightly as a result of sequencing or other similar "errors". MegaBLAST is the fastest BLAST
program as it defaults to a large word-size (an exact match of
28 bases is required to initiate an extension). In addition, it has a
quite fast gapped extension algorithm which has no gap existence cost
but merely a gap extension cost. Especially when larger
word size is used , it is
up to 10 times faster than standard BLAST programs. Word size is
roughly the minimal length of an identical match an alignment must
contain if it is to be found by the algorithm. Mega BLAST is most
efficient with word sizes 16 and larger, although word size as
low as 8 can be used. Mega BLAST is also able to efficiently handle
much longer DNA
sequences than the blastn program of traditional BLAST
algorithm.
MegaBLAST
is good for scanning a large number of EST type sequences (about 500 kb
in length) against large database in search of the closest matches. You
can import a file EST sequences in FASTA format or as a list of GenBank
accessions or/GIs and have them compared to the BLAST databases. The
default is an easily reviewable Hit Table format, although you can
download and save the results in Standard pairwise HTML or any of the
other result output options. Web MegaBLAST is available from the main BLAST web page and
a description
is available. MegaBLAST is also part of the Standalone BLAST
exectuables, and an option in the Network
BLAST client (see below).
Note: MegaBLAST works only for blastn
searches of nucleotide databases. There
is NO batch option at the NCBI BLAST web site for protein
sequences. For this purpose, other options have to be selected (as
discussed in the Appendix
of the BLAST
Program Selection Guide).
The Ensembl BLASTView
is a good tool if you have a series of sequences (max. 30) you
want to search "in-batch", and if you want to have full and
easy-to-understand control over all available options, like E-values,
dust filtering, repeats and more. You can follow the status of every
sequence
individually. You can even define the columns which will be displayed
in the summary output table. But note that the structure
of databases is different from e.g. NCBI BLAST; there are databases
like "Ensembl cDNAs" or "Ensembl genome", but not e.g. "dbEST" ! The
link "Contig view" will reveal the region of interest in the
Ensembl graphical genome browser, which is also highly
customizable but has some drawbacks like the display of EST sequences.
On the
other hand, there are some very nice features like the positions of
individual Affymetrix ProbeSets, a sub-classification of repeats,
multiple export options, and a direct connection to the other major
genome browsers, UCSC and NCBI via the "Jump to" button.
2. Local BLAST-program and local databases:
Standalone
BLAST executables: The Standalone BLAST executables are command
line programs which run BLAST searches against local downloaded
copies of the NCBI BLAST databases. The programs will handle either a
single large file with multiple FASTA query sequences, or you can
create a script to send multiple files one at a time. The executables
are available for a wide variety of platforms, including many "flavors"
of UNIX (LINUS, Solaris, etc.) Windows PC and even Mac OSX. The
Standalone executables are available at the anonymous FTP location:
ftp://ftp.ncbi.nih.gov/blast/executables/.
NOTE that using
this option, you need to provide the appropriate disc space, and
permanently have to update your local copies of the databases !
3. Local BLAST-program but web-based databases:
BLAST Network Client 'blastcl3': The
BLAST Network client will allow you to submit a file of FASTA sequences
over an internet connection to the NCBI BLAST databases. The BLAST
Network client executables are located at NCBI's BLAST
executables FTP site, and the most recent file versions are located
in the subdirectory "LATEST".
Depending on the system you are using, you simply have to download the
appropriate file (for Windows PC this
is: netblast-2.2.6-ia32-win32.exe). Once downloaded, double-klick
at the exe-file, which leads to the extraction of some files. One of
them is the actual program (blastcl3.exe).
In order to execute the program, you first have to
open the DOS-prompt window of your system, then change to the directory
where you saved the blastcl3.exe file. Now you have to write a command
line at the DOS prompt, which looks a bit weird at first sight, but
is not that complicated. In principle, you have to define the same
parameters as over the BLAST web-interface, which are blast program,
database, input file name, output file name. NOTE that the input
sequence file must be in FASTA format (each
sequence starting with a header line starting with >seqname). So you
can gather your sequences one after the other in one word processor
file, and save the file as *.txt format, at best directly in the
directory where you also have saved the blastcl3.exe file.
A simple command line would look like that:
">blastcl3 -p blastn -d nr -i test.txt -o out.test". You can see
that the -p argument defines the program (in this case blastn, so you
have a nucleotide query), the -d argument defines the database (in this
case nr, non-redundant database), -i is the input sequence file, and -o
the output file. The output file can be opened in a standard word
processor. Note that the output file can become very large (all BLAST
hits of all 20 input sequences are written into one output file), if
you do not set arguments to restrict its size. In order
to do so, you can add to your command line "-v" which defines
the number of one-line descriptions, and "-b" which defines the number
of pairwise alignments to show. In addition, you
can produce a html-output file if you want to by using the argument
"-T" and the option "T". Note that many arguments allow two options:
"T" (true) or "F" (false). The complete command line now would look
like this: ">blastcl3 -p blastn -d nr -i test.txt -o out.test
-v 50 -b 20 -T T". Note that when using the html-option, the output
file still must be renamed (from out.test to test.html), in order
to be readable by a browser.
4. Web-based BLAT-program and web-based (genome)
databases:
BLAT
was developed at the UCSC (University of California, Santa
Cruz). BLAT uses query sequences to quickly
find sequences of 95% and greater similarity of length 40 bases or
more. BLAT is not BLAST. DNA BLAT works by keeping an index of the
entire genome in memory. The index consists of all non-overlapping 12-mers
except for those heavily involved in repeats. BLAT can be used to
locate extremely fast DNA or protein sequences
within the UCSC genome browser.
For single queries, only DNA sequences of 25,000 or fewer bases
and protein or translated sequence of 10000 or fewer letters will be
processed.
BLAT can also be used for batch sequence search
against the UCSC genome databases. An entire set of query sequences
can be looked up simultaneously when provided in fasta format. Up to 25
sequences
can be submitted at the same time. The total limit for multiple
sequence
submissions is 50,000 bases or 25,000 letters. Main advantage is that
BLAT works extremely fast, even in "batch mode".
SIM2...search databases
using short sequences like oligos and peptides ? (last update May 5, 2006)
If you want to search databases using short
sequences (oligos, peptides), you can try programs which automatically
are set to the appropriate parameters, because normally these short
matches are filtered out as being "unspecific" hits.
The NCBI
BLAST page provides a series of general and special BLAST
programs, suitable for different purposes and requirements. In order to
run similarity searches using short oligos or peptides, 2 special
programs are available within the sections "Nucleotide BLAST" and
"Protein BLAST". You should note that both of the following
links are very long and may be unstable; if so you can easily find the
programs starting from the BLAST-"Home" page.
BLAST
at NCBI - Search for short, nearly exact sequences (Nucleotide) can
be used with oligo sequences as input. A short query is more likely to
occur by chance in the database. Therefore increasing the Expect value
threshold, and also lowering the word size is often necessary before
results can be returned. Low Complexity filtering has also been removed
since this filters out larger percentage of a short sequence, resulting
in little or no query sequence remaining.
BLAST
at NCBI - Search for short, nearly exact sequences (Peptide) can be
used with peptide sequences as
input. Also for short protein sequence searches the Matrix
is changed to PAM-30 which is better suited to finding short regions of
high similarity.
PHI-BLAST
(Pattern-hit initiated BLAST)
searches a database looking only
for alignments that include a specified pattern. This means that in
order to run PHI-BLAST, both a query protein sequence AND a
peptide
pattern ("PHI pattern") which is part of the query sequence must
be provided. PHI-BLAST then searches the database for hits which
contain both the pattern AND an "overall homology" to the query
sequence, thereby eliminating hits which are based on the (short)
pattern alone. Please refer to the PHI-BLAST main section for
details.
BlastView
is an Ensembl-based program which provides an integrated
platform for
sequence similarity searches against Ensembl
databases. As there is a very user-friendly selection of the search
sensitivity,
one of the options is called "near exact matches (oligo)". It
is possible
to use this program to match short oligo sequences onto the genome.
The Protein
Information
Resource (PIR)
provides, among many others, two tools to search protein databases
using short peptide sequences (Peptide Match) or user-defined
patterns (Pattern Search).
Peptide
Match is a program which retrieves protein sequences via exact
peptide match (meaning that you know the exact peptide sequence
without IUPAC codes etc.). Peptide Match searches either UniProt or
UniRef databases for
matching protein sequences. The output and download options are similar
to Pattern Search.
Pattern
Search performs two different tasks. First, Pattern
Search
can search the iProClass database for proteins sharing a certain
user-defined pattern. There is also a comprehensive help
file on how to write a
peptide pattern. In the output list, each entry can be selected
individually, and there are options to generate a common FASTA sequence
file as well as ClustalW multiple alignments and domain architecture
images. Second (not relevant for this FAQ), Pattern Search can
search your query sequence for
known patterns against the PROSITE database.
NOTE: Programs
which match PAIRS of primers onto genomes ("In
silico PCR") are described in section "Primers and In silico PCR" !
SIM3...produce a multiple
sequence alignment from a group of related sequences ? (last update
Sep. 19, 2006)
Tip!
The best known program for this purpose is ClustalW. There are
several ways to access the
program, one of them is at the EBI. You should first create a simple
text-file (WORD) where the input sequences are copied one after the
other in FASTA format (each sequence starting with a header
line beginning with >seqname; delete
numbers and blank lines !). Alternatively, you may download a sequence
set using programs like NCBI-Entrez
Protein. All sequences must be in the correct
orientation, ClustalW does NOT perform reverse/complements !
At the
ClustalW input page, you can adjust a number of options. If you
chose
"Input" under "output order" the sequences will be aligned according to
their position within the input file, whereas if you choose "aligned"
the sequences will be aligned according to their similarities. You
should chose "ALN
without numbers" as output format if you want to color the alignment
using the program Boxshade
(please see FAQ GRAPH1 !).
ClustalW output includes the sequence
alignment itself, which can
be
transfered via COPY/PASTE to a WORD processing program like MS-WORD and
manually edited there before coloring in Boxshade. The alignment may
also be investigated using a Java applet called JalView, which
provides a number of options to sort and color the alignment.
Nevertheless, it is not possible to download the images which are
produced. ClustalW also produces a phylogenetic tree of your
input
sequences, showing evolutionary distances. In addition, you
may want to display stretches of high conservation using a
tool that produces sequence logos. Please refer to question SIM4 for that purpose.
NOTE: There are good tutorials
(including
references to ClustalW and other alignment programs) at
the EBI educational portal 2can, for the following
topics: ClustalW
Nucleotide alignment, ClustalW
Protein alignment, how to find the "optimal
alignment", what are gaps and gap
penalties, what are matrices
(like PAM or BLOSUM).
NOTE: Also, the ClustalW site itself
contains detailed help and FAQ sections !
NOTE: Please refer to the main section "Multiple Alignment"
for the description of additional programs in this field, like MUSCLE and TCoffee.
SIM4...produce sequence
logos from a group of related sequences ? (last update May 9, 2006)
A sequence logo is a graphic representation
of an aligned set of sequences. A logo displays the frequencies
of bases at each position, as the relative heights of letters, along
with the degree of sequence
conservation as the total height of a stack of letters, measured in
bits of information
. The vertical scale is in bits, with a maximum of 2 bits possible at
each position. In general, a sequence logo provides a richer and more
precise description of, for example, a binding site, than would a
consensus sequence. Please note that there is a whole chapter at the main page,
dealing with the topic of sequence logos. The "Gallery
of Sequence Logos" is a kind of portal for this topic, which is
highly recommendable.
In general, the tools which finally lead to the
generation of sequence logos,
may be grouped in two categories.
1. Tools that generate sequence logos from pre-aligned sequences:
These are tools that do not perform a multiple alignment or extract
over-represented motifs
by themselves, but need a pre-aligned sequence set as input. By
the way, there is only a quite loose definition of "pre-aligned", so
this point has to be checked individually for every program. Anyway,
this group of tools offers more options concerning the graphic
representation of the logos, meaning that the final output is highly
customizable.
Tip!
WebLogo is possibly
the best program for this purpose if you have a set of
pre-aligned nucleic or protein sequences, showing the greatest
versatility and the best quality of Logo images. The
input sequences should be in "FASTA alignment format", or you may also
use a CLUSTALW generated
multiple alignment (see question SIM3). In any
case, all sequences must be the same length. Note
that this is also the case in CLUSTAL alignments, which have a "-"
at gap positions, finally yielding sequences of the same length ("-"
counts like an amino acid). You may choose between 4 different output
formats (most compared with other programs !): GIF, PNG,
EPS, and PDF. Note that vector formats (EPS and PDF) are better
for printing, while bitmaps (GIF and PNG) are more
suitable for displaying on the screen, or embedding into a web
page. You can also determine the dimensions of the
output logo in centimeters, inches, pixels, or points (!), making
WebLogo the only application with this extremely useful
feature ! You may even change the bitmap resolution of the Logo
in dpi (300 or 600 dpi for publications/printing). Using "first
position number" and "Logo range", you may restrict the Logo
to a sub-range of your sequence set, and set the numbering accordingly.
Thus, if the first position number is "2", start is "5" and end is
"10", then the 4th through 9th (inclusive) positions will be
displayed, and they will be numbered "5", "6", "7", "8", "9" and "10".
GENIO/logo
is a similar program to generate sequence logos, although it does not
seem to cover the full functionality of WebLogo. This program accepts
only the "FASTA alignment format" as input, but not e.g. CLUSTAL
format. Although the sequences do not necessarily have to be of the
same length to be accepted by the program, GENIO/logo does not
extract the best-scoring motifs but simply compares the residues one
position after the other (like WebLogo). In general, you have a lot
of options influencing the design of the output, like coloring and
replacement rules, background lines, or numberings, but, as mentioned,
less than in WebLogo. You will recieve (very quickly !) the
sequence logo in 3 file formats for download (GIF, EPS, and
Postscript).
2. Tools that "all-in-one" align sequences, extract conserved
motifs AND generate sequence logos:
In principle, these programs are described in question GEN5, part B3). I would like to focus
on the Sequence Logos - related information here.
Tip! The
MEME
(Multiple Em for Motif Elicitation) system allows you to discover
motifs (highly conserved regions) in groups of related DNA or
protein sequences using MEME and to display these motifs
as logos. Simply provide a multiple FASTA-file of your
DNA or protein sequence set, and set the number of motifs to be
extracted. It is not necessary that sequences show the same
length ! Note that MEME can search both strands of DNA,
so it is not necessary to reverse/complement individual sequences first
! MEME seems to be the only
program (tested so far) which reliably detects motifs on the reverse
strand. In addition, you may set a "minimum/maximum motif
width", e.g. for TF binding sites you may choose 6 and 9, meaning
the program will extract only motifs ranging from 6 to 9 bp. Individual
MEME motifs do not contain gaps. Patterns with variable-length gaps are
split by MEME into two or more separate motifs. As part of the output,
MEME will display a link "Submit to BLOCKS multiple
alignment processor". At the output page, hit one of the format
buttons (GIF, PDF, or Postscript) to display the Logos. The
Logo represents graphically the PSSM (Position-Specific Scoring
Matrix) where each position in the motif is characterized by the amount
of information it contains (measured in bits). Note that the
graphic
is not customizable at all, but has to be accepted "as is". Anyway,
a good alternative is the following: At the MEME output page, hit the
button "View FASTA1" which displays an "aligned FASTA file" of the
extracted
motif. This file you may simply copy/paste into the WebLogo input
form, in order to "restore" the fully customizable range of graphical
options !
MotifSampler,
which is also part of the TOUCAN
package, is a tool for the detection of "over-represented"
motifs in a DNA or protein sequence set. The "stand-alone"
version of MotifSampler contains a feature to display the motif as a sequence
logo. You simply provide a FASTA formatted multiple
sequence file. It is not necessary that sequences are
pre-aligned or show the same length ! Anyway, although
MotifSampler can search both strands of DNA, it is highly
preferable to reverse /complement those sequences first, which are
known to bear the specific motif on the reverse strand
! You have to choose a background model, and you may change e.g. the
number of motifs the program shall return, and the length of the motifs
(hint: try lengths from 6 to 9 bp for TFs). Note that you can
not define a range of length (from-to) but only the exact length
(e.g. "8"), but you may of course run the same data set using different
lengths. The "motifs" will be displayed as tables and graphically as Logo,
again showing the information content for each individual position. Note
that, as in MEME, the graphic is not customizable at all, but has to be
accepted "as
is". In contrast to MEME, there is no FASTA version (only a
list) of the extracted sequence block containing the motif, therefore
it is much harder to transfer the output into programs like
WebLogo.
SEQLOGO
is a module of the software package Expression Profiler, for
clustering, analysis and visualization of gene expression and other
genomic data, provided by the
EBI. Like similar programs, SEQLOGO is a logo drawing tool to
visualize the information content of patterns. In principle, there are
both options here. You may provide a set of pre-aligned sequences,
using the option " Use sequences as provided, assuming they are
prealigned". Otherwise, another module called SPEXS (also
available as individual
application), is automatically used to find the pattern that
matches. There are some important notes to consider. The input
sequences must be in capital letters, and must NOT be in FASTA
format, but as a "plain list", each sequence starting at a new line
(which is very unusual, by the way). SPEXS does not look for
patterns on the reverse strand, so you first have to reverse
complement sequences, which bear the motif on the reverse strand (which
is only possible IF you already know that !). The logo
graphics are similar to those produced by WebLogo, but are not
customizable at all, aso have to be accepted as they are. In
summary, the combination MEME plus WebLogo seems to be the much
better approach.
SIM5...test the specificity of a PCR primer pair within a
genome? -> see DNA4 !
SIM6...match my query
sequence very quickly with a whole-genome assembly ? (last update May 5, 2006)
If you want to search whole-genome sequence
databases
in a very quick fashion, there are special programs which will
perform
this task. A "regular" BLAST search normally takes quite a while, and
the
respective servers often are jammed with user queries. Of course, it
has
to be stated that the following programs are designed to find only very
close
matches between sequences, let's say if there is a higher than about
90%
of homology. Therefore, they are especially suitable to align mRNA,
promoter
etc. sequences of the same species to the genome, unless
evolutionary
sequence conservation is very high.
Tip! BLAT
was developed at the UCSC,
and uses query sequences to quickly find sequences of 95% and
greater similarity of length 40 bases or more. BLAT is not
BLAST. DNA BLAT works by keeping an index of the entire genome in
memory. The index consists of all non-overlapping 12-mers
except for those heavily involved in repeats. BLAT can be used to
locate extremely
fast DNA or protein sequences within the UCSC genome browser. Please
refer to the BLAT main section for
details.
SSAHA
was developed
within the Ensembl project and
is used for example to search the Ensembl databases via BLASTView. The
Sequence
Search and Alignment by Hashing Algorithm is a fast nearly exact
matcher and used as example to search DNA queries
against
the unmasked genome sequence assembly at Ensembl. SSAHA is very fast
and useful
matching
closely related sequences such as mRNAs to a genomic sequence.
It
will however fail if the sequence similarity between query and subject
falls
below 90%. For weaker similarities suitable BLAST algorithms should be
considered. Please refer to the SSAHA
main section for details.
MegaBLAST uses BLAST
parameters to quickly search for highly similar sequences
(e.g. from the same organism). MegaBLAST is the fastest BLAST
program as it defaults to a large word-size (an exact match of
28 bases is required to initiate an extension). In addition, it has a
quite fast gapped extension algorithm which has no gap existence cost
but merely a gap extension cost. Especially when larger
word size is used , it is
up to 10 times faster than standard BLAST programs. Word size is
roughly the minimal length of an identical match an alignment must
contain if it is to be found by the algorithm. Mega BLAST is most
efficient with word sizes 16 and larger, although word size as
low as 8 can be used. Mega BLAST is also able to efficiently handle
much longer DNA
sequences than the blastn program of traditional BLAST
algorithm.
SIM7...search for
(also distant) orthologs/homologs of my gene/protein of interest ? -> see GENOM2 !