| Sequence Manipulation | |
| Linkpage 1 - Sequence manipulation tools (Pasteur) |
The Pasteur Institute
provides a linkpage
to various tools involved in sequence
manipulation, like sequence replacement, deletion, gap removal,
trimming, and splitting. |
| Linkpage 2 - SeWeR (Pasteur) |
SeWeR (Sequence
analysis using Web Resources) is an integrated
portal to common web-based services in bioinformatics, developed at
Pasteur Institute. Within the "Tools" section there is a large list of
programs involved in sequence
manipulation. NOTE: SeWeR does NOT perform well with Netscape 6 and 7 ! Please use MS Internet Explorer instead ! |
| Linkpage 3 - Sequence Manipulation Suite (University of Alberta) |
The
Sequence
Manipulation Suite is a collection of web-based programs for
analyzing and formatting DNA and protein sequences. The output of each
program is a set of HTML commands, which is rendered
by your web browser as a standard web page. You can print and save the
results, and you can edit them using an HTML editor or a text editor. |
| BIOSED (EMBOSS, Pasteur) - search and replace |
BIOSED
is a simple
sequence editing utility that searches for a target subsequence
in one or more input sequences and replaces it with a specified
second subsequence (or optionally just deletes the found target
subsequence). Biosed was inspired by the useful UNIX utility sed which searches for a pattern in text and can replace or delete the found pattern. If the target subsequence occurs more than once, then each instance of the target is replaced. The target subsequence is not any sort of an ambiguity pattern, it is just a short sequence. A simple string match is done and if it exactly matches then the replacement is done. The matching is independent of the case of the sequence or the target - both uppercase and lowercase will match. |
| CAP (EMBOSS, Pasteur) - contig assembly of multiple sequences |
CAP3
(Pasteur) is the most widely
used
program to generate a contig sequence from a list of
input sequences. Fragments in random orientations are assembled
into contigs. CAP3 (PBIL, Lyon) is an alternative site to access this program. CAP3 (BCM) is also available from the BCM Search Launcher. |
| Chromas (Technelysium) |
Chromas
is a software
which allows to open and manipulate chromatogram files from a
variety of DNA sequencers. It is able to export sequences in plaint
text, formatted with base numbering, FASTA, EMBL, GenBank or GCG
formats. It can reverse & complement, and translate the
sequences. NOTE: The unregistered version is fully functional for a period of 60 days, after which you have to register/pay. Older versions are available for free. |
| CUTSEQ (EMBOSS, Pasteur) - cut out a region |
CUTSEQ
removes a
specified section from a sequence. This simple editing program allows you to cut out a region from your sequence. It removes the sequence from the specified start to the end positions (inclusive) and returns the rest of the sequence in the output file. |
| DEGAPSEQ (EMBOSS, Pasteur) - remove gaps |
DEGAPSEQ
removes gap
characters from
sequences. It reads in one or more sequences and writes them out again
minus
any gap characters. In effect it removes gaps from aligned sequences. In fact, if does more than just this as it removes ANY non-alphabetic character from the input sequence, so as well as removing the gap-characters, it will remove such things as the '*' in protein sequences that indicates the position of a 'translated' STOP codon. |
| EXTRACTSEQ (EMBOSS, Pasteur) - extract regions |
EXTRACTSEQ
extracts
regions from a sequence. The program allows you
to specify one or more regions of a sequence to extract sub-sequences
from to build up
a contiguous resulting sequence. This is modelled on the cell's process of splicing out exons from mRNA, but the program is generally applicable to any cutting and splicing or editing operation on a single sequence. Extractseq reads in a sequence and a set of regions of that sequence as specified by pairs of start and end positions (either on the command-line or contained in a file) and writes out the specified regions of the input sequence in the order in which they have been specified. Thus, if the sequence "AAAGGGTTT" has been input and the regions: "7-9, 3-4" have been specified, then the output sequence will be: "TTTAG". |
| MEGAMERGER
(EMBOSS, Pasteur) - contig assembly of two
large sequences |
MEGAMERGER
takes two
overlapping sequences and merges them into one sequence. It could
thus
be regarded as the opposite of what splitter does. It should be possible to merge sequences that are Mega bytes long. Compare this with the program merger which does a more accurate alignment of more divergent sequences using the Needle and Wunsch algorithm but which uses much more memory. The sequences should ideally be identical in their region of overlap. If there are any mismatches between the two sequences then megamerger will still attempt to create a merged sequence, but you should check that this is what you required. |
| MERGER (EMBOSS, Pasteur) - contig assembly of two sequences |
MERGER
joins two
overlapping
nucleic acid sequences into one merged sequence. It uses a global
alignment algorithm (Needleman & Wunsch) to optimally align the
sequences
and then it creates the merged sequence from the alignment. When there is a mismatch in the alignment between the two sequences, the correct base to include in the resulting sequence is chosen by using the base from the sequence which has the best local sequence quality score. This program was originally written to aid in the reconstruction of mRNA sequences which had been sequenced from both ends as a 5' and 3' EST (cDNA). eg. joining two reads produced by primer walking sequencing. The gap open and gap extension penalties have been set at a higher level than is usual (50 and 5). This was experimentally determined to give the best results with a set of poor quality EST test sequences. |
| PASTESEQ (EMBOSS, Pasteur) - paste 2 sequences |
PASTESEQ
is a simple editing program
which allows you to insert one sequence into another sequence
after a
specified position and to then write out the results to a sequence file. |
| REVSEQ (EMBOSS, Pasteur) - reverse and complement |
REVSEQ
performs reverse and complement of a
sequence (EMBOSS) . NOTE: You can also submit/paste a multiple
FASTA sequence
file for "batch reversion" of sequences ! REVSEQ (BCM) is also available from the BCM Search Launcher. |
| TRIMEST (EMBOSS, Pasteur) - trim poly-A tails |
TRIMEST
trims poly-A tails off EST sequences. EST and mRNA sequences often have
poly-A tails at the end
of
them. This utility removes those poly-A tails. EST sequences are often the reverse complement of the corresponding mRNA's forward sense and have poly-T tails at their 5' end. By default, this program also detects and removes these and writes out the reverse complement of the sequence. NOTE: Trimest is not infallible. There are often repeats of 'A' (or 'T') in a sequence that just happen by chance to occur at the 3' (or 5') end of the EST sequence. Trimest has no way of determining if the A's it finds are part of a real poly-A tail or are a part of the transcribed genomic sequence. It removes any apparent poly-A tails that match its criteria for a poly-A tail. |
| TRIMSEQ (EMBOSS, Pasteur) - trim ambigous sequence ends |
TRIMSEQ
trims ambiguous
bits off the ends of sequences. This program is used to tidy up the ends of sequences, removing all the bits that you would really rather were not published. Specifically, it: - removes all gap characters from the ends - removes X's and N's (in nucleic sequences) from the ends - optionally removes *'s from the ends - optionally removes IUPAC ambiguity codes from the ends (B and Z in proteins, M,R,W,S,Y,K,V,H,D and B in nucleic sequences). |
| VecScreen (NCBI) |
VecScreen
is a system for
quickly identifying segments of a nucleic acid sequence that may be of
vector origin. NCBI developed VecScreen to minimize the incidence and
impact of vector contamination in public sequence databases. Alternative: NCBI-BLAST2 Vector Searches at the EBI |
| VECTORSTRIP
(EMBOSS, Pasteur) - stripping vector sequences |
VECTORSTRIP
is intended
to be useful for stripping vector sequence from the ends of sequences
of interest. For example, if a fragment has been cloned into a vector
and then sequenced, the sequence may contain vector data eg from the
cloning polylinker at the 5' and 3' ends of the sequence. Vectorstrip
will remove these contaminating regions and output trimmed sequence
ready for input into another application. |
| Sequence Format Conversion | |
| Linkpage 1 - Sequence formats (2can, EBI) |
This is a concise view of the most important sequence
formats, taken from the 2can educational
webportal at EBI. Sequence formats are simply the
way in which the amino acid or DNA sequence is recorded in a computer
file. Examples of described formats are ALN/ClustalW, EMBL,
Genebank, FASTA, Pfam, Phylip, and raw. |
| Linkpage 2 - Sequence format
conversion (Pasteur) |
The Pasteur Institute
provides a linkpage to various tools involved in sequence
format
conversion. They are all connected to analysis tools available on
the Pasteur server. |
| ABIVIEW (EMBOSS, Pasteur) |
ABIVIEW reads in an ABI sequence trace file and graphically displays the results. The data for each nucleotide is plotted and the assigned nucleotide (G, A, T, C or N) in the ABI file is overlayed on the graphs. It also writes out the sequence to an output sequence file, you can choose between many different formats like FASTA, PHYLIP, GCG, CLUSTAL, MSF and many more ! |
| Protein
Duster (UCSC) |
Protein
Duster was developed by Jim Kent at
UCSC. This program removes formatting characters and other non-sequence
related stuff from a protein sequence. It
outputs in a variety of formats. You may choose the number of amino acids per line, the size of the "blocks", and the formatting in upper or lower case letters. |
| READSEQ (BIMAS, NIH) |
READSEQ is
probably the most popular program for sequence
conversion; converts data from/to formats like GCG, EMBL,
DNAStrider, GenGank, Fasta, MSF,.... READSEQ is also available at Pasteur. READSEQ is also available at EBI. |
| Three
To One (University of Alberta) |
Three
To One is part of the Sequence Manipulation Suite provided from
the Unversity of Alberta and converts
three letter translations to single letter translations.
Digits and
blank spaces are removed automatically. Non-standard triplets are
ignored. |
| Sequence Randomization | |
| Random Sequence and Random Genes (RSAT) |
Random Sequence
and Random
Genes are tools which are integrated in the RSAT portal of regulatory
sequence analysis. Random Sequence generates random DNA sequences according to various probabilistic models. This tool is very useful if you want to verify the significance of results obtained by programs of Motif Discovery or programs of Motif Matching. You can easily generate a random sequence set corresponding to your "query dataset", simply by selecting the same sequence number and the same length. Random Genes performs a random selection among the genes of a selected organism. This program is useful for estimating the rate of false positive for pattern discovery programs. NOTE: Please refer to the main sections of Random Sequence and Random Genes for details ! |
| RandSeq (ExPASy) |
RandSeq is
a
tool which generates a random protein sequence. You can choose the
composition percentage of each amino acid or use the composition of an
existing Swissprot/TrEMBL accession number. |
| SHUFFLESEQ (EMBOSS, Pasteur) |
SHUFFLESEQ
takes a sequence as input
and outputs one or more sequences whose order has been randomly
shuffled. No bases or residues are changed, only their order. The
number of shuffled sequences output can be set by the '-shuffle'
qualifier. NOTE: This program may be useful for producing sets of sequences which can be used to check the statistics of sequence similarity finding software. |
| Code Tables |
|
| Essential Codes for Molecular Biology (Oxford University) |
This "Essential Codes
for Molecular Biology" page presents a very good, concise view
of the most important code tables in molecular biology: IUPAC
nucleotide codes, amino acid codes, the Genetic code, a list of stop
and start codons, and amino acid properties. |
| IUPAC Nucleotide and Amino Acid Codes (2can, EBI) |
This IUPAC Nucleotide and Amino Acid Codes table is taken from the 2can educational webportal at EBI. It contains tables for the one-letter and three-letter abbreviation codes for amino acids, amino acid properties, substitutions, and structures, and the nucleotide codes assigned by the IUB-IUPAC. |
| IUPAC Periodic Table of the Elements
(IUPAC) |
This IUPAC Periodic
Table of the Elements is taken from the IUPAC website, the International
Union of Pure and Applied Chemistry. |