Bioinformatics World    
         
 Main Index -> VARIOUS TOOLS
                -> Sequence Manipulation
                -> Sequence Format Conversion
                -> Sequence Randomization
                -> Code Tables
     
               
Navigate    AtoZ   Search this Site   Site Journal    FAQ Index   Main Index   Appendix       
            
Sequence Manipulation
Linkpage 1 -
Sequence manipulation tools
(Pasteur)
The Pasteur Institute provides a linkpage to various tools involved in sequence manipulation, like sequence replacement, deletion, gap removal, trimming, and splitting.
Linkpage 2 -
SeWeR
(Pasteur)
SeWeR (Sequence analysis using Web Resources) is an integrated portal to common web-based services in bioinformatics, developed at Pasteur Institute. Within the "Tools" section there is a large list of programs involved in sequence manipulation.

NOTE: SeWeR does NOT perform well with Netscape 6 and 7 ! Please use MS Internet Explorer instead !
Linkpage 3 -
Sequence Manipulation Suite
(University of Alberta)
The Sequence Manipulation Suite is a collection of web-based programs for analyzing and formatting DNA and protein sequences. The output of each program is a set of HTML commands, which is rendered by your web browser as a standard web page. You can print and save the results, and you can edit them using an HTML editor or a text editor.
BIOSED
(EMBOSS, Pasteur) - search and replace
BIOSED  is a simple sequence editing utility that searches for a target subsequence in one or more input sequences and replaces it with a specified second subsequence (or optionally just deletes the found target subsequence).

Biosed was inspired by the useful UNIX utility sed which searches for a pattern in text and can replace or delete the found pattern. If the target subsequence occurs more than once, then each instance of the target is replaced.
The target subsequence is not any sort of an ambiguity pattern, it is just a short sequence. A simple string match is done and if it exactly matches then the replacement is done. The matching is independent of the case of the sequence or the target - both uppercase and lowercase will match.
CAP
(EMBOSS, Pasteur) - contig assembly of multiple sequences
CAP3 (Pasteur) is the most widely used program to generate a contig sequence from a list of  input sequences. Fragments in random orientations are assembled into contigs.

CAP3 (PBIL, Lyon) is an alternative site to access this program.

CAP3 (BCM) is also available from the BCM Search Launcher.
Chromas
(Technelysium)
Chromas is a software which allows to open and manipulate chromatogram files from a variety of DNA sequencers. It is able to export sequences in plaint text, formatted with base numbering, FASTA, EMBL, GenBank or GCG formats. It can reverse & complement, and translate the sequences. 

NOTE
: The unregistered version is fully functional for a period of 60 days, after which you have to register/pay. Older versions are available for free. 
CUTSEQ
(EMBOSS, Pasteur) - cut out a region
CUTSEQ removes a specified section from a sequence.

This simple editing program allows you to cut out a region from your sequence. It removes the sequence from the specified start to the end positions (inclusive) and returns the rest of the sequence in the output file.
DEGAPSEQ
(EMBOSS, Pasteur) - remove gaps
DEGAPSEQ removes gap characters from sequences. It reads in one or more sequences and writes them out again minus any gap characters. In effect it removes gaps from aligned sequences.

In fact, if does more than just this as it removes ANY non-alphabetic character from the input sequence, so as well as removing the gap-characters, it will remove such things as the '*' in protein sequences that indicates the position of a 'translated' STOP codon. 
EXTRACTSEQ
(EMBOSS, Pasteur) - extract regions
EXTRACTSEQ extracts regions from a sequence. The program allows you to specify one or more regions of a sequence to extract sub-sequences from to build up a contiguous resulting sequence.

This is modelled on the cell's process of splicing out exons from mRNA, but the program is generally applicable to any cutting and splicing or editing operation on a single sequence.
Extractseq reads in a sequence and a set of regions of that sequence as specified by pairs of start and end positions (either on the command-line or contained in a file) and writes out the specified regions of the input sequence in the order in which they have been specified. Thus, if the sequence "AAAGGGTTT" has been input and the regions: "7-9, 3-4" have been specified, then the output sequence will be: "TTTAG".   
MEGAMERGER (EMBOSS, Pasteur) -  contig assembly of two large sequences

MEGAMERGER  takes two overlapping sequences and merges them into one sequence. It could thus be regarded as the opposite of what splitter does.

It should be possible to merge sequences that are Mega bytes long. Compare this with the program merger which does a more accurate alignment of more divergent sequences using the Needle and Wunsch algorithm but which uses much more memory. The sequences should ideally be identical in their region of overlap. If there are any mismatches between the two sequences then megamerger will still attempt to create a merged sequence, but you should check that this is what you required.
MERGER
(EMBOSS, Pasteur) -  contig assembly of two sequences
MERGER joins two overlapping nucleic acid sequences into one merged sequence. It uses a global alignment algorithm (Needleman & Wunsch) to optimally align the sequences and then it creates the merged sequence from the alignment.

When there is a mismatch in the alignment between the two sequences, the correct base to include in the resulting sequence is chosen by using the base from the sequence which has the best local sequence quality score. This program was originally written to aid in the reconstruction of mRNA sequences which had been sequenced from both ends as a 5' and 3' EST (cDNA). eg. joining two reads produced by primer walking sequencing. The gap open and gap extension penalties have been set at a higher level than is usual (50 and 5). This was experimentally determined to give the best results with a set of poor quality EST test sequences.
PASTESEQ
(EMBOSS, Pasteur) - paste 2 sequences
PASTESEQ is a simple editing program which allows you to insert one sequence into another sequence after a specified position and to then write out the results to a sequence file.
REVSEQ
(EMBOSS, Pasteur) - reverse and complement
REVSEQ performs reverse and complement of a sequence (EMBOSS) . NOTE: You can also submit/paste a multiple FASTA sequence file for "batch reversion" of sequences !

REVSEQ (BCM) is also available from the BCM Search Launcher.
TRIMEST
(EMBOSS, Pasteur) -
trim poly-A tails
TRIMEST trims poly-A tails off EST sequences. EST and mRNA sequences often have poly-A tails at the end of them. This utility removes those poly-A tails.
EST sequences are often the reverse complement of the corresponding mRNA's forward sense and have poly-T tails at their 5' end. By default, this program also detects and removes these and writes out the reverse complement of the sequence.
NOTE: Trimest is not infallible. There are often repeats of 'A' (or 'T') in a sequence that just happen by chance to occur at the 3' (or 5') end of the EST sequence. Trimest has no way of determining if the A's it finds are part of a real poly-A tail or are a part of the transcribed genomic sequence. It removes any apparent poly-A tails that match its criteria for a poly-A tail.
TRIMSEQ
(EMBOSS, Pasteur) -
trim ambigous sequence ends

TRIMSEQ trims ambiguous bits off the ends of sequences.
This program is used to tidy up the ends of sequences, removing all the bits that you would really rather were not published. Specifically, it:
- removes all gap characters from the ends
- removes X's and N's (in nucleic sequences) from the ends
- optionally removes *'s from the ends
- optionally removes IUPAC ambiguity codes from the ends (B and Z in proteins, M,R,W,S,Y,K,V,H,D and B in nucleic sequences). 
VecScreen
(NCBI)
VecScreen is a system for quickly identifying segments of a nucleic acid sequence that may be of vector origin. NCBI developed VecScreen to minimize the incidence and impact of vector contamination in public sequence databases.
Alternative
NCBI-BLAST2 Vector Searches at the EBI
VECTORSTRIP (EMBOSS, Pasteur) - stripping vector sequences
VECTORSTRIP is intended to be useful for stripping vector sequence from the ends of sequences of interest. For example, if a fragment has been cloned into a vector and then sequenced, the sequence may contain vector data eg from the cloning polylinker at the 5' and 3' ends of the sequence. Vectorstrip will remove these contaminating regions and output trimmed sequence ready for input into another application.
     

Sequence Format Conversion
Linkpage 1 - Sequence formats
(2can, EBI)
This is a concise view of the most important sequence formats, taken from the 2can educational webportal at EBI. Sequence formats are simply the way in which the amino acid or DNA sequence is recorded in a computer file. Examples of described formats are ALN/ClustalW, EMBL, Genebank, FASTA, Pfam, Phylip, and raw.
Linkpage 2 - Sequence format conversion
(Pasteur)
The Pasteur Institute provides a linkpage to various tools involved in sequence format conversion. They are all connected to analysis tools available on the Pasteur server.
ABIVIEW
(EMBOSS, Pasteur)
ABIVIEW reads in an ABI sequence trace file and graphically displays the results. The data for each nucleotide is plotted and the assigned nucleotide (G, A, T, C or N) in the ABI file is overlayed on the graphs. It also writes out the sequence to an output sequence file, you can choose between many different formats like FASTA, PHYLIP, GCG, CLUSTAL, MSF and many more !
Protein Duster
(UCSC)
Protein Duster was developed by Jim Kent at UCSC. This program removes formatting characters and other non-sequence related stuff from a protein sequence. It outputs in a variety of formats.
You may choose the number of amino acids per line, the size of the "blocks", and the formatting in upper or lower case letters.
READSEQ
(BIMAS, NIH)
READSEQ is probably the most popular program for sequence conversion; converts data from/to formats like GCG, EMBL, DNAStrider, GenGank, Fasta, MSF,....

READSEQ is also available at Pasteur.

READSEQ is also available at EBI.
Three To One
(University of Alberta)
Three To One is part of the Sequence Manipulation Suite provided from the Unversity of Alberta and converts three letter translations to single letter translations. Digits and blank spaces are removed automatically. Non-standard triplets are ignored.
             
           
Sequence Randomization
Random Sequence
and
Random Genes
(RSAT)
Random Sequence and Random Genes are tools which are integrated in the RSAT portal of regulatory sequence analysis.
Random Sequence generates random DNA sequences according to various probabilistic models. This tool is very useful if you want to verify the significance of results obtained by programs of Motif Discovery or programs of Motif Matching. You can easily generate a random sequence set corresponding to your "query dataset", simply by selecting the same sequence number and the same length.
Random Genes performs a random selection among the genes of a selected organism. This program is useful for estimating the rate of false positive for pattern discovery programs.
NOTE: Please refer to the main sections of Random Sequence and Random Genes for details !
RandSeq
(ExPASy)
RandSeq is a tool which generates a random protein sequence. You can choose the composition percentage of each amino acid or use the composition of an existing Swissprot/TrEMBL accession number.
SHUFFLESEQ
(EMBOSS, Pasteur)
SHUFFLESEQ takes a sequence as input and outputs one or more sequences whose order has been randomly shuffled. No bases or residues are changed, only their order. The number of shuffled sequences output can be set by the '-shuffle' qualifier. 
NOTE: This program may be useful for producing sets of sequences which can be used to check the statistics of sequence similarity finding software.


Code Tables
Essential Codes for Molecular Biology
(Oxford University)
This "Essential Codes for Molecular Biology" page presents a very good, concise view of the most important code tables in molecular biology: IUPAC nucleotide codes, amino acid codes, the Genetic code, a list of stop and start codons, and amino acid properties.
IUPAC Nucleotide and Amino Acid Codes
(2can, EBI)
This IUPAC Nucleotide and Amino Acid Codes table is taken from the 2can educational webportal at EBI. It contains tables for the one-letter and three-letter abbreviation codes for amino acids, amino acid properties, substitutions, and structures, and the nucleotide codes assigned by the IUB-IUPAC. 
IUPAC Periodic Table of the Elements
(IUPAC)
This IUPAC Periodic Table of the Elements is taken from the IUPAC website, the International Union of Pure and Applied Chemistry.