Bioinformatics World    
         
 Main Index -> SEQUENCE RETRIEVAL
                -> Sequence Retrieval Resources                
                -> General Sequence Databases              
                -> ID Guide                                
                                                       
Navigate    AtoZ   Search this Site   Site Journal    FAQ Index   Main Index   Appendix       
           
Sequence Retrieval Resources
NOTE: Databases and tools which perform complex integrated data retrieval are described in the "High-Throughput Data Retrieval" section within the "Data Integration" page !
Dbfetch
(EBI)
Dbfetch is an acronym for "database fetch". Dbfetch provides an easy way to retrieve entries from various databases at the EBI in a consistent manner. It can be used from any browser as well as well as within a web-aware scripting tool that uses wget, lynx or similar.
Ensembl
(EBI+Sanger)

including:
Ensembl - basic search

BioMart


Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software system which produces and maintains automatic annotation on eukaryotic genomes.

1. Ensembl - basic search:
1.1. All Species at once: Using the "Search Ensembl" option at the Ensembl homepage, it is possible to look for genes in all species, using gene names, accession numbers, diseases, protein domain names, ...
NOTE that the output of a keyword sequence search here is considerably smaller than via SRS or ENTREZ !
1.2. Search the human genome: for genes, proteins, diseases, domains, SNPs and many more.
1.3. Search the mouse genome: for genes, proteins, diseases, domains, SNPs and many more.

2. BioMart:
BioMart is a powerful data retrieval tool that generates lists of biological objects (e.g. genes, SNPs) from data held in the Ensembl
(and other) databases.
Please refer to the corresponding section for details !
Entrez
(NCBI) 
Entrez is a very powerful search and retrieval system that integrates information from databases at NCBI.  In fact, it is designed to retrieve not only sequences but data from many different sources.
Please refer to the Entrez main section for details. 
Fetch       
(EMBnet.ch)
Fetch is an alternative for NCBI-Entrez; limited mainly to sequence databases, like GenBank, Swissprot, UniGene, RefSeq, but  includes also Prosite and Interpro entries. NOTE that the search term is limited to ID or Accession number (no keyword search possible !)
SRS - Sequence Retrieval System
(LION)
SRS is a very comprehensive retrieval system for database entries. SRS was initially developed at the EBI, and is also available as commercial versions from LION Bioiscience (Heidelberg, Germany).
Note: There is no recent publication on SRS, but there are excellent SRS online documentations.

1. Access to public SRS servers is available at (some selected sites):
- SRS6 at EBI  
- SRS6 at Sanger     
- SRS6 at DKFZ   
- SRS6 at EMBnet Austria 
NOTE that the various public SRS servers contain a different set of searchable libraries ! NOTE: All SRS sites provide extensive Help how to use SRS !!

2. SRS session:
You can perform searches using accession numbers, keywords, authors, and many more. The set of databases ranges from sequence libraries (EMBL, PIR, SWISSPROT;...), SeqRelated libraries (Prosite, Prints, Pfam, EPD, UTR, UNIGENE, UNIEST, ...), transcription factor libraries (TFSITE, TFFACTOR, TFMATRIX), literature db (MEDLINE,...), 3D Structures (PDB,...), Genome db (LocusLink,...), Mutation db (OMIM, ...), metabolic pathways (ENZYME,...), and many more ! Datasets can be saved in various formats, like complete database entries, or FASTA sequence files only. 

UCSC Table Browser
(UCSC)
The UCSC Table Browser provides a powerful and flexible graphical interface for querying and manipulating the UCSC Genome Browser annotation tables. The Table Browser  lets you retrieve the DNA sequence data or annotation data underlying Genome Browser tracks for the entire genome, a specified coordinate range, or a set of accessions.
Please refer to the main section of UCSC Table Browser for detailed information !
            
                           
General Sequence Databases
NOTE: This section is intended to give an overview about some of the most important sequence databases (DNA and protein). Many of these databases are used by other tools and programs described elsewhere on this web portal. 
CCDS
(EBI, NCBI, UCSC, Sanger)
The Consensus CDS (CCDS) database, started in March 2005, is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality. Annotation of genes on the human genome is provided by multiple public resources, using different methods, and resulting in information that is similar but not always identical. The human genome sequence is now sufficiently stable to start identifying those gene placements that are identical, and to make those data public and supported as a core set by the three major public human genome browsers. The long term goal is to support convergence towards a standard set of gene annotations on the human genome.

Ways to access CCDS (selection):
1. CCDS membership is indicated on Ensembl, NCBI Map Viewer, UCSC, and Vega genome browser (of WTSI - Wellcome Trust Sanger Institute) and/or browser-associated gene, transcript, or protein reports. 
2. Links to the CCDS Database are provided on NCBI Entrez Gene reports and RefSeq transcript and protein sequence records.
3.The CCDS Database can be directly queried by accession identifier, Entrez Gene ID, CCDS ID, or Gene Symbol. The CCDS query interface does not support full text searching and so you may want to use Entrez Gene to retrieve the gene of interest and follow the link provided in the RefSeq section to the CCDS browser.

CCDS database structure: The CCDS set includes coding regions that are annotated as full-length (with an initiating ATG and valid stop-codon), can be translated from the genome without frameshifts, and use consensus splice-sites. Therefore, CCDS entries always span only the coding region of a transcript, and not the UTRs !

Typical CCDS accessions: refer to section CCDS IDs.
Ensembl Databases
(EBI+Sanger)



Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. The resulting databases contain genomic sequences with features attached. Of particular importance are gene features which represent protein coding genes in the genome. Ensembl copes with rapid updates in data and provides stable identifiers for genes so researches can track genes of interest across different versions of the data. The Ensembl system is being applied to genomes of different origanisms and is being kept up to date with the current state of genome sequencing.

Ways to access Ensembl (selection):
1. The diverse Ensembl Databases are accessible at the Ensembl website. Please refer also to the chapter "Ensembl - basic search".

Ensembl Databases: There is a series of individual Ensembl databases, according to different organisms and according to different types of molecules.
Examples:
ENSEMBL is a virtual database containing genomic sequences and annotations for all organisms currently available.
ENSEMBL_CDNA is a virtual database containing all cDNA sequences from all organisms.
ENSEMBL_RNA is a virtual database containing all RNA sequences from all organisms.
ENSEMBL_PROT is a virtual database containing all protein sequences from all organisms.

Typical Ensembl accessions: refer to section Ensembl IDs.
GenBank
(NCBI) 

EMBL-Bank
(EBI)

DDBJ
(NIG)

including:

BankIt

Sequin

TPA - Third Party Annotation dataset
(EBI)

NRNUC
(NCBI)

GENPEPT
(NCBI)

NRPEP
(NCBI)
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. Since 1992 it has been based at the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine, located on the NIH campus. GenBank records are annotated using a standard set of biological terms and show these annotations in a Feature Table. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ) at the National Institute of Genetics (NIG), the European Molecular Biology Laboratory (EMBL-Bank) , and GenBank at NCBI. Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis. These 3 databases are designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information.

GenBank database organization:
GenBank has traditionally been partitioned into "divisions" of taxonomic groups (like BCT for bacteria or PRI for primates). In recent years, divisions have been added corresponding to specific sequencing strategies, like EST (expressed sequence tag), STS (sequence-tagged site), GSS (genome survey sequence),  HTG (high throughput genomic), HTC (high throughput cDNA) and environmental sample (ENV) sequence sets.

Ways to access GenBank (selection):
1. Text search using the NCBI Entrez system.
2. Sequence search using BLAST.
3. Text search via public SRS servers. Note: The list of databases in SRS shows several entries directing to GenBank, like GENBANKRELEASE (latest full release of GENBANK),  GENBANKNEW (recent updates), or GENBANKTPA (the Third Party Annotation dataset). Likewise, the EMBL databases are organized accordingly.
4. Text search via Dbfetch ("simple version of SRS").

Data Submission to GenBank: The data in GenBank are submitted primarily by individual authors to one of the three databases, or by sequencing centers as batches of EST, STS, GSS, HTC, WGS, or HTG sequences.
1. BankIt: is the tool of choice for simple submissions, when only one or a small number of records is to be submitted. Authors enter sequence information directly into a form and add biological annotation.
2. Sequin: is a standalone multi-platform submission program, available for download for Windows, UNIX, and Mac systems. Sequin handles both simple sequences and complex sequence sets, like those generated by population or mutaion studies.

Typical GenBank accessions: refer to section GenBank IDs.

Related Databases:
1. The Third Party Annotation (TPA) dataset is a complement to the existing DDBJ/EMBL/GenBank comprehensive database of primary nucleotide sequences, which typically result from direct sequencing of cDNA's, EST's, genomic DNA's etc. Primary data are defined to be data for which the submitting group has done the sequencing and annotation, and as 'owner' of these data has privileges to submit updates/corrections etc. In contrast, non-primary sequences are defined as sequences which a) consist exclusively of sequences from one or several already existing entries 'owned' by other groups or b) consist of a mixture of new & already existing sequences. These sequences are submitted to DDBJ/EMBL/GenBank as part of the process of publishing biological experiments that include the annotation of existing nucleotide sequences in the primary sequence database. Thus, a publicly accessible TPA record will be linked to a publication that documents that the data are supported by biological experimentation.  

2. NRNUC is the non-redundant nucleotide sequence database from NCBI. It contains sequences from GENBANK, EMBL, DDBJ and PDB without EST, STS, GSS or HTG sequences. This non-redundant database is often selected at BLAST searches, when the user wants to limit the output to "better annotated" transcripts.

3. GENPEPT is a protein database translated from the current release of GENBANK.

4. NRPEP is the non-redundant protein database from NCBI. It contains both GENBANK translations, but also sequences from protein databases like PDB, SWISSPROT and PIR.
RefSeq
(NCBI)
The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. RefSeq standards serve as the basis for medical, functional, and diversity studies; they provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses.
Example: Most genes / mRNAs are characterized by a series of cDNA sequences which have been deposited by different research groups in databases like GenBank. As example, cDNA sequences often miss the extreme 5'-end of the mRNA. RefSeq now tries to pick those sequences as "RefSeqs" which are the most "reliable" ones.

Ways to access RefSeq (selection):
1. Text search using the NCBI Entrez system. Note that RefSeq entries are included in the "Entrez Nucleotide" section which covers also the total GenBank database. The output list can be filtered via the "Limits" tab and the selection "only from RefSeq". Note that property restrictions may also be directly entered as part of the query term, see this RefSeq query help section for details.
2. Sequence search using BLAST (nr database). Note: RefSeq records are also included in pre-computed BLAST analyses available at NCBI (BLink).
3. Text search via public SRS servers. Note: The list of databases in SRS shows several entries related to RefSeq, reflecting a partition of RefSeq into distinct subsets (mRNA, genomic, protein, non-coding RNA).
4. RefSeqs for each gene are also listed in gene reports like those available via the Entrez Gene pages, and are also graphically mapped onto genomes like in the UCSC genome browser.

RefSeq database structure: There are 2 RefSeq databases, containing either DNA or protein sequences. Similar to GenBank, there are "Release", "Updates" and "ALL" (virtual combined) databases.

Typical RefSeq accessions: refer to section RefSeq IDs.

Note: Starting in March 2005, the CCDS database is a new collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality. Thereby, the CCDS database also makes use of the RefSeq databases.
Swiss-Prot / UniProt
(SIB, EBI)

TrEMBL / UniProt
(EBI)
The Swiss-Prot Protein Sequence Database is a database of protein sequences produced collaboratively by Amos Bairoch (University of Geneva) and the EMBL Data Library. The data in Swiss-Prot are derived from translations of DNA sequences from the EMBL Nucleotide Sequence Database, adapted from the Protein Identification Resource (PIR) collection, extracted from the literature and directly submitted by researchers. It contains high-quality annotation,is non-redundant, and cross-referenced to several other databases, notably the EMBL nucleotide sequence database, PROSITE pattern database and PDB. SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc), a minimal level of redundancy and a high level of integration with other databases.

TrEMBL (also known as SPTREMBL) is a computer-annotated protein sequence database supplementing the SWISSPROT Protein Sequence Data Bank. TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary section of SWISS-PROT. For all TrEMBL entries which should finally be upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers have been assigned.

Swiss-Prot and TrEMBL database structure: At public SRS sites, there is a list of Swiss-Prot / TrEMBL databases, like: SWISSPROT (Release); SWISSPROT (Updates); SPTREMBL; SWISSPROT_SPLICEVAR (a database containing splice isoforms of proteins from SWISSPROT); TREMBL_SPLICEVAR (a database containing splice isoforms of proteins from TrEMBL); SWISSPROTALL (a virtual database containing the latest full release of SWISSPROT plus the updates SWISSNEW); SWISSPROTPLUS (a virtual database containing SWISSPROTALL and TrEMBL).
UniProt database structure: UniProt databases are listed in SRS database lists as UNIREF100, UNIREF90 and UNIREF50. Please also refer to the UniProt section for details !

Typical Swiss-Prot / TrEMBL / UniProt accessions: see section Swiss-Prot / TrEMBL / UniProt IDs.

Note: In 2002, EBI, SIB (Swiss Institute of Bioinformatics), and PIR (Protein Information Resource) joined forces as the UniProt consortium. The primary mission is to support biological research by maintaining a high quality database that serves as a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase. UniProt is essentially built of Swiss-Prot and TrEMBL. Please also refer to the UniProt section for details ! 
    

ID Guide
NOTE: Often, when reading publications or database entries, accession numbers appear where the user initially has no idea which database these IDs are derived from. The purpose of this section is to list typical IDs and accession numbers of important databases, which may help to identify others of the same type. Please note that the terms "ID" and "accession number" are used in a quite loose definition here. Most entries are derived from one example gene/protein: PTGS2 (COX2), in order to demonstrate the wide range of data available for one gene of interest.
NOTE:
At the FAQ page, FAQ DAT6 shows tools to convert lists of specific IDs into corresponding other IDs. 
BIND IDs
(Blueprint, Toronto, Canada)

217962
BIND IDs: BIND (Biomolecular Interaction Network Database) is hosted by the Blueprint Initiative in Toronto, Canada. BIND is a database designed to store full descriptions of interactions, molecular complexes and pathways. Interactions between any two molecules composed of proteins, nucleic acids and small molecules are described.

Typical BIND accessions:
BIND accessions are simple numbers without prefix.
- 217962: This BIND entry describes the interaction between the protein TTP (ZFP36) with the example Cox2 mRNA.

Please refer also to the BIND main section.
BioCyc IDs
(SRI International)

including:
MetaCyc, HumanCyc



BioCyc IDs: BioCyc is a collection of Pathway/Genome Databases. Each Pathway/Genome Database in the BioCyc collection describes the genome and metabolic pathways (NO signaling pathways !) of a single organism, with the exception of the MetaCyc database, which is a reference source on metabolic pathways from many organisms.

Typical BioCyc accessions:
Please note that there are actually NO BioCyc-specific IDs, but for matters of completeness the example gene PTGS2 (COX2) is listed here.
- ENSG00000073756: BioCyc uses Ensembl Gene IDs for gene-specific reports, like this one for human PTGS2.
- There is also a respective Enzyme (protein)-specific entry for PTGS2.
- 1.14.99.1: BioCyc uses the standard EC nomenclature for enzyme reactions, like the synthesis of Prostaglandin-H2 in this example.
- There are also specific compound-entries, like for Prostaglandin-H2 in this example.

Please refer also to the BioCyc main section.
CCDS IDs
(EBI, NCBI, UCSC, Sanger)

CCDS1371.1
CCDS IDs: The CCDS (Consensus CDS) database, is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality.

Typical CCDS accessions:
- CCDS1371.1: This is the CCDS entry of an example gene PTGS2 (COX2). Annotated genes that are included in the CCDS set are associated with a unique identifier number and version number (e.g., CCDS1.1, CCDS234.1). The version number will update if the CDS structure changes, or if the underlying genome sequence changes at that location. A CCDS entry lists accessions which are "included" like RefSeq and Ensembl IDs. Also, links to the major genome browsers (UCSC, NCBI, Ensembl, WTSI) are contained.

Please refer also to the CCDS main section.
ChEBI IDs
(EBI)

ChEBI:15365


ChEBI IDs: ChEBI, Chemical Entities of Biological Interest, is a freely available dictionary of "small molecular entities". The molecular entities in question are either products of nature or synthetic products (like drugs).

Typical ChEBI accessions:
- ChEBI:15365: This is the ChEBI identifier for the substance Aspirin. A ChEBI entry lists the ChEBI name, synonyms, IUPAC name, and database links like to the KEGG Compound database.
Note: the substance Diclofenac which is used as example for interference with PTGS2 in sections KEGG IDs and MMDB IDs, is not found in ChEBI, therefore another anti-inflammatory substance was chosen.
Note: There is no option in ChEBI to search using gene names like PTGS2, as there is no gene-drug correlation.

Please refer also to the ChEBI main section.
CleanEx IDs
(SIB)

HS_PTGS2
AFFY_HG-U133A_204748_at

CleanEx IDs: CleanEx is a database which provides access to public gene expression data via unique approved gene symbols and which represents heterogeneous expression data produced by different technologies in a way that facilitates joint analysis and cross-dataset comparisons.

Typical CleanEx accessions:
- HS_PTGS2: This is the CleanEx database entry for the human gene PTGS2. It summarizes expression data from different databases corresponding to this gene of interest.
- AFFY_HG-U133A_204748_at: This is a CleanEx Target database entry derived from an Affymetrix ProbeSet corresponding to the gene PTGS2, as contained on the human U133A chips.

Please refer also to the CleanEx main section.
dbSNP IDs
(NCBI)

rs20426
ss23080
dbSNP IDs: The SNP database (dbSNP) of NCBI is a central repository for information related to Single Nucleotide Polymorphisms in the genomes of various species.

Typical dbSNP accessions:

- rs20426: This is a typical dbSNP database record, a Reference SNP record which shows a polymorphism affecting the first amino acid of the human PTGS2 protein (Met/Ile). Such a record stores all SNP detail information, like method, submitter, a variation summary, and a validation summary.
These Reference SNP records provide a summary list of individual submitter records (useful in cases where several records address the same variation) in dbSNP and a list of external resource and database links. Reference SNP identifiers will also be exported as standardized features for annotation in other NCBI resources.
Reference SNP cluster 'rs' ID's are created by NCBI during periodic 'builds' of the database. Reference SNP clusters define a non-redundant set of markers that are used for annotation of reference genome sequence and integration with other NCBI resources. New submissions that match existing data will be merged into an existing refSNP cluster. A reference SNP cluster record has the format NCBI | rs[NCBI SNP ID] where 'rs' is always lower case.
Note that from the GeneView it is easy to retrieve the complete list of SNPs in the respective transcript region.

- ss23080: This is a dbSNP Assay ID ('ss'), representing
an individual submission record.

Please refer also to the dbSNP main section. Please refer also to this dbSNP identifiers information page.
Ensembl IDs
(EBI+Sanger)

ENSG00000073756
ENSMUSG00000032487
ENST00000186982 
ENSMUST00000035065
ENSP00000186982
ENSMUSP00000035065 
ENSF00000001307
ENSE00001002288
Ensembl IDs: Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. The resulting databases contain genomic sequences with features attached.

Typical Ensembl accessions:
Note: These links are related to the example gene / protein PTGS2 (COX2).
- ENSG00000073756: This is a human Gene Report, with links to the genome browser display, to related Ensembl transcript, family, etc. reports, to orthologous sequences, protein domains, motifs, GO terms, and links to all major databases.
- ENSMUSG00000032487: A mouse Gene Report, similar to human.
- ENST00000186982: A human Transcript report. Similar to Ensembl Gene reports, but showing a very useful feature which displays the transcript sequence as colored codons, exons, and also SNPs.
- ENSMUST00000035065: A mouse Transcript report.
- ENSP00000186982: A human Protein Report. This report focuses on protein domains and motifs, but also displays very nicely the positions of alternating exons and SNPs.
- ENSMUSP00000035065: A mouse Protein Report.
NOTE: A search via the Ensembl search field using a Transcript ID or a protein ID will initially retrieve the Gene report which then links to the other entries.
- ENSF00000001307: A Protein Family Report. This report "unifies" in one display all protein sequences from all species available in Ensembl which belong to the same protein family. It is possible to produce multiple alignments and to download the sequences.
NOTE: There is an individual Ensembl Family report for each species (like human in the example) as in the upper part of the entry a chromosomal map is shown which presents the species-specific positions of the genes belonging to the family, whereas in the lower part all the other species are listed.
- ENSE00001002288: This is an Ensembl Exon Report, corresponding to the first of the 10 exons of the human PTGS2 gene.

Please refer also to the Ensembl Databases main section.
Entrez Gene IDs
(NCBI)

5743
Entrez Gene IDs: Entrez Gene is one of the Entrez databases, which was constructed to replace the widely known and used LocusLink database in the year 2004. Entrez Gene integrates information from LocusLink and from genes annotated on Reference Sequences from completely sequenced genomes.

Typical Entrez Gene accessions:
- 5743: The Entrez Gene report of the human PTGS2 (COX2) gene. Entrez Gene IDs consist of a simple number without any letters. Entrez Gene entries include information from many resources, like a concise summary what is known about a certain gene, PubMed links, transcripts, genomic context, described interactions, GO terms, and much more.

Please refer also to the Entrez Gene main section.
Enzymes IDs

including:
IntEnz IDs
(EBI)

ENZYME IDs
(ExPASy)

EC 1.14.99.1
GO:0004666
Enzymes IDs: In general, there is a standardized enzyme nomenclature, all enzyme IDs start with the prefix "EC" followed by a number which is separated in 4 sections. The following examples are derived from enzyme entries in different enzyme databases.

Typical Enzyme accessions:
- EC 1.14.99.1: This is the IntEnz entry for the enzyme prostaglandin synthetase.
- EC 1.14.99.1: This is the ENZYME (ExPASy) entry for the enzyme prostaglandin synthetase.
- EC 1.14.99.1: This is the KEGG LIGAND database entry for the enzyme prostaglandin synthetase (see also section KEGG IDs).

- GO:0004666: This is the QuickGO database entry for the enzyme prostaglandin synthetase (see also section GO IDs).

Please refer also to the Enzymes main section.
GenBank IDs
(NCBI) 

AY462100
U04636
L15326
U20548
GenBank IDs: GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. Since 1992 it has been based at the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine, located on the NIH campus. GenBank records are annotated using a standard set of biological terms and show these annotations in a Feature Table. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ) at the National Institute of Genetics (NIG), the European Molecular Biology Laboratory (EMBL-Bank) , and GenBank at NCBI.

Typical GenBank accessions:
A typical GenBank entry includes a concise description of a sequence, the taxonomy of the source organism, bibliographic references and a table of sequence features (like CDS, polyadenylation signals, etc.), contact information (eventually information on clone distribution), literature references, and of course the sequence itself.
The GenBank accession number is a stable and unique identifier which is shared across GenBank-EMBL-DDBJ and which remains constant even when there is a change to the sequence or to the annotation. The DNA sequence is also assigned a unique NCBI identifier, called "gi". Note: When a change is made to the sequence, a new gi number is issued to the sequence and the version extension of the "accession.version" identifier is incremented. The accession number of the record as a whole remains unchanged and the older sequence remains available under the old "accession.version" identifier and gi.
Note: A similar system is applied to protein translations corresponding to nucleotide entries.
- AY462100: Human PTGS2 mRNA (CDS only).
- U04636: Human PTGS2 mRNA (full mRNA including UTRs, but generated from genomic DNA ! ("join..."); reference for RefSeq NM_000963).
- L15326: Human PTGS2 mRNA (full mRNA including UTRs, generated from mRNA molecule).
- U20548: Human PTGS2 promoter sequence.

Please refer also to the GenBank main section.
GeneCards IDs
(Weizmann Institute)

GC01M183372

GeneCards IDs: GeneCards at Weizmann Inst. (mirror: GeneCards at Stanford) is a database of human genes, their products and their involvement in diseases.

Typical GeneCards accessions:
- GC01M183372: GeneCard for the human PTGS2 gene. A typical "GeneCard" includes related sequences, SNPs, medical news, disorders and mutations, MIPS PEDANT Viewer report (!), links to UCSC and Ensembl, PubMed search and many more. In addition, links to gene expression data (Arrays, ESTs, links to SOURCE) are available.

Please refer also to the GeneCards main section.
GEO IDs
(NCBI)

GPL96
GSE973
GSM15390
GDS649


GEO IDs: GEO is a gene expression and hybridization array data repository, as well as an online resource for the retrieval of gene expression data from any organism or artificial source. Many types of gene expression data including microarray-based experiments measuring mRNA, genomic DNA and protein abundance, as well as non-array techniques such as serial analysis of gene expression (SAGE), and mass spectrometry proteomic data, are accepted, accessioned, and archived as public data sets.

Typical GEO accessions:
- GPL96: This is a GEO Platform record (GPL) describing the Affymetrix GeneChip Human Genome U133 Array Set HG-U133A. GPLs describe the technology of the platforms (e.g. arrays) used, meaning the lists of elements on the array (e.g., cDNAs, oligonucleotide probesets, ORFs, antibodies) or the list of elements that may be detected and quantified in that experiment (e.g., SAGE tags, peptides). Each platform record is assigned a unique and stable GEO accession number (GPLxxx). A platform may reference many samples that have been submitted by multiple submitters.
- GSE973: This is a GEO Series Record (GSE) describing a microarray experiment where HUVEC were treated with the inflammatory cytokine Interleukin-1. GSEs contain detailed descriptions of an experiment, including submitter details and literature references. A GSE record also provides a list of all single sample records (GSMs) which are part of this experiment. GSEs therefore define a set of related samples considered to be part of a group, how the samples are related and if and how they are ordered. Each series record is assigned a unique and stable GEO accession number (GSExxx).
- GSM15390: This is one of the GEO Sample records (GSM) which belong to the GSE973 series. GSMs describe the conditions under which an individual sample was handled, the manipulations it underwent and the abundance measurement of each element derived from it. Example: In the case of an Affymetrix array experiment, the GSM provides the complete list of ProbeSets, and their measured values. Each sample record is assigned a unique and stable GEO accession number (GSMxxx). A sample entity must reference only one platform, but may be included in multiple series.
- GDS649: This is the GEO DataSet (GDS) record which corresponds to the experiment described in GSE973. GDS records are curated sets of GEO sample data. A GDS record represents a collection of biologically and statistically comparable GEO samples (GSMs) and forms the basis of GEO's suite of data display and analysis tools. Samples within a GDS refer to the same platform, that is, they share a common set of probe elements. Value measurements for each sample within a GDS are assumed to be calculated in an equivalent manner, that is, considerations such as background processing and normalization are consistent across the dataset. Information reflecting experimental design is provided through GDS subsets.
NOTE: If you want to know the expression profile of the example gene PTGS2 within this experiment, simply enter "PTGS2 GDS649" into the query field of the Entrez GEO Profiles page.

Please refer also to the GEO main section.
GO IDs
(AmiGO)

GO:0004666

GO IDs: The goal of the Gene Ontology Consortium is to produce a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing. The three organizing principles of GO are molecular function, biological process and cellular component.

Typical GO accessions:
- GO:0004666: A GO entry of the branch "molecular function", namely "prostaglandin-endoperoxide synthase activity" which was attributed to the example gene PTGS2 (COX2). A Gene Ontology accession always starts with the prefix "GO", followed by a number.

Please refer also to the GO main section.
HGMD IDs
(University of Wales)

CM024574
HGMD IDs: HGMD - Human Gene Mutation Database is maintained by the Institute of Medical Genetics, Cardiff, University of Wales and is the leading database storing not only mutations in human genes but also curated polymorphisms showing clear phenotypes. HGMD is also directly linked via NCBI LocusLink entries.

Typical HGMD accessions:
- CM024574: This is the HGMD accession for one of the mutations described in the human PTGS2 (COX2) gene (Note that the link referes to the overview page of all described PTGS2 mutations). Mutations are grouped either by mutation type (like missense/nonsense, splicing, deletions, insertions,...) or phenotype (like diabetes mellitus, colorectal cancer risk,...). The entry itself lists codon and amino acid exchanges, as well as the proposed phenotype and the primary literature reference.

Please refer also to the HGMD main section.
H-InvDB IDs
(JBIRC + DDBJ)

HIX0001429
HIT000036341


H-InvDB IDs: H-Invitational Database (H-InvDB) is a human gene database opened to the public in April 2004, which is hosted by the Japan Biological Information Research Center (JBIRC) and by the DNA Databank of Japan (DDBJ), with contributions from more than 40 institutes worldwide, like the german DKFZ. The scope of H-InvDB is to provide an integrative annotation of full-length cDNA clones available from high throughput cDNA sequencing projects.

Typical H-InvDB accessions:
- HIX0001429: This is the Locus view (HIX# ID) of the example gene PTGS2, which collects all gene-centered information and links, like exon-intron organization, cDNA multiple alignments, ORF alignments, gene expression profiles, and disease information.
- HIT000036341: This is the cDNA view (HIT# ID) of the example gene PTGS2, which provides information concerning function (HUGO, PubMed, and to the KEGG database for metabolic pathway information), mapping information (including external links like Ensembl, GeneLynx, Refseq, OMIM), predicted ORF (incl. Interpro and GO IDs), evolutionary features (orthologs in other species), secondary and tertiary structures (GTOP database), subcellular localization (incl. the output of several programs like PSORT, TargetP and TMHMM; and links to the TOPO viewer), and cDNA information (incl. SNPs, PolyA sites, repeats, and more).

Please refer also to the H-InvDB main section.
HPR IDs
(HPR program, Sweden)

CAB000113
HPR IDs: HPR - the Human Protein Atlas, contains hundreds of thousands of images of protein expression in normal human tissues and cancer cells.

Typical HPR accessions:
- CAB000113: This is the HPR entry for the gene / protein PTGS2 (COX2). In fact, the ID refers to the PTGS2 specific antibody. Links to Ensembl gene and transcript IDs as well as general protein specific data are indicated on top. Lists of normal and cancer tissues are presented together with symbols referring to the intensity of the staining in these tissues. Each tissue can be selected to show the corresponding high-quality immuno-histochemical staining. Note that the link "antibody info" will present  additional gene data, including the important links to the major databases, a gene summary, assigned GO terms and more.

Please refer also to the HPR main section.
HPRD IDs
(Johns Hopkins School of Medicine, Baltimore, and Institute of Bioinformatics, Bangalore, India)

02599
HPRD IDs: HPRD - Human Protein Reference Database represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome. All the information in HPRD has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data.

Typical HPRD accessions:
HPRD IDs consist of a simple number without any letters. Note that the HPRD ID is only displayed in the URL of a protein record, it is not shown within the record itself (although there is the option to query using the HPRD ID).
- 02599: This is the HPRD entry for the protein PTGS2 (COX2). In particular, have a look at the tab "Interactions". Interestingly, there is no entry at the tab "PTMs and Substrates".

Please refer also to the HPRD main section.
IntAct IDs
(EBI)

EBI-298933

IntAct IDs: IntAct provides a freely available, open source database system and analysis tools for protein interaction data.

Typical IntAct accessions:
- EBI-298933: This is the IntAct entry for the mouse protein Ptgs2. IMPORTANT NOTE: This tabular form is somehow misleading as one could believe that Ptgs2 itself interacts with all the molecules listed. But this is not the case as can be seen when hitting the link "Number of interactions", which will display the experimental details, the technical protocol used to verify the interaction, and the PubMed link. You can see that Ptgs2 is only one of the molecules which were identified as interaction partners of Protein kinase C, epsilon type (Kpce) which was used as the bait in this experiment. So, the table is actually a list of partners of Kpce and not of Ptgs2.

There are several display options for interaction networks: "Graph" displays all interaction partners in a "hierarchical view". "Path" computes the "minimal connection network" of the components which have been selected via the checkboxes.


Please refer also to the IntAct main section.
InterPro IDs
(EBI)

IPR006209
InterPro IDs: InterPro is a database of protein families, domains and functional sites. InterPro searches simultanously in a whole list of databases like Pfam, PRINTS, ProDom, PROSITE, SMART, SWISS-PROT, TIGRFAMs, PIRSF (PIR Superfamily), and Superfamily for domains, families, repeats and short sequence motifs.

Typical InterPro accessions:
- IPR006209: This is a domain-type InterPro entry, namely EGF-like domain, which is found in the example protein PTGS2. Besides a description of the domain itself, it lists many links to related InterPro entries, it shows the taxonomic coverage of the entry, example proteins, and more.         

Please refer also to the InterPro main section.
IPI IDs
(EBI)

IPI00018109

IPI IDs: The International Protein Index (IPI) provides a top level guide to the main databases that describe the human, mouse and rat  proteomes. IPI is assembled from protein sequence information taken from the following data sources: UniProt, Swiss-Prot, TrEMBL, RefSeq NPs, RefSeq XPs, Ensembl, TAIR, and H-InvDB.

Typical IPI accessions:
- IPI00018109: IPI entry of the example protein PTGS2. Similar to Swiss-Prot, there is a "raw version" and a "pretty-view" version of this IPI entry available, both retrievable via SRS-based systems. Note that, besides the "usual" database links, IPI entries also provide links to "more exotic" databases like trome, UTRdb, or CleanEx.

Please refer also to the IPI main section.
JASPAR IDs
(CGB, Karolinska)

MA0061
JASPAR IDs: JASPAR is an open-access database of transcription factors and their binding site profiles, which gained tremendous importance as the "mother" database for TF information, TRANSFAC, has been largely commercialized.

Typical JASPAR accessions:
- MA0061: This is the MAtrix of the binding profile for the transcription factor NF-kappaB (REL). Note: There is no individual URL for each TF, but all TFs can be browsed by using the "Browse by..." buttons at the JASPAR start site.

Please refer also to the JASPAR main section.
KEGG IDs
(Kyoto University)

hsa:5743
K00509
hsa00590
ot00590
map00590
EC 1.14.99.1
R00073
cpd:C01690
KEGG IDs: KEGG is part of the GenomeNet project of the Kyoto University. KEGG turns sequence information from a number of organisms into metabolic or regulatory pathways. This site makes it easy to place genes into a functional context, and to look for as yet unknown genes that might exist in an organism.

Typical KEGG accessions:
- hsa:5743: This is the "gene-centered" KEGG report for the example gene human PTGS2, which can be retrieved by searching the KEGG GENES database. Besides the aa and nt sequences, the report lists internal links to other KEGG databases as well as external links.
- K00509: This is a KEGG ORTHOLOGY (KO) database report for the enzyme function "prostaglandin-endoperoxide synthase" which is related to the example protein PTGS2. The KO database contains a manually curated set of orthologous gene groups in the complete genomes, which are linked to the nodes (boxes) indicating gene products (mostly proteins) in the KEGG pathway maps. Thus, the KO database accession number, or the K number, represents the common identifier of the pathway node of the KEGG PATHWAY database and the ortholog group of the KEGG GENES database.
- hsa00590; ot00590, map00590: These accessions are KEGG PATHWAY reports for prostaglandin and leukotriene metabolism, the first one is the human pathway, the second is the same pathway related to ALL organisms in KEGG, the third one is the so-called reference pathway. Pathway reports in general provide a graphical representation of metabolic (like glycolysis or ATP synthesis) and regulatory pathways (like apoptosis or cell cycle).
- EC 1.14.99.1: This is a KEGG report specific for enzymes, like prostaglandin synthetases in this example, which lists the different names of an enzyme, reactions it catalyzes, substrates, inhibitors, pathways, genes involved, and more.
- R00073: This is a KEGG report specific for enzymatic Reactions, like prostaglandin reductase reaction in this example, which includes the equation, the chemical structures, as well as a link to the respective pathway.
- cpd:C01690: This is a KEGG compound ID, which refers to chemical compounds like Diclofenac in this example, which is a described inhibitor of PTGS2 action.

Please refer also to the KEGG main section.
MGI IDs
(Jackson Laboratory, Maine, USA)

MGI:97798
MGI IDs: MGI (Mouse Genome Informatics) is maintained at Jackson Laboratory, Maine, USA and collects all data about mouse genes, including nomenclature, map positions, expression data and more.

Typical MGI accessions:
- MGI:97798: This is the MGI gene entry for the mouse Ptgs2 (Cox-2) gene. It contains the majority of the major database links for this gene, including links to the specific MGI datasets, like expression information including RT-PCR, Northern Blot and Western Blot data (stored in the GXD database).

Please refer also to the MGI main section.
MMDB IDs
(NCBI)

23795

MMDB IDs: The NCBI Structure group maintains MMDB, a database of macromolecular 3D structures, as well as tools for their visualization and comparative analysis. MMDB, the Molecular Modeling Database, contains experimentally determined biopolymer structures obtained from the Protein Data Bank (PDB).

Typical MMDB accessions:
- 23795: This is the MMDB accession for the crystal structure of Diclofenac bound to the cyclooxygenase active site of the example protein Ptgs2 (Cox-2) (mouse). MMDB entries can be displayed in 3D using the NCBI software Cn3D which is very easy to install and to handle. Please refer to the Cn3D section for information.

NOTE: Usually, there is always a corresponding PDB entry to each MMDB entry (see PDB IDs).

Please refer also to the MMDB main section.
OMIM IDs
(NCBI)

600262
601373.0001

OMIM IDs: OMIM is a catalog of human genes and genetic disorders. The database contains textual information (like "mini-reviews" !), pictures, and reference information. It also contains  links to NCBI's Entrez database of MEDLINE articles and sequence information.

Typical OMIM accessions:

- 600262: OMIM IDs consist of a simple number without any letters. This is the OMIM entry of the example gene PTGS2.

Note:
Each OMIM entry is assigned a unique six-digit number whose first digit indicates whether its inheritance is autosomal (1, 2, and 6 depending on date of creation), X-linked (3), Y-linked (4), or mitochondrial (5).

In addition, OMIM entries are categorized by content, which is displayed by symbols preceding the MIM number:
"*" (asterisk) indicates an entry of a gene of known sequence.
"#" (number symbol) indicates a descriptive entry, usually a phenotype, and does not represent a unique locus.
"+" (plus sign) indicates an entry of gene AND phenotype.
"%" (percentage sign) indicates confirmed phenotype for which the molecular basis is unknown.
"^" (caret symbol) indicates that the entry was removed from the database.

- 601373.0001: This is one of the allelic variants of the gene CCR5 (32bp deletion), conferring resistance to HIV infection.
Allelic variants are given a 10 digit number: the 6-digit number of the parent locus followed by a decimal point and a unique 4-digit variant number.
Note that for most genes, only selected mutations are included as specific subentries. Criteria for inclusion include: the first mutation to be discovered, high population frequency, distinctive phenotype, historic significance, unusual mechanism of mutation, unusual pathogenetic mechanism, and distinctive inheritance (e.g., dominant with some mutations, recessive with other mutations in the same gene).
Most of the allelic variants represent disease-producing mutations. A few polymorphisms are included, many of which show a positive statistical correlation with particular common disorders.
NOTE: OMIM does not accept an allelic variant number as search term (only the parent number) !

Please refer also to the OMIM main section.
PDB IDs
(RCSB)

1PXX

PDB IDs: The PDB, Protein Data Bank, is maintained by the consortium RCSB (Research Collaboratory for Structural Bioinformatics. PDB makes all published macromolecular structures available. So the PDB is the most comprehensive place to look for proteins, DNA, RNA, and polysaccharides. PDB files contain atomic coordinate data for each structure.

Typical PDB accessions:
- 1PXX: This is the PDB accession for the crystal structure of Diclofenac bound to the cyclooxygenase active site of the example protein Ptgs2 (Cox-2) (mouse). Note that PDB accessions in general have a characteristic 4-character code.
Note: You may want to refer to the PDB entry in cases where a structure is composed of several molecules and you want to know which chain in the 3D structure (as displayed by e.g. Cn3D) corresponds to which molecule.

NOTE: Usually, there is always a corresponding MMDB entry to each PDB entry (see MMDB IDs).

Please refer also to the PDB main section.
Pfam IDs
(Sanger)

PF00008
CL0001
Pfam IDs: Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. For each family in Pfam you can retrieve multiple alignments, view protein domain architectures, examine species distribution, follow links to other databases, and view known protein structures.

Typical Pfam accessions:
- PF00008: This is the same entry as described for InterPro IDs (EGF-like domain, which is found in the example protein PTGS2). Besides a description of the domain itself, Pfam offers interesting tools to generate multiple alignments, and to investigate the domain organization, the species distribution, or to generate a phylogenetic tree.
- CL0001: This is a typical Pfam clan accession number. This entry refers to the EGF superfamily clan. A clan contains two or more Pfam families that have arisen from a single evolutionary origin, and thus represents a protein superfamily.

Please refer also to the Pfam main section.
PharmGKB IDs
(Stanford)

PA293
PA448871
PA444552



PharmGKB IDs: PharmGKB - the "Pharmacogenetics and Pharmacogenomics Knowledgebase", is a research tool which aims to aid researchers in understanding how genetic variation among individuals contributes to differences in reactions to drugs. The PharmGKB database is a central repository for genetic and clinical information. In addition, genomic data, molecular and cellular phenotype data, and clinical phenotype data are accepted from the scientific community at large.

Typical PharmGKB accessions:
- PA293: This is the PharmGKB Gene entry for the gene PTGS2. Besides cross-references to all major "outside-databases", links to related drugs and related diseases are listed.
- PA448871: This is a PharmGKB Drug entry for the substance celecoxib (Celebrex) which interferes with the action of PTGS2. Interestingly, there is no correlation between PTGS2 and the drug Diclofenac, although present in the database (PA449293).
- PA444552: This is a PharmGKB Disease entry for the disease hypertension, which is related to several genes, like PTGS2. Interestingly, there is no correlation between PTGS2 and the disease inflammation, although present in the database (PA444620).

Please refer also to the PharmGKB main section.
PolyA_DB IDs
(University of Medicine and Dentistry of New Jersey)

p.5743.5
PolyA_DB IDs: PolyA_DB, maintained at the University of Medicine and Dentistry of New Jersey (UMDNJ), is a database which provides several types of information regarding polyadenylation in mammalian species.

Typical PolyA_DB accessions:
-
p.5743.5: This is the Poly(A) site ID of the human gene PTGS2. The PolyA_DB has the format of p.###.*, where ### is the Entrez Gene ID (LocusLink ID) for the corresponding gene, * is a order number (ordered for all poly(A) sites of the gene from 5' to 3' of the transcript).

Please refer also to the PolyA_DB main section.
PRINTS IDs
(Manchester University)

PR00457

PRINTS IDs: PRINTS  is a compendium of protein fingerprints, provided by the Manchester University. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SWISS-PROT / TrEMBL composite.

Typical PRINTS accessions:
- PR00457: This is a PRINTS entry for Anperoxidase, a domain found in the example gene PTGS2, displayed by SPRINT ("pretty-view"). A detailed documentation about the domain is presented along with a list of proteins from diverse species which bear this domain. Also, a list of "initial motifs" and "final motifs" is presented.
- PR00457: This is the output which is displayed when searching the PRINTS database.

Please refer also to the PRINTS main section.
PROSITE IDs
(ExPASy)

PS50026

PROSITE IDs: PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.

Typical PROSITE accessions:
- PS50026: This is the PROSITE entry for the same domain described for InterPro or Pfam IDs (EGF-like domain, which is found in the example protein PTGS2).

Please refer also to the PROSITE main section.
PubChem IDs
(NCBI)

CID:3033
SID:364811
PubChem IDs: PubChem is a collection of databases provided by the NCBI, which contains the chemical structures of small organic molecules and information on their biological activities. PubChem is organized as three linked databases within the Entrez/PubMed information retrieval system, meaning that all 3 databases can be searched "in one run". These are PubChem Substance, PubChem Compound, and PubChem BioAssay.

Typical PubChem accessions:
- CID:3033: This is the PubChem Compound ID for the substance Diclofenac, a known inhibitor of PTGS2 action. The entry contains structural data of the substance, references, synonyms, a description of the pharmacological action and more.
- SID:364811: This is the PubChem Substance ID for the substance Diclofenac.

Please refer also to the PubChem main section.
Reactome IDs
(CSHL, EBI, GO Consortium)

61606
140180

Reactome IDs: Reactome (formerly known as "Genome KnowledgeBase) is a collaborative effort to develop a curated resource of core pathways and reactions in human biology.

Typical Reactome accessions:
Note: Actually there are no "typical" Reactome IDs, instead reactions, pathways, and single molecules are attributed a certain number to be retrieved in the Reactome Eventbrowser.
- 61606: This is a typical Reactome-entry for a specific gene/protein displaying concise information on COX2 (PTGS2), linking to major external databases as well as to the reactions stored in Reactome where this molecule is involved.
- 140180: This is a typical Reactome-entry for a specific type of reaction, displaying information on COX reactions, meaning reactions performed by the enzymes COX1 and COX2 (PTGS2). Note that both enzymes perform the same reactions but under different stimulatory conditions. In this case there are links to 2 sub-reactions, displayed in individual pages.

Please refer also to the Reactome main section.
RefSeq IDs
(NCBI)

NM_000963
NP_000954
RefSeq IDs: The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.

Typical RefSeq accessions:
- NM_000963: This is a RefSeq mRNA record for human PTGS2. RefSeq entries look similar to "normal" GenBank records, in fact RefSeqs are actually derived from GenBank records which were adapted by NCBI staff.
- NP_000954: This is a RefSeq protein record for human PTGS2.
Note: Further accession numbers include "NC_", "NG_", "NT_", "NW_", NZ_" for genomic DNA and "NR_" for non-coding RNAs. "XM_", "XP_" and "XR_" designate model mRNAs, proteins, and non-coding transcripts generated by automated genome annotation pipelines. Please refer to this RefSeq accessions page for a full list of IDs !

Please refer also to the RefSeq main section.
Note: RefSeq was started by the NCBI, while other institutes like EBI-Ensembl also began to generate such "non-redundant sets of sequences". As a consequence, a new database was created to integrate these datasets into one common identifier for each gene; please refer to the CCDS section for details.
RESID IDs
(EBI and NCIFCRF)

AA0151

RESID IDs: The RESID Database of Protein Modifications is a comprehensive collection of annotations and structures for protein modifications and cross-links including pre-, co-, and post-translational modifications.

Typical RESID accessions:
- AA0151: This is the RESID database entry for N4-glycosyl-L-asparagine, which is found in N-glycosylated proteins. This modification typically occurs in extracellar peptides with an N-X-[ST] motif.
Note: All RESID entries start with the prefix "AA" followed by a four-letter code.

Please refer also to the RESID main section.
Rfam IDs
(Sanger and Washington University)

RF00254



miRBase IDs
(Sanger)

MI0000070
MIMAT0000069
Rfam IDs: Rfam is a large collection of multiple sequence alignments and covariance models covering many common non-coding RNA families.

Typical Rfam accessions:
-
RF00254: This is the Rfam database entry for the mir-16 microRNA precursor family. miR-16 microRNA was shown to be involved in the mRNA-destabilization of inflammatory mediators like TNFalpha and PTGS2 (COX2), as shown in Jing et al., Cell 2005.

Please refer also to the Rfam main section.

miRBase IDs: miRBase
contains all published miRNA sequences, genomic locations and associated annotation. Each entry in the miRBase Sequence database represents a predicted hairpin portion of a miRNA transcript (termed mir in the database), with information on the location and sequence of the mature miRNA sequence (termed miR). Both hairpin and mature sequences are available.

Typical miRBase accessions:
-
MI0000070: This is the stem-loop sequence entry of the human miR-16-1 microRNA. Note that this file also contains the data of the mature sequence of this microRNA, with accession MIMAT0000069.

Please refer also to the miRBase main section.
SMART IDs
(EMBL)

SM00181


SMART IDs: SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 500 domain families found in signalling, extracellular and chromatin-associated proteins are detectable.

Typical SMART accessions:
- SM00181: This is the same entry as described for InterPro IDs and Pfam IDs (EGF-like domain). There are obviously strong overlaps concerning content and functionality of these entries.

Please refer also to the SMART main section.
Swiss-Prot IDs
TrEMBL IDs
UniProt IDs
(EBI, SIB, PIR)

P35354
PGH2_HUMAN
Q6ZYK7
Q6ZYK7_HUMAN
Swiss-Prot IDs / TrEMBL IDs / UniProt IDs: The Swiss-Prot Protein Sequence Database is the "mother database" of protein sequences. It contains high-quality annotation, is non-redundant, and cross-referenced to several other databases. TrEMBL (also known as SPTREMBL) is a computer-annotated protein sequence database supplementing the SWISSPROT Protein Sequence Data Bank. In 2002, EBI, SIB (Swiss Institute of Bioinformatics), and PIR (Protein Information Resource) joined forces as the UniProt consortium. UniProt is essentially built of Swiss-Prot and TrEMBL.

Typical Swiss-Prot accessions:
Note: These links are related to the example gene / protein PTGS2 (COX2).
- P35354: A typical Swiss-Prot entry lists the definition of a protein sequence, the source organism, literature references, keywords concerning function and cellular localization, GO terms, protein domains, motifs, interaction partners, (potential) sites of post-translational modification, and of course the sequence itself.
Typical Swiss-Prot/UniProt accessions:
- PGH2_HUMAN: shows (almost) the same information content as the Swiss-Prot entry, but in a more structured layout.
Typical TrEMBL accessions:
- Q6ZYK7: Somehow similar to Swiss-Prot entry but usually shorter.
Typical TrEMBL/UniProt accessions:
- Q6ZYK7_HUMAN: shows (almost) the same information content as the TrEMBL entry, but in a more structured layout.

Please refer also to the Swiss-Prot and TrEMBL main section as well as to the UniProt main section.
TRANSFAC IDs
(BIOBASE, Germany)

G
000316
M00054
R01634
T00590
TRANSFAC IDs: TRANSFAC is "historically" the most important transcription factor database. TRANSFAC is maintained by BIOBASE, Germany. TRANSFAC contains data of transcription factors, transcription factor matrices, and their binding sites. Thus, TRANSFAC is divided into several databases, all containing entries with characteristic accession numbers.

Typical TRANSFAC accessions:
NOTE: The first link points to the BIOBASE entry (which demands prior free subscription), the second link points to the free SRS-based entry. SRS databases can be selected via the list of public SRS servers.
- G001237; G001237: TRANSFAC TFGene Table entry for the example gene PTGS2. This entry shows a certain gene, where the promoter was experimentally analyzed and the binding TFs ("T...") and TF binding sites ("R...") were chracterized.
- M00109; M00109: TRANSFAC TFMatrix Table entry for the TF CEBPbeta. This entry shows a matrix, which is built by a series of individual binding sites related to a specific TF. Often, binding sites of several species are "merged". Such TF matrices are used by several programs to predict TFBS in sequences.
Note: This matrix entry is derived from CEBPbeta instead of CEBPdelta (which was described in the context of the regulation of the PTGS2 promoter) as there is no matrix for this TF available in TRANSFAC 6.0 public.
- R08091; R08091: TRANSFAC TFSite Table entry. This entry shows an individual short binding sequence within the PTGS2 promoter, which is recognized by the TF CEBPdelta (NF-IL6). The relative position to the transcription start site is indicated. Links to the TFGene entry and to the binding factor(s) are contained.
- T01114; T01114: TRANSFAC TFFactor Table entry for the TF CEBPdelta. This entry shows the protein sequence of a TF, literature references, functional annotation, cell specificity, and links to the orthologs in other species, as well as to the binding site entries.

Please refer also to the TRANSFAC main section.
UniGene IDs
(NCBI)

Hs.196384
Mm.292547
UniGene IDs: UniGene clusters all EST- sequences that belong to a single gene. UniGene provides links to databases like Entrez Gene, OMIM, and HomoloGene. Note: In contrast to databases like TIGR, there is NO consensus sequence assembly, and NO graphical output.

Typical UniGene accessions:
- Hs.196384: A human UniGene entry (PTGS2 gene) always shows the prefix "Hs", followed by a number.
- Mm.292547: A mouse UniGene entry (Ptgs2 gene) always shows the prefix "Mm", followed by a number.

Please refer also to the UniGene main section.
VEGA IDs
(Sanger)

OTTHUMG00000035473
OTTMUSG00000010099
OTTHUMT00000086157
OTTMUST00000023494
OTTHUMP00000033524
VEGA IDs: The VEGA (Vertebrate Genome Annotation) database is a central repository for high quality, frequently updated, manual annotation of vertebrate finished genome sequence. The website is built upon code from the Ensembl project.

Typical VEGA accessions:
Note: These links are related to the example gene / protein PTGS2 (COX2).
- OTTHUMG00000035473: This is a human "curated locus report", very similar to the Ensembl Gene report, with links to the genome browser display, to related transcript, family, etc. reports, to orthologous sequences, protein domains, motifs, GO terms, and links to all major databases.
Note: This ID is different from the Ensembl ID for the same gene (see section Ensembl IDs), but both reports (Ensembl AND Vega) can also be retrieved by searching the Ensembl website.
Note: "OTT" indicates that the reports are manually annotated in the so-called Otter database. Otter is an extended Ensembl database with an associated client/server system that is able to support interactive updating of annotation. Please refer also to the VEGA publication in NAR (Jan. 2005) for details.
- OTTMUSG00000010099: A mouse "curated locus report", similar to Ensembl Gene Report.
- OTTHUMT00000086157: A human VEGA Transcript report, similar to Ensembl Transcript reports. Note: VEGA contains 3 different transcripts for this gene (the other acc. are OTTHUMT00000086158, and OTTHUMT00000086159), in contrast to Ensembl (only 1 transcript, see section Ensembl IDs).
- OTTMUST00000023494: A mouse VEGA Transcript report, similar to Ensembl Transcript reports.
- OTTHUMP00000033524: A VEGA human protein report.

Please refer also to the VEGA main section.