
-> EXPRESSION
-> EXP1...know the best
access to expression data for a gene of interest ? (last update Mar. 3, 2006)
-> EXP2...know which available microarrays contain a gene or a whole gene set of
interest ? (last update Aug. 30,
2005)
->
EXP3...know which published microarray experiments contain
data of my gene of interest ? (last
update Aug. 30, 2005)
-> EXP4...identify genes with expression patterns similar to my
gene of interest ? (last
update Jun. 1, 2004)
-> EXP5...query microarray
data not by gene name but by the "nature of the experiment" ? (last update Nov.
11, 2004)
-> EXP6...perform
in silico expression profiling in the field of endothelial cell
biology ? (last update Jun. 14, 2005)
-> EXP7...perform clustering
analyses of microarray data ? (last
update Jun. 18, 2006)
-> EXP8...get all proteins from endothelial cells involved in
inflammation (SRS and GO approaches) ? -> see RET2
!
-> EXP9...submit my own
microarray data to a public database ? (last update Jan. 3, 2006)
-> EXP10...compare the expression of my gene of interest in
B-cells, T-cells, monocytes, and dendritic cells ? (last update Aug. 20, 2004)
-> EXP11...get a "virtual
multiple tissue Northern blot" of my gene of interest ? (last update Mar. 3, 2006)
-> EXP12...know which
genes are expressed in fetal brain 6 times higher than in adult brain ? (last update Aug. 20, 2004)
-> EXP13...compare two different microarray platforms and see
which genes are represented on both of them ? (last
update Aug. 31, 2005)
-> EXP14...compare a list of upregulated genes from one
microarray experiment with one or more other experiments ? (last
update Aug. 31, 2005)
-> EXP15...know if alternative transcripts of a gene of interest are expressed in different
tissues ? (last
update Sep. 16, 2005)
-> EXP16...generate contig sequences from a set of ESTs while
considering alternative splicing ? (last
update Sep. 16, 2005)
-> EXP17...analyze the expression of a gene set of interest in
cancer tissues ? ->
see GENOM9 !
-> EXP18...determine the expression
profiles of normal vs. cancer tissues ? ->
see GENOM10 !
-> EXP19...determine the reliability of individual SAGE tags for
expression analysis of a specific gene ? (last
update Feb. 10, 2006)
-> EXP20...get in situ microscopy images of RNA
tissue localization and expression intensity ? (last
update Mar. 2, 2006)
-> EXP21...get RT-PCR, Northern Blot, and Western Blot expression
data ? (last
update Mar. 3, 2006)
EXP1...know the best access
to expression data for a gene of interest ? (last update Mar. 3, 2006)
This question
is intended as a kind of "quickview" of some of the topics
treated in the following sections. There is a wide range of categories
and therefore also databases concerning the storage and access of gene
expression data. Main categories are microarray data, SAGE data, and
EST data. Note that many of these resources are described in
more detail in FAQ EXP11 ("virtual multiple
tissue Northern blot").
GENERAL REMARK: Often,
individual microarray probesets and also EST sequences correspond to
only ONE specific splicing variant of a gene. Thus, in cases
where different splicing variants may also differ in tissue-specific
expression, it is necessary to consider this point. Please refer to FAQ
EXP15 for this purpose !
1. Microarray data:
The access to public microarray data is described in detail in question
EXP3. A general concern is the fact that there is
no "unified" microarray database (yet). Therefore, all repositories
have to be searched individually. In brief, I want to mention the
following databases.
SOURCE
of the SMD-Stanford
Microarray
Database is very user friendly database for storage of raw
and normalized
microarray experimental data. SOURCE has an easy
query method which allows to use GB IDs, LocusLink IDs, UniGene ID, or
gene names. When available, there is a link to published microarray
expression data, including data on the gene of interest. The
visualized data are very easy to interprete: RED means UP and GREEN
means DOWNregulated. Please
note that by klicking onto the "red and green expression bar" of a
single gene you can retrieve all
other genes with similar regulation; in this list
you may again klick on every gene and retrieve respective genes with
similar expression !!! Note that the link "Authors' webpage"
offers direct access to
the primary databases holding the expression data. Please
note that not all microarray data stored in the SMD (Stanford
Microarray Database) are retrievable via SOURCE, therefore you may
also directly search
SMD (basic
or advanced)
for datasets via lists of specific experimental setups. Please refer to
the SMD description at the main page for details.
GEO
(Gene Expression Omnibus) was launched by the NCBI, in
order to support the public use of gene expression data. GEO is a gene
expression and hybridization array data repository, as well as an
online resource for the retrieval of gene expression data from any
organism or artificial source. For a full description of
the GEO functionality, please refer to the corresponding
chapter at the main page. In brief, there are several options
to query and browse the GEO database. If you want to search for
expression data of a gene of interest, you can use one of
the following options. Start with a "total ENTREZ
search", which also includes GEO expression profiles in the output.
Directly start at ENTREZ-Geo
Expressions: You can simply enter your search term(s) like gene
name, organism, tissue. Start at the GEO home-page, and enter
all or part of the gene name, or gene symbol, into the "Query", "Gene
Profiles" field. Note that also a gene-specific entry in Entrez
Gene contains direct links to the GEO
expression data of a gene of interest !
CleanEx,
provided by the
Swiss Institute of
Bioinformatics (SIB), is a database which provides access to public
gene
expression data via unique approved gene symbols and which
represents
heterogeneous expression data produced by different technologies in a
way that facilitates joint analysis and cross-dataset comparisons. So
far, CleanEx contains only human genes for which the symbol is
approved by the HUGO nomenclature committee. There is one entry per
gene name. Thus, CleanEx is NOT a repository of expression data
in the strict sense but it collects expression data from
several resources (GEO, ArrayExpress, SMD, etc.) in a gene-centered way
in order to make it available via one common interface.
2. SAGE data:
SOURCE
does not only provide links to array data concerning
a gene of interest, but also to SAGE (Serial Analysis of
Gene Expression) data stored at the NCBI SAGE database.
SOURCE has an easy query method which allows to use GB IDs, LocusLink
IDs, UniGene ID, or gene names. When available, there is a link to SAGE
data via "Go
to Gene-to-tag mapping at NCBI".
GEO
(Gene Expression Omnibus) is not only designed to retrieve
microarray data, but
also SAGE data. If you want to search for SAGE expression data of
a gene of interest, you can follow the same instructions
as for array data retrieval.
SAGEmap
is the "primary tool" at the NCBI to query SAGE expression data.You can
query by tag, sequence, gene, library and more. Please, also
refer to the "SAGE and ESTs"
chapter at the main page for detailed information.
NCBI
Entrez Gene is another good starting point to retrieve various
kinds of information, also gene expression data. Following the UniGene
link of an entry, you will again find a link to SAGE
data via "SAGE: Gene-to-tag mapping".
CleanEx
(see above) can also be used to retrieve SAGE data corresponding to a
specific gene of interest.
ECgene
(gene prediction by EST clustering) predicts genes by combining
genome-based
EST clustering and a transcript assembly procedure. There is a
specific section of ECgene called ECexpression,
which is the expression data viewer of ECgene. ECexpression utilizes
the extensive expression data from EST and SAGE
sources.
SAGE Genie,
which is part of the CGAP portal, provides highly intuitive, visual
displays of human
and mouse gene
expression. Moreover, SAGE Genie provides tools to examine the reliability
of individual SAGE tags, meaning the probability that a tag is "unique"
or that it matches more than one gene (see FAQ EXP18
for details).
3. EST data:
EST sequences are often derived from individual tissues of an
organism, therefore it
is possible to use these data to get a "rough" impression of the
expression level of a certain gene in this tissue. This is
done by a normalization procedure, via comparison to the total content
of EST sequences from this tissue.
Again, SOURCE
entries provide access to these data, in the box "UniGene and EST
expression information". A normalized expression distribution for
tissue types is calculated, and the ratio of cluster clones versus
total tissue clones is indicated.
NCBI
Entrez Gene is another good starting point to retrieve EST
expression data by following the UniGene
link of an entry. UniGene presents lists of the tissues where the ESTs
were derived from. Sometimes, a remark "Highly represented in library
xy" is indicated. In contrast to SOURCE, there is no normalized value
given.
GeneCards
is another good resource for expression data of genes of interest. In
particular, graphical images are displayed showing the expression in
individual tissues derived from array experiments, as well as an
"electronic Northern" of UniGene (EST) data.
ECgene
(gene prediction by EST clustering) predicts genes by combining
genome-based
EST clustering and a transcript assembly procedure. There is a
specific section of ECgene called ECexpression,
which is the expression data viewer of ECgene. ECexpression utilizes
the extensive expression data from EST and SAGE
sources.
4. RNA in situ expression images:
Please refer to FAQ EXP20 for
this purpose.
5. RT-PCR, Northern Blot, Western Blot data:
Please refer to FAQ EXP21 for
this purpose.
6. Protein microscopy data based on immunostains:
Please refer to FAQ PROT6
for this purpose.
EXP2...know which available
microarrays contain a gene or a whole gene set of interest ? (last
update Aug. 30, 2005)
GENERAL REMARK: Often,
individual microarray probesets and also EST sequences correspond to
only ONE specific splicing variant
of a gene. Thus, in cases where different splicing variants may also
differ in tissue-specific expression, it is necessary to consider this
point. Please refer to FAQ EXP15 for
this purpose !
Tip! For
this purpose, the program Resourcerer
at the TIGR webpage is a very nice
tool. RESOURCERER provides annotation based on the TIGR Gene Indices (TGI)
for commonly available microarray resources, including widely used
clone sets and Affymetrix GeneChip Arrays. RESOURCERER also allows
comparisons between resources from the same species using either the
TGI or UniGene and between species using the EGO database. Please note
that Resourcerer is NOT a repository for data of chip experiments. BUT
it is a very good tool for large-scale annotation of e.g. EST accession
numbers. Resourcerer currently works using human, mouse, or rat
accessions. You can see a list of all the array types currently
stored in this database by activating
the drop-down menu "Data Set:" at the Resourcerer start
page !
If you have a list of accession numbers,
click at the link "Batch
Search" and either upload a *.txt file containing your accession
numbers, like UniGene, RefSeq, GenBank incl. EST Acc., LocusLink
(separated by spaces) or simply type in the numbers in the text box.
You will retrieve a table containing links to UniGene,
LocusLink, the human, mouse, and rat TIGR indices, and to the GO
database. You can save the output page in
*.html format using your Browser's "Save page..."
function and open it in applications like WORD or
EXCEL, having all the hyperlinks fully intact.
If the list is very long, it will be separated into multiple
files
!!! OR: If you use the "Download Virtual"-function of
Resourcerer, you will get a tab-delimited txt-file of the table
(single
file irrespective of its length) which you can import into EXCEL, but
which has NO hyperlinks. Please note that the output file
lists the array names which contain a gene of interest but NOT the
individual
identifiers (like ProbeSet IDs in case of Affymetrix arrays), please
refer
to the BioMart
description
below for this purpose.
Example: If you e.g. want to know which
available chip contains ESTs of your gene of interest, you simply
have to open the UniGene cluster of your gene, activate "switch
to text mode" at the bottom of the page (or leave it in
html-format), save the file as *.html format. Then open the file in
WORD and mark all accession numbers by holding down the "Alt" key (or
in html-format: mark the table row), copy the accession numbers into a
new file; convert the table into text, and copy all accession numbers
into the Resourcerer query field. Those ESTs which are contained in an
available array will show the respective information ! For additional
information, you may also refer to the corresponding section at the main
page !
Tip!
Another database for this purpose is CleanEx, in particular
the CleanEx Target database. This
database can be easily
searched using the name of your gene of interest. This will produce a
list of identifiers in the CleanEx target database which include
Affymetrix ProbeSets, IMAGE cDNA clones, INCYTE cDNA clones, RefSeq
cDNAs, and SAGE tags. Note that CleanEx also provides a very
convenient Batch Search
which lets you retrieve
all the CleanEx
Target entries corresponding
to the given input identifier list and the given organism. It will also
retrieve all array experiments which are stored in CleanEx
which contain the respective identifier. Possible
queries include target-to-target retrieval as well as gene-to-target,
RefSeq-to-target or Unigene-to-target retrieval.
Accepted identifiers include gene symbols, RefSeq, UniGene, and more.
Please
refer also to the CleanEx main
section
for details.
If you want to scan only the Affymetrix
microarrays, it is possible to search through the list of
available arrays, if and where a certain gene of interest can be
found, using the tool ArrayFinder. You may enter any keyword,
gene symbol, or accession number to find relevant probe sets on all
GeneChip arrays. BUT: Although ArrayFinder offers to query
using
multiple accessions at once (space-separated), tests showed that
it seems to be impossible to use more than about 5 accession numbers
"in-batch"
! Therefore, even for screening the Affymetrix chips, you may better
use
the tool Resourcerer
!
If you mainly want to scan the most widely
used Affymetrix arrays (and a few others), you can feed your
list of genes also into BioMart
(at the "Filter" Page, "Limit to Genes with these IDs") in order to
achieve an annotation table, which maintains
all hyperlinks actively in an EXCEL sheet. At the "Output"
Page, within the "Features" tab, you can select the specific
array type which you want to scan which reveals all the
(ProbeSet)
IDs corresponding to your genes of interest. Note that in
comparison
to Resourcerer, you get the gene-specific IDs but you can only screen
ONE array at a time (meaning that this option is mainly suitable for
purposes where you already know one specific array you are interested
in).
Tip! If you have a single
gene you are interested in (not a batch submission), you may also
perform a very simple query at GEO, the microarray data
repository at NCBI. You can simply enter the gene name at the
ENTREZ
GEO site, which extracts a list of microarray experiments
("GEO Datasets" = "GDS") using array types ("GEO platforms" = "GPL")
containing your gene of interest. You can simply scroll the lists
and wrtite down the cited GPLs (which of course may be listed several
times). NOTE that this option probably picks the datasets from the
most recent microarray platform developments !!!
Again, if
you have a single
gene you are interested in (not a batch submission), you may also
have a "quick-look" at the UCSC Genome Browser,
either via
BLAT
search using the cDNA sequence or via keyword search at the Genome Browser
Gateway (Note that "position" may also mean gene
name !). The genomic organization of your gene of interest is
graphically displayed. Now you may select all "tracks" related to
microarray IDs in order to show them along the sequence (don't forget
to hit the "Refresh" button). Although only a few (the most widely
used) microarray platforms are included, this option is extremely
useful to quickly identify array probesets belonging to different
splice variants of a gene, because the alignment to individual
exons is displayed.
Tip! Similarly, the UCSC Gene Sorter
is a resource which very quickly shows if your gene of interest is
contained on the most widely used microarray types (in particular
Affymetrix arrays). You simply select the organism and type in the gene
name or other identifiers. You may have to "activate" those columns of
interest via the "configure" button by selecting entries like "U133 ID"
or "U74 ID". If no identifier is displayed in a specific column, your
gene of interest is not part of this specific array type.
EXP3...know which published microarray experiments contain
data of my gene of interest ? (last
update Aug. 30, 2005)
Still, there is no "universal database format" for storing
and accessing complete datasets of microarray experiments.
Nevertheless, public repositories are emerging that address
this question. Please note that
many microarray experimental data are only accessible via the author's
homepages and are not (yet) submitted to public repositories.
Therefore, it is still recommended to additionally search PubMed
or even Google
using appropriate keywords.
Tip! CleanEx, provided by the
Swiss Institute of
Bioinformatics (SIB), is a database which provides access to public
gene
expression data via unique approved gene symbols and which
represents
heterogeneous expression data produced by different technologies in a
way that facilitates joint analysis and cross-dataset comparisons. So
far, CleanEx contains only human genes for which the symbol is
approved by the HUGO nomenclature committee. There is one entry per
gene name.
NOTE: Thus, CleanEx is NOT a repository of expression data
in the strict sense but it collects expression data from
several resources (GEO, ArrayExpress, SMD, etc.) in a gene-centered way
in order to make it available via one common interface. Please
note that in particular GEO datasets can be analyzed at the original
GEO site in a much more sophisticated way. Also note that the content
of CleanEx in general "lags behind" the one at the "mother"
databases. Please refer
also to the CleanEx main section
for details.
Tip! SOURCE
of the SMD-Stanford
Microarray
Database is a powerful and very user friendly database for
storage of raw and normalized microarray experimental data. SMD stores
data from microarray experiments, as well as their corresponding image
files. In addition, SMD provides interfaces for data retrieval,
analysis and visualization. SOURCE has an easy query method which
allows to use GB IDs, LocusLink IDs, UniGene ID, or gene names.
Currently 3 species (human, mouse, rat) are available. When
available, there is a link to published microarray expression data,
including data on the gene of interest. The visualized data are very
easy to interprete: RED means UP and GREEN means DOWNregulated. Please
note
that by klicking onto the "red and green expression bar" of a single
gene you can retrieve all other genes with similar regulation;
in this list you may again klick on every gene and retrieve respective
genes with similar expression !!! Note that the link "Authors'
webpage" offers direct access to the primary databases holding the
expression data. Please note that not all microarray
data stored in the SMD (Stanford
Microarray Database) are retrievable via SOURCE, therefore you may
also directly search
SMD (basic
or advanced)
for datasets via lists of specific experimental setups. Please refer to
the SMD description at the main page
for details.
Tip! GEO (Gene Expression Omnibus)
was launched by the NCBI, in order to support the public use of
gene expression data. GEO is a gene expression and hybridization array
data repository, as well as an online resource for the retrieval of
gene expression data from any organism or artificial source.
For a full description
of the GEO functionality, please refer to the corresponding chapter at the main page. In
brief, there are several options to query and browse the GEO
database.
If
you want to search for expression data of a gene of
interest, you can use one of the following options. Start
with a "total
ENTREZ search", which also includes GEO expression profiles in the
output. Directly start at ENTREZ-Geo
Expressions: You can simply enter your search term(s) like gene
name, organism, tissue. Start at the GEO home-page, and enter
all or part of the gene name, or gene symbol, into the "Query", "Gene
Profiles" field.
You may also search for expression data of a
sequence of interest. For this purpose, start at the "Query",
"BLAST" field of the GEO
homepage, or go directly to the GEO BLAST site,
and enter your query sequence, a GenBank accession, or a GI. The GEO
BLAST tool queries Entrez GEO
for expression profiles of interest based on nucleotide sequence
similarity.
After performing one of these searches, you first
get a typical ENTREZ output-list of GEO DataSets matching your
query. Here you have several options.
First, you may enlarge the thumbnail
images that are provided with each entry showing the corresponding
expression profile. Simply click onto this image to enlarge it. The
image represents the abundance profile for an individual gene across
each Sample in a DataSet. Please refer to the GEO section at the main page for details
about these graphs !
Second, you can view
the GDS record. At the bottom of the record,
you find the list of Sample records (GSMs), which belong to the
GDS file. You have several options to select / de-select individual
GSMs for further analyses. GDS records often can be
divided in several subsets (grouped by e.g. disease or
age), which also can be checked / unchecked. By using the option "Query
set A versus B", you may select 2 sets, compare them and display a
list of e.g. all the genes which show a min. 4-fold
expression in one set compared to the other ! The "Analysis" button
provides features that help describe and visualize an entire
dataset. For example, the option "Clustering"
provides a visualization tool for displaying precomputed hierarchical
cluster
maps. Cluster portions of interest may be selected, enlarged, charted
as line plots, and the original data downloaded. The "Download"
button
provides several options to download the full set or parts of
the data as tab-delimited data tables.
Third, the link "Profile Neighbors" retrieves
other genes/molecules that show a similar profile shape over that
dataset, possibly inferring some common function or regulatory
elements.
Fourth, the link "Sequence Neighbors"
searches all GEO
datasets for related genes based on nucleotide sequence similarity, and
thus may be useful in identifying sequence homologs such as related
gene family members, or for cross-species comparisons.
Fifth, the link "Links" offers
additional links, like GEO DataSets, which displays the GDS
report within the browser window (not as JavaScript), without the
"analysis and download functionality". UniGene displays the
UniGene cluster corresponding to the current sequence. Nucleotide
displays the database entry of the current sequence. MapViewer
displays the position of the UniGene cluster within the graphical
genome browser.
ArrayExpress
is a public repository for microarray based gene expression data,
maintained by the microarray
informatics group at the EBI (for details, refer to this chapter at the main page).
Note that, at the time of writing, you can query ArrayExpress
by experiments, arrays, and protocol types, BUT there is no
"free-text" search option, AND there is no option to search
for a GENE of interest (there is no possibility to query for e.g.
UniGene accessions or EST GB acc. numbers), which is a MAJOR
DRAWBACK compared to e.g. the NCBI-GEO database !
EXP4...identify genes with expression patterns similar to
my gene of interest ? (last update
Jun. 1, 2004)
SOURCE
of the SMD-Stanford
Microarray
Database is a powerful and very user friendly database for
storage of raw and normalized microarray experimental data. SMD stores
data from microarray experiments, as well as their corresponding image
files. In addition, SMD provides interfaces for data retrieval,
analysis and visualization. SOURCE has an easy query method which
allows to use GB IDs, LocusLink IDs, UniGene ID, or gene names.
Currently 3 species (human, mouse, rat) are available. When
available, there is a link to published microarray expression data,
including data on the gene of interest. The visualized data are very
easy to interprete: RED means UP and GREEN means DOWNregulated. Please
note
that by klicking onto the "red and green expression bar" of a single
gene you can retrieve all other genes with similar regulation;
in this list you may again klick on every gene and retrieve respective
genes with similar expression !!! Note that the link "Authors'
webpage" offers direct access to the primary databases
holding the expression data. Please note that not all
microarray data stored in the SMD (Stanford
Microarray Database) are retrievable via SOURCE, therefore you may
also directly search
SMD (basic
or advanced)
for datasets via lists of specific experimental setups. Please refer to
the SMD description at the main page
for details.
Tip! GEO (Gene Expression Omnibus)
was launched by the NCBI, in order to support the public use of
gene expression data. GEO is a gene expression and hybridization array
data repository, as well as an online resource for the retrieval of
gene expression data from any organism or artificial source.
For a full description of the GEO functionality, please refer to the corresponding chapter at the main page.
Please, also refer to question EXP3 for details.
If you want to search for expression data of
a gene of interest, you can use one of the following
options. Start with a "total ENTREZ
search", which also includes GEO expression profiles in the output.
Directly start at ENTREZ-Geo
Expressions: You can simply enter your search term(s) like gene
name, organism, tissue. Start at the GEO home-page, and enter
all or part of the gene name, or gene symbol, into the "Query", "Gene
Profiles" field.
After performing one of these searches, you first
get a typical ENTREZ output-list of GEO DataSets matching your
query. Here you have several options. One
of
these is the link "Profile Neighbors", which retrieves
other genes/molecules that show a similar profile shape over
that dataset, possibly inferring some common function or regulatory
elements.
Tip! The UCSC Gene Sorter
is an excellent resource for exploring gene families and the
relationships among genes. This tool displays a table of genes
within a selected genome that are related to one another. Several
different relationships may be explored: protein-level homology,
similarity of gene expression profiles, or genomic proximity.
The Browser supports searches on a variety of terms and phrases,
including the gene name, the SwissProt protein name, a GenBank
accession, or a word or phrase present in a gene's description. At the "Sort
by" field, you can
e.g. choose "Expression (GNF)", which looks for all datasets in
this database which show a similar expression pattern to your gene
of interest. Note that although very user-friendly, the amount
of expression data referenced by this browser is
limited to certain "whole-genome normal expression datasets" (at least
in the current status).
The gene family display is highly configurable,
allowing the user to control the order and number of columns, the
number of rows, and the genes displayed. The tool provides several output
formats, including a simple tab-delimited format that may be
imported into a spreadsheet or a relational database. In addition, the sequences
of the displayed genes can be downloaded: cDNA, protein, genomic and
promoter (!) sequences, allowing a user-definition of upstream and
downstream regions. Please refer also to the UCSC Gene Sorter main section
for details!
EXP5...query microarray data not by gene name but by the
"nature of the experiment" ? (last
update Nov. 11, 2004)
Please note that
many microarray experimental data are only accessible via the author's
homepages and are not (yet) submitted to public repositories.
Therefore, it is still recommended to additionally search PubMed
or even Google
using appropriate keywords.
Tip! GEO (Gene Expression Omnibus)
was launched by the NCBI, in order to support the public use of
gene expression data. GEO is a gene expression and hybridization array
data repository, as well as an online resource for the retrieval of
gene expression data from any organism or
artificial source. For a full description of the
GEO functionality, please refer to the corresponding
chapter at the main page. There are several options to browse and
query GEO.
If you already know a valid GEO
accession number you can query GEO using this ID at the Accession Display
Tool. You can also simply use the bar found at the foot of the GEO home page and the top
of each GEO record.
If you want to browse through lists of
GEO data and experiments, e.g. to simply scan alphabetical lists of
experiments stored in the database (e.g. "endothelial cell profiles"),
use
the GDS Browser.
You can sort by different criteria, like title of experiment, organism,
platform type or accession, and GDS accession.
If
you want to search for an experiment of interest, you can use
one of the following options. All 3 options initiate a search in
the database Entrez GEO
Datasets which stores all GDS annotation including the GDS
description, reference series and sample descriptions, titles,
keywords, source material, contributer, authors and organisms. Start
with a "total
ENTREZ search", which also includes experimental descriptions in
so-called GEO DataSets in the output. Directly start at ENTREZ-Geo
DataSets: You can simply enter your search term(s)
like organism, tissue author name, GEO terms or accessions. Start
at the "Query", "DataSets" field of the GEO homepage, and enter
your search terms. For detailed information concerning the construction
of complex queries, please refer also to the Quick
Query Builder tutorial.
ArrayExpress
is a public repository for microarray based gene expression data,
maintained by the microarray
informatics group at the EBI (for details, refer to this chapter at the main page).
You can query
ArrayExpress by experiments, arrays, and protocol
types. NOTE that, until recently, there
was no option to search the stored array experiments for data of a gene
of interest which was a
MAJOR DRAWBACK compared to e.g. the NCBI-GEO database. Now, there is at
least a prototype
of such a program, which only scans a small set of
array accession numbers (experiments) using a gene name as query.
SMD
(Stanford Microarray Database) is a large array data repository
of Stanford University, providing public access to published array
experiments. You may search
SMD (basic
or advanced)
for datasets via lists of specific experimental setups. You may select
for organisms, experimenters, categories (like "T-cell" or "stress")
and sub-categories (like "aging", "development", or "infection").
Within
the retrieved datasets, there is a multitude of options for data
analysis
and download. Please refer to the SMD
description at the main page for details. Please note that not
all microarray data
stored in the SMD are retrievable via SOURCE!
EXP6...perform in silico
expression profiling in the field of endothelial cell biology ? (last
update Jun. 14, 2005)
0. How to search for expression profiles of endothelial
cells (ECs) ?
As also discussed in FAQs EXP5
and EXP10, there are generally 2 ways of
obtaining these data, either via keyword searches in databases
or via browsing their lists of stored microarray datasets. In
cases when microarray data are not available via public repositories,
a third option is to simply perform PubMed
or even Google searches to
retrieve literature of such data, which often contains at least links
to the author's homepages, allowing certain analyses. This, in
fact, is an important issue as a huge amount of data is buried in such
"author's homepages" !
1. Global profiles of "normal" (uninduced) ECs:
1.1. GNF Gene Expression Atlas v1 (Novartis, Dec. 2002):
Tip! The
GNF Gene
Expression Atlas (version 1)
contains a survey of gene expression profiles in a diverse group of
primary human and mouse tissues and organs, as well as
transformed cell lines, performed with Affymetrix U95A (human) and U74A
(mouse) arrays. GNF is the Genomics
institute of the Novartis research Foundation. A total of 101 unique
specimens representing 47 tissues / cell lines from the normal
physiological state are represented, providing a survey of the human
and mouse "transcriptomes". As one of these cell lines, HUVEC (Human Umbilical
Vein Endothelial Cells) are included. You can search these data directly at the GNF
site, either by Keyword or Accession
(GB, UniGene, LocusLink,...), Sequence
(BLAT Search), or by Expression
Pattern. Note that the last option can be extremely useful
meaning that you are able to choose a single tissue/cell type
and see ONLY the genes showing a pronounced expression. Note
that besides a very instructive graphical display, there is a link
"Find similar expression", revealing respective genes.
This series of array data is also available as NCBI-GEO dataset GDS181
("Large-scale analysis of the human transcriptome") and as GEO
series record GSE96.
Please also refer to FAQs EXP3 and EXP11 for additional information. You may
selectively search this GDS record for your gene of interest, using ENTREZ GEO
Expressions, by entering e.g. "ELAM AND GDS181". Note that
sometimes, gene names (like "TNF") yield heterogenous lists of
"similar" hits, in this case you either have to scan for the specific
entries or you first look for the array-specific IDs (like
ProbeSets for Affymetrix
chips) of your gene (described in FAQ EXP2), and then
use these IDs for your GEO search instead of the
gene name.
Please note that also the SOURCE
database can be used for this purpose, although it is not possible to
define complex queries like in GEO. Nevertheless, you may start with
your gene name of interest (like "ELAM"), and then follow the link
"Show gene expression data", which yields a list of microarray data
stored in SOURCE, refering to your gene of interest. If available, the
GNF data are listed as "NormalTissueAtlas". The respective link
leads to the "SOURCE specific" display, showing high expression as
shades
of red and low expression as shades of green. Note that you can
immediately display genes with similar expression by clicking onto the
color bar !
Tip!
Please note that also the UCSC Gene Sorter
may be used to display this GNF exression data set. This tool displays
a table of genes within a selected genome that are related to one
another. Several different relationships may be explored, one of them
is the similarity of gene expression profiles, as available from the
GNF data. This means that, similar
to the option at SOURCE, you may start with your gene of interest and
display genes with similar tissue specific expression by
choosing
the appropriate option. Note that the UCSC Gene Sorter does not
display all the tissues / cell lines in the "initial view", but only a
"selection". Anyway, the user may display all tissues via the
"Configure"
button ! Please also refer to FAQ GENOM2
for additional information.
1.2. GNF Gene Expression Atlas v2 (Novartis, Mar. 2004):
Tip! The
Version 2 of GNF Atlas
is
based on even more tissues and cell lines than version 1. In the
context of
endothelial cell biology, bone marrow derived
CD105-positive ECs are included. These data may be searched
using the GNF site
itself. Note that this
version 2 dataset was not only created using the Affymetrix chips U133A
(human),
but also using data derived from Novartis - designed arrays (human = GNF1H;
mouse = GNF1M), using a panel of RNAs derived from 79 human and
61 mouse tissues or cell types. Using the link "Search Expression",
you are able to choose a single tissue/cell type and see ONLY
the genes showing a pronounced expression. When choosing the "Search
Type" "Correlation Search", you may list genes showing high expression
in " BM-CD105+Endothelial cells" and e.g. low expression in dendritic
cells or T-cells. Please also refer to EXP11 for
details.
Note that also version 2 is
integrated in the UCSC Gene Sorter,
allowing a very convenient "quick-view" of the expression profile of
your gene of interest. Thereby, you may especially compare the
expression of your gene in ECs with e.g. T-cells, B-cells, NK-cells,
monocytes, or dendritic cells (see also FAQ EXP10).
In addition, all genes showing similar expression are also
displayed.
This series of array data is also available as
NCBI-GEO series
record GSE1133
("Tissue-specific pattern of mRNA expression"); there is no
GDS record yet (therefore, it is not possible to perform these "compare
2 sets (A vs B)" analyses, yet). The GNF1H array is
described as platform record GPL1074;
the GNF1M array is described as GPL1073.
1.3. Endothelial cell profiles (Stanford, Jul. 2003):
This dataset reflects a specific investigation of endothelial
cell profiles, but using chips of only 1.200 genes, spotted
as cDNAs. This means that these chips (GPL217)
are covering only a fraction of the whole genome, and interestingly,
queries did not find well-known genes like ELAM or VCAM. Anyway,
different types of ECs are included (HUVEC, human umbilical vein endothelial
cells;HMVEC, human lung microvascular endothelial cells; HAEC,
human aortic endothelial cells; HCAEC, human coronary artery
endothelial cells) and may be compared either between each other
or to other cell types like smooth muscle cells, astrocytes, or HepG2
cells. These data are available as GEO series record GSE515,
and as dataset GDS204,
thereby providing the option to compare 2 sets (A vs B) of cell
types or tissues, in order to produce lists of genes which are x-times
higher expressed in one compared to the other. You simply select the
appropriate check-boxes and choose a value for x. When hitting the
"QueryAvsB" button, the output is generated. Note that in the case of GDS204,
you may also compare the whole group of endothelial cell types
to the group of "other" cell types.
1.4. Endothelial cell diversity (Stanford, Sep. 2003):
In this work, a large-scale comparative analysis of
a series of different endothelial cell subtypes was performed (Chi
et al., PNAS 2003). In particular, ECs from artery
(coronary, pulmonary, umbilical, iliac, and aorta) were compared
to ECs from vein (umbilical, saphenous),
as well as ECs from various tissues (lung, skin,
intestine, uterus, bladder, myocardium, nasal). In addition,
microvascular ECs were compared to large vessel ECs.
These data are available on the web site
companion to the publication. Here, you may not only download
the primary data and view the web supplements, but you
can also interactively explore enhanced versions of the figures
from the paper, using the GeneExplorer software. GeneExplorer
is a web-based program which allows users to navigate clusters and
search for specific genes by name or symbol. NOTE that GeneExplorer
works much better with MS Internet Explorer than with Netscape! Click
on a region within the left panel ("Radar": customizable via the
"percentage" drop-down menu) to view it magnified
within the right panel ("Zoom"). Click on the "SC" links
preceding the gene names for additional information on the clones and
genes
listed. These links point to SOURCE
GeneReports. Click on the expression colorbar of any gene of interest
within the zoom image area to see the 20 most similar genes.
Alternatively, these
data are available via the "Search"
tool at the Stanford Microarray Database (SMD). Here,
you
have a multitude of options for data analysis and download. You
may
view, sort, download array data, or you may even see the clickable
image
of all arrays and click onto individual spots to reveal the spot
details.
1.5. Cardiac and aortic endothelial cells (University of
Antwerp, June 2004):
In this work, the gene expression
profiles of endocardial (EE) and aortic (AE)
endothelial cells of rat were analyzed. These
data are available as NCBI GEO dataset GDS695,
series record GSE1478.
As platform, rat Affymetrix U34 arrays were used.
1.6. Lymphatic endothelial cells (LECs) and blood vascular
endothelial cells (BECs) and Kaposi sarcoma reprogramming (Cancer
Research, UK, July 2004):
In this experiment, published in Nat.
Genet. by Wang
et al., a series of human Affymetrix U133A arrays was used to
compare the global expression profiles of endothelial cells (LEC,
BEC, HUVEC, MVEC) to other cell types like fibroblasts, smooth
muscle cells, or mesenchymal cells. In addition, a comparison between
Kaposi sarcoma lesions and "normal" skin was performed, revealing
similarities between KS profiles and EC profiles, especially LEC ones.
These data are available at the ArrayExpress database
under accession number E-MEXP-66. Simply enter this accession
number at the page "Query database",
in the field "Query for Experiments". You may then hit the "Retrieve
Data" button, which leads to the download of the whole dataset. If you
now want to focus on the expression of your gene
of interest, you have to locate the respective row using a specific
identifier, and then concentrate on those columns containing the
"Signals". You may delete all
other rows and columns and finally produce a diagram from the "Signal"
values.
2. Endothelial cells and inflammation:
2.1. IL-1 stimulation of HUVEC (BMT, May 2004):
In this experiment, HUVEC
(Human Umbilical Vein Endothelial Cells)
were treated with Interleukin-1 for various periods of time up
to 6 hours, and the global changes of expression patterns were analysed
using human Affymetrix U133A arrays (Mayer
et al., ATVB 2004). This experiment was performed within the BMT
(Bio-Molecular Therapeutics, Vienna), in collaboration with the
Clinical
Institute of Medical and Chemical Laboratory diagnostics, Vienna. These
data are available as NCBI-GEO series record GSE973,
providing links to the individual sample records GSM15389 to 15393,
corresponding to the individual time points. This series of data is
also
available as GEO Dataset record GDS649,
providing many options for interactive data analysis and download.
2.2. Inflammatory cytokine effect on five primary endothelial cell
types (DNAX Research Inc., Nov. 2003):
These datasets represent an examination of
gene expression induced by interferon gamma (IFNg), tumor
necrosis factor alpha (TNFa) and interleukin 4 (IL4)
inflammatory cytokines on 5 different primary
endothelial cells (lung: GDS498;
aortic: GDS499;
iliac: GDS500;
dermal: GDS501;
and colon: GDS502).
These samples are collectively described in GEO series record GSE569.
These data were performed on Incyte Gene Album Arrays 1-6, GEO
record GPL371,
containing approx. 37.000 human cDNAs. Please note that within the GDS
records, you may e.g. generate lists of genes selectively induced or
repressed by one of the 3 stimuli as compared to the other. An
important remark has to be added here: As also confirmed by email
from the company, the labeling was designed in a way that all
values seen in the GEO graphs are actually "the other way round",
meaning that high bars correspond to DOWN-regulation, whereas small
bars correspond to UP-regulation (as
tested with genes like ELAM,
known to be strongly up-regulated by TNF) !!! So,
always stay extremely cautious when interpreting microarray data, and
always perform "positive controls" for yourself !
2.3. Steroids Effect on HUVEC Response to LPS or
Cytokines (Institute of Surgical Research, San
Antonio, Jun. 2004):
HUVEC
were treated with Lipopolysaccharide (LPS) or cytokine mix (CM),
a
mixture of proinflammatory cytokines (TNF-α, and IFN-γ)
for 4 hours.
All
treatments were done in quadruplicates. At the end of treatment, total
cellular RNA was isolated. These data are not (yet)
available as NCBI GEO dataset,
but as series record GSE1486.
As platform, the HUMAN_21K_OligoArray_2 (GPL1225)
was used, a non-commercial array of spotted 70mer oligonucleotides.
2.4. Leukotriene LTD4 effect on macrophage and endothelial
cells (Institute Vascular Medicine, Jena, Germany, Aug. 2004):
Human umbilical vein endothelial cells (HUVEC) or the human macrophage
cell line, Mono-Mac-6 were treated with the pro-inflammatory
mediator Leukotriene D4 for 1 hour. These data are available as
NCBI GEO
dataset GDS731,
series record GSE1644.
As platform, human Affymetrix U133A arrays were used.
2.5. HUVEC gene profile after TNF-stimulation (University of
Muenster, Germany, May 2005):
HUVEC were left
untreated or
stimulated for 5h with 2 ng/ml TNF.
Comparsion of the gene profiles revealed TNF-mediated gene expression
changes in HUVEC. These data are not (yet)
available as NCBI GEO dataset,
but as series record GSE2639.
As platform, human Affymetrix U133A arrays were used.
3.
Endothelial cells and stress:
3.1. Human endothelium exposed to shear stress and pressure
(Internal Medicine Dep., Goeteborg, Sweden, Jun. 2004):
Intact
living conduit vessels (umbilical veins)
were exposed to normal
or high intraluminal pressure, or low or high shear stress in
combination with a physiological level of the other force. These data
are not (yet)
available as NCBI GEO dataset,
but as series record GSE1518.
As platform, human Affymetrix U133A arrays were used.
4. Endothelial cells and hypoxia:
4.1. Responses of HUVEC to Hypoxia and Reoxygenation (Institute
of Surgical Research, San Antonio, Feb. 2004):
This is a timecourse experiment. The hypoxic
treatment consisted of
1-hour hypoxia followed by various periods
of reoxygenation (0, 3, 5, 12 and 24 hrs) for gene expression analysis.
Total RNA was extracted from cultured HUVECs.
These data are not (yet)
available as NCBI GEO dataset,
but as series record GSE1041.
As platform, the HUMAN_21K_OligoArray_1 (GPL981)
was used, a non-commercial array of spotted 70mer oligonucleotides.
5. Endothelial cells and angiogenesis:
5.1. HUVEC treated with VEGF-A versus PIGF (Cambridge
University, Nov. 2003):
In this
experiment, HUVEC were treated with
angiogenic factors vascular
endothelial growth factor-A (VEGF-A) versus Placental Growth
Factor (PIGF), in
low or high serum media, in a time course up to
42 hours. These data are available as NCBI GEO dataset GDS495,
series record GSE837.
As platform, human Affymetrix U95 arrays were used.
6. Endothelial cell development:
6.1. Endothelial progenitor cell expression profile (Molecular
Cardiology, Frankfurt, Dec. 2004):
Expression profilng of endothelial progenitor cells (EPCs) derived from
peripheral blood. EPCs compared to human umbilical venous endothelial
cells (HUVEC) and CD14+ monocytes. Results
provide insight into the mechanism
underlying the positive contribution of EPCs to neovascularization.
These data are accessible as GEO dataset GDS1075,
series record GSE2040,
platform: Affymetrix human U95A (GPL91).
EXP7...perform clustering analyses of microarray data ? (last update Jun. 18, 2006)
As often, there are commercial and free software packages
available which
address this question.
Tip! Expression Profiler, provided freely
by the EBI, is a set of tools for clustering, analysis
and visualization of gene expression and other genomic data.
Tools in the Expression Profiler allow to perform cluster analysis,
pattern discovery, pattern visualization, study and search Gene
Ontology categories, generate sequence logos, extract regulatory
sequences, study protein interactions, as well as to link analysis
results to external tools and databases. The main module of Expression
Profiler is EPCLUST
(Expression Profile data CLUSTering and
analysis),
which is a generic data clustering, visualization, and analysis tool
for numeric (e.g. gene expression data) as well as sequence data.
Please refer to the corresponding
Expression Profiler chapter at the main page for a detailed
description how to
efficiently use
Expression Profiler !
Expression Profiler:
Next Generation (EP:NG) is the new EP
version (2004), providing the full functional range from the old
version AND additional features, together with a "unified"
user interface. Expression Profiler is a web-based platform for microarray
gene expression and other functional genomics-related data analysis.
The new architecture, EP:NG, modularizes the original design
and allows individual analysis-task-related components to be developed
by different groups and yet still seamlessly to work together and share
the same user interface look and feel. Please refer to
the corresponding
Expression Profiler: NG chapter at the main page for a detailed
description, including also a list of problems
encountered when using this new version!
EXP8...get all proteins from endothelial cells involved in
inflammation (SRS and GO approaches) ? -> see RET2
!
Note that RET2
describes approaches based on sequence retrieval tools, as well
as approaches based on the Gene Ontology (GO) system of
functional assignments. For comparison, approaches to address this
question based on expression analysis are discussed in FAQ EXP6.
EXP9...submit my own
microarray data to a public
database ? (last update
Jan. 3, 2006)
As already discussed especially in questions EXP3 and EXP5, there are only a
few public repositories of microarray data, which accept data
from institutes all over the world, comparable to sequence deposit at
GENBANK. The two major databases are ArrayExpress of the
EBI and GEO of
the NCBI. As I investigated the "overall-handling" and
user-friendliness of these databases, I came to the conclusion that at
the current status, GEO is the better solution. Therefore I am focusing
on the data deposit at GEO in the following text.
Tip! GEO (Gene Expression Omnibus)
was launched by the NCBI, in order to support the public use of
gene expression data. GEO is a gene expression and hybridization array
data repository, as well as an online resource for the retrieval of
gene expression data from any organism or
artificial source. For a full description of the
GEO functionality, please refer to the corresponding
chapter at the main page. In order to submit data to GEO,
you should follow a flow-chart of different steps.
First, you should check and prepare your
data to be "as MIAME compliant as possible". The Minimum
Information About a Microarray Experiment
(MIAME)
is a standard developed by the Microarray Gene Expression
Data (MGED) group for
the content of data which should be supplied when publishing microarray
experiments. Please note that there is a comprehensive, yet concise MIAME
checklist available, which is a guide to authors, editors, and
reviewers of microarray gene expression papers. This checklist should
be considered as a basis for the preparation of submissions to public
repositories. Especially, you should consider the requirements
concerning the description of the experiment design, the nature and
preparation of samples, the hybridization procedures, and the data
measurements. Note that the specifications concerning the array design
are only needed in cases when the array type used is not already
present in the chosen database.
Second, it is generally favourable to browse
the
GEO database and to view some entry examples in order to get an
impression of the overall database architecture. In general, GEO
consists of four entities, which are submitter, platform,
sample, and series. These entities have to be deposited separately, in
a process which is described concisely in the GEO
web deposit guide.
Fourth, you have
to create your own GEO
account ("submitter" entity) by entering your
contact information. This is publicly accessible information
that is necessary so that proper credit can be given for data.
This "user" and "password" information can from-now-on be used
to log in
and
submit new entries or manipulate / update existing entries.
Fifth, check to see if your platform
(microarray type) already exists in GEO. If your experiments were
performed using commercial arrays (e.g., Affymetrix), it may not be
necessary to submit a platform record. If you find the relevant
platform already deposited in GEO (view all commercial
nucleotide platforms), take a note of its GEO accession number ("GPLxxx"),
which you have to submit with each
sample record (individual array measurement).
Sixth, submit your hybridization data as sample
records ("GSMxxx"). A
sample record references one platform, and describes the abundance
measurements of a single hybridization/experimental condition
(meaning if you e.g. submit an experimental time course represented by
several chips, you have to submit each chip data as individual
sample record). You will first be asked to specify the experiment type
(e.g., single channel for Affymetrix chips) and the platform accession
number. Next you must provide the sample data table in text,
tab-delimited
format, meaning you have to save your EXCEL sheets in this format. Please
note that GEO does not accept values which have a comma
(or any other punctuation) as decimal separator, except periods
(!).
The
easiest way to choose this format for your values is to make the
appropriate
selection in the Windows Control Panel (Regional Options -> Numbers
-> Decimal symbol). The first row of the data table must
contain
the column headers. Sample data tables require a column named "ID_REF",
matching the "ID" column of the reference platform, and a "VALUE"
column. For dual channel experiments, the VALUE will reflect the
normalized
log ratio measurements. For single channel experiments, the VALUES will
be normalized (scaled) signal count data (not log transformed), which
in the case of Affymetrix MAS4.0 software is the value "Average
difference"
and in version MAS5.0 is the "Signal". Note that this "value"
is
consistent
across all samples and can therefore be used for comparisons between
different chips (hybridizations). In case of Affymetrix data,
GEO strongly suggests to provide 2 additional columns, namely "ABS_CALL"
displaying the "Present/Marginal/Absent"
detection calls, and finally "DETECTION P-VALUE" showing the
respective information. Note that you may optionally provide even
further
columns. After the data table has passed validation, you will be asked
to supply the sample title, organism, description, authors and
keywords. The 'Description' field may hold very large volumes
of data, and it is encouraged that submitters provide a thorough report
of the sample, which may include a detailed description of the
biological source, experimental conditions and treatments, labeling and
hybridization protocols, spot quantification and normalization schemes.
Seventh, submit a
series record ("GSExxx"). A series brings together
a related group of samples, and provides a focal point and
description of the experiment as a whole. Information reflecting
experimental sample subsets
may also be specified. Submitters are encouraged to supply information
regarding the overall experimental design, aim, summary results and
conclusions. If you e.g. publish a series of samples representing a
time course experiment, then the series record is the suitable place to
define the correct ordering
of the individual samples, and also, if applicable, the definition of subsets
according to fields like tissue, disease or treatment.
Note that these series records are used by GEO
staff to produce a so-called GEO
Dataset record ("GDSxx"), as soon as the
data become public, allowing the interactive analysis of the sample
records and their subsets and providing multiple download options.
These GDS records are displayed in the output lists when performing an ENTREZ search against
the GEO expression profiles, using e.g. your favourite gene of interest
as query.
Eighth,
each record you submit will receive a unique and stable GEO
accession number which you may quote in manuscripts. Records may
remain private for several months until the data
is published. During this period, you may request a "read-only"
password (email geo@ncbi.nlm.nih.gov)
which allows collaborators or reviewers confidential access to your
private
data prior to publication.
Ninth, one final point should be considered
in order to fulfill the MIAME guidelines, which state that not
only the processed but
also the raw
data have to be supplied in order that other users have the ability to re-evaluate
your dataset using alternative statistical
algorithms.
Therefore, in the case of
Affymetrix experiments it is recommended that original .cel files
are supplied via FTP (FTP server details via email). Please
name these files after the GEO accession number they correspond to,
e.g., GSM12345.cel. Links will be supplied on GEO sample records
allowing users access to the original data files.
EXP10...compare the expression
of my gene of interest in B-cells, T-cells, monocytes, and dendritic
cells ? (last update Aug. 20, 2004)
This question is somehow related
to FAQ EXP3, which explains the principal
options, how to search public expression databases for data concerning
your gene of interest, and to FAQ EXP5
which describes ways to search for experimental conditions
(including cell
types used). In this FAQ now, we specifically address the point how to
compare the expression profiles of your gene of interest in a series of
different cell types. As example, we are looking for the
expression of IL8 (Interleukin 8) in B-cells, T-cells,
monocytes, and dendritic cells.
1. "One-step" keyword search:
GEO
provides a lot of
options to create simple or complex queries, both searching for
expression
profiles as well as experimental conditions like cell types, organisms
and treatments used. If you directly query ENTREZ GEO
Expressions using only the gene name ("IL8"), you will get a
very long list of all microarray experiments having this gene on the
chip. Therefore, you
have to combine the gene name with a search for the experimental setup
you are interested in. Please note that there is a link "Preview/Index"
just below the query box, which is very useful as it immediately
displays the number of hits produced by a specifiy query. In general,
tests showed that there is no "uniform system" for sample nomenclature,
meaning that often it is not easy to choose the field which best fits
to your query. Therefore, these "combined" keyword searches often
produce either long lists of "false positives" or very short lists when
an inappropriate field was chosen.
2. "Two-step" keyword search:
If you prefer a "Two-step keyword strategy" over the
combined approach, you may first query ENTREZ GEO
Datasets for stored microarray experiments using your
cell type of interest (like "dendritic"), and then specifically search
in a second step for the expression of your gene(s) of
interest within this data set, using ENTREZ GEO
Expressions. There, you may simply combine (using "AND") the
GEO dataset accession with the gene name of interest. Note that
sometimes, gene names (like "TNF") yield heterogenous lists of
"similar" hits, in this case you either have to scan for the specific
entries or you first
look for the array-specific IDs (like ProbeSets for Affymetrix chips)
of your gene (described in FAQ EXP2), and then
use these IDs for your GEO search instead of the gene name. In our
case,
this approach, although suitable for individual cell types, is
still unsatisfactory, as it is very difficult to catch datasets
comprising all the cell types listed above via a keyword
search.
3. Browse microarray datasets:
Tip!
Note that, instead of playing around with search terms, it may
be advantagous to simply browse
the list of GEO datasets for experiments matching your
interests. In addition, you may simply "browse PubMed or the whole
internet" for appropriate array experiments. In our case, we would
be screening for a panel of cell types, which are often contained in "global
transcriptome datasets". In particular, the cell types listed above
are contained in the Version
2 of GNF
Atlas, which is split into 3 curated GDS (GEO Dataset)
records:
GDS592
covers the GNF1M data (mouse atlas); GDS594
covers the human GNF1H data; and GDS596
covers the human U133A data, and may as well be searched using the GNF site itself. Please
also refer to EXP11 for details. Note that
both GNF Atlas version 1+2 expression data are also
integrated in the UCSC Gene Sorter,
allowing a very convenient "quick-view" of the expression profile of
your gene of interest. Thereby, you simply query for "IL8", and
immediately produce a "red (up-) vs. green (down-regulated) image
of tissue- and cell type-specific expression. Also note, that all genes
showing similar expression are also displayed.
EXP11...get a "virtual
multiple tissue
Northern blot" of my gene of interest ? (last update Mar. 3, 2006)
Often, researchers in the lab are
interested to get a "quickview" of the tissue- and cell
type-specific
expression of a novel gene of interest. In the "good old days",
this was done (but still is done today) by so-called Multiple Tissue
Northern blots, which are quite expensive. Today, public microarray
databases store a vast amount of expression information for the
majority
of known genes, so it is quite likely that you will retrieve "in
silico" the desired information.
GENERAL REMARK: Often,
individual microarray probesets and also EST sequences correspond to
only ONE specific splicing variant
of a gene. Thus, in cases where different splicing variants may also
differ in tissue-specific expression, it is necessary to consider this
point. Please refer to FAQ EXP15 for
this purpose !
1. Microarray data:
Tip! The
GNF Gene
Expression Atlas contains a survey of gene expression profiles
in a diverse group of primary human and mouse tissues and
organs, as well as transformed cell lines. GNF is the Genomics institute of the
Novartis research Foundation. A total of 101 unique specimens
representing 47 tissues / cell lines from the normal physiological
state are represented, providing a survey of the human and mouse
"transcriptomes". (Su et al. 2002 PNAS
99: 4465-70). These data were produced
using Affymetrix human U95A and mouse U74A chips. There are several
ways to define a query.
You can search directly at the GNF
site, either by Keyword or Accession
(GB, UniGene, LocusLink,...), Sequence
(BLAT Search), or by Expression
Pattern. Note that the last option can be extremely useful
meaning that you are able to choose a single tissue/cell type
and see ONLY the genes showing a pronounced expression. Note
that besides a very instructive graphical display, there is a link
"Find similar expression", revealing respective genes.
This series of array data is also
available as NCBI-GEO
dataset GDS181
and as GEO series record GSE96.
Please also refer to FAQs EXP3 and EXP6
for additional information. You may selectively search this GDS record
for your gene of interest, using ENTREZ GEO
Expressions, by entering e.g. "Interleukin 8 AND GDS181". Note
that sometimes, gene names (like "TNF") yield heterogenous lists of
"similar" hits, in this case you either have to scan for the specific
entries or
you first look for the array-specific IDs (like ProbeSets
for Affymetrix chips) of your gene (described in FAQ EXP2),
and then use these IDs for your GEO
search instead of the gene name.
Please note that also the SOURCE
database can be used for this purpose, although it is not possible to
define complex queries like in GEO. Nevertheless, you may start with
your gene name of interest (like "IL8"), and then follow the link "Show
gene expression data", which yields a list of microarray data stored in
SOURCE, refering to your gene of interest. If available, the GNF data
are listed as "NormalTissueAtlas". The respective link leads to
the "SOURCE specific" display, showing high expression as shades of red
and low expression as shades of green. Note that you can
immediately display genes with similar expression by clicking onto the
color bar ! The link to the "author's webpage" on the left side leads
to the GNF
homepage (described above), where you may repeat your search to
produce a different graphical image of the expression profile.
Please note that also the UCSC Gene Sorter
may be used to display the GNF exression data set. This tool displays a
table of genes within a selected genome that are related to one
another. Several different relationships may be explored, one of them
is the similarity of gene expression profiles, as available from the
GNF data. This means that, similar
to the option at SOURCE, you may start with your gene of interest and
display genes with similar tissue specific expression by choosing the
appropriate option. Note that the UCSC Gene Sorter does not
display
all the tissues / cell lines in the "initial view", but only a
"selection".
Anyway, the user may display all tissues via the "Configure"
button
! Please also refer to FAQ GENOM2
for
additional information.
The data retrieval tool BioMart, which
is
described in detail in e.g. FAQ RET3,
also
provides an option to filter (in the section "Expression" at the
"Filter" page) any retrieved data set, in order to keep only those
entries (genes, SNPs, etc.) providing a link to expression data of
the GNF database. Note that you also have to choose these expression
data at the output tab "Features", if you want to display the
respective information.
Tip! The Version 2 of GNF Atlas
was released in March 2004, which was not only created using the
Affymetrix chips U133A (human), but also using data derived from
Novartis - designed arrays (human = GNF1H; mouse = GNF1M),
using a panel
of RNAs derived from 79 human and 61 mouse tissues or cell types. Note
that GNF1H essentially contains genes which are NOT present
on the Affy U133 arrays, meaning the two chips are complementary. GNF1H
is also termed GNF1B at the GNF site. You may query
using gene symbols, accessions, keywords, er even sequence. Note that
using the link "Search Expression", you are able to choose a single
tissue/cell type and see ONLY the genes showing a pronounced
expression.
Note that these new data are especially
useful if you are looking for profile comparisons between a list
of cell types, like bone-marrow (BM) derived endothelial cells,
BM-early erythroid cells, or peripheral blood (PB) dendritic cells,
PB-B cells, PB-T cells, PB-monocytes, or PB-NK cells.
This series of array data is also available as NCBI-GEO
series record GSE1133; which
is split into 3 curated GDS (GEO Dataset) records: GDS592
covers the GNF1M data (mouse atlas); GDS594
covers the human GNF1H data; and GDS596
covers the human U133A data. The GNF1H array is described as
platform record GPL1074;
the GNF1M array is described as GPL1073.
Note that both GNF Atlas 1+2
expression data are also integrated in the UCSC Gene Sorter,
allowing a very convenient "quick-view" of the expression profile
of your gene of interest (see above) !
Tip! GeneNote
is a database of human genes and their expression
profiles in healthy tissues. It is based on Weizmann
Institute of Science DNA array experiments, which were performed on the
Affymetrix HG-U95 set A-E (the same arrays like in the GNF database).
GeneNote is tightly connected to the Weizmann database GeneCards, an
"integrated" database of human genes, their products and their
involvement in diseases. It offers concise information about the
functions of all human genes that have an approved symbol, as
well as selected others. Actually, GeneNote not only stores microarray
data of "normal tissues", but also SAGE data, as well as data
based on ESTs (named "Electronic Northern"), as
described on the "Methods"
page. You may search
GeneNote by diverse identifiers like gene symbol, Ensembl ID,
UniGene ID, SAGE tag, or LocusLink ID. Interestingly, you may also
choose between MAS5.0 (Affymetrix software) normalized or raw data. You
will retrieve data for your gene of interest from all three types of
methods mentioned.
Please note that the GeneNote
data is also available as NCBI-GEO
dataset GDS422
to GDS426
(representing the different U95 subtypes), and as GEO series record GSE803.
Please also refer to FAQs EXP3 and EXP6
for additional information. You may selectively search this GDS record
for your gene of interest, using ENTREZ GEO
Expressions, by entering e.g. "Interleukin 8 AND GDS422". Note
that sometimes, gene names (like "TNF") yield heterogenous lists of
"similar" hits, in this case you either have to scan for the specific
entries or
you first look for the array-specific IDs (like ProbeSets for
Affymetrix chips) of your gene (described in FAQ EXP2),
and then use these IDs for your GEO search instead of the gene
name.
Tip! H-Invitational
Database (H-InvDB) is a human gene database opened to the
public in April 2004, which is hosted by the Japan Biological
Information Research Center (JBIRC) and by the
DNA Databank of Japan (DDBJ),
with contributions from more than 40 institutes worldwide, like the
german DKFZ. The scope of H-InvDB is to provide an integrative
annotation of full-length cDNA clones available from high
throughput cDNA sequencing projects. The database generates cDNA
clusters describing their gene structures, and, among many other
features, showing data on gene expression profiling. You first
have to get the specific database entry of your gene of interest,
either via BLAST
(sequence) search or via keyword search,
and then look for the section "Gene expression profile" within the
so-called "Locus view". The colored symbol links
to the database H-ANGEL. H-ANGEL is a viewer of gene expression
data incorporated into the database Human Anatomic Gene Expression
Library (H-ANGEL), in which we can see the expression data from several
experimental platforms (described in the H-InvDB
manual) and descriptions about expression from public data
resources. Gene expression data in H-ANGEL were generated from three
types of methods and in seven different platforms, including iAFLP,
a PCR-based quantitative expression profiling method, DNA arrays
and cDNA sequence tags (SAGE, EST and MPSS). Note that
H-ANGEL contains many tissues but not individual cell lines.
You may also query H-ANGEL using different identifiers like
H-Inv Locus ID, RefSeq, UniGene ID, or Locus Link ID. Please note
that a very nice
feature of H-ANGEL is the colored representation of expression data
from
all the different methods within one single graphical display, allowing
a very easy comparison between the different methods ! Please also
refer to the H-InvDB section at the
Data Integration page for additional
information !
Normal
tissues of diverse types (Stanford University, Jan. 2005) is a
dataset which contains the expression profiling of a series of normal
human tissues. Samples obtained by surgery or autopsy, and
evaluated by
pathologists. Results provide insight into the molecular organization
of diverse cell types, and provide a baseline for comparison to
diseased tissues. These data are split into 4 different GEO datasets,
which are collectively described in GEO series record GSE2193:
SHBW: GDS1085;
SHCN: GDS1086;
SHBA: GDS1087;
SHDP: GDS1088.
NOTE:
These datasets not only cover different tissues but are based on
different platforms, which are non-commercial arrays of spotted cDNAs. NOTE:
These expression data are also integrated in the UCSC Gene Sorter,
listed as "Expression (Stanford)" in the dropdown menu "sort by",
allowing a very convenient "quick-view" of the expression profile
of your gene of interest !!!
2. SAGE data:
SAGE
offers the expression profile comparison of a multitude of
human and mouse cancer and non-cancer cell lines via nucleotide tags;
Essentially, the SAGE technique measures not the expression level of a
gene, but quantifies a "tag" which represents the transcription
product of a gene. A tag, for the purposes of SAGE, is a nucleotide
sequence of a defined length, directly 3'-adjacent to the 3'-most
restriction site for a particular restriction enzyme. As originally
described, the length of the tag was nine bases, and the restriction
enzyme NlaIII. Current SAGE protocols produce a ten to eleven base tag,
and, although NlaIII remains the most widely used restriction enzyme,
enzyme substitutions are possible. The data product of the SAGE
technique is a list of tags, with their corresponding count
values, and thus is a digital representation of cellular gene
expression.
The SAGEmap
Virtual
Northern extracts SAGE tags and orientation signals from an
input-sequence and displays output links to expression values in
different cell lines. Alternatively, you may also query via SAGE tag or
via gene name. When you click onto a specific tag in the output, a
"virtual Northern" is displayed showing the expression levels in a
multitude of cell types just like "bands" on a Northern blot. Please
note that often tags are NOT SPECIFIC for one gene / UniGene
cluster, so you should always check how many mRNA-source sequences are
supporting one tag, and which UniGene clusters these mRNAs belong to !
Tip! H-Invitational
Database (H-InvDB) (see description above) provides
expression data generated from all three types of methods
(Microarrays, SAGE, and ESTs).
Tip! ECgene
(gene prediction by EST clustering) predicts genes by combining
genome-based
EST clustering and a transcript assembly procedure. There is a
specific section of ECgene called ECexpression,
which is the expression data viewer of ECgene, but you may also start
at the ECgene homepage and retrieve the Gene Summary Viewer which also
displays EST and SAGE data across multiple tissues and cell types.
ECexpression utilizes
the extensive expression data from EST and SAGE
sources. Note that a very useful
feature is that normal and cancer libraries are divided
and also displayed separately in the graphs. Therefore, this layout
makes it easy to find any tissue-specific or cancer-specific isoforms !
Please refer to the main sections of
ECgene and ECexpression
for details !
Tip! SAGE Genie is a website of
the CGAP portal which provides
highly intuitive, visual displays of human and mouse gene expression,
based on a unique analytical process
that reliably matches SAGE tags, 10 or 17 nucleotides in
length, to known genes.
SAV (SAGE
Anatomic Viewer) is one of the tools of SAGE Genie. SAV displays gene
expression in human normal and malignant tissues by shading
each organ in one of ten colors, each representing a different level of
gene expression. Gene expression levels are based on the analysis of
counts of SAGE tags, which are either "short" (10 bp),
including "extracted short" (10 bp extracted from 17bp tag), or "long"
(17 bp).
SAV can be used to find the best tag for a
gene / accesion number: NOTE: SAV is an excellent resource to examine
the reliability
of individual SAGE tags, meaning the probability that a tag is "unique"
or that it matches more than one gene (and thereby renders expression
data analysis highly difficult). The best tags are color coded. In
addition, the LTV (Ludwig Transcript
Viewer) display,
showing shorter alternative polyadenylated and internally primed
transcripts, supports the prediction of reliable SAGE tags. The tag
link enables the user to see which other gene(s) may be represented by
the particular tag
and the reliability of each mapping. The Digital Northern (DN) display
shows the expression of a particular gene (SAGE
tag count) per individual SAGE library as color coded tag count. The SAGE
Anatomic Viewer itself displays the SAGE tag expression count as colored
organ images which are hyperlinked to a Digital Northern displaying the
tag expression in each individual library.
3. EST data:
DigiNorthern, provided by the Bioinformatics Group of the
Roswell Park Cancer Institute, is a tool for virtually
displaying the expression profile of query genes (currently only
accept DNA sequence as
input) based on the EST sequences currently available at NCBI
GenBank. Note that this is a completely different approach than using
microarray data. In this case, expression is "measured" as the tissue
distribution of EST sequences corresponding to a certain gene, making
it a "rougher" method. Nevertheless, it can be quite interesting to
especially see the differences of expression between normal and cancer
tissue.
There are currently two versions for this
program. DN1
takes one sequence as
query gene and lists all the cell lines/tissues/organs that express the
gene and displays the relative expression levels of the gene based on
the number of matched ESTs vs the total number of ESTs for related
libraries. Whereever available, comparison will also be made between
the same tissue/organ in normal and neoplasis status. DN2 takes two
sequences as query genes and compares their expression profiles
side by side. DigiNorthern is currently available for Human and
mouse.
Please note that DigiNorthern is somehow
similar to the SOURCE
database display of genes, which provides "UniGene and EST expression
information" in the lower part of the page.
Tip! H-Invitational
Database (H-InvDB) (see description above) provides
expression data generated from all three types of methods
(Microarrays, SAGE, and ESTs).
Tip! ECgene
(gene prediction by EST clustering) predicts genes by combining
genome-based
EST clustering and a transcript assembly procedure. There is a
specific section of ECgene called ECexpression,
which is the expression data viewer of ECgene, but you may also start
at the ECgene homepage and retrieve the Gene Summary Viewer
which also displays EST and SAGE data across multiple tissues and cell
types. ECexpression utilizes
the extensive expression data from EST and SAGE
sources. Note that a very useful
feature is that normal and cancer libraries are divided
and also displayed separately in the graphs. Therefore, this layout
makes it easy to find any tissue-specific or cancer-specific isoforms !
Please refer to the main sections of
ECgene and ECexpression
for details !
4. RNA in situ expression images:
Please refer to FAQ EXP20 for
this purpose.
5. RT-PCR, Northern Blot, Western Blot data:
Please refer to FAQ EXP21 for
this purpose.
6. Protein microscopy data based on immunostains:
Please refer to FAQ PROT6
for this purpose.
EXP12...know which genes are expressed in fetal brain 6 times
higher than in adult brain ? (last
update Aug. 20, 2004)
This question is an example for all related
questions, like "know which genes are expressed in heart
(dendritic cells, monocytes...) 10 times higher than in kidney
(T-cells,
B-cells...)". In principle, all 3 kinds of methods (microarrays,
SAGE, ESTs) may be used for this purpose, but in fact the first
one yields the best results.
If we are interested in a specific tissue or cell
type, many of them are contained in large "whole transcriptome"
datasets, which are described in EXP11. For
example, a comparison between expression profiles of fetal and
adult brain is possible in datasets like the GNF Atlas version 1
(GEO dataset GDS181)
a large-scale analysis of the gene expression profiles from a diverse
array of human tissues, organs, and cell lines, from the normal
physiological state, using Affymetrix U95 chips. When you open the GDS
record, you have the option to compare 2 sets (A vs B) of cell
types or tissues, in order to produce lists of genes which are x-times
higher expressed in one compared to the other. You simply select the
appropriate check-boxes and choose a value for x. When hitting the
"QueryAvsB"
button, the output is generated.
Note that in March 2004, the Version 2 of GNF Atlas
was released (see also EXP11), which was not only
created using the Affymetrix chips U133A (human), but also using data
derived from Novartis - designed arrays (human = GNF1H; mouse =
GNF1M), using a panel of RNAs derived from 79 human and 61 mouse
tissues or
cell types. Using the link "Search Expression", you are able to
choose a single tissue/cell type and see ONLY the genes
showing a pronounced expression. When choosing the "Search Type"
"Correlation Search", you
may list genes showing high expression in e.g. B-cells and low
expression
in dendritic cells or T-cells. This series of array data is also
available as NCBI-GEO series record GSE1133; which
is split into 3 curated GDS (GEO Dataset) records: GDS592
covers the GNF1M data (mouse atlas); GDS594
covers the human GNF1H data; and GDS596
covers the human U133A data. Using these GDS records, it is possible to
perform these "compare 2 sets (A vs B)" analyses, as described
for the
version 1 above. The GNF1H array is described as platform
record
GPL1074;
the GNF1M array is described as GPL1073.
Note that both GNF Atlas 1+2
expression data are also integrated in the UCSC Gene Sorter,
allowing a very convenient "quick-view" of the expression profile
of your gene of interest, but not (yet) providing the option
to compare whole expression profiles between tissues.
EXP13...compare two different
microarray platforms and see which genes are represented on both of
them ? (last
update Aug. 31, 2005)
Tip! CleanEx,
provided by the
Swiss Institute of
Bioinformatics (SIB), is a database which provides access to public
gene
expression data via unique approved gene symbols and which
represents
heterogeneous expression data produced by different technologies in a
way that facilitates joint analysis and cross-dataset comparisons. So
far, CleanEx contains only human genes for which the symbol is
approved by the HUGO nomenclature committee. There is one entry per
gene name. Thus, CleanEx is NOT a repository of expression data
in the strict sense but it collects expression data from
several resources (GEO, ArrayExpress, SMD, etc.) in a gene-centered way
in order to make it available via one common interface.
In order to address this specific question, there is
a query option called Find
common genes in different datasets. This option lets you easily
compare the content of 2 different microarray (or other) platforms. The
comparison is only available between experiments from
one single organism. A table is produced listing ALL genes and their
platform-specific identifiers which overlap between 2 different
platforms. Note that only the short forms of gene names are listed but
not the full descriptions.
EXP14...compare a list of upregulated
genes from one microarray experiment with one or more other experiments
? (last
update Aug. 31, 2005)
Tip! CleanEx,
provided by the
Swiss Institute of
Bioinformatics (SIB), is a database which provides access to public
gene
expression data via unique approved gene symbols and which
represents
heterogeneous expression data produced by different technologies in a
way that facilitates joint analysis and cross-dataset comparisons. So
far, CleanEx contains only human genes for which the symbol is
approved by the HUGO nomenclature committee. There is one entry per
gene name. Thus, CleanEx is NOT a repository of expression data
in the strict sense but it collects expression data from
several resources (GEO, ArrayExpress, SMD, etc.) in a gene-centered way
in order to make it available via one common interface.
In order to address this specific question, there is
a query option called Step-by-step
analysis. This is an extremely interesting feature of
CleanEx as it allows the successive comparison of the results of
different expression experiments. Example: You may first
retrieve all genes which are immediately upregulated by the
inflammatory cytokine IL-1 in HUVEC (GDS649), and then use only this
subset of genes to analyze which of them are also upregulated in human
fibroblasts by UV radiation-induced DNA damage (GDS400). First, select the first dataset
(GDS649). Once in the selected dataset's form, select the two
experiments pools that you want to compare, like overexpression at time
point 0.5h compared to 0h, and submit your job. On the result page, you will have two
choices. You can either select some genes and extract the corresponding genomic
sequences, like potential promoter sequences ("upstream of TSS"). Or
you can select another microarray dataset (GDS400) which you want to
analyze using the gene subset derived from the first step. In the list,
you will see the number of overlapping genes from the first step with
each of the other datasets. Again, you have to choose a certain
comparison, like overexpression at high doses of UV as compared to the
control. If you want to, you may now even add another dataset for
comparison. Naturally, the gene set in the output list will get smaller
after each round of analysis. NOTE: CleanEx does not
seem to consider the "Absent" and "Present" detection calls of
Affymetrix, meaning that "absent" genes are not filtered out.
EXP15...know if alternative
transcripts of a gene of interest are expressed in different
tissues or in different diseases ?
(last
update Sep. 16, 2005)
This question demands the availablility of resources which partition
the expression data of a specific gene according to its different
splice variants. There is a whole list of databases which all
store
information on alternative splicing, but only few of them provide
direct links to associated expression data. In general, we may again
separate resources in the field of microarray technology and
resources storing EST and SAGE-related expression data.
1. Microarray data:
1.1. Identification of the splice variant represented on the chip:
NOTE that the general
question how to identify microarray platforms containing a specific
gene is treated in FAQ EXP2 !!!
In general, 2 main types of nucleotide-based
microarrays exist, those which have cDNAs and those which have
a set of oligonucleotides spotted on the chips. Normally, it is
not obvious which of the different splice variants of a gene is
represented by a certain molecule on a chip. Let us take the human gene
IKIP
as an example, which has 4 exons occuring in 3 different splice forms.
We may query the CleanEx
Target database using this gene name. A list of microarray
targets is presented, containing several Affymetrix ProbeSets and cDNA
clones (e.g. IMAGE IDs). In addition, we may simply query the NCBI GEO
Gene profiles using this gene name which will also retrieve cDNAs and
oligonucleotide sets spotted on arrays.
In the case of ESTs, we have to consider a
specific problem. EST sequences in databases usually do not cover the
whole sequence of a certain clone but only a part of it. Thus, if a
certain EST is spotted on a microarray, it may not be clear which
transcript it corresponds to. In general, the EST (or other mRNA)
sequence can be fetched by using NCBI
Entrez Nucleotide using the accession number. Then, you may perform
a BLAT
search at UCSC against the human genome in order to locate the region
(and the splice variant) where this clone corresponds to.
In the case of oligonucleotide sets (e.g.
Affymetrix U133 ProbeSets), the situation is a little more complex. In
the CleanEx list, we can see 3 different Affymetrix ProbeSets
for the HG-U133B chip type. Affymetrix oligos usually are designed from
a so-called target sequence within the 3' regions of mRNAs
(3'UTRs and/or 3'coding regions). Thus, in order to know which
transcript a ProbeSet corresponds to, we have to extract this target
sequence. Clicking onto the CleanEx link for U133B:227295_at
reveals the list of oligos and the positions within the targeted
sequences. Unfortunately, there is no "contig" of the complete target
sequence and there is no graph which immediately shows the exon where
the target sequence is taken from. For this purpose, a "Quick
Query" at the Affymetrix NetAffx
analysis center is needed (free registration required), which retrieves
the record for the ProbeSet
227295_at. Here, not only the individual oligonucleotides but also
the complete target sequence is shown. You may now simply
COPY/PASTE this sequence and perform a BLAT
search at UCSC against the human genome in order to locate this region
to the last of the 4 exons of the IKIP gene. NOTE: This
region is NOT exactly the same as the one which is shown for
this ProbeSet by the UCSC browser when setting the option "Affy U133"
to "full" (indeed, this entry would cover both exon 1 and exon 4)!!!
So, if you want to precisely locate the target region, it seems
to be necessary to use the NetAffx portal.
The other 2 ProbeSets (CleanEx 235202_x_at
and 236249_at);
NetAffx 235202_x_at
and 236249_at)
can be treated in the same way, demonstrating that these datasets
correspond to exon 3 of the 4 exons of the IKIP gene.
1.2. Analysis of splice variant-specific expression data:
Now that we know the particular Affymetrix
ProbeSets which correspond to the different splice variants of the
IKIP gene, we may specifically search for associated expression data
and determine which of the different resources are suitable for this
purpose.
The UCSC Gene Sorter,
although an excellent resource for a quick-view of "virtual Multiple
Tissue Northern", is not suitable to decipher transcript
specific expression patterns. A test using the 3 different ProbeSets of
our example showed that all of them directed the query to the pattern
of ProbeSet 235202_x_at.
In contrast, the UCSC Genome Browser,
which also integrates expression data, only shows the red/green images
for the ProbeSet 227295_at (you have to set the option "GNF Atlas 2" to
"full" for this purpose). There is no expression data for the other 2
ProbeSets.
Tip! GEO
(Gene Expression Omnibus) is probably the best resource to
search for ProbeSet-specific and therefore in this example
transcript-specific expression data. A simple query for GEO Profiles
using "235202_x_at" as query reveals a list of experiments containing
data for this set. Note that all IKIP-related ProbeSets are
spotted on the U133B chip which is used less frequently than the
U133A chip. As 2 ProbeSets (235202_x_at and 236249_at) match to
the same exon, we would expect similar expression profiles with these 2
IDs, which actually is the case. Nevertheless, the profiles of ProbeSet
227295_at are similar too, not (yet) suggesting transcript-specific
expression differences. For a full description of GEO, please refer to
the GEO section at the main page.
NOTE: The CleanEx list of target sets
for IKIP also contains several ProbeSets of the U95 series of
chips, which all are true IKIP sequences (checked by BLAT).
Nevertheless, these ProbeSets are NOT listed in the "U95"-section of
the UCSC
Genome Browser, when displaying the IKIP genomic region. This
is probably due to an inconsistency of annotation freezes.
2. EST and SAGE data:
ASD - Alternative
Splicing Database, maintained at the EBI, aims to understand the
mechanism of alternative splicing on a genome-wide scale by creating a
database of alternative splice events and the resultant isoform splice
patterns of genes from human, and other model species. Simple
queries can be placed at the ASD start site. The Advanced
Query page offers a kind of "BioMart-style" interface which also
allows to filter
the complete dataset to retrieve subsets which are defined by certain
features like type of splice event, human-mouse conservation, SNP
types, and more. At the gene-specific output pages, there is a section
called Splice Pattern Viewer which displays transcripts as
interactive
graphics with expression state information, links to splice
pattern table (with
the appropriate splice pattern as high-lighted) along with expression
state information and pattern sequence. The Splice Pattern Table
lists the number of confirming ESTs
for each transcript as well as corresponding source libraries which in
turn contain data on tissue, development, and
pathology state. Note that ASD is a very good source to
link splicing variants
with tissue and cell type-specific expression, although these
data are available in tabular form only, not as graphs. Please refer
also to the ASD main section for
details !
Tip! ECgene
(gene prediction by EST clustering) predicts genes by combining
genome-based
EST clustering and a transcript assembly procedure. There is a
specific section of ECgene called ECexpression,
which is the expression data viewer of ECgene, but you may also start
at the ECgene homepage and retrieve the Gene Summary Viewer
which also
displays EST and SAGE data across multiple tissues and cell types.
ECexpression utilizes
the extensive expression data from EST and SAGE
sources. Note that a very useful
feature is that normal and cancer libraries are divided
and also displayed separately in the graphs. Therefore, this layout
makes it easy to find any tissue-specific or cancer-specific isoforms !
Gene Summary
includes a graphical image of the transcripts along with a
"reliability rating" (from A to C), and the links to the different ECgene
transcript IDs. This table is especually useful as it shows the
total lengths of the transcripts, the sizes of UTRs, the lengths of the
CDS and predicted peptide sequences. Transcript Summary is
shown by clicking onto one of
the transcript IDs in Gene Summary. This page shows the transcript
image in detail, including the protein
domains and motifs (from Motif/Domain Viewer). Also, more detailed
views of the functional annotation and the expression data are shown,
now corresponding only to the specific transcript. In
partictular, SAGE data are presented as "virtual microarray"
(red/green) image. Also, the source libraries of all individual ESTs
are listed. Please refer to the main
sections of ECgene and ECexpression
for details !
EXP16...generate contig sequences
from a set of ESTs while considering alternative splicing ? (last
update Sep. 16, 2005)
This question addresses a "historical" matter of debate whether it is
automatically feasible to separate EST sequences into subsets belonging
to different splice products and to generate the respective contig
sequences. Although databases like UniGene
collect all ESTs belonging to a specific gene, UniGene does not attempt
to generate individual contig seuqences of splice products.
Nevertheless, there are resources which take a user set of sequences as
input and try to generate the corresponding transcripts.
Tip!
ASmodeler is a web-based utility that finds
gene models including alternative splicing events from genomic alignment of
mRNA, EST and protein sequences. Asmodeler is part of the portal ECgene. Please refer also
to the ECgene section at the main page.
The user may supply a UniGene cluster ID and /or a set of mRNA, EST, or
protein sequences. User-supplied sequences are aligned
against the genome map using BLAT and SIM4 programs. Resulting exon
connectivity is analyzed by applying graph-theoretic methods to build
all possible gene models including splice variants. In addition to the
user-supplied sequences, UniGene clusters and many well-known gene
predictions such as Genscan, Ensembl, Acembly, may be included in gene
modeling. Current implementation supports human, mouse
and rat genomes. The output consists of a list of
predicted transcripts together with the deduced amino acid sequences. NOTE:
Test runs showed that it may be problematic to use FASTA files with
long ID lines, as the sample sequence set (which only shows the EST
GenBank acc. in the ID line) works well !
EXP17...analyze the expression of a
gene set of interest in cancer tissues ? -> see GENOM9 !
As the resources in the field of cancer
research are described in section "Genomics", this FAQ is also
located at the Genomics FAQ page.
EXP18...determine the expression
profiles of normal vs. cancer tissues ? ->
see GENOM10 !
As the resources in the field of cancer
research are described in section "Genomics", this FAQ is also
located at the Genomics FAQ page.
EXP19...determine the reliability of
individual SAGE tags for expression analysis of a specific gene ? (last
update Feb. 10, 2006)
This question addresses the general problem when analyzing gene
expression via SAGE tag counts that many SAGE tags are not
unique, meaning that they match more than one gene. This, of course,
renders gene expression analysis highly difficult, as the total SAGE
tag count is (or may be) a mixture of the expression of several source
genes. One strategy to circumvent this problem was the development of longer
SAGE tags, which are therefore more sequence-specific.
Tip!
SAGE Genie is a
website of the CGAP portal
which provides highly intuitive, visual displays of human and mouse gene expression,
based on a unique analytical process
that reliably matches SAGE tags, 10 or 17 nucleotides in
length, to known genes.
SAV (SAGE
Anatomic Viewer) is one of the tools of SAGE Genie. SAV displays gene
expression in human normal and malignant tissues by shading
each organ in one of ten colors, each representing a different level of
gene expression. Gene expression levels are based on the analysis of
counts of SAGE tags, which are either "short" (10 bp),
including "extracted short" (10 bp extracted from 17bp tag), or "long"
(17 bp).
SAV can be used to find the best tag for a
gene / accesion number: NOTE: SAV is an excellent resource to examine
the reliability
of individual SAGE tags, meaning the probability that a tag is "unique"
or that it matches more than one gene (and thereby renders expression
data analysis highly difficult). The best tags are color coded. In
addition, the LTV (Ludwig Transcript
Viewer) display,
showing shorter alternative polyadenylated and internally primed
transcripts, supports the prediction of reliable SAGE tags. The tag
link enables the user to see which other gene(s) may be represented by
the particular tag
and the reliability of each mapping.
EXP20...get in situ microscopy images of RNA
tissue localization and expression intensity ? (last
update Mar. 2, 2006)
This question is somehow related to FAQ PROT6
which described databases storing images of
protein localizations in diverse tissues, based on antibody staining.
This FAQ here describes databases which store in
situ hybridization data which represent both localization
and
expression rate information of
RNA. These databases store microscopic images of such in situ
hybridization experiments. Resources which fall into this category are
described in main
section "RNA Localization Databases".
Tip!
UCSC VisiGene
is a browser for viewing in situ images, provided by
the UCSC Genome
Bioinformatics portal. It enables
the user to examine cell-by-cell as well as tissue-by-tissue expression
patterns. The browser serves as a virtual microscope, allowing
users to retrieve images that meet specific search criteria, then
interactively zoom and scroll across the collection. Please refer to
the VisiGene
section of the User Guide for a list of currently available image
collections. NOTE: As such, VisiGene is also a good link
page to resources of in situ hybridization data in general
!
Searching VisiGene: The image database may be
searched by gene symbols, authors, years of
publication, body parts, GenBank or UniProt accessions, organisms, Theiler
stages (mice), and Nieuwkoop/Faber
stages (frogs).
Image Navigation and Download: Following a
successful search, VisiGene displays a list of thumbnails
of images matching the search criteria in the lefthand pane of the
browser. By default, the image corresponding to the first thumbnail in
the list is displayed in the main image pane. The image may be zoomed
in or out, sized to match the resolution of the original image or best
fit the image display window, and moved or scrolled in any direction to
focus on areas of interest. The original full-sized image may also be downloaded.
addresses
the general problem when analyzing gene
expression via SAGE tag counts that many SAGE tags are not
unique, meaning that they match more than one gene. This, of course,
renders gene expression analysis highly difficult, as the total SAGE
tag count is (or may be) a mixture of the expression of several source
genes. One strategy to circumvent this problem was the development of longer
SAGE tags, which are therefore more sequence-specific.
EXP21...get RT-PCR, Northern Blot,
and Western Blot expression data ? (last
update Mar. 3, 2006)
Until recently, expression data based on laboratory techniques like RT-PCR,
Northern Blot, or Western Blot have been accessible only
on the level of the individual publication figures. Now, resources are
starting to emerge which try to build centralized repositories of such
data which can be queried by "simple" gene name searches. As these sites
often also store microscopy images of tissue sections, there is
a correlation to FAQ EXP20. This question is
somehow also related to FAQ PROT6
which described databases storing images of
protein localizations in diverse tissues, based on antibody
staining.
Tip! MGI
(Mouse Genome
Informatics) is maintained at Jackson Laboratory,
Maine, USA and collects all data about mouse
genes, nomenclature,
map positions, individual ESTs, and mouse expression data. The GXD
(Gene Expression Database) integrates different types of gene
expression information (RNA in situ data, RT-PCR data, Northern Blot,
Western Blot) from the
mouse and provides a searchable index of published experiments on
endogenous gene expression during development. The GXD-Gene
Expression page allows to query mouse expression data via different
options.