Regulatory Element Discovery from Raw Expression Data
RED2 provides a simple and efficient way of discovering regulatory elements from whole-genome expression data (e.g. microarray, RNA-seq or mRNA decay).
RED2 does not require lists of up- or down-regulated genes, nor any pre-computed gene clustering.
Instead, RED2 estimates motif densities around each point (gene) in the expression space, and searches for motifs whose presence in a sequence is informative about the expression of the corresponding gene.
Please send any questions, suggestions or bug report to
red2@lirmm.fr.
RED2 online execution
User's Guide
Input
RED2's input consists of
nucleic acid sequences and
expression data. Be careful not to provide a selection of your genes, as RED2 is made for running on whole genome data (
a minimum of 500 genes are required).
- Sequences: You can select one of the available species from
the drop-down list or upload your own sequence file in FASTA-like
format. It should mainly contain CAPITALIZED nucleotides (A C G T only)
as anything else is considered as a masked position. Commented lines are
allowed if they begin with '#'. We recommend that you compress your
file (.gz or .zip) prior to upload (SIZE LIMIT IS 16 MB).
# Example sequence file
>seq1_name
ACACGCGCAGGACCNNNNNNNNNNNCACAAC
CGCGATATGGCTGACCACTAAACGAGA
>seq2_name
ACACGCGCAGGACCacgtgtagtagCACAAC
CGCGATATGGCTGACCACTAAACGAGA
- Expression file : Each line corresponds to an expression
profile, i.e. a sequence name followed by one or more expression
measurements, separated by space or tab characters. The first
not-commented line must contains the column labels (do not use spaces in your column labels). Missing values are
not allowed and all lines must have the same number of columns
(conditions).
# Example expression file
sequences exp1 exp2 exp3
seq1_name 0.3 4.2 -6.1
seq2_name -1.2 1.3 4.5
Warning : You should remove recent duplicates and members of
multigene famillies from your datasets, as they are prone to hybridize on
the same probes in microarray experiments, potentially leading to spurious
motifs.
Parameters
- Analysis Type
- Double strand : forward and reverse strands are considered.
Recommended for DNA sequences.
- Single strand : only the forward strand is considered.
Recommended for RNA, 5' or 3' UTR sequences.
- Sequence Location
- Upstream : nucleotide positions are defined with respect to
the right extremity of the provided sequences
(...,-3,-2,-1).
- Downstream : nucleotide positions are defined with respect
to the left extremity of the provided sequences (1,2,3,...).
- Length of considered region. The maximum number of bp
considered in the input sequences. If your sequences are longer, the
leftmost (in an upstream analysis) or rightmost (in a downstream
analysis) nucleotides will be discarded.
- Scoring function used to evaluate the motifs.
- Mutual information measures the mutual dependence between
the presence/absence of a motif in the regulatory sequences and the
location of the corresponding genes in the expression space.
- The hypergeometric function measures motif
over-representation in the regulatory sequences of neighboring genes
in the expression space. The score of a motif corresponds to the
negative logarithm of the p-value associated with the neighborhood
showing the highest over-representation.
- Distance function used to compare the expression profiles and
define the neighborhood of each gene in the expression space.
Output
For each discovered motif, RED
2 outputs :
- The IUPAC representation of the motif,
its (seed) and links to additional analysis files (detailed
below).
- The motif score, expressed in bits or negative log p-val
(logP), depending on the selected scoring function.
- The number of positive genes (i.e. the number of sequences
that contain at least one occurrence of the motif).
- The total number of motif occurrences.
- The strand bias. The P-value is computed using binomial
distribution (for double strand analysis only).
- The best GO term enrichment. The P-value is computed using the
hypergeometric distribution with the following parameters. b :
number of positive genes annotated with that term, n : number
of positive genes with at least one annotation, B : number of
considered genes annotated with that term, N : number of
considered genes. The P-value is Bonferroni corrected for the number of
tested terms. Note that this feature is only available for the species
selected from the drop-down list.
- The logo of the motif (left figure).
- The expression heatmap (middle figure). The vertical axis
corresponds to the expression level. Each column corresponds to a
condition (column) of the expression file, and is divided into 40
equally-spaced bins. The color indicate the motif
over/under-representation in each of these bins. The units of the color
scale (on the right) are expressed in standard deviations according to
the hypergeometric law, under the null hypothesis that motifs are
uniformly distributed in the expression space (and consequently among
the bins).
- The distribution of the motif occurrence positions (right
figure). Occurrence positions are relative to the left or right
extremity of the sequences, depending on the Sequence location
parameter.
Additional analysis files
For each motif :
- The Positive genes file contains the list of genes containing
the motif, sorted by motif density, along with the number of motif
occurrence(s) in each sequence.
- The Occurrences file contains every occurrences of the motif.
The column labeled "left" (respectively "right") contains the number of
nucleotides comprised between the motif occurrence and the sequence left
(respectively right) extremity. The column labeled "context" contains
the actual occurrence sequence, plus 10 positions on each side
(delimited by a space character). In cases where a position fall outside
of the available sequence, it is filled with a dot.
- The Expression file contains information about the motif
distribution for each column/condition of the expression file. The
column labeled "Col. ID" contains the number of the column (starting at
1). The column labeled "Mean value motif" contains the mean expression
value of the genes containing the motif ("Mean value all" is for all
genes). The column labeled "P-val (KS)" contains the P-value of
observing the motif distribution for that particular column, according
to a two-sample Kolmogorov-Smirnov test (positive vs negative genes).
Columns (lines) are ordered with respect to this P-value.
- The Comparison to other motifs file. The column labeled
"overlap" contains the number of position that overlap with the compared
motif. The column labeled "correlation" contains the Pearson coefficient
between their density profiles (see above).
Top :
- The List of considered genes file contains the gene
identifiers from your expression file that were matched to a sequence
and included in the analysis.
- The List of excluded genes file contains the gene identifiers
for which no matching sequence were found.
- The All qmers file contains the score and FDR estimated for
every possible q-mers (which comprise the seeds of the motifs).
- The Selected motifs file contains the same motifs as in the
HTML output, but in plain text format.
Advanced Parameters
- Expression Data Normalization. Each row/column in the
expression file will be normalized to have mean value 0 and standard
deviation 1.
- Neighborhood size [50,1000]; default=200. The neighborhood of a
sequence is defined as the sequence itself plus the K-1
nearest sequences in the expression space, according to the selected
distance function. To achieve a good sensitivity, the neighborhood must
be significantly smaller than the total number of sequences, but large
enough to allow a good estimation of the motif densities.
- Seed length [6-8]; default=7.
- Max. motif lenght [6,15]; default=9.
- FDR threshold [0.001,0.1]; default=0.001. RED2 start
by computing the score of every possible 7-mers and estimates a False
Discovery Rate (FDR) by repeating this procedure on 10 randomized
dataset. Only the kmers with a FDR lower or equal to this threshold (the
seeds) will be considered for further optimization. The value of this
parameter affects the number of motifs in the final output.
- Min. Pearson correlation between seeds and motifs [-1,1];
default=0.75. The min. Pearson correlation required between the density profile of a motif and its seed during the
optimization step. This parameter is used to ensure that optimized
motifs are distributed similarly to their seed in the expression space.
It can also prevent the merging of motifs with similar sequence but
different effects on the expression. High parameter values lead to more
motifs with fewer degenerated positions, while a value of -1 (no
restriction) will lead to a smaller set of motifs, with more degenerated
positions, and potentially some merged motifs.
- Max. correlation between final motifs [-1,1]; default=0.75.
After the optimization step, a motif is considered redundant and
discarded from the output if it is similar to a better scoring
motif. The first criteria for similarity is the Pearson correlation
between the motif density profiles. The second criteria is defined
below.
- Max. overlap between final motifs [0,15]; default=4. The second
criteria for similarity between two motifs (see above) is the overlap,
which is the maximum number of compatible positions (e.g. A and W, see
IUPAC code below) that can be aligned without gaps or mismatches. For
example AAA and AWC have an overlap of 2.
Tuning
- To increase the specificity of the returned motifs, increase
the Min. Pearson correlation between seeds and motifs.
- To remove redundancy from the output, decrease the Max.
correlation and Max. overlap between final motifs.
- If the heatmaps seem uninformative or contain only horizontal
lines, try to normalize each profile by using the ROW
option in the Expression Data Normalization.
- If RED2 return no or very few motifs, increase the
FDR threshold or try a different seed length.
Motif density
- The density of a motif around a particular gene is the
proportion of genes that possess the motif in the neighborhood of that
gene (in the expression space). In the RED2 output, density
is expressed in standard deviations according to the hypergeometric law,
under the null hypotheses that motifs are uniformly distributed in the
expression space. Motif densities are useful for determining the
genes that are most likely regulated by a particular motif, since
occurrences in high density regions are more likely to be functional
than occurrences in low density regions.
- The density profile of a motif is a vector that contains the
motif density around each gene in the data set (even those that do not
contain the motif). To determine if two motifs are associated with
similar expression profiles, RED2 considers the Pearson
correlation coefficient between their respective density profiles. A
positive coefficient implies that the motifs are distributed similarly
(i.e. present in genes with similar expression profiles), while a
negative coefficient implies that the motifs are distributed in genes
with opposite profiles.
IUPAC code table
code |
nucleotides |
W |
A, T |
R |
A, G |
M |
A, C |
S |
C, G |
Y |
C, T |
K |
G, T |
H |
A, C, T (not G) |
V |
A, C, G (not T) |
D |
A, G, T (not C) |
B |
C, G, T (not A) |
N |
A, C, G, T |
Contact information
Contact us if you have any questions, suggestions (such as species to add)
or bug report :
red2@lirmm.fr