ATGC: RED2

Regulatory Element Discovery from Raw Expression Data

RED² provides a simple and efficient way of discovering regulatory elements from whole-genome expression data (e.g. microarray, RNA-seq or mRNA decay). RED² does not require lists of up- or down-regulated genes, nor any pre-computed gene clustering. Instead, RED² estimates motif densities around each point (gene) in the expression space, and searches for motifs whose presence in a sequence is informative about the expression of the corresponding gene.

Please send any questions, suggestions or bug report to red2@lirmm.fr.

RED2 online execution

Input Files
Sequence file (FASTA format)		Use available Upload
Expression file (table format)

Sequence Parameters
Analysis type	Double strand (forward and reverse)	Single strand (forward only)
Location	Upstream	Downstream
Length of considered region [25,2000 bp]

Expression Parameters
Scoring function	Mutual information	Hypergeometric
Distance measure	Euclidean	Pearson correlation

Expression Data Normalization	None	Rows	Columns
Neighbourhood size [50,1000]
Seed length [6,8]
Max. motif length [6,15]
FDR threshold [0.001,0.1]
Min. Pearson's correlation between seeds and motifs [-1,1]
Max. correlation between final motifs [-1,1]
Max. overlap (bp) between final motifs [0,15]


Name of your analysis
Your email

User's Guide

Input

RED2's input consists of nucleic acid sequences and expression data. Be careful not to provide a selection of your genes, as RED2 is made for running on whole genome data (a minimum of 500 genes are required).

Sequences: You can select one of the available species from the drop-down list or upload your own sequence file in FASTA-like format. It should mainly contain CAPITALIZED nucleotides (A C G T only) as anything else is considered as a masked position. Commented lines are allowed if they begin with '#'. We recommend that you compress your file (.gz or .zip) prior to upload (SIZE LIMIT IS 16 MB).
```
   # Example sequence file

    >seq1_name
   ACACGCGCAGGACCNNNNNNNNNNNCACAAC
   CGCGATATGGCTGACCACTAAACGAGA

    >seq2_name
   ACACGCGCAGGACCacgtgtagtagCACAAC
   CGCGATATGGCTGACCACTAAACGAGA
  
```
Expression file : Each line corresponds to an expression profile, i.e. a sequence name followed by one or more expression measurements, separated by space or tab characters. The first not-commented line must contains the column labels (do not use spaces in your column labels). Missing values are not allowed and all lines must have the same number of columns (conditions).
```
   # Example expression file
   
   sequences  exp1 exp2 exp3
   seq1_name  0.3  4.2 -6.1
   seq2_name -1.2  1.3  4.5 
```

Warning : You should remove recent duplicates and members of multigene famillies from your datasets, as they are prone to hybridize on the same probes in microarray experiments, potentially leading to spurious motifs.

Parameters

Analysis Type
- Double strand : forward and reverse strands are considered. Recommended for DNA sequences.
- Single strand : only the forward strand is considered. Recommended for RNA, 5' or 3' UTR sequences.
Sequence Location
- Upstream : nucleotide positions are defined with respect to the right extremity of the provided sequences (...,-3,-2,-1).
- Downstream : nucleotide positions are defined with respect to the left extremity of the provided sequences (1,2,3,...).
Length of considered region. The maximum number of bp considered in the input sequences. If your sequences are longer, the leftmost (in an upstream analysis) or rightmost (in a downstream analysis) nucleotides will be discarded.
Scoring function used to evaluate the motifs.
- Mutual information measures the mutual dependence between the presence/absence of a motif in the regulatory sequences and the location of the corresponding genes in the expression space.
- The hypergeometric function measures motif over-representation in the regulatory sequences of neighboring genes in the expression space. The score of a motif corresponds to the negative logarithm of the p-value associated with the neighborhood showing the highest over-representation.
Distance function used to compare the expression profiles and define the neighborhood of each gene in the expression space.

Output

For each discovered motif, RED² outputs :

The IUPAC representation of the motif, its (seed) and links to additional analysis files (detailed below).
The motif score, expressed in bits or negative log p-val (logP), depending on the selected scoring function.
The number of positive genes (i.e. the number of sequences that contain at least one occurrence of the motif).
The total number of motif occurrences.
The strand bias. The P-value is computed using binomial distribution (for double strand analysis only).
The best GO term enrichment. The P-value is computed using the hypergeometric distribution with the following parameters. b : number of positive genes annotated with that term, n : number of positive genes with at least one annotation, B : number of considered genes annotated with that term, N : number of considered genes. The P-value is Bonferroni corrected for the number of tested terms. Note that this feature is only available for the species selected from the drop-down list.
The logo of the motif (left figure).
The expression heatmap (middle figure). The vertical axis corresponds to the expression level. Each column corresponds to a condition (column) of the expression file, and is divided into 40 equally-spaced bins. The color indicate the motif over/under-representation in each of these bins. The units of the color scale (on the right) are expressed in standard deviations according to the hypergeometric law, under the null hypothesis that motifs are uniformly distributed in the expression space (and consequently among the bins).
The distribution of the motif occurrence positions (right figure). Occurrence positions are relative to the left or right extremity of the sequences, depending on the Sequence location parameter.

Additional analysis files

For each motif :

The Positive genes file contains the list of genes containing the motif, sorted by motif density, along with the number of motif occurrence(s) in each sequence.
The Occurrences file contains every occurrences of the motif. The column labeled "left" (respectively "right") contains the number of nucleotides comprised between the motif occurrence and the sequence left (respectively right) extremity. The column labeled "context" contains the actual occurrence sequence, plus 10 positions on each side (delimited by a space character). In cases where a position fall outside of the available sequence, it is filled with a dot.
The Expression file contains information about the motif distribution for each column/condition of the expression file. The column labeled "Col. ID" contains the number of the column (starting at 1). The column labeled "Mean value motif" contains the mean expression value of the genes containing the motif ("Mean value all" is for all genes). The column labeled "P-val (KS)" contains the P-value of observing the motif distribution for that particular column, according to a two-sample Kolmogorov-Smirnov test (positive vs negative genes). Columns (lines) are ordered with respect to this P-value.
The Comparison to other motifs file. The column labeled "overlap" contains the number of position that overlap with the compared motif. The column labeled "correlation" contains the Pearson coefficient between their density profiles (see above).

Top :

The List of considered genes file contains the gene identifiers from your expression file that were matched to a sequence and included in the analysis.
The List of excluded genes file contains the gene identifiers for which no matching sequence were found.
The All qmers file contains the score and FDR estimated for every possible q-mers (which comprise the seeds of the motifs).
The Selected motifs file contains the same motifs as in the HTML output, but in plain text format.

Advanced Parameters

Expression Data Normalization. Each row/column in the expression file will be normalized to have mean value 0 and standard deviation 1.
Neighborhood size [50,1000]; default=200. The neighborhood of a sequence is defined as the sequence itself plus the K-1 nearest sequences in the expression space, according to the selected distance function. To achieve a good sensitivity, the neighborhood must be significantly smaller than the total number of sequences, but large enough to allow a good estimation of the motif densities.
Seed length [6-8]; default=7.
Max. motif lenght [6,15]; default=9.
FDR threshold [0.001,0.1]; default=0.001. RED² start by computing the score of every possible 7-mers and estimates a False Discovery Rate (FDR) by repeating this procedure on 10 randomized dataset. Only the kmers with a FDR lower or equal to this threshold (the seeds) will be considered for further optimization. The value of this parameter affects the number of motifs in the final output.
Min. Pearson correlation between seeds and motifs [-1,1]; default=0.75. The min. Pearson correlation required between the density profile of a motif and its seed during the optimization step. This parameter is used to ensure that optimized motifs are distributed similarly to their seed in the expression space. It can also prevent the merging of motifs with similar sequence but different effects on the expression. High parameter values lead to more motifs with fewer degenerated positions, while a value of -1 (no restriction) will lead to a smaller set of motifs, with more degenerated positions, and potentially some merged motifs.
Max. correlation between final motifs [-1,1]; default=0.75. After the optimization step, a motif is considered redundant and discarded from the output if it is similar to a better scoring motif. The first criteria for similarity is the Pearson correlation between the motif density profiles. The second criteria is defined below.
Max. overlap between final motifs [0,15]; default=4. The second criteria for similarity between two motifs (see above) is the overlap, which is the maximum number of compatible positions (e.g. A and W, see IUPAC code below) that can be aligned without gaps or mismatches. For example AAA and AWC have an overlap of 2.

Tuning

To increase the specificity of the returned motifs, increase the Min. Pearson correlation between seeds and motifs.
To remove redundancy from the output, decrease the Max. correlation and Max. overlap between final motifs.
If the heatmaps seem uninformative or contain only horizontal lines, try to normalize each profile by using the ROW option in the Expression Data Normalization.
If RED² return no or very few motifs, increase the FDR threshold or try a different seed length.

Motif density

The density of a motif around a particular gene is the proportion of genes that possess the motif in the neighborhood of that gene (in the expression space). In the RED² output, density is expressed in standard deviations according to the hypergeometric law, under the null hypotheses that motifs are uniformly distributed in the expression space. Motif densities are useful for determining the genes that are most likely regulated by a particular motif, since occurrences in high density regions are more likely to be functional than occurrences in low density regions.
The density profile of a motif is a vector that contains the motif density around each gene in the data set (even those that do not contain the motif). To determine if two motifs are associated with similar expression profiles, RED² considers the Pearson correlation coefficient between their respective density profiles. A positive coefficient implies that the motifs are distributed similarly (i.e. present in genes with similar expression profiles), while a negative coefficient implies that the motifs are distributed in genes with opposite profiles.

IUPAC code table

code	nucleotides
W	A, T
R	A, G
M	A, C
S	C, G
Y	C, T
K	G, T
H	A, C, T (not G)
V	A, C, G (not T)
D	A, G, T (not C)
B	C, G, T (not A)
N	A, C, G, T