Motif search on whole genome

Motif is a software that searches exhaustively for several DNA-binding patterns, also called motifs, on whole genome sequences. These motifs are given as Position Weight Matrices (PWM), and this server incorporates all motifs extracted from the last version of the widely used JASPAR database. Using this server is easy: select the patterns you want to search, select the genome, select a threshold percentage of scores, run the search, and Motif will send you all the DNA sequences matching each pattern, their score, as well as all their genomic locations.

Instead of selecting known patterns from JASPAR, you may also want to search for your own pattern confidentially on our server. You may then enter your own matrix in the interactive box below, column by column. When launched, Motif searches for your pattern only and returns the corresponding result. Your pattern is neither stored in our pattern database, nor accessible in any way to other users.

Why a new tool and service ?

Our goal is to provide an user friendly service for searching efficiently such patterns on complete genomes. Motif has several advantages over concurrent methods:

  • it can seek several patterns in a single search
  • it outputs the precise words that match the pattern
  • is extremely fast
  • is exhaustive (all matching words and locations are output)

More information

Scoring and threshold

The usefulness of PWM representation of a pattern is to score the similarity (i.e. the resemblance) between the pattern and any DNA sequence having the same length as the matrix. Given a matrix of say 10 columns, one can score the similarity of any DNA sequence of length 10, also termed a 10-mer, to the pattern. The higher the score, the better the similarity. These scores are used for comparing words. It is difficult to interpret a score value by itself.

When using Motif, you select how many of the most similar words Motif will search for as a percentage of all possible words. For instance, the default threshold of 90% means, search for the words whose score is greater than or equal to 90% of the best possible score for the pattern. This way of choosing a threshold is a more intuitive than selecting a minimum score value.

When matrix search on a genome, a homology percentage is given, more is tall and more, searched sequences contain tall base in the picture.

Remark: in general, JASPAR recommends a threshold of 85%. The default is 90%. Value below 70% are generally not meaningful and are disabled here on this web server. If you need special search with lower threshold please contact us.

How does a matrix represent a variable pattern?

A matrix represents a set of sequences sharing similarity. Starting from a gap-free multiple alignment, a matrix records for each column of the alignment, the number of occurrences of each base in that column. The proportionally highest numbers indicate the preferred or most conserved nucleotides at this position.

Representing a count matrix as a sequence logo

A sequence logo is a graphical representation of the sequence conservation of nucleotides in a pattern/matrix. In each column, all nucleotides are represented with a size proportional to its relative frequency. The largest nucleotides are the more frequent. For example, if one nucleotide takes all the place in a column, it means that the conservation is maximal, and only one nucleotide is "allowed" at that position.

Set of sequences

A video tutorial

Learning how to use the Motif web server ? Take a look at this tutorial (90 seconds).

Output format of the search results

Motif can search for multiple patterns on a given genome. Thus the output is organised first by pattern, then by strand, and for each pattern and strand by matching words.

The output format contains:

> followed by the matrix name
The next lines contain for all sequences that match the pattern and were found on the genome, their score and genomic locations. Lines are grouped by matching word:
  1. the word and its similarity score, and whether it was found on the normal or reverse strand (the latter being written as "reverse")
  2. each other line prints a genomic location as follows: chromosome_name:position
Note that chromosome positions are numbered from 1 on.
>Tcf12
ACAGCTGCTG	4.89141
chromosom-3:303
chromosom-3:1269
chromosom-4:123
CAGCAGCTGT	3.448715	(reverse)
chromosom-2:7584
ACAGCTGTTG	5.66843
chromosom-1:528
chromosom-1:39871
chromosom-2:5814
chromosom-4:1552
>ZNF711
AGGCCTAG	4.82415
chromosom-2:1788
chromosom-2:25451
chromosom-3:44584
CTAGGCCT	3.448715	(reverse)
chromosom-2:7584

FAQs

  1. I need to search on a genome that is not on our list
    Depending on the genome status and availability, we can incorporate new genomes in the server. Please contact us.
  2. I need to search on a private genome.
    We can set up a specific search service. Please contact us for this
  3. Why does my search does not give any results?
    Several explanations are possible:
    - the genome does not contain any matching location
    - the threshold is too high, and thus filters out weaker matches.