MS_Align v 2.0 : Comparison of minisatellites
Bérard S., Rivals E. Journal of Computational Biology. 2003 10(3-4):357-72.
Please cite
THESE paper if you use MS_Align.
Downloads
Click
here to download the program binaries for Linux and MacOS.
Parameters and Options
-
Sequences
The input file contains sequences in FASTA format (example of FASTA formatted file). The file extension does not matter. A file in this format is a collection of sequences in which each sequence is described by an "identification" line and as many "sequence" lines as needed. The "identification" line starts with the symbol '>' immediatly followed by the identifier of the sequence and then a blank. The rest of this line may contain comment on this sequence. The "sequence" lines contain only the sequence of symbols that forms the sequence. Lines are separated by carriage return. The sequences are usually Minisatellite Variant Repeat codes, which means their alphabet may be arbitrary. In the example file, the alphabet is { A, B, C, F, o }. The case matters: for instance, the symbol 'a' is different from 'A'.
-
Alignment costs
There are five alignment costs for the five mutational events taken into account by the alignment procedure: Amplification, Contraction, Insertion, Deletion, and Mutation. These reduce to three because of the symmetry of the alignment cost, for we want it to be a metric distance. So the cost of dual mutational events should be identical; that is the Amplification cost should be identical to the Contraction cost, and the Insertion cost should be identical to the Deletion cost. There are two options concerning the Mutation cost: either it is fixed or it depends on the variant that are substituted (see below).
-
Amplification cost / Contraction cost
Cost of the tandem duplication of a character, for instance ABC -> ABBC. Cost of the tandem contraction of a character (dual modification of the amplification), for instance ABBC -> ABC. For the alignment cost to be a metric distance, the amplification and contraction costs should be identical.
-
Insertion cost / Deletion cost
Cost of the insertion of a character, for instance ABD -> ABCD. The difference with the amplification is that none of the neighboring characters needs to be identical to the one that is inserted. Cost of the deletion of a character (dual modification of the insertion), for instance ABCD -> ABC. For the alignment cost to be a metric distance, the insertion and deletion costs should be identical.
-
Mutation cost
There are two options concerning the Mutation cost: either it is fixed or it depends on the variant that are substituted (see below). Choosing one or the other is done by toggling a button. If you choose a fixed cost, just input a integer value for the mutation. If you input the value, say 10, substituting an 'A' into a 'B' will cost the same, e.g. 10, as substituting a 'C' into a 'D'. If you choose a variable cost, you have to input a matrix that gives the cost of each possible mutation. Here again, the matrix is triangular since it is symmetrical. With this option you must upload a text file that contains the following information in order and one per line:
- a line with number of symbols in the alphabet.
- for each symbol.
- the symbol.
- the mutation costs for substituting this symbol into all symbols placed before in this file including itself (because the matrix is given as a down triangular matrix).
Remarks: The order in which the mutation costs are given on each line should be the same than the order in which the symbols are given. The alphabet given in the file is a set: it should not contain twice or more the same symbol. This example corresponds to the alphabet given in our sequence example file.