Shortest common supersequence

Data: scsp data

Problem description

Given an alphabet and a set of sequences composed of characters of that alphabet, the goal is to find a supersequence S, i.e. a sequence containing all of the original sequences, of a minimum length. Sequence r is contained in S if and only if all characters of r are present in S in the same order as they appear in r.

Input:

Alphabet A, of which the sequences are composed.
N sequences (not necessarily of the same length).

Output: Supersequence S of the minimum length.

Possible representations

Linear string of characters of the given Alphabet.

                   s1:  ca ag cca  cc ta    cat  c a
                   s2:  c gag ccat ccgtaaa g  tt  g
                   s3:   aga acc tgc  taaatgc t a ga
                        ----------------------------
      Supersequence S:  cagagaccatgccgtaaatgcattacga

…

Uniform evaluation function

Supersequence quality S is calculated as

where <latex>$C(S)$</latex> is the total number of characters the supersequence S covers, and <latex>$L(S)$</latex> is a contribution for the length of S calculated as

<latex> $L(S) = \frac{SumL - Length(S)}{SumL}$, </latex>

where <latex>$SumL$</latex> is the total number of all characters in the original sequences.

The function is to be maximized.

If solved as a pure black-box optimization problem, one could use information about partial sequences only in the evaluation function. Here, it is allowed to use or analyze partial sequences elsewhere as well, e.g., in recombination operators.

Table of Contents

Shortest common supersequence

Problem description

Possible representations

Uniform evaluation function