Warning

This page is located in archive.

- Data: scsp data

Given an alphabet and a set of sequences composed of characters of that alphabet,
the goal is to find a supersequence *S*, i.e. a sequence containing all of the original sequences, of a minimum length.
Sequence *r* is contained in *S* if and only if all characters of *r* are present in *S* in the same order as they appear in *r*.

**Input**:

- Alphabet A, of which the sequences are composed.
- N sequences (not necessarily of the same length).

**Output**: Supersequence S of the minimum length.

- Linear string of characters of the given Alphabet.

s1: ca ag cca cc ta cat c a s2: c gag ccat ccgtaaa g tt g s3: aga acc tgc taaatgc t a ga ---------------------------- Supersequence S: cagagaccatgccgtaaatgcattacga

- …

Supersequence quality *S* is calculated as

<latex> $f(S) = C(S) + L(S)$, </latex>

where <latex>$C(S)$</latex> is the total number of characters the supersequence *S* covers, and <latex>$L(S)$</latex> is a contribution for the length of *S* calculated as

<latex> $L(S) = \frac{SumL - Length(S)}{SumL}$, </latex>

where <latex>$SumL$</latex> is the total number of all characters in the original sequences.

The function is to be maximized.

If solved as a pure black-box optimization problem, one could use information about partial sequences only in the evaluation function. Here, it is allowed to use or analyze partial sequences elsewhere as well, e.g., in recombination operators.

courses/a0m33eoa/semestral_tasks/supersequence.txt · Last modified: 2022/11/14 09:55 by xposik