[genocon2021_c]Practice 3

ID: 3437

传统题

10000ms

1024MiB

尝试: 0

已通过: 0

难度: (无)

上传者:

admin

Problem statement

Given $m$ strings consisting of A, C, G, T. Let us insert blank symbols (denoted each by -) to the strings to make another set of $m$ strings such that all the lengths become equal and generate $m$ by $n$ matrix $S$ in which each row is one of the strings and $n$ is the length of the strings. We denote an element (i.e. character) at $i$ -th row and $j$ -th column of $S$ by $S_{i, j} (1 \\leq i \\leq m, 1 \\leq j \\leq n)$ . We also denote row and column vectors by $S_{i,\*}$ and $S_{\*,j}$ respectively. Let us consider inserting - at appropriate positions such that we can generate aligned $S$ in which the same characters are contained in each column vector. For this purpose, we define the following score $C$ that measures how much the given column vector $t$ is well aligned.

$C(t) = min_{x \\in \\{A, T, G, C, -\\}} \\left| \\{ t\[i\] \\; | \\; t\[i\] \\neq x \\; (i=1, \\ldots, m) \\} \\right|$

$C(t)$ counts the number of characters in $t$ that are not equal to the one that appears in $t$ the most frequently. For example, when $t$ =CCCC-T, C appears the most frequently, so the number of characters that are not equal to C (i.e., - and T) becomes $C(t)=2$ . When $t$ =CCAA-T, one can choose either C or A as $x$ because counts of C and A are equal. If C is chosen as $x$ , $C(t)$ becomes $4$ by adding number of counts for A, - and T.

If we can generate $S$ such that sum of $C(S_{\*,j})$ for all columns becomes small, it is expected that the given $m$ strings are well aligned. We call such a problem multiple alignment and define the following score.

$C_{all}(S) = \\sum_{j=1}^n C(S_{\*,j})$

For example, if we align the three strings GGATC, GGAT, GACC by inserting - as follows,

  GGATC
  GGAT-
  G-ACC

The score $C_{all}(S)$ becomes $C(S_{\*,1}) + C(S_{\*,2})+ C(S_{\*,3})+ C(S_{\*,4})+ C(S_{\*,5}) = 0 + 1 + 0 + 1 + 1 = 3$．

Given $m$ strings $s_{1}, \\ldots, s_{m}$ , output a multiple alignment such that $C_{all}(S)$ becomes small as much as possible.

Constraints

$s_1$ ,..., $s_m$ are strings consisting of A, C, G, T.
Test case (small)
- $8 \\leq m \\leq 10$ ．
- $80 \\leq |s_1|, \\ldots, |s_m| \\leq 100$ (the length of $x$ is denoted by $|$ x $|$ .)
Test case (large)
- $35 \leq m \leq
- $700 \\leq |s_1|, \\ldots, |s_m| \\leq 750$

Input

Input is given from Standard Input in the following format:

$m$ $s_{1}$ $s_{2}$ : $s_{m}$

Output

Print a multiple alignment of input strings such that $C_{all}$ becomes small as much as possible. After inserting - in each input strings, they should be printed in the same order as the order of input strings.

The judging system can only accept user's output when: - Each line is equal to the corresponding input line if - is removed. - The lengths of all the lines are equal.

Sample Input 1

10
GGAGGTTATTGCTGTGGAGGTACTGGAGAAGGAGGATGCTAGCGTTGGGTAAACCACGAGCATTTTGACTTGTACTTCGCCTC
GGGTTATTGCTGTTGTGAGTACTGGAGACAGGAGGGAGTGTTAGAGTTGGGGTAAACCACAGTAGCTCATGTCACTTGGATAACTCGTCAGCCTC
GGTCACTCGCTGTGGAGAGTACTTTGAGACAGGAGGGAGTGCTAGAGTTTGGGGTAAAACCACAGCAGCTCATGCACTTGGATATCTGTGAGCC
GAGGTTAGTGCTGTGGAGAGTACTGGAGACAGGAGGGAGTGCTAGAGTTGGGGTAAAGCACAGCACCATTCACTGATAAATGTCAGGCCTAGGGG
GAGGTATTGCTGTGGAGGGTACTGGAGACAGGAGGAGTGCTAGAGGTTGGGGTAAAACCACAGCAGCTCATTTACTTGATACTGTCAGGCTCAGG
GAGGTTATTTGCTGTGGAGAGTTACTGAGACATGGGGTGCCAAGTTGGGTAGCTACAGCAGCTCATTTCACTTGATACTGCAGGCTCTCAG
GAGTTAATTTCGTGGAGAGTACTAGAGCACAGGAGGGAGGCCAGATTGGGGTATACCACAGCAGCTCGTTCACTTTAACTGTCAGGCCCTCA
ACAGTTTAATTGATGGGCGGAGAGTACTGGAGACAGGAGGGAGTGCTAGAGTGGGGTAAACCACAGCAGCATCTTTCATTATAACTGTCAG
CAAGGTTTTTTCGCTGTGGAGAGTACTGGAGACCGGGGAGTGTAGACTTTGGGGTATCACGTAGCAGCTTATTTCGACTTTGTACTGTAA
GAGTTATTTCTGTGGAGAGAACTGGAGACGGAGGGAGTGCTAGAGTTGGGGTAAACACCAGGCAGCCATTTCACTTGATAACTGTCAGGCCTT

Sample Output 1

GGAGGTTA-TTGCT--GTGGAG-GTAC-TGGAGA-AGGA-GGA-TGCTAGCG-TT-GGGT-AAACCAC-G-AGC-ATTTTGACTT-G-T-ACT--TC-GCCTC----
--GGGTTA-TTGCT--GTTGTGAGTAC-TGGAGACAGGAGGGAGTGTTAGAG-TTGGGGT-AAACCACAGTAGCTCATGTCACTTGGATAACTCGTCAGCCTC----
---GGTCACTCGCT--GTGGAGAGTACTTTGAGACAGGAGGGAGTGCTAGAGTTTGGGGTAAAACCACAGCAGCTCATG-CACTTGGATATCT-GTGAG-C-C----
-GAGGTTA-GTGCT--GTGGAGAGTAC-TGGAGACAGGAGGGAGTGCTAGAG-TTGGGGT-AAAGCACAGCA-CCATTCACTGATAAATGTCAGGCCTAGGGG----
--GAGGTA-TTGCT--GTGGAGGGTAC-TGGAGACAGGA-GGAGTGCTAGAGGTTGGGGTAAAACCACAGCAGCTCAT-TTACTT-GAT-ACT-GTCAGGCTC-AGG
-GAGGTTATTTGCT--GTGGAGAGTTACT-GAGACA--TGGG-GTGCCA-AG-TT-GGGT--AGCTACAGCAGCTCATTTCACTT-GAT-ACT-G-CAGGCTCTCAG
--GAGTTAATTTC---GTGGAGAGTACTAGAGCACAGGAGGGAG-GCCAGA--TTGGGGT-ATACCACAGCAGCTCGT-TCACTT---TAACT-GTCAGGC-CCTCA
ACAGTTTAATTGATGGGCGGAGAGTAC-TGGAGACAGGAGGGAGTGCTAGAG--TGGGGT-AAACCACAGCAGCATCTTTCA-TT--ATAACT-GTCAG--------
CAAGGTT-TTTTCGCTGTGGAGAGTAC-TGGAGAC-CG-GGGAGTG-TAGACTTTGGGGT---ATCAC-GTAG--CAGCTTATTTCG--ACTTTGT-A--CT-GTAA
--GAGTTA-TTTCT--GTGGAGAGAAC-TGGAGAC-GGAGGGAGTGCTAGAG-TTGGGGT-AAACACCAGGCAGCCATTTCACTT-GATAACT-GTCAGGC-C--TT

Points

The submitted program is evaluated by two test case types: Small and Large, as denoted in the Constraints section. Small contains two cases, and Large contains eight cases. The points are given to outputs that satisfy the format within the time limit. For each output of a test case in Small, a contestant earns $\\{200 - \\lfloor C_{all}(S) \* 0.2 \\rfloor \\}$ or 0, whichever is greater. For a test case in Large, the contestant earns $\\{700 - \\lfloor C_{all}(S) \* 0.1 \\rfloor \\}$ or 0, whichever is greater. For example, if one can generate two outputs $S_1$ and $S_2$ for Small-type cases with a correct format within the time limit, one can earn $max\\{200 - \\lfloor C_{all}(S_1) \* 0.2 \\rfloor, 0\\} + max\\{200 - \\lfloor C_{all}(S_2) \* 0.2 \\rfloor, 0\\}$.

We do not guarantee that the optimal solution can achieve the maximum points (6000 points). The final standings are determined by the other dataset generated by the same method used to generate the current test cases.

The method for generating test cases

We extract substring $s$ from chromosome 22 of the human reference genome sequence in the first step. $s$ is extracted such that it does not include N and only consists of A, T, G and C. In the second step, we generate insertions/deletions/substitutions for $s$ to make a new string $s'$ based on a sequencing error model that simulates a certain sequencer. In the third step, we repeat the second step to generate all the input strings. Note that $s$ is not included in a test case. We do not describe the sequencing error model we used but provide the sample test cases available from the following URL.

Sample test cases

#genocon2021c. [genocon2021_c]Practice 3