#genocon2021c. [genocon2021_c]Practice 3

[genocon2021_c]Practice 3

Problem statement

Given mm strings consisting of A, C, G, T. Let us insert blank symbols (denoted each by -) to the strings to make another set of mm strings such that all the lengths become equal and generate mm by nn matrix SS in which each row is one of the strings and nn is the length of the strings. We denote an element (i.e. character) at ii-th row and jj-th column of SS by Si,j(1leqileqm,1leqjleqn)S_{i, j} (1 \\leq i \\leq m, 1 \\leq j \\leq n). We also denote row and column vectors by Si,\*S_{i,\*} and S\*,jS_{\*,j} respectively. Let us consider inserting - at appropriate positions such that we can generate aligned SS in which the same characters are contained in each column vector. For this purpose, we define the following score CC that measures how much the given column vector tt is well aligned.

$C(t) = min_{x \\in \\{A, T, G, C, -\\}} \\left| \\{ t\[i\] \\; | \\; t\[i\] \\neq x \\; (i=1, \\ldots, m) \\} \\right|$

C(t)C(t) counts the number of characters in tt that are not equal to the one that appears in tt the most frequently. For example, when tt=CCCC-T, C appears the most frequently, so the number of characters that are not equal to C (i.e., - and T) becomes C(t)=2C(t)=2. When tt=CCAA-T, one can choose either C or A as xx because counts of C and A are equal. If C is chosen as xx, C(t)C(t) becomes 44 by adding number of counts for A, - and T.

If we can generate SS such that sum of C(S\*,j)C(S_{\*,j}) for all columns becomes small, it is expected that the given mm strings are well aligned. We call such a problem multiple alignment and define the following score.

Call(S)=sumj=1nC(S\*,j)C_{all}(S) = \\sum_{j=1}^n C(S_{\*,j})

For example, if we align the three strings GGATC, GGAT, GACC by inserting - as follows,

  GGATC
  GGAT-
  G-ACC

The score Call(S)C_{all}(S) becomes $C(S_{\*,1}) + C(S_{\*,2})+ C(S_{\*,3})+ C(S_{\*,4})+ C(S_{\*,5}) = 0 + 1 + 0 + 1 + 1 = 3$.

Given mm strings s1,ldots,sms_{1}, \\ldots, s_{m}, output a multiple alignment such that Call(S)C_{all}(S) becomes small as much as possible.

Constraints

  • s1s_1,...,sms_m are strings consisting of A, C, G, T.
  • Test case (small)
    • 8leqmleq108 \\leq m \\leq 10
    • 80leqs1,ldots,smleq10080 \\leq |s_1|, \\ldots, |s_m| \\leq 100 (the length of xx is denoted by|x|.)
  • Test case (large)
    • $35 \leq m \leq
    • 700leqs1,ldots,smleq750700 \\leq |s_1|, \\ldots, |s_m| \\leq 750

Input

Input is given from Standard Input in the following format:

mm s1s_{1} s2s_{2} : sms_{m}

Output

Print a multiple alignment of input strings such that CallC_{all} becomes small as much as possible. After inserting - in each input strings, they should be printed in the same order as the order of input strings.

The judging system can only accept user's output when: - Each line is equal to the corresponding input line if - is removed. - The lengths of all the lines are equal.


Sample Input 1

10
GGAGGTTATTGCTGTGGAGGTACTGGAGAAGGAGGATGCTAGCGTTGGGTAAACCACGAGCATTTTGACTTGTACTTCGCCTC
GGGTTATTGCTGTTGTGAGTACTGGAGACAGGAGGGAGTGTTAGAGTTGGGGTAAACCACAGTAGCTCATGTCACTTGGATAACTCGTCAGCCTC
GGTCACTCGCTGTGGAGAGTACTTTGAGACAGGAGGGAGTGCTAGAGTTTGGGGTAAAACCACAGCAGCTCATGCACTTGGATATCTGTGAGCC
GAGGTTAGTGCTGTGGAGAGTACTGGAGACAGGAGGGAGTGCTAGAGTTGGGGTAAAGCACAGCACCATTCACTGATAAATGTCAGGCCTAGGGG
GAGGTATTGCTGTGGAGGGTACTGGAGACAGGAGGAGTGCTAGAGGTTGGGGTAAAACCACAGCAGCTCATTTACTTGATACTGTCAGGCTCAGG
GAGGTTATTTGCTGTGGAGAGTTACTGAGACATGGGGTGCCAAGTTGGGTAGCTACAGCAGCTCATTTCACTTGATACTGCAGGCTCTCAG
GAGTTAATTTCGTGGAGAGTACTAGAGCACAGGAGGGAGGCCAGATTGGGGTATACCACAGCAGCTCGTTCACTTTAACTGTCAGGCCCTCA
ACAGTTTAATTGATGGGCGGAGAGTACTGGAGACAGGAGGGAGTGCTAGAGTGGGGTAAACCACAGCAGCATCTTTCATTATAACTGTCAG
CAAGGTTTTTTCGCTGTGGAGAGTACTGGAGACCGGGGAGTGTAGACTTTGGGGTATCACGTAGCAGCTTATTTCGACTTTGTACTGTAA
GAGTTATTTCTGTGGAGAGAACTGGAGACGGAGGGAGTGCTAGAGTTGGGGTAAACACCAGGCAGCCATTTCACTTGATAACTGTCAGGCCTT

Sample Output 1

GGAGGTTA-TTGCT--GTGGAG-GTAC-TGGAGA-AGGA-GGA-TGCTAGCG-TT-GGGT-AAACCAC-G-AGC-ATTTTGACTT-G-T-ACT--TC-GCCTC----
--GGGTTA-TTGCT--GTTGTGAGTAC-TGGAGACAGGAGGGAGTGTTAGAG-TTGGGGT-AAACCACAGTAGCTCATGTCACTTGGATAACTCGTCAGCCTC----
---GGTCACTCGCT--GTGGAGAGTACTTTGAGACAGGAGGGAGTGCTAGAGTTTGGGGTAAAACCACAGCAGCTCATG-CACTTGGATATCT-GTGAG-C-C----
-GAGGTTA-GTGCT--GTGGAGAGTAC-TGGAGACAGGAGGGAGTGCTAGAG-TTGGGGT-AAAGCACAGCA-CCATTCACTGATAAATGTCAGGCCTAGGGG----
--GAGGTA-TTGCT--GTGGAGGGTAC-TGGAGACAGGA-GGAGTGCTAGAGGTTGGGGTAAAACCACAGCAGCTCAT-TTACTT-GAT-ACT-GTCAGGCTC-AGG
-GAGGTTATTTGCT--GTGGAGAGTTACT-GAGACA--TGGG-GTGCCA-AG-TT-GGGT--AGCTACAGCAGCTCATTTCACTT-GAT-ACT-G-CAGGCTCTCAG
--GAGTTAATTTC---GTGGAGAGTACTAGAGCACAGGAGGGAG-GCCAGA--TTGGGGT-ATACCACAGCAGCTCGT-TCACTT---TAACT-GTCAGGC-CCTCA
ACAGTTTAATTGATGGGCGGAGAGTAC-TGGAGACAGGAGGGAGTGCTAGAG--TGGGGT-AAACCACAGCAGCATCTTTCA-TT--ATAACT-GTCAG--------
CAAGGTT-TTTTCGCTGTGGAGAGTAC-TGGAGAC-CG-GGGAGTG-TAGACTTTGGGGT---ATCAC-GTAG--CAGCTTATTTCG--ACTTTGT-A--CT-GTAA
--GAGTTA-TTTCT--GTGGAGAGAAC-TGGAGAC-GGAGGGAGTGCTAGAG-TTGGGGT-AAACACCAGGCAGCCATTTCACTT-GATAACT-GTCAGGC-C--TT

Points

The submitted program is evaluated by two test case types: Small and Large, as denoted in the Constraints section. Small contains two cases, and Large contains eight cases. The points are given to outputs that satisfy the format within the time limit. For each output of a test case in Small, a contestant earns 200lfloorCall(S)\*0.2rfloor\\{200 - \\lfloor C_{all}(S) \* 0.2 \\rfloor \\} or 0, whichever is greater. For a test case in Large, the contestant earns 700lfloorCall(S)\*0.1rfloor\\{700 - \\lfloor C_{all}(S) \* 0.1 \\rfloor \\} or 0, whichever is greater. For example, if one can generate two outputs S1S_1 and S2S_2 for Small-type cases with a correct format within the time limit, one can earn $max\\{200 - \\lfloor C_{all}(S_1) \* 0.2 \\rfloor, 0\\} + max\\{200 - \\lfloor C_{all}(S_2) \* 0.2 \\rfloor, 0\\}$.

We do not guarantee that the optimal solution can achieve the maximum points (6000 points). The final standings are determined by the other dataset generated by the same method used to generate the current test cases.

The method for generating test cases

We extract substring ss from chromosome 22 of the human reference genome sequence in the first step. ss is extracted such that it does not include N and only consists of A, T, G and C. In the second step, we generate insertions/deletions/substitutions for ss to make a new string ss' based on a sequencing error model that simulates a certain sequencer. In the third step, we repeat the second step to generate all the input strings. Note that ss is not included in a test case. We do not describe the sequencing error model we used but provide the sample test cases available from the following URL.