#genocon2021c. [genocon2021_c]Practice 3
[genocon2021_c]Practice 3
Problem statement
Given strings consisting of A, C, G, T. Let us insert blank symbols (denoted each by -) to the strings to make another set of strings such that all the lengths become equal and generate by matrix in which each row is one of the strings and is the length of the strings. We denote an element (i.e. character) at -th row and -th column of by . We also denote row and column vectors by and respectively. Let us consider inserting - at appropriate positions such that we can generate aligned in which the same characters are contained in each column vector. For this purpose, we define the following score that measures how much the given column vector is well aligned.
$C(t) = min_{x \\in \\{A, T, G, C, -\\}} \\left| \\{ t\[i\] \\; | \\; t\[i\] \\neq x \\; (i=1, \\ldots, m) \\} \\right|$
counts the number of characters in that are not equal to the one that appears in the most frequently. For example, when =CCCC-T, C appears the most frequently, so the number of characters that are not equal to C (i.e., - and T) becomes . When =CCAA-T, one can choose either C or A as because counts of C and A are equal. If C is chosen as , becomes by adding number of counts for A, - and T.
If we can generate such that sum of for all columns becomes small, it is expected that the given strings are well aligned. We call such a problem multiple alignment and define the following score.
For example, if we align the three strings GGATC, GGAT, GACC by inserting - as follows,
GGATC
GGAT-
G-ACC
The score becomes $C(S_{\*,1}) + C(S_{\*,2})+ C(S_{\*,3})+ C(S_{\*,4})+ C(S_{\*,5}) = 0 + 1 + 0 + 1 + 1 = 3$.
Given strings , output a multiple alignment such that becomes small as much as possible.
Constraints
- ,..., are strings consisting of A, C, G, T.
- Test case (small)
- .
- (the length of is denoted byx.)
- Test case (large)
- $35 \leq m \leq
Input
Input is given from Standard Input in the following format:
:
Output
Print a multiple alignment of input strings such that becomes small as much as possible. After inserting - in each input strings, they should be printed in the same order as the order of input strings.
The judging system can only accept user's output when: - Each line is equal to the corresponding input line if - is removed. - The lengths of all the lines are equal.
Sample Input 1
10
GGAGGTTATTGCTGTGGAGGTACTGGAGAAGGAGGATGCTAGCGTTGGGTAAACCACGAGCATTTTGACTTGTACTTCGCCTC
GGGTTATTGCTGTTGTGAGTACTGGAGACAGGAGGGAGTGTTAGAGTTGGGGTAAACCACAGTAGCTCATGTCACTTGGATAACTCGTCAGCCTC
GGTCACTCGCTGTGGAGAGTACTTTGAGACAGGAGGGAGTGCTAGAGTTTGGGGTAAAACCACAGCAGCTCATGCACTTGGATATCTGTGAGCC
GAGGTTAGTGCTGTGGAGAGTACTGGAGACAGGAGGGAGTGCTAGAGTTGGGGTAAAGCACAGCACCATTCACTGATAAATGTCAGGCCTAGGGG
GAGGTATTGCTGTGGAGGGTACTGGAGACAGGAGGAGTGCTAGAGGTTGGGGTAAAACCACAGCAGCTCATTTACTTGATACTGTCAGGCTCAGG
GAGGTTATTTGCTGTGGAGAGTTACTGAGACATGGGGTGCCAAGTTGGGTAGCTACAGCAGCTCATTTCACTTGATACTGCAGGCTCTCAG
GAGTTAATTTCGTGGAGAGTACTAGAGCACAGGAGGGAGGCCAGATTGGGGTATACCACAGCAGCTCGTTCACTTTAACTGTCAGGCCCTCA
ACAGTTTAATTGATGGGCGGAGAGTACTGGAGACAGGAGGGAGTGCTAGAGTGGGGTAAACCACAGCAGCATCTTTCATTATAACTGTCAG
CAAGGTTTTTTCGCTGTGGAGAGTACTGGAGACCGGGGAGTGTAGACTTTGGGGTATCACGTAGCAGCTTATTTCGACTTTGTACTGTAA
GAGTTATTTCTGTGGAGAGAACTGGAGACGGAGGGAGTGCTAGAGTTGGGGTAAACACCAGGCAGCCATTTCACTTGATAACTGTCAGGCCTT
Sample Output 1
GGAGGTTA-TTGCT--GTGGAG-GTAC-TGGAGA-AGGA-GGA-TGCTAGCG-TT-GGGT-AAACCAC-G-AGC-ATTTTGACTT-G-T-ACT--TC-GCCTC----
--GGGTTA-TTGCT--GTTGTGAGTAC-TGGAGACAGGAGGGAGTGTTAGAG-TTGGGGT-AAACCACAGTAGCTCATGTCACTTGGATAACTCGTCAGCCTC----
---GGTCACTCGCT--GTGGAGAGTACTTTGAGACAGGAGGGAGTGCTAGAGTTTGGGGTAAAACCACAGCAGCTCATG-CACTTGGATATCT-GTGAG-C-C----
-GAGGTTA-GTGCT--GTGGAGAGTAC-TGGAGACAGGAGGGAGTGCTAGAG-TTGGGGT-AAAGCACAGCA-CCATTCACTGATAAATGTCAGGCCTAGGGG----
--GAGGTA-TTGCT--GTGGAGGGTAC-TGGAGACAGGA-GGAGTGCTAGAGGTTGGGGTAAAACCACAGCAGCTCAT-TTACTT-GAT-ACT-GTCAGGCTC-AGG
-GAGGTTATTTGCT--GTGGAGAGTTACT-GAGACA--TGGG-GTGCCA-AG-TT-GGGT--AGCTACAGCAGCTCATTTCACTT-GAT-ACT-G-CAGGCTCTCAG
--GAGTTAATTTC---GTGGAGAGTACTAGAGCACAGGAGGGAG-GCCAGA--TTGGGGT-ATACCACAGCAGCTCGT-TCACTT---TAACT-GTCAGGC-CCTCA
ACAGTTTAATTGATGGGCGGAGAGTAC-TGGAGACAGGAGGGAGTGCTAGAG--TGGGGT-AAACCACAGCAGCATCTTTCA-TT--ATAACT-GTCAG--------
CAAGGTT-TTTTCGCTGTGGAGAGTAC-TGGAGAC-CG-GGGAGTG-TAGACTTTGGGGT---ATCAC-GTAG--CAGCTTATTTCG--ACTTTGT-A--CT-GTAA
--GAGTTA-TTTCT--GTGGAGAGAAC-TGGAGAC-GGAGGGAGTGCTAGAG-TTGGGGT-AAACACCAGGCAGCCATTTCACTT-GATAACT-GTCAGGC-C--TT
Points
The submitted program is evaluated by two test case types: Small and Large, as denoted in the Constraints section. Small contains two cases, and Large contains eight cases. The points are given to outputs that satisfy the format within the time limit. For each output of a test case in Small, a contestant earns or 0, whichever is greater. For a test case in Large, the contestant earns or 0, whichever is greater. For example, if one can generate two outputs and for Small-type cases with a correct format within the time limit, one can earn $max\\{200 - \\lfloor C_{all}(S_1) \* 0.2 \\rfloor, 0\\} + max\\{200 - \\lfloor C_{all}(S_2) \* 0.2 \\rfloor, 0\\}$.
We do not guarantee that the optimal solution can achieve the maximum points (6000 points). The final standings are determined by the other dataset generated by the same method used to generate the current test cases.
The method for generating test cases
We extract substring from chromosome 22 of the human reference genome sequence in the first step. is extracted such that it does not include N and only consists of A, T, G and C. In the second step, we generate insertions/deletions/substitutions for to make a new string based on a sequencing error model that simulates a certain sequencer. In the third step, we repeat the second step to generate all the input strings. Note that is not included in a test case. We do not describe the sequencing error model we used but provide the sample test cases available from the following URL.