The DNSRA ChallengeFoundations in Algorithms |
Added FAQ to Section 10 and how we will be grading M3 to Section 9.
Thanks to a comment from Wang Wei in the IVLE forum, we found an off-by-one error in the values of d reported in the input files. The reported value is one less than the actual value, this has been corrected in all input files by altering the value of d.
In Section 8.3, added link to the DNSRA Challenge site and instructions to submit your output as part of the requirements for M3.
Section 4 includes a link to the program we use for computing the score of a set of contigs.
Added formula used in evaluation to Section 4.
Added medium, large, and hidden datasets to Section 5.3, 5.4, and 5.5.
Please DO NOT run your programs on the medium and large datasets using sunfire, you should use the Tembusu cluster (see https://www.comp.nus.edu.sg/cf/tembusu/index.html and https://www.comp.nus.edu.sg/cf/tembusu/sge.html for details, as usual you should post your question on the IVLE forum).
Added team pairings and assignment of algorithm for M1 to Section 7.
The aim of the course project is to apply advanced algorithms and data structures to solve real problems. The problem chosen for this purpose is the De novo Short Read Assembly problem (DNSRA).
DNSRA is a problem that arises from DNA sequencing. Sequencing is the process of determining the order of chemical bases (abbreviated A, G, C, and T) that make up the target DNA sequence. The strategy used in the new generation sequencing projects is as follows:
New sequencing machines are able to produce large amounts of short paired reads. The challenge of short read assembly is to take in a set of reads and assemble them to form the original DNA sequence. As it is almost impossible to recover the entire DNA sequence, the output of the assembly process is a number of contigs. A contig is contiguous sequence of the target DNA sequence formed by combining multiple reads together.
In this project, we will focus on a simplified version of the DNSRA problem by assuming that the target DNA sequence consists of a single linear chromosome taken from bacteria. In addition, we assume there are no errors due to erroneous litigations and indels.
This project is to implement efficient and effective algorithms for DNSRA and verify the performance of your algorithm on various benchmark datasets. This project is modelled after various DIMACS Challenges, where researchers are invited to implement algorithms to solve specific problems. (Past DIMACS challenge problems include matching, network flows, graph colouring, maximum clique, satisfiability, TSP, shortest path, etc.).
In this DNSRA Challenge, the “challenge” is to assemble contiguous portions of the target DNA sequence given a large number of reads.
The input for DNSRA consists of the following:
Recall that the reads are generated by sequencing the start and end of each fragment. In addition, reads are subjected to read errors that causes a base to be misread. Suppose the base on a fragment is supposed to be A but due to read errors, it could be read as G, C, T or N, where N means the sequencer is unable to decide the correct base.
The solution to the DNSRA problem is a set of contigs. Contigs are formed by combining multiple reads together. Contigs should be as long as possible.
For a more detailed write-up of the DNSRA problem and related works, please refer to this report [1] and [6]. We would like to thank Pramila for allowing us to distribute his report.
The input will be in the form of a multi-FASTA file with a custom header block. The following shows a sample input file.
## CS5206 DNSRA project input file ## Reads generated from genome/AM260523_852.fna via simulation of a paired-end sequencing machine ## Simulation parameters are: 100 10 50 30 0 1 ## File generated on: Tue Sep 15 10:35:57 SGT 2009 #\d=100 #\w=10 #\l=30 #\n=1400 >read_1_1 TTGAGTTAAGGAGATAAGATGTTAAAAAAT >read_1_2 AAAAACAATTAGCTAAATATCAAAGTAAAG >read_2_1 AATTAGCTAAATATCAAAGTAAAGTTTTAG >read_2_2 AAGCTAAGAAAGATGAGAATTTAGCTAGTA >read_3_1 ACTAGATAAGGGGGTTACAGCGTTCCTTAA >read_3_2 GATATAATGGGGTTGTTATGATTGATTGAG ...
Lines beginning with #
represent the header of the file.
Lines starting with ##
are human readable comments, your program should ignore these lines.
Lines starting with #\
are the parameters for this dataset, your program
should parse these lines. The parameters d, w, and l are as described in
the Section 3. The parameter n indicates the number of reads in
the file.
The reads are organised as follows the (2i − 1)th read is the start of the ith fragment and the (2i)th read is the end of the ith fragment.
Each read is given using two lines, the first line begins with >
followed
by a string that is the id of the read. This is followed by a line containing
l characters from the set {A, G, C, T, N}.
Your program should try to combine as many reads as possible into contigs and output the contig in a multi-FASTA file as follows. The following shows a sample output file.
## CS5206 DNSRA project output file ## Contigus generated from reads001.fna using GCE algorithm ## Algorithm parameters are: 100 10 50 30 0 1 ## File generated on: Tue Sep 15 10:35:57 SGT 2009 #\f=reads001.fna #\t=12 #\m=100 #\r=10 >contig_1 TTGAGTTAAAAGTAAAAAACAAGCTAAATATCAAAGTAAAG >contig_2 AATTAGCTAAATATCAAAGTAAGTTGACTAAGAAAGATGAGAATTTAGCTAGTA ...
Your output file should start with a human readable text comment indicating the name of the algorithm used and the parameters of the algorithm.
This is followed by a section that indicates the values for the following parameters.
Finally, output the contigs in multi-FASTA format. Each contig should be printed
on two lines. The first line should start with >
and contain an unique id
of the contig and the next line should contain the sequence of the contig which
is a string contain characters from the set {A, G, C, T}
We will be using blat [5] to map your contigs back to the reference genome. The quality of your output will be evaluated using a combination of the coverage and the N50 contig length.
The coverage of a set of contigs is the percentage of the reference genome that is covered by contigs.
The N50 contig length is defined as follows: sort the contigs that can be mapped back to the reference genome in decreasing length and map them back to the reference genome one at a time. The contig that causes the coverage to go over 50% is the N50 contig. The N50 contig length is the length of the N50 contig.
For the purpose of this project, we came up with a score function that combines different measures of contig quality. Take note that this is done solely for the sake of grading your solution in an objective way. In actual research, algorithms are typically compared using each metric separately.
The score of a set of contigs is computed using the following formula
| + |
| + max(log2 |
| ,0) |
where
We have developed a grading program that computes the score of a set of contigs given your contigs and the output of blat. You can download the program from here. This program is also used on the DNSRA Challenge site.
An important aspect of the DIMACS Challenges is a publicly available set of problem instances which made it possible to directly compare results. Likewise, the problem instances for the DNSRA Challenge are made available in the course web-site.
Target DNA sequence has 840 bases and there are no read errors.
Target DNA sequence has 3661 bases and there are no read errors.
Target DNA sequence has 79,745 bases.
Coverage | Percentage error per base | (d, w) | Link (gzipped) |
50x | 0% | (5001, 50) | sim/reads201.fna.gz |
50x | 1% | (5001, 50) | sim/reads202.fna.gz |
50x | 1% | (1001, 10) | sim/reads203.fna.gz |
100x | 1% | (1001, 10) | sim/reads204.fna.gz |
20x | 2% | (1001, 10) | sim/reads205.fna.gz |
20x | 2% | (5001, 50) | sim/reads206.fna.gz |
Target DNA sequence has 117,080 bases.
Coverage | Percentage error per base | (d, w) | Link (gzipped) |
50x | 0% | (5001, 50) | sim/reads301.fna.gz |
50x | 1% | (5001, 50) | sim/reads302.fna.gz |
50x | 1% | (1001, 10) | sim/reads303.fna.gz |
100x | 1% | (1001, 10) | sim/reads304.fna.gz |
20x | 2% | (1001, 10) | sim/reads305.fna.gz |
20x | 2% | (5001, 50) | sim/reads306.fna.gz |
Target DNA sequence has 615,980 bases.
Coverage | Percentage error per base | (d, w) | Link (gzipped) |
50x | 0% | (10001, 100) | sim/reads401.fna.gz |
50x | 1% | (10001, 100) | sim/reads402.fna.gz |
50x | 1% | (5001, 50) | sim/reads403.fna.gz |
100x | 1% | (5001, 50) | sim/reads404.fna.gz |
20x | 2% | (5001, 50) | sim/reads405.fna.gz |
20x | 2% | (10001, 100) | sim/reads406.fna.gz |
Target DNA sequence has 1,553,927 bases.
Coverage | Percentage error per base | (d, w) | Link (gzipped) |
50x | 0% | (10001, 100) | sim/reads501.fna.gz |
50x | 1% | (10001, 100) | sim/reads502.fna.gz |
50x | 1% | (5001, 50) | sim/reads503.fna.gz |
100x | 1% | (5001, 50) | sim/reads504.fna.gz |
20x | 2% | (5001, 50) | sim/reads505.fna.gz |
20x | 2% | (10001, 100) | sim/reads506.fna.gz |
In addition to the publicly available datasets posted here, we may also make use of an additional hidden set of test data to evaluate your programs.
You will be assigned to teams of two students according to the following table. The pairing was done by Hon Wai with the assistance of a random number generator in Excel.
Team id | Members | Heuristics for M1 (see Section 8.1) | |
T01 | VO HOANG TAM, SHEN ZHONG | Q1, B1 | |
T02 | CHEN LIANG, RIKKY WENANG PURBOJATI | Q1, B2 | |
T03 | SUJIT MATHEW, ZHAO FENG | Q2, B1 | |
T04 | ZHOU YE, ATTILA PERESZLENYI | Q2, B2 | |
T05 | PENGHUI YAO, LANVIN PIERRE CYRIL | Q1, B1 | |
T06 | PILLARISETTI JAIDEV, LU XUESONG, WANG SUYUN | Q1, B2 | |
T07 | ZHAO ZHENWEI, TIAN ZHENGMIAO | Q2, B1 | |
T08 | MEHMET ERDOGAN, GUO WENYUAN | Q2, B2 | |
T09 | CAO THANH TUNG, YING SHANSHAN | Q1, B1 | |
T10 | VU THUY HUONG, ZHANG HAOJUN | Q1, B2 | |
T11 | VENUGOPAL NAVANEETHAN, WANG YUYI | Q2, B1 | |
T12 | LIM JING QUAN, WANG ZIRUI | Q2, B2 | |
T13 | NGUYEN HOANG MINH DUNG, WANG XIAOLI | Q1, B1 | |
T15 | BAI HAOYU, KRISHNAMOORTHY SHYAAMKUMAAR | Q2, B1 | |
T16 | KOH CHUEN HOA, ZHOU ZHENGLONG | Q2, B2 | |
T17 | FOO CEXIN LEWIS, DENG XIAOXIA | Q1, B1 | |
T18 | LI BOWEN, LI XIAOHUI | Q1, B2 | |
T19 | KO WEILIANG WILLIAM, TRAN QUOC TUAN | Q2, B1 | |
T20 | NIE LIQIANG, LIM YONG SAN GILBERT | Q2, B2 | |
T21 | NGUYEN PHUC KHANH LUAN, CHEN YINGCHAO | Q1, B1 | |
T22 | LIU CHEN, ZHANG HAIMO | Q1, B2 | |
T23 | GHO ZHENGHENG, PRANG ANTOINE GABRIEL GUY | Q2, B1 | |
T24 | SHUBHABRATA SEN, WANG WEI | Q2, B2 | |
T25 | KAZI RUBAIAT HABIB (TARIN), LU YING | Q1, B1 | |
T26 | SRIGANESH MANIGANAHALLI SRIHARI, WANG XUANCONG | Q1, B2 | |
T27 | SUCHEENDRA KUMAR PALANIAPPAN, XIONG FEI | Q2, B1 | |
T28 | WEI XUELIANG, TAN TIAN SHENG ALEX | Q2, B2 | |
T29 | XIAO QIAN, MAI DANG QUANG HUNG | Q1, B1 |
To help you achieve your project goal of implementing your best algorithm for DNSRA , we have set three milestones for the project:
The following is a generic greedy algorithm [6] for the DNSRA problem.
sort the reads based on ``read quality'' consider each unused read, r, in decreasing order of quality mark r as used let the current contig, c, be r while (current contig can be extended) find another an unused read, r', that ``best'' extends the current contig mark r' as used extend c with r'
We can get different algorithms by using different definitions of read quality and the best read that extends the current contig.
Two possible definitions of read quality are:
Two possible definitions of the best read which extends the current contig are:
You should implement your variation of the generic algorithm that is assigned to your team (see Section 7). For parts of the above algorithm which are not fully specified, you are free to come up with your own definitions.
Write a brief report on your greedy algorithm that describes some of your design choices and includes a table of the results obtained on the testing and small datasets. Your table of results should contain at least, columns for the coverage, N50 contig length, and running time of your algorithm for each data set. An additional thing that you may want to do is to try to draw some conclusions (good, bad, or ugly) on your assigned greedy heuristic algorithm.
Name your report DNSRA-M1-Rep-[Tnn].pdf where nn is your assigned team number.
Put your report under a new directory called "report".
To submit your DNSRA M1 deliverables, please do the following:
Write up your solution proposal as a 2-page document that outlines your proposed algorithm (with some details to make it understandable for me) and call it DNSRA-M2-Prop-[Tnn].pdf and submit this zip file to CS5206 IVLE workbin called "CS5206-DNSRA-M2".
Your DNSRA Report – For Milestone M3, you should write a proper project report that should have (at least) the following sections:
We will use this grading sheet to grade your M3 submission.
We will also be running your programs on secret inputs during the Study Week.
If we have difficulties getting your program to run (compilation, run error, etc), you will be called back during Study Week to help us run OUR COPY OF YOUR SUBMITTED PROGRAM.
Recall that a contig is contiguous sequence of the target DNA sequence formed by combining multiple reads together.
After running blat, the valid contigs are those that can be mapped back to the target DNA sequence. They must satisfy the following conditions:
The requirements for valid contigs are necessarily strict as we are solving the de novo (target sequence is unknown) assembly problem. Invalid contigs will lead to incorrect information about the genome we are trying to assemble.
Yes. See this post in the IVLE forum.
Yes.
This document was translated from LATEX by HEVEA.