The DNSRA Challenge

Foundations in Algorithms
School of Computing, National University of Singapore
(CS5206, Fall 2009)

Changelog

5th November 2009

Added FAQ to Section 10 and how we will be grading M3 to Section 9.

4th November 2009

Thanks to a comment from Wang Wei in the IVLE forum, we found an off-by-one error in the values of d reported in the input files. The reported value is one less than the actual value, this has been corrected in all input files by altering the value of d.

2nd November 2009

In Section 8.3, added link to the DNSRA Challenge site and instructions to submit your output as part of the requirements for M3.

Section 4 includes a link to the program we use for computing the score of a set of contigs.

28th October 2009

Added formula used in evaluation to Section 4.

23rd October 2009

Added medium, large, and hidden datasets to Section 5.3, 5.4, and 5.5.

Please DO NOT run your programs on the medium and large datasets using sunfire, you should use the Tembusu cluster (see https://www.comp.nus.edu.sg/cf/tembusu/index.html and https://www.comp.nus.edu.sg/cf/tembusu/sge.html for details, as usual you should post your question on the IVLE forum).

1st October 2009

25th September 2009

24th September 2009

Added team pairings and assignment of algorithm for M1 to Section 7.

1  Aims of the Course Project

The aim of the course project is to apply advanced algorithms and data structures to solve real problems. The problem chosen for this purpose is the De novo Short Read Assembly problem (DNSRA).

DNSRA is a problem that arises from DNA sequencing. Sequencing is the process of determining the order of chemical bases (abbreviated A, G, C, and T) that make up the target DNA sequence. The strategy used in the new generation sequencing projects is as follows:

New sequencing machines are able to produce large amounts of short paired reads. The challenge of short read assembly is to take in a set of reads and assemble them to form the original DNA sequence. As it is almost impossible to recover the entire DNA sequence, the output of the assembly process is a number of contigs. A contig is contiguous sequence of the target DNA sequence formed by combining multiple reads together.

In this project, we will focus on a simplified version of the DNSRA problem by assuming that the target DNA sequence consists of a single linear chromosome taken from bacteria. In addition, we assume there are no errors due to erroneous litigations and indels.

2  DNSRA Challenge

This project is to implement efficient and effective algorithms for DNSRA and verify the performance of your algorithm on various benchmark datasets. This project is modelled after various DIMACS Challenges, where researchers are invited to implement algorithms to solve specific problems. (Past DIMACS challenge problems include matching, network flows, graph colouring, maximum clique, satisfiability, TSP, shortest path, etc.).

In this DNSRA Challenge, the “challenge” is to assemble contiguous portions of the target DNA sequence given a large number of reads.

3  The DNSRA Problem

The input for DNSRA consists of the following:

Recall that the reads are generated by sequencing the start and end of each fragment. In addition, reads are subjected to read errors that causes a base to be misread. Suppose the base on a fragment is supposed to be A but due to read errors, it could be read as G, C, T or N, where N means the sequencer is unable to decide the correct base.

The solution to the DNSRA problem is a set of contigs. Contigs are formed by combining multiple reads together. Contigs should be as long as possible.

For a more detailed write-up of the DNSRA problem and related works, please refer to this report [1] and [6]. We would like to thank Pramila for allowing us to distribute his report.

3.1  Input format

The input will be in the form of a multi-FASTA file with a custom header block. The following shows a sample input file.

## CS5206 DNSRA project input file
## Reads generated from genome/AM260523_852.fna via simulation of a paired-end sequencing machine
## Simulation parameters are: 100 10 50 30 0 1
## File generated on: Tue Sep 15 10:35:57 SGT 2009
#\d=100
#\w=10
#\l=30
#\n=1400
>read_1_1
TTGAGTTAAGGAGATAAGATGTTAAAAAAT
>read_1_2
AAAAACAATTAGCTAAATATCAAAGTAAAG
>read_2_1
AATTAGCTAAATATCAAAGTAAAGTTTTAG
>read_2_2
AAGCTAAGAAAGATGAGAATTTAGCTAGTA
>read_3_1
ACTAGATAAGGGGGTTACAGCGTTCCTTAA
>read_3_2
GATATAATGGGGTTGTTATGATTGATTGAG
...

Lines beginning with # represent the header of the file.

Lines starting with ## are human readable comments, your program should ignore these lines.

Lines starting with #\ are the parameters for this dataset, your program should parse these lines. The parameters d, w, and l are as described in the Section 3. The parameter n indicates the number of reads in the file.

The reads are organised as follows the (2i − 1)th read is the start of the ith fragment and the (2i)th read is the end of the ith fragment.

Each read is given using two lines, the first line begins with > followed by a string that is the id of the read. This is followed by a line containing l characters from the set {A, G, C, T, N}.

3.2  Output format

Your program should try to combine as many reads as possible into contigs and output the contig in a multi-FASTA file as follows. The following shows a sample output file.

## CS5206 DNSRA project output file
## Contigus generated from reads001.fna using GCE algorithm
## Algorithm parameters are: 100 10 50 30 0 1
## File generated on: Tue Sep 15 10:35:57 SGT 2009
#\f=reads001.fna
#\t=12
#\m=100
#\r=10
>contig_1
TTGAGTTAAAAGTAAAAAACAAGCTAAATATCAAAGTAAAG
>contig_2
AATTAGCTAAATATCAAAGTAAGTTGACTAAGAAAGATGAGAATTTAGCTAGTA
...

Your output file should start with a human readable text comment indicating the name of the algorithm used and the parameters of the algorithm.

This is followed by a section that indicates the values for the following parameters.

Finally, output the contigs in multi-FASTA format. Each contig should be printed on two lines. The first line should start with > and contain an unique id of the contig and the next line should contain the sequence of the contig which is a string contain characters from the set {A, G, C, T}

4  Evaluation Criteria

We will be using blat [5] to map your contigs back to the reference genome. The quality of your output will be evaluated using a combination of the coverage and the N50 contig length.

The coverage of a set of contigs is the percentage of the reference genome that is covered by contigs.

The N50 contig length is defined as follows: sort the contigs that can be mapped back to the reference genome in decreasing length and map them back to the reference genome one at a time. The contig that causes the coverage to go over 50% is the N50 contig. The N50 contig length is the length of the N50 contig.

For the purpose of this project, we came up with a score function that combines different measures of contig quality. Take note that this is done solely for the sake of grading your solution in an objective way. In actual research, algorithms are typically compared using each metric separately.

The score of a set of contigs is computed using the following formula

m
n
 + 
m
b
 + max(log2 
l × 100
n
,0)

where

We have developed a grading program that computes the score of a set of contigs given your contigs and the output of blat. You can download the program from here. This program is also used on the DNSRA Challenge site.

5  The Benchmark DNSRA Datasets

An important aspect of the DIMACS Challenges is a publicly available set of problem instances which made it possible to directly compare results. Likewise, the problem instances for the DNSRA Challenge are made available in the course web-site.

5.1  Testing

Target DNA sequence has 840 bases and there are no read errors.

5.2  Small

Target DNA sequence has 3661 bases and there are no read errors.

5.3  Medium

Target DNA sequence has 79,745 bases.

CoveragePercentage error per base(d, w)Link (gzipped)
50x0%(5001, 50)sim/reads201.fna.gz
50x1%(5001, 50)sim/reads202.fna.gz
50x1%(1001, 10)sim/reads203.fna.gz
100x1%(1001, 10)sim/reads204.fna.gz
20x2%(1001, 10)sim/reads205.fna.gz
20x2%(5001, 50)sim/reads206.fna.gz

Target DNA sequence has 117,080 bases.

CoveragePercentage error per base(d, w)Link (gzipped)
50x0%(5001, 50)sim/reads301.fna.gz
50x1%(5001, 50)sim/reads302.fna.gz
50x1%(1001, 10)sim/reads303.fna.gz
100x1%(1001, 10)sim/reads304.fna.gz
20x2%(1001, 10)sim/reads305.fna.gz
20x2%(5001, 50)sim/reads306.fna.gz

5.4  Large

Target DNA sequence has 615,980 bases.

CoveragePercentage error per base(d, w)Link (gzipped)
50x0%(10001, 100)sim/reads401.fna.gz
50x1%(10001, 100)sim/reads402.fna.gz
50x1%(5001, 50)sim/reads403.fna.gz
100x1%(5001, 50)sim/reads404.fna.gz
20x2%(5001, 50)sim/reads405.fna.gz
20x2%(10001, 100)sim/reads406.fna.gz

Target DNA sequence has 1,553,927 bases.

CoveragePercentage error per base(d, w)Link (gzipped)
50x0%(10001, 100)sim/reads501.fna.gz
50x1%(10001, 100)sim/reads502.fna.gz
50x1%(5001, 50)sim/reads503.fna.gz
100x1%(5001, 50)sim/reads504.fna.gz
20x2%(5001, 50)sim/reads505.fna.gz
20x2%(10001, 100)sim/reads506.fna.gz

5.5  Hidden

In addition to the publicly available datasets posted here, we may also make use of an additional hidden set of test data to evaluate your programs.

6  Source Codes and Resources

7  What You Have to Do

You will be assigned to teams of two students according to the following table. The pairing was done by Hon Wai with the assistance of a random number generator in Excel.

Team idMembersHeuristics for M1 (see Section 8.1)
T01VO HOANG TAM, SHEN ZHONGQ1, B1
T02CHEN LIANG, RIKKY WENANG PURBOJATIQ1, B2
T03SUJIT MATHEW, ZHAO FENGQ2, B1
T04ZHOU YE, ATTILA PERESZLENYIQ2, B2
T05PENGHUI YAO, LANVIN PIERRE CYRILQ1, B1
T06PILLARISETTI JAIDEV, LU XUESONG, WANG SUYUNQ1, B2
T07ZHAO ZHENWEI, TIAN ZHENGMIAOQ2, B1
T08MEHMET ERDOGAN, GUO WENYUANQ2, B2
T09CAO THANH TUNG, YING SHANSHANQ1, B1
T10VU THUY HUONG, ZHANG HAOJUNQ1, B2
T11VENUGOPAL NAVANEETHAN, WANG YUYIQ2, B1
T12LIM JING QUAN, WANG ZIRUIQ2, B2
T13NGUYEN HOANG MINH DUNG, WANG XIAOLIQ1, B1
T15BAI HAOYU, KRISHNAMOORTHY SHYAAMKUMAARQ2, B1
T16KOH CHUEN HOA, ZHOU ZHENGLONGQ2, B2
T17FOO CEXIN LEWIS, DENG XIAOXIAQ1, B1
T18LI BOWEN, LI XIAOHUIQ1, B2
T19KO WEILIANG WILLIAM, TRAN QUOC TUANQ2, B1
T20NIE LIQIANG, LIM YONG SAN GILBERTQ2, B2
T21NGUYEN PHUC KHANH LUAN, CHEN YINGCHAOQ1, B1
T22LIU CHEN, ZHANG HAIMOQ1, B2
T23GHO ZHENGHENG, PRANG ANTOINE GABRIEL GUYQ2, B1
T24SHUBHABRATA SEN, WANG WEIQ2, B2
T25KAZI RUBAIAT HABIB (TARIN), LU YINGQ1, B1
T26SRIGANESH MANIGANAHALLI SRIHARI, WANG XUANCONGQ1, B2
T27SUCHEENDRA KUMAR PALANIAPPAN, XIONG FEIQ2, B1
T28WEI XUELIANG, TAN TIAN SHENG ALEXQ2, B2
T29XIAO QIAN, MAI DANG QUANG HUNGQ1, B1

To help you achieve your project goal of implementing your best algorithm for DNSRA , we have set three milestones for the project:

M1: Basic algorithm
You will implement a simple greedy assembly algorithm to gain some familiarity with the DNSRA problem and the input/output format.
M2: Your proposed DNSRA algorithm
For this milestone, you will first do suitable literature research. You should select a solution approach and work out some of the technical details for your proposed DNSRA algorithm. Then write out a solution proposal for your DNSRA algorithm.
M3: Your very own DNSRA algorithm
Finally, you will implement your proposed DNSRA algorithm. Remember that you will need to make a lot of refinements and improvements to your algorithm after you see the experimental results on the benchmark datasets. So start early.

8  Your DNSRA Project Deliverables

8.1  For Milestone M1 (due 06 Oct)

The following is a generic greedy algorithm [6] for the DNSRA problem.

sort the reads based on ``read quality''
consider each unused read, r, in decreasing order of quality
    mark r as used
    let the current contig, c, be r
    while (current contig can be extended)
        find another an unused read, r', that ``best'' extends the current contig
        mark r' as used
        extend c with r'

We can get different algorithms by using different definitions of read quality and the best read that extends the current contig.

Two possible definitions of read quality are:

Two possible definitions of the best read which extends the current contig are:

You should implement your variation of the generic algorithm that is assigned to your team (see Section 7). For parts of the above algorithm which are not fully specified, you are free to come up with your own definitions.

Write a brief report on your greedy algorithm that describes some of your design choices and includes a table of the results obtained on the testing and small datasets. Your table of results should contain at least, columns for the coverage, N50 contig length, and running time of your algorithm for each data set. An additional thing that you may want to do is to try to draw some conclusions (good, bad, or ugly) on your assigned greedy heuristic algorithm.

Name your report DNSRA-M1-Rep-[Tnn].pdf where nn is your assigned team number.

Put your report under a new directory called "report".

To submit your DNSRA M1 deliverables, please do the following:

8.2  For Milestone M2 (due 09 oct)

Write up your solution proposal as a 2-page document that outlines your proposed algorithm (with some details to make it understandable for me) and call it DNSRA-M2-Prop-[Tnn].pdf and submit this zip file to CS5206 IVLE workbin called "CS5206-DNSRA-M2".

8.3  For Milestone M3 (due 10 Nov)

Your DNSRA Report – For Milestone M3, you should write a proper project report that should have (at least) the following sections:

Problem Statement
Explain the DNSRA problem.
Overview of Your Algorithm for M3
Explain the general idea(s) behind your DNSRA algorithm for M3. Give possible reasons why you think it should give good results for DNSRA.
Details of Your Algorithm for M3
Give the technical details of your algorithm here. Together with information on how you implemented it. Clearly state any parameters you have in your algorithm and describe briefly how you choose their values.
Complexity Analysis
Analyse the time complexity of your algorithm.
Result and Observations
Results obtained by your M3 algorithm, observations, if any, and recommendations for better approaches.

9  Grading of M3

We will use this grading sheet to grade your M3 submission.

We will also be running your programs on secret inputs during the Study Week.

If we have difficulties getting your program to run (compilation, run error, etc), you will be called back during Study Week to help us run OUR COPY OF YOUR SUBMITTED PROGRAM.

References

[1]
Ariyaratne, P. De novo genome assembly using paired-end short reads. Tech. rep., National University of Singapore, 2009.
[2]
Bryant, D., Wong, W., and Mockler, T. QSRA – a quality-value guided de novo short read assembler. BMC bioinformatics 10, 1 (2009), 69.
[3]
Hernandez, D., François, P., Farinelli, L., Østerås, M., and Schrenzel, J. De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Research 18, 5 (2008), 802.
[4]
Jeck, W., Reinhardt, J., Baltrus, D., Hickenbotham, M., Magrini, V., Mardis, E., Dangl, J., and Jones, C. Extending assembly of short DNA sequences to handle error. Bioinformatics 23, 21 (2007), 2942.
[5]
Kent, W. BLAT-the BLAST-like alignment tool. Genome research 12, 4 (2002), 656–664.
[6]
Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 4 (2009), 354.
[7]
Warren, R., Sutton, G., Jones, S., and Holt, R. Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23, 4 (2007), 500.
[8]
Zerbino, D., and Birney, E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 5 (2008), 821.

10  FAQ

10.1  How to interpret blat output?

Recall that a contig is contiguous sequence of the target DNA sequence formed by combining multiple reads together.

After running blat, the valid contigs are those that can be mapped back to the target DNA sequence. They must satisfy the following conditions:

The requirements for valid contigs are necessarily strict as we are solving the de novo (target sequence is unknown) assembly problem. Invalid contigs will lead to incorrect information about the genome we are trying to assemble.

10.2  Can LEDA be used on tembusu?

Yes. See this post in the IVLE forum.

10.3  Is the length of each read fixed at 30?

Yes.


This document was translated from LATEX by HEVEA.