The DNSRA Challenge

Foundations in Algorithms
School of Computing, National University of Singapore
(CS5206, Fall 2009)

Changelog

5th November 2009

Added FAQ to Section 10 and how we will be grading M3 to Section 9.

4th November 2009

Thanks to a comment from Wang Wei in the IVLE forum, we found an off-by-one error in the values of d reported in the input files. The reported value is one less than the actual value, this has been corrected in all input files by altering the value of d.

2nd November 2009

In Section 8.3, added link to the DNSRA Challenge site and instructions to submit your output as part of the requirements for M3.

Section 4 includes a link to the program we use for computing the score of a set of contigs.

28th October 2009

Added formula used in evaluation to Section 4.

23rd October 2009

Added medium, large, and hidden datasets to Section 5.3, 5.4, and 5.5.

Please DO NOT run your programs on the medium and large datasets using sunfire, you should use the Tembusu cluster (see https://www.comp.nus.edu.sg/cf/tembusu/index.html and https://www.comp.nus.edu.sg/cf/tembusu/sge.html for details, as usual you should post your question on the IVLE forum).

1st October 2009

The following files had reads which are one character shorter that the actual read length due to an off-by-one error in our simulation program. They have been corrected. Please use the updated input files. Thanks to Antoine and Liu Chen for alerting us to this problem.
- sim/reads002.fna
- sim/reads003.fna
- sim/reads004.fna
- sim/reads005.fna
- sim/reads101.fna
- sim/reads105.fna
Clarified that for parts of M1 that are not fully specified, you should come up with your own definitions (see Section 8.1).

25th September 2009

Clarified that N only occurs when there are read errors.
Corrected error in description for B2 in Section 8.1.

24th September 2009

Added team pairings and assignment of algorithm for M1 to Section 7.

1 Aims of the Course Project

The aim of the course project is to apply advanced algorithms and data structures to solve real problems. The problem chosen for this purpose is the De novo Short Read Assembly problem (DNSRA).

DNSRA is a problem that arises from DNA sequencing. Sequencing is the process of determining the order of chemical bases (abbreviated A, G, C, and T) that make up the target DNA sequence. The strategy used in the new generation sequencing projects is as follows:

make many copies of the original DNA sequence
break them at random positions to obtain DNA fragments
select fragments of a particular length
sequence the start and end of each selected fragment to obtain pairs of reads (a read is a DNA sequence produced by the sequencing machine)

New sequencing machines are able to produce large amounts of short paired reads. The challenge of short read assembly is to take in a set of reads and assemble them to form the original DNA sequence. As it is almost impossible to recover the entire DNA sequence, the output of the assembly process is a number of contigs. A contig is contiguous sequence of the target DNA sequence formed by combining multiple reads together.

In this project, we will focus on a simplified version of the DNSRA problem by assuming that the target DNA sequence consists of a single linear chromosome taken from bacteria. In addition, we assume there are no errors due to erroneous litigations and indels.

2 DNSRA Challenge

This project is to implement efficient and effective algorithms for DNSRA and verify the performance of your algorithm on various benchmark datasets. This project is modelled after various DIMACS Challenges, where researchers are invited to implement algorithms to solve specific problems. (Past DIMACS challenge problems include matching, network flows, graph colouring, maximum clique, satisfiability, TSP, shortest path, etc.).

In this DNSRA Challenge, the “challenge” is to assemble contiguous portions of the target DNA sequence given a large number of reads.

3 The DNSRA Problem

The input for DNSRA consists of the following:

length of fragments, which ranges from d to d+w
length of each read, l
a sequence of paired reads, each paired read consist of two strings of length l over the alphabet {A, G, C, T, N}. The symbol N denotes a read error and the particular base could not be read by the sequencer.

Recall that the reads are generated by sequencing the start and end of each fragment. In addition, reads are subjected to read errors that causes a base to be misread. Suppose the base on a fragment is supposed to be A but due to read errors, it could be read as G, C, T or N, where N means the sequencer is unable to decide the correct base.

The solution to the DNSRA problem is a set of contigs. Contigs are formed by combining multiple reads together. Contigs should be as long as possible.

For a more detailed write-up of the DNSRA problem and related works, please refer to this report [1] and [6]. We would like to thank Pramila for allowing us to distribute his report.

3.1 Input format

The input will be in the form of a multi-FASTA file with a custom header block. The following shows a sample input file.

## CS5206 DNSRA project input file
## Reads generated from genome/AM260523_852.fna via simulation of a paired-end sequencing machine
## Simulation parameters are: 100 10 50 30 0 1
## File generated on: Tue Sep 15 10:35:57 SGT 2009
#\d=100
#\w=10
#\l=30
#\n=1400
>read_1_1
TTGAGTTAAGGAGATAAGATGTTAAAAAAT
>read_1_2
AAAAACAATTAGCTAAATATCAAAGTAAAG
>read_2_1
AATTAGCTAAATATCAAAGTAAAGTTTTAG
>read_2_2
AAGCTAAGAAAGATGAGAATTTAGCTAGTA
>read_3_1
ACTAGATAAGGGGGTTACAGCGTTCCTTAA
>read_3_2
GATATAATGGGGTTGTTATGATTGATTGAG
...

Lines beginning with # represent the header of the file.

Lines starting with ## are human readable comments, your program should ignore these lines.

Lines starting with #\ are the parameters for this dataset, your program should parse these lines. The parameters d, w, and l are as described in the Section 3. The parameter n indicates the number of reads in the file.

The reads are organised as follows the (2i − 1)th read is the start of the ith fragment and the (2i)th read is the end of the ith fragment.

Each read is given using two lines, the first line begins with > followed by a string that is the id of the read. This is followed by a line containing l characters from the set {A, G, C, T, N}.

3.2 Output format

Your program should try to combine as many reads as possible into contigs and output the contig in a multi-FASTA file as follows. The following shows a sample output file.

## CS5206 DNSRA project output file
## Contigus generated from reads001.fna using GCE algorithm
## Algorithm parameters are: 100 10 50 30 0 1
## File generated on: Tue Sep 15 10:35:57 SGT 2009
#\f=reads001.fna
#\t=12
#\m=100
#\r=10
>contig_1
TTGAGTTAAAAGTAAAAAACAAGCTAAATATCAAAGTAAAG
>contig_2
AATTAGCTAAATATCAAAGTAAGTTGACTAAGAAAGATGAGAATTTAGCTAGTA
...

Your output file should start with a human readable text comment indicating the name of the algorithm used and the parameters of the algorithm.

This is followed by a section that indicates the values for the following parameters.

f : name of the input file
t : your team id
m : number of contigs in the file
r : time taken by your program to generate the output in seconds

Finally, output the contigs in multi-FASTA format. Each contig should be printed on two lines. The first line should start with > and contain an unique id of the contig and the next line should contain the sequence of the contig which is a string contain characters from the set {A, G, C, T}

4 Evaluation Criteria

We will be using blat [5] to map your contigs back to the reference genome. The quality of your output will be evaluated using a combination of the coverage and the N50 contig length.

The coverage of a set of contigs is the percentage of the reference genome that is covered by contigs.

The N50 contig length is defined as follows: sort the contigs that can be mapped back to the reference genome in decreasing length and map them back to the reference genome one at a time. The contig that causes the coverage to go over 50% is the N50 contig. The N50 contig length is the length of the N50 contig.

For the purpose of this project, we came up with a score function that combines different measures of contig quality. Take note that this is done solely for the sake of grading your solution in an objective way. In actual research, algorithms are typically compared using each metric separately.

The score of a set of contigs is computed using the following formula

+ max(log₂

l × 100

,0)

where

m is the number of bases in the reference genome covered by some contig
n is the number of bases in the reference genome
b is the total number of bases in all contigs
l is the number of bases in the N50 contig (N50 contig length)

We have developed a grading program that computes the score of a set of contigs given your contigs and the output of blat. You can download the program from here. This program is also used on the DNSRA Challenge site.

5 The Benchmark DNSRA Datasets

An important aspect of the DIMACS Challenges is a publicly available set of problem instances which made it possible to directly compare results. Likewise, the problem instances for the DNSRA Challenge are made available in the course web-site.

5.1 Testing

Target DNA sequence has 840 bases and there are no read errors.

5.2 Small

Target DNA sequence has 3661 bases and there are no read errors.

5.3 Medium

Target DNA sequence has 79,745 bases.

Coverage	Percentage error per base	(d, w)	Link (gzipped)

50x	0%	(5001, 50)	`sim/reads201.fna.gz`
50x	1%	(5001, 50)	`sim/reads202.fna.gz`
50x	1%	(1001, 10)	`sim/reads203.fna.gz`
100x	1%	(1001, 10)	`sim/reads204.fna.gz`
20x	2%	(1001, 10)	`sim/reads205.fna.gz`
20x	2%	(5001, 50)	`sim/reads206.fna.gz`

Target DNA sequence has 117,080 bases.

Coverage	Percentage error per base	(d, w)	Link (gzipped)

50x	0%	(5001, 50)	`sim/reads301.fna.gz`
50x	1%	(5001, 50)	`sim/reads302.fna.gz`
50x	1%	(1001, 10)	`sim/reads303.fna.gz`
100x	1%	(1001, 10)	`sim/reads304.fna.gz`
20x	2%	(1001, 10)	`sim/reads305.fna.gz`
20x	2%	(5001, 50)	`sim/reads306.fna.gz`

5.4 Large

Target DNA sequence has 615,980 bases.

Coverage	Percentage error per base	(d, w)	Link (gzipped)

50x	0%	(10001, 100)	`sim/reads401.fna.gz`
50x	1%	(10001, 100)	`sim/reads402.fna.gz`
50x	1%	(5001, 50)	`sim/reads403.fna.gz`
100x	1%	(5001, 50)	`sim/reads404.fna.gz`
20x	2%	(5001, 50)	`sim/reads405.fna.gz`
20x	2%	(10001, 100)	`sim/reads406.fna.gz`

Target DNA sequence has 1,553,927 bases.

Coverage	Percentage error per base	(d, w)	Link (gzipped)

50x	0%	(10001, 100)	`sim/reads501.fna.gz`
50x	1%	(10001, 100)	`sim/reads502.fna.gz`
50x	1%	(5001, 50)	`sim/reads503.fna.gz`
100x	1%	(5001, 50)	`sim/reads504.fna.gz`
20x	2%	(5001, 50)	`sim/reads505.fna.gz`
20x	2%	(10001, 100)	`sim/reads506.fna.gz`

5.5 Hidden

In addition to the publicly available datasets posted here, we may also make use of an additional hidden set of test data to evaluate your programs.

6 Source Codes and Resources

SSAKE [7] (Perl)
SHARCGS [3] (Perl)
VCAKE [4] (C)
Velvet [8] (C)
QSRA [2] (C++)

7 What You Have to Do

You will be assigned to teams of two students according to the following table. The pairing was done by Hon Wai with the assistance of a random number generator in Excel.

Team id	Members	Heuristics for M1 (see Section 8.1)

T01	VO HOANG TAM, SHEN ZHONG	Q1, B1
T02	CHEN LIANG, RIKKY WENANG PURBOJATI	Q1, B2
T03	SUJIT MATHEW, ZHAO FENG	Q2, B1
T04	ZHOU YE, ATTILA PERESZLENYI	Q2, B2
T05	PENGHUI YAO, LANVIN PIERRE CYRIL	Q1, B1
T06	PILLARISETTI JAIDEV, LU XUESONG, WANG SUYUN	Q1, B2
T07	ZHAO ZHENWEI, TIAN ZHENGMIAO	Q2, B1
T08	MEHMET ERDOGAN, GUO WENYUAN	Q2, B2
T09	CAO THANH TUNG, YING SHANSHAN	Q1, B1
T10	VU THUY HUONG, ZHANG HAOJUN	Q1, B2
T11	VENUGOPAL NAVANEETHAN, WANG YUYI	Q2, B1
T12	LIM JING QUAN, WANG ZIRUI	Q2, B2
T13	NGUYEN HOANG MINH DUNG, WANG XIAOLI	Q1, B1
T15	BAI HAOYU, KRISHNAMOORTHY SHYAAMKUMAAR	Q2, B1
T16	KOH CHUEN HOA, ZHOU ZHENGLONG	Q2, B2
T17	FOO CEXIN LEWIS, DENG XIAOXIA	Q1, B1
T18	LI BOWEN, LI XIAOHUI	Q1, B2
T19	KO WEILIANG WILLIAM, TRAN QUOC TUAN	Q2, B1
T20	NIE LIQIANG, LIM YONG SAN GILBERT	Q2, B2
T21	NGUYEN PHUC KHANH LUAN, CHEN YINGCHAO	Q1, B1
T22	LIU CHEN, ZHANG HAIMO	Q1, B2
T23	GHO ZHENGHENG, PRANG ANTOINE GABRIEL GUY	Q2, B1
T24	SHUBHABRATA SEN, WANG WEI	Q2, B2
T25	KAZI RUBAIAT HABIB (TARIN), LU YING	Q1, B1
T26	SRIGANESH MANIGANAHALLI SRIHARI, WANG XUANCONG	Q1, B2
T27	SUCHEENDRA KUMAR PALANIAPPAN, XIONG FEI	Q2, B1
T28	WEI XUELIANG, TAN TIAN SHENG ALEX	Q2, B2
T29	XIAO QIAN, MAI DANG QUANG HUNG	Q1, B1

To help you achieve your project goal of implementing your best algorithm for DNSRA , we have set three milestones for the project:

M1: Basic algorithm: You will implement a simple greedy assembly algorithm to gain some familiarity with the DNSRA problem and the input/output format.
M2: Your proposed DNSRA algorithm: For this milestone, you will first do suitable literature research. You should select a solution approach and work out some of the technical details for your proposed DNSRA algorithm. Then write out a solution proposal for your DNSRA algorithm.
M3: Your very own DNSRA algorithm: Finally, you will implement your proposed DNSRA algorithm. Remember that you will need to make a lot of refinements and improvements to your algorithm after you see the experimental results on the benchmark datasets. So start early.

8 Your DNSRA Project Deliverables

8.1 For Milestone M1 (due 06 Oct)

The following is a generic greedy algorithm [6] for the DNSRA problem.

sort the reads based on ``read quality''
consider each unused read, r, in decreasing order of quality
    mark r as used
    let the current contig, c, be r
    while (current contig can be extended)
        find another an unused read, r', that ``best'' extends the current contig
        mark r' as used
        extend c with r'

We can get different algorithms by using different definitions of read quality and the best read that extends the current contig.

Two possible definitions of read quality are:

number of occurrence of the read in the input (Q1)
number of other reads it overlaps with, where we define two reads to overlap if they share a substring of l−1 bases (Q2)

Two possible definitions of the best read which extends the current contig are:

maximum length of overlap between the current contig and the read (B1)
minimum edit distance between some prefix/suffix of the current contig and the read, where all operations have cost 1 (B2)

You should implement your variation of the generic algorithm that is assigned to your team (see Section 7). For parts of the above algorithm which are not fully specified, you are free to come up with your own definitions.

Write a brief report on your greedy algorithm that describes some of your design choices and includes a table of the results obtained on the testing and small datasets. Your table of results should contain at least, columns for the coverage, N50 contig length, and running time of your algorithm for each data set. An additional thing that you may want to do is to try to draw some conclusions (good, bad, or ugly) on your assigned greedy heuristic algorithm.

Name your report DNSRA-M1-Rep-[Tnn].pdf where nn is your assigned team number.

Put your report under a new directory called "report".

To submit your DNSRA M1 deliverables, please do the following:

Create a README-GDYnn.txt file (where nn is your assigned team number) that contains instructions on how to compile and execute your programs. Even if it contains just a simple "go to directory XXX; make;".
If you have included little scripts that help you run your algorithm multiple times, please include them and also instructions on how to use them in the README.txt file.
Also make sure you include the solution files produced by your algorithm (these should be in a separate directory).
Include your report in the directory called "report" as described above.
Zip everything into a zip file called CS5206-DNSRA-M1-[Tnn].zip.
Submit this zip file to CS5206 IVLE workbin called "CS5206-DNSRA-M1". (Go to CS5206 IVLE Page here. Click "Workbin", View it, then Click "CS5206-DNSRA-M1" Folder. Submit there.)

8.2 For Milestone M2 (due 09 oct)

Write up your solution proposal as a 2-page document that outlines your proposed algorithm (with some details to make it understandable for me) and call it DNSRA-M2-Prop-[Tnn].pdf and submit this zip file to CS5206 IVLE workbin called "CS5206-DNSRA-M2".

8.3 For Milestone M3 (due 10 Nov)

Your DNSRA Report – For Milestone M3, you should write a proper project report that should have (at least) the following sections:

Problem Statement: Explain the DNSRA problem.
Overview of Your Algorithm for M3: Explain the general idea(s) behind your DNSRA algorithm for M3. Give possible reasons why you think it should give good results for DNSRA.
Details of Your Algorithm for M3: Give the technical details of your algorithm here. Together with information on how you implemented it. Clearly state any parameters you have in your algorithm and describe briefly how you choose their values.
Complexity Analysis: Analyse the time complexity of your algorithm.
Result and Observations: Results obtained by your M3 algorithm, observations, if any, and recommendations for better approaches.

Please put your report in your DNSRA code, under the directory "Report".
Again, include a README file to help us compile and run your code.
Then, zip everything into a zip file called CS5206-DNSRA-M3-Tnn.zip, for example CS5206-DNSRA-M3-T14.zip
Submit this zip file to CS5206 IVLE workbin called "CS5206-DNSRA-M3".
Submit your output to the challenge page – after you have run your M3 algorithm on the datasets, please submit to the DNSRA Challenge site. You are free to re-submit improved solutions as you get them (as many times as you like).

9 Grading of M3

We will use this grading sheet to grade your M3 submission.

We will also be running your programs on secret inputs during the Study Week.

If we have difficulties getting your program to run (compilation, run error, etc), you will be called back during Study Week to help us run OUR COPY OF YOUR SUBMITTED PROGRAM.

References

[1]: Ariyaratne, P. De novo genome assembly using paired-end short reads. Tech. rep., National University of Singapore, 2009.
[2]: Bryant, D., Wong, W., and Mockler, T. QSRA – a quality-value guided de novo short read assembler. BMC bioinformatics 10, 1 (2009), 69.
[3]: Hernandez, D., François, P., Farinelli, L., Østerås, M., and Schrenzel, J. De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Research 18, 5 (2008), 802.
[4]: Jeck, W., Reinhardt, J., Baltrus, D., Hickenbotham, M., Magrini, V., Mardis, E., Dangl, J., and Jones, C. Extending assembly of short DNA sequences to handle error. Bioinformatics 23, 21 (2007), 2942.
[5]: Kent, W. BLAT-the BLAST-like alignment tool. Genome research 12, 4 (2002), 656–664.
[6]: Pop, M. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 4 (2009), 354.
[7]: Warren, R., Sutton, G., Jones, S., and Holt, R. Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23, 4 (2007), 500.
[8]: Zerbino, D., and Birney, E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 5 (2008), 821.

10 FAQ

10.1 How to interpret blat output?

Recall that a contig is contiguous sequence of the target DNA sequence formed by combining multiple reads together.

After running blat, the valid contigs are those that can be mapped back to the target DNA sequence. They must satisfy the following conditions:

have not been previously mapped to the target DNA sequence
length of the largest block is at least 90% of the entire contig

The requirements for valid contigs are necessarily strict as we are solving the de novo (target sequence is unknown) assembly problem. Invalid contigs will lead to incorrect information about the genome we are trying to assemble.

10.2 Can LEDA be used on tembusu?

Yes. See this post in the IVLE forum.

10.3 Is the length of each read fixed at 30?

Yes.

This document was translated from L^AT_EX by H^EV^EA.