 Technical Note
 Open Access
 Published:
ArapanS: a fast and highly accurate wholegenome assembly software for viruses and small genomes
BMC Research Notes volume 5, Article number: 243 (2012)
Abstract
Background
Genome assembly is considered to be a challenging problem in computational biology, and has been studied extensively by many researchers. It is extremely difficult to build a general assembler that is able to reconstruct the original sequence instead of many contigs. However, we believe that creating specific assemblers, for solving specific cases, will be much more fruitful than creating general assemblers.
Findings
In this paper, we present ArapanS, a wholegenome assembly program dedicated to handling small genomes. It provides only one contig (along with the reverse complement of this contig) in many cases. Although genomes consist of a number of segments, the implemented algorithm can detect all the segments, as we demonstrate for Influenza Virus A. The ArapanS program is based on the de Bruijn graph. We have implemented a very sophisticated and fast method to reconstruct the original sequence and neglect erroneous kmers. The method explores the graph by using neither the shortest nor the longest path, but rather a specific and reliable path based on the coverage level or kmers’ lengths. ArapanS uses short reads, and it was tested on raw data downloaded from the NCBI Trace Archive.
Conclusions
Our findings show that the accuracy of the assembly was very high; the result was checked against the European Bioinformatics Institute (EBI) database using the NCBI BLAST Sequence Similarity Search. The identity and the genome coverage was more than 99%. We also compared the efficiency of ArapanS with other wellknown assemblers. In dealing with small genomes, the accuracy of ArapanS is significantly higher than the accuracy of other assemblers. The assembly process is very fast and requires only a few seconds.
ArapanS is available for free to the public. The binary files for ArapanS are available through http://sourceforge.net/projects/dnascissor/files/.
Background
Sequencing technologies have been providing us with thousands of sets of genomic reads (sometimes called fragments or segments), with each set being taken from a specific genome. Bringing these reads all together in order to reconstruct the original sequence (the genome) is commonly known as the (whole) genome assembly problem. This problem has been studied extensively and many assemblers, along with some assembly models, have been proposed. Most models are based either on the overlap graph approach or the de Bruijn graphbased approach. The overlap graph is a graph whose nodes represent the genomic reads, while its edges correspond to the overlaps of these reads. It was the pillar of the first assemblers that appeared on the market, such as: TIGR [1], CAP3 [2], PCAP [3], the string graph of Myers [4] and MIRA [5]. The second category of assemblers is based on the de Bruijn graph, in which the nodes represent the substrings (kmers) of the genomic reads (which are of the same length), while the edges correspond to the overlaps of these substrings. The de Bruijn graph has become the standard pillar of the socalled “de novo” assemblers. Some of the assemblers based on this approach include: Euler assembler [6], SSAKE [7], EULERSR [8], Velvet [9, 10], ALLPATHS [11, 12], ABySS [13], and SOAPdenovo [14]. Although the assemblers share the same graph structure, they use different (but sometimes similar) algorithms to walk through the graph. To our knowledge, there is no proof that the shortest or the longest path, or the Hamiltonian or Eulerian paths will represent the genome in its natural form; therefore, we developed an algorithm that selects only the reliable nodes in the de Bruijn graph in order to reconstruct the original sequence of small genomes or long contigs when the graph is sparse.
Because of the diversity of genomes, creating a general assembler that is able to solve all cases will not be as effective and fast as a specific assembler that focuses on solving particular cases. For instance, ploidy can be a serious problem when dealing with plant genomes in which tetraploidy is common. Concerning very small genomes, we believe that we can improve the accuracy of assembly of such genomes by creating an assembler that is devoted to solving small genomes. That is the reason we aimed to create an assembler (named ArapanS) dedicated to solving small genomes. As a result, the ArapanS assembler was able to reconstruct one very highly accurate supercontig in most cases. To check the accuracy of ArapanS, we performed a BLAST sequence similarity search against the EBI (European Bioinformatics Institute) database, which includes the complete genomes of our dataset. This analysis showed that the ArapanS assemblies were more than 99% accurate. We also compared ArapanS with other well known assemblers in the assembly of viral genomes.
Findings
ArapanS parameters
ArapanS was written in C/C++ language under a programming framework called Qt on a 64bit Linux machine and was also compiled in Windows. The input data must represent each kmer (i.e. de Bruijn sequence), along with its frequency in the same line, separated by a whitespace character. Note that all frequency values of generated kmers are based on the coverage level of the dataset. In other words, we have used such frequency values instead of the coverage value. A tool called kmerBuilder, which is one of several assembly pipelines included in the Arapan software package, can generate kmer files for ArapanS (i.e. the dataset must be prepared independently from our assembler). The project acronym (Arapan) represents our primary goal to produce a software system that includes a set of opensource tools dedicated to solving and analyzing the whole genome assembly problem.
The ArapanS assembler is very sensitive to the length of k of short reads, and because of its architecture our tool always tries to find one supercontig along with its reverse complement. Nevertheless, if the length of k is very short, ArapanS will encounter some difficulties in constructing the original sequence. Also, if k is very long, the result of the assembly will not be significant. There is always a tradeoff between the specificity and sensitivity of choosing the length of k. By experiment, the most appropriate value of k is when 20 ≤ k ≤ 35.
ArapanS has only one parameter, which is the merging function: the frequency function or the kmer length function. The graphical user interface of ArapanS represents this parameter by a checkbox. During the experiments, it was preferable to choose the frequency function, since it usually leads to a more accurate result. We have considered the frequency function to be the only objective function in our experiments.
BLAST similarity search
We downloaded some real datasets from the NCBI Trace Archive (ftp://ftp.ncbi.nih.gov/pub/TraceDB/). The data were cleaned and prepared by a trimming tool (http://sourceforge.net/projects/dnascissor/files/DNA%20Scissor/). A minimum quality value cutoff of 20 (i.e. the accuracy of the base call was 99%) was set for most of the genomes, and the lowquality end regions were trimmed at the 5′end and 3′end of every read. The short reads (kmers) were generated by the same trimming tool for each set of reads. The ArapanS assembler was very fast, used less memory and provided us with one supercontig along with its reverse complement in many cases. For checking the accuracy of our assembler, we searched for the obtained supercontigs (the complete genome) on the EBI database using the NCBI BLAST Similarity Search. The input data are given in Table 1, while Table 2, Table 3, Table 4 and Table 5 show the results.
The total length of each genome was very close to the genome length obtained from the EBI database, and yielded very high identities (Table 2). Moreover, to show the robustness of ArapanS, we compared its results to other wellknown assemblers: ABySS1.2.7 [13], SSAKE 3.7 [7], Velvet 1.1.3 [9, 10] and QSRA [15]. The OverlapLayoutConsensusbased assemblers that were included for comparison were: Minimus [16] and Mira [5, 17]. The selected version of each assembler was the latest release, except for the SSAKE assembler for which we chose the release SSAKE 3.7 instead of SSAKE 3.8 because of installation problems. All assemblers have been run with default parameters.
Comparison
Because of its architecture (de Bruijn graph), ArapanS is classified as a de novo assembler. However, since our datasets are Sanger reads, we compared our assembler with de novo assemblers and also OverlapLayoutConsensus assemblers. Note that the current version of QSRA assembler is not able to deal with different read lengths. To solve this problem we used our tool, kmerBuilder, which is also in the Arapan package, to generate reads of the same length (200 bp for QSRA) from shotgun data.
De novo assembler competitors
Concerning the de novo assemblers, the most competitive assembler to ArapanS was ABySS in Table 3. As with ArapanS, ABySS was also able to produce only one supercontig for the Bovine Respiratory Coronavirus AH187 genome and the Waterbuck Coronavirus US/OH WD358 TC/1994 genome. However, in contrast to ABySS, ArapanS achieved the greatest genome coverage and only one supercontig in all cases. Since ArapanS generated only one contig in all cases, it produced the largest contigs compared to other assemblers. In contrast, the other assemblers generated more contigs and SSAKE had the lowest genome coverage every time and more contigs most of the time. QSRA also did not work well with small genomes.
The Influenza A Virus genome consists of eight segments (http://bioafrica.mrc.ac.za/rnavirusdb/virus.php?id=335341). Table 4 shows that ArapanS was able to detect the eight contigs of different genomes of type Influenza A Virus. According to our empirical results, SSAKE failed to deal with small viral genomes. N50 values of SSAKE were not computed because its results did not cover half of the entire genome. ABySS was again the second best assembler after ArapanS. However, our assembler succeeded in determining the eight segments of each genome, such that its N50 values, as well as the largest contig, were always the highest compared to other assemblers.
Overlaplayoutconsensus competitors
Among the OverlapLayoutConsensusbased assemblers, ArapanS was comparable to Minimus. Minimus failed in one case, Influenza A Virus A/Memphis/1/71(H3N2), in which it produced nine contigs instead of eight (Table 4). Our assembler showed good approximation compared to Minimus for the Antelope coronavirus US/OH1/2003 genome (Table 5). They achieved almost the same result for the Waterbuck Coronavirus US/OHWD358TC/1994 and the Whitetailed Deer Coronavirus US/OHWD470/1994 genomes. On the other hand, Mira did not work well with small genomes, as shown in Tables 3, 4 and 5.
Discussion
We have relied on only one objective function “the frequency function” for the sequence assembly algorithm. In fact, one may also consider another function, which is, “the kmer length function”, $g(L)={\Sigma}_{i}{}^{N}{=}_{1}{a}_{i}{l}_{i}\text{,}$ such that $L=\left\{{l}_{1},{l}_{2},K,{l}_{N}\right\}$ such that is the set of kmer lengths. This function is based on the assumption that nodes whose kmers have longer, relative to shorter, lengths are more probably generated from trustworthy consecutive nodes, that is to say, a chain that has fewer or no sequencing errors. However, we have considered only the frequency function in the analysis presented here.
In the case of nonuniform coverage of some areas in the genome [18], the frequency function may suffer from less accuracy. On the other hand, we believe that the kmer length function can be a good choice in the case of coverage nonuniformity. Building an algorithm that combines the two objective functions and switches from one to another may lead to more accurate results. Creating such an effective algorithm is an important issue for future research.
Another thing that can be said about the objective function is that the assembly algorithm does not look for the optimal solution. As a matter of fact, the algorithm starts at a determined node whose associated kmer has the longest length, then starts going forward and backward in the graph selecting nodes that have the highest scores (greatest frequency values) locally in order to construct a contiguous path in a given connected component.
We have noticed that most genome assemblers, which were built for tackling medium or large genomes, could not successfully deal with tiny and small genomes. ArapanS, ABySS and Minimus were able to deal with such cases. In future work a comparison would be worthwhile for all genome assemblers to determine the efficiency field of each set of assemblers.
Since our aim was creating a genome assembler for tackling only tiny genomes, dealing with repeats was not an essential task, since they do not regularly appear in very small genomes and the confrontation with tandem repeats does not generally mislead the assembly process (according to our experience). However, in the future, we aim to build another version of the ArapanS assembler that can handle longer genomes.
Conclusions
According to our experiments, we have found that general assemblers are not always as effective as the ArapanS assembler in dealing with tiny genomes. We have used only long reads in our experiments, because the raw data of small genomes can be easily found in the NCBI Trace Archive. However, our assembler can work with any other sequencing technology, such as Illumina/Solexa, SOLiD and 454 sequencing technologies. The raw data are converted into a set of kmers by kmerBuilder (http://sourceforge.net/projects/dnascissor/files/kmerBuilder/). The user can run ArapanS assembler by providing it with the kmer file. This feature represents another advantage of our assembler compared to other assemblers. ArapanS is fast and uses less memory. However, because we are dealing with small genomes, the time and space complexities of all assemblers were negligible. Our assembler is not designed to be applied to medium or large genomes.
Methods
The assembly process consists of four major phases. In the first phase, the de Bruijn graph is straightforwardly constructed. The second phase (called the cleaning process) is a very important step in which the graph is simplified as much as possible by collapsing paths, removing tips and solving bubbles, as well as handling a few other different structures in the graph. In the third phase the graph components are detected before starting the assembly algorithm in the fourth step.
Our algorithm differs from previous works in the following ways:

1.
The cleaning process simplifies the graph by a few iterations without incorporating timeconsuming algorithms, such as the Dijkstralike breadthfirst search in Velvet [9, 10] and the Dijkstra algorithm in SOAPdenovo [14].

2.
An algorithm was created to solve only simple bubbles (Figure 1), but by involving other algorithms (i.e. paths collapsing, tips, etc.) all complex bubbles are solved after a few iterations of the cleaning algorithm.

3.
The assembly algorithm uses the frequency values and lengths of kmers in order to construct contigs as will be described below.
Most de novo assemblers focus on solving large genomes; this involves implementing timeconsuming and very complicated algorithms. As a result, the construction of contigs becomes stricter, though this is not the case for small genomes, as shown in the results section.
Input data and graph construction
The entire dataset of kmers is recorded using hash tables in order to speed up further operations. The reverse complements are also recorded without binding them with their original kmers. All we need is a linear algorithm for constructing the de Bruijn graph. Since the alphabet is composed of four nucleotide letters, each kmer will be connected to four kmers at most. All kmers that include unknown ‘N’ nucleotides are discarded. The pseudocode of the algorithm is shown below:

1.
deBruijnGraphBuilder( HashTable kmerList , integer K )

2.
Integer N :=kmerList; //the size of kmerList

3.
String temp;

4.
for i:=1 to N do

5.
begin

6.
temp := kmerList[i][1..K−1];

7.
//forward connection

8.
if temp+“A” kmerList then createArc( i, kmerList.IndexOf(temp+“A”));

9.
if temp+“T” kmerList then createArc( i, kmerList.IndexOf(temp+“T”));

10.
if temp+“C” kmerList then createArc( i, kmerList.IndexOf(temp+“C”));

11.
if temp+“G” kmerList then createArc( i, kmerList.IndexOf(temp+“G”));

12.
//backward connection

13.
if “A”+ temp kmerList then createArc(kmerList.IndexOf(“A”+ temp), i );

14.
if “T”+ temp kmerList then createArc(kmerList.IndexOf(“T”+ temp), i );

15.
if “C”+ temp kmerList then createArc(kmerList.IndexOf(“C”+ temp), i )

16.
if “G”+ temp kmerList then createArc(kmerList.IndexOf(“G”+ temp), i);

17.
end
Let K be the length of the short reads. The variable temp will contain the first prefix of a given Kmer whose length is K − 1. The algorithm computes the outneighbours in the forward orientation, and the inneighbours in the opposite direction.
Cleaning process (simplifying the graph and solving errors)
The raw DNA data always suffer from errors, and since the de Bruijn graph is based on the exact matching of kmers, error correction (or removal) becomes very important to the use of such graphs in representing and analyzing sequencing data. The coverage plays a vital role in guiding the cleaning and assembly algorithms to a more accurate result. After constructing the graph, some erroneous kmers appear in the graph in different forms. The most common forms are the socalled “Tips, Bubbles and Chimeric connections”. However, while analyzing the graph, we found other forms as well. We have implemented an iterative algorithm that reduces the graph to its maximum simplification. The pseudocode of the algorithm is shown below and its flowchart is given in Figure 2.

1.
cleaningAlgorithm ()

2.
Boolean col, bub, intip, outip, less, great;

3.
Begin

4.
do

5.
col := collapsePaths();

6.
bub := solveBubbles();

7.
if col==false and bub==false then

8.
begin

9.
intip := removeInTips();

10.
outip := removeOutTips();

11.
less := removeLessMarkTips();

12.
great := removeGreatMarkTips();

13.
if intip==false and outip==false and

14.
less ==false and great==false then stop;

15.
end

16.
while (true)

17.
removeSingletons();

18.
End
The collapsePaths() procedure will return false if it does not collapse any path, otherwise, it returns true. The other procedures behave exactly as collapsePaths() does. We will hereafter explain each procedure invoked by the cleaning algorithm.
Path collapsing
To simplify and shrink the graph before applying any cleaning procedure, a path collapsing algorithm should be run immediately after constructing the graph.
A path is a chain of nodes. Two nodes X and Y are merged if the node X has only one outgoing arc connected to the node Y that has only one incoming arc. Their corresponding kmers must be concatenated accordingly. Most of the resulting nodes (we call them switch nodes) are seen in Figure 3.
Bubble solving
In genome assembly, a bubble appears where two sequences initially align, then diverge in the middle, and align again at the end. Bubbles are caused by repeats or heterozygotes of diploid chromosomes [14], or created by errors or biological variants, such as SNPs, diploids or cloning artefacts prior to sequencing.
A path is a chain of nodes in a graph. We call a path a simple path if each internal node (i.e., each node between the start node and the end node of the path) has one outgoing edge and one incoming edge. A bubble is a subgraph that consists of multiple simple paths all of which share the same start node and the same end node. In the original graph, the start node must not have any outgoing edges other than those in the bubble, and the end node must not have any incoming edges other than those in the bubble.
In Velvet [9, 10], detection of bubbles was done by an algorithm based on a Dijkstralike breadthfirst search called “The Tour Bus Algorithm”. Similarly, Dijkstra’s algorithm is also used to detect bubbles in SOAPdenovo [14], in which the detected bubbles are merged into a single path if the sequences of the parallel paths are very similar; that is, had fewer than four base pairs difference with more than 90% identity.
In ArapanS, all bubbles will be relaxed by combining all the cleaning procedures and without incorporating a timeconsuming algorithm. After collapsing all paths, bubbles will appear in the graph as shown in Figure 1. The node with a high coverage will not be removed from the bubble (However, the algorithm can also be parameterized to keep only the node that has the maximum kmer’s length instead of high coverage).
Tips removal
Tips generally result from errors at the end of reads. In the graph, a tip is a node connected only on one end (Figure 4). In Velvet, a tip is removed if it is shorter than 2 k (k is chosen for the kmer). After removing tips, new paths will appear again in the graph. Almost all the remaining nodes’ degrees are ≥ 2. We will hereafter call such nodes: switch nodes. The result of the cleaning process will be similar to what is shown in Figure 5.
Connected components detection
Once the graph is reduced and contains only switch nodes, we start determining the connected components of the graph. There are two cases in which we need to determine the connected component. The first case is the nature of the kmers and their reverse complements. Since each kmer was recorded along with its reverse complement, we will obtain a graph composed of two subgraphs, one being the reverse of the other. The second case is the sparseness of the graph, especially when the initial kmer length is a bit longer. Our assembly algorithm can run on every connected component of the graph. Detection of these components can lead the assembly algorithm to be run in parallel. The breadthfirst search or depthfirst search can be applied to find the connected components in linear time. The search begins at an arbitrary node v from which the entire connected component including v will be detected. A loop through all nodes of the graph must be implemented in order to find all the connected components. The loop runs until no visited node can be found. The pseudocode of the modified algorithm is shown as follows:

1.
connectedComponent(VertexSet V , EdgeSet E , Node a )

2.
Set X;

3.
Boolean visited[V];

4.
//Step 1

5.
$X:=X\cup \left\{a\right\}\text{;}$

6.
$visited\u230ax\u230b:=false,\forall x\in V\text{;}$

7.
//Step 2

8.
while $\exists \mathit{x}\in \mathit{X}\mathit{v}\mathit{i}\mathit{s}\mathit{i}\mathit{t}\mathit{e}\mathit{d}\mathit{x}=\mathit{f}\mathit{a}\mathit{l}\mathit{s}\mathit{e}$ do

9.
begin

10.
$\exists \mathit{x}\in \mathit{X}\mathit{v}\mathit{i}\mathit{s}\mathit{i}\mathit{t}\mathit{e}\mathit{d}\mathit{x}=\mathit{f}\mathit{a}\mathit{l}\mathit{s}\mathit{e}\text{;}$

11.
$X:=X\cup \left\{y\right\},\forall \left(x,y\right)\in Eor\left(y,x\right)\in Ey\notin X\text{;}$

12.
end

13.
return X;
The idea of this algorithm is to traverse the graph from an arbitrary node a, mark it as a visited node and record its neighbors in the set X. The same job is done for the recorded nodes until there are no visited nodes in the set X. The algorithm returns the connected component engendered from the node a. To find all connected components we apply the following algorithm:

1.
allComponents(VertexSet V , EdgeSet E )

2.
SetList C;

3.
Set X’ ;

4.
Integer i;

5.
//Step1

6.
$X\text{'}:=V\text{;}$

7.
$i:=1\text{;}$

8.
//Step 2

9.
while $X\text{'}\ne \xd8$ do

10.
begin

11.
select an arbitrary x∈X’;

12.
${C}_{i}:=connectedComponent(G,x)\text{;}$

13.
$X\text{'}:X\text{'}{C}_{i}\text{;}$

14.
$i:=i+1\text{;}$

15.
end

16.
return C;
We only need to select an arbitrary node x and determine, due to the connectedComponent() procedure, the connected component C_{ i } having x. The determined component’s nodes will be removed from the X’ (Line 14). The same operation is performed until no connected components can be detected.
Assembly algorithm
Once the connected components are detected, we run the assembly algorithm for each component. The assembly algorithm can be run by using one of two parameters: the coverage (kmer’s frequency), and the kmer lengths. The latter parameter is obtained by the cleaning process, which provides us with switch nodes whose corresponding kmers have longer lengths due to the merging process.
Most of the previous work on genome assembly has the following assumption: given a set of reads, the objective of the assembly program is to minimize the length of the assembled genome [18]. However, according to our knowledge, there is no proof that the shortest path can always faithfully represent the genome. The same can be concluded concerning the longest path, the Hamiltonian path and the Eulerian path.
The assembly algorithm is a greedy function. It traverses the graph by selecting only the nodes whose frequency values are higher. We have chosen this strategy by assuming that kmers, which are characterized by high frequency values, are more likely to be free of sequencing errors (we call it “frequency function”). All procedures of the assembly algorithm are given as follows:

1.
stringPath( Set C )

2.
Ordered Set path;

3.
Set P, Visited;

4.
Node u, v;

5.
//Step1: preprocessing

6.
u := the index of the node which have the longest kmer length.

7.
$v:=u\text{;}$

8.
$path:=path\cup \left\{u\right\}\text{;}$

9.
Visited: =Ø

10.
//Step 2: forward direction

11.
do forever

12.
begin

13.
P := out_neighbors(u) − Visited;

14.
$Visited:=Visited\cup P$ ;

15.
if P= Ø then stop;

16.
u := bestNeighbor(u, P);

17.
$path:=path\cup \left\{u\right\}\text{;}$

18.
End

19.
//Step 3: backward direction

20.
do forever

21.
begin

22.
P := in_neighbors(v) − Visited;

23.
$Visited:=Visited\cup P\text{;}$

24.
if P=Ø then stop;

25.
v := bestNeighbor(v, P);

26.
$path:=\left\{v\right\}\cup path\text{;}$

27.
end

28.
return path;
The set C represents a connected component of the graph. The resulting path is kept in the ordered set path. After variables initialization, the algorithm goes in a forward direction selecting the best outneighbors. In the last step, it goes backwards selecting the best inneighbors. The bestNeighbor() function is the current node and the set of its in or outneighbors. Since each node could be connected to several neighbouring nodes, the best neighbor is characterized by the highest frequency value. The two loops stop when no more exploration can be done. To find all possible paths, we apply the following algorithm, called the stringPath() algorithm.

1.
allPaths()

2.
SetList C; //components list

3.
SetList P; //paths list

4.
Integer i;

5.
//Step 1

6.
$C:=allComponents(G)\text{;}$

7.
//Step 2

8.
for i := 1 to  C  do

9.
begin

10.
${P}_{i}:=stringPath({C}_{i})\text{;}$

11.
end

12.
return P;
By going through all connected components (determined by the allComponents() procedure), and due to the previous algorithm, a path P_{ i } will be constructed for each connected component C_{ i. }
Availability and requirements
ArapanS is open access and freely available. All questions, comments and requests should be sent by email to nihon.sahli@gmail.com.
Project name: Arapan project
Project home page: http://shibuyalab.hgc.jp/Arapan/
Operating system(s): Windows, Linux (Ubuntu)
Programming language: C/C++
Other requirements: None
License: None required
Any restrictions to use by nonacademics: None required
References
 1.
Sutton GG, White O, Adams MD, Kerlavage AR: TIGR Assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci. Technol. 1995, 1: 919. 10.1089/gst.1995.1.9.
 2.
Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Research. 1999, 9: 868877. 10.1101/gr.9.9.868.
 3.
Huang X, Wang J, Aluru S, Yang SP, Hillier L: PCAP: A wholegenome assembly program. Genome Research. 2003, 13: 21642170. 10.1101/gr.1390403.
 4.
Myers EW: The fragment assembly string graph. Bioinformatics. 2005, 21: ii79ii85. 10.1093/bioinformatics/bti1114.
 5.
Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Müller WEG, Wetter T, Suhai S: Using the miraEST assembler for reliable and automated mrna transcript assembly and snp detection in sequenced ests. Genome Research. 2004, 14: 11471159. 10.1101/gr.1917404.
 6.
Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001, 98: 97489753. 10.1073/pnas.171285098.
 7.
Warren RL, Sutton GG, Jones SJ, Holt RA: Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 2007, 23: 500501. 10.1093/bioinformatics/btl629.
 8.
Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Research. 2008, 18: 324330. 10.1101/gr.7088808.
 9.
Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research. 2008, 18: 821829. 10.1101/gr.074492.107.
 10.
Zerbino DR, McEwen GK, Margulies EH, Birney E: Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short read de novo assembler. PLoS One. 2009, 4: e840710.1371/journal.pone.0008407.
 11.
Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS: De novo assembly of whole genome shotgun microreads. Genome Research. 2008, 18: 810820. 10.1101/gr.7337908.
 12.
Maccallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter I, Gnirke A, Malek J, McKernan K, Ranade S, Shea TP, et al: ALLPATHS 2: Small genomes assembled accurately and with high continuity from short paired reads. Genome Biology. 2009, 10: R10310.1186/gb20091010r103.
 13.
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: a parallel assembler for short read sequence data. Genome Research. 2009, 19: 11171123. 10.1101/gr.089532.108.
 14.
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, et al: De novo assembly of human genomes with massively parallel short read sequencing. Genome Research. 2010, 20: 265272. 10.1101/gr.097261.109.
 15.
Bryant DW, Wong WK, Mockler TC: QSRA – a qualityvalue guided de novo short read assembler. BMC Bioinformatics. 2009, 10: 6910.1186/147121051069.
 16.
Sommer DD, Dlecher AL, Salzberg SL, Pop M: Minimus: a fast, lightweight genome assembler. BMC Bioinformatics. 2007, 8: 6410.1186/14712105864.
 17.
Genome Sequence Assembly Using Trace Signals and Additional Sequence Information Computer Science and Biology. Proceedings of the German Conference on Bioinformatics. 1999, 99: 4556. GCB
 18.
Medvedev P, Brudno M: Maximum likelihood genome assembly. Journal of Computational Biology. 2009, 16: 116. 10.1089/cmb.2008.0137.
Acknowledgements
This work was partially supported by the GrantinAid from the Ministry of Education, Culture, Sports, Science and Technology of Japan. We should give thanks to Mr. Yassine Bouhmadi and Fouad Kharroubi for their corrections.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MS and TS conceived the research and wrote the article. MS conducted the research and implemented ArapanS in C++ programming language. All authors have read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Sahli, M., Shibuya, T. ArapanS: a fast and highly accurate wholegenome assembly software for viruses and small genomes. BMC Res Notes 5, 243 (2012). https://doi.org/10.1186/175605005243
Received:
Accepted:
Published:
Keywords
 Small Genome
 Frequency Function
 General Assembler
 Assembly Algorithm
 Start Node