Changes

← Older edit

Sequence assembly

319 bytes added, 00:13, 19 December 2010

no edit summary

In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather small pieces between 20 and 1000 bases~~, depending on the technology used~~. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (~~ESTs~~[[EST]]s).Sequence assembly as reconstructing a book

The problem of sequence assembly can be compared to taking many copies of a book, passing them all through a shredder, and piecing a copy of the book back together from only shredded pieces. The book may have many repeated paragraphs, and some shreds may be modified to have typos. Excerpts from another book may be added in, and some shreds may be completely unrecognizable.

<h2>Genome assemblers</h2>

The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. As the sequenced organisms grew in size and complexity ~~(from small viruses over plasmids to bacteria and finally eukaryotes)~~, the assembly programs needed to increasingly employ ~~more and more~~ sophisticated strategies to handle:

<ul>

<li>terabytes of ~~sequencing~~ data which need processing on computing clusters; </li>

<li>identical and nearly identical sequences (known as repeats) which can, in the worst case, increase the time and space complexity of algorithms exponentially; </li>

<li>and errors in the fragments from the sequencing instruments, which can confound assembly. </li>

</ul>

Faced with the challenge of assembling the first larger eukaryotic genomes, the fruit fly [[Drosophila melanogaster]], in 2000 and the human genome ~~just a year later~~in 2001, scientists developed assemblers ~~like~~  such as Celera Assembler[1] and [[Arachne]][2] able to handle genomes of 100,000,000 -300 ~~million~~ ,000,000 base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as [[AMOS]][3] was launched to bring together all the innovations in genome assembly technology under the open source framework.

<h2>EST assemblers</h2>

EST assembly differs from genome assembly in several ways. The sequences for EST assembly are the transcribed mRNA of a cell and represent only a subset of the whole genome. At a first glance, underlying algorithmical problems differ between genome and EST assembly. For instance, genomes often have large amounts of repetitive sequences, mainly in the inter-genic parts. Since ESTs represent gene transcripts, they will not contain these repeats. On the other hand, cells tend to have a certain number of genes that are constantly expressed in very high amounts (housekeeping genes), which again leads to the problem of similar sequences present in high amounts in the data set to be assembled.

WikiSysop

Bureaucrats, Administrators

1,081

edits

Changes

Sequence assembly

Navigation menu

Views

Personal tools

Search

Tools

Related Links[Edit]