"Transcriptomics & Functional Genomics"

Welcome to Transcriptomics & Functional Genomics News Letter No 11. This editions focus is on the underlying technology and trends in Next Generation Sequencing (NGS/Next-Gen).

Currently only the focus topic from each newsletter is being made available on the internet (please note this material is covered by copy right and permission should be sought to reproduce any content). The full newsletter is available internally via the intranet as a pdf. If you are interested in advertising a seminar or promotion via the newsletter or sponsorship please contact : Dr K Laing.

From the publication of the structure of DNA in 1953 to present day there has been a revolution in science and technology that has led to the emergence of true molecular biology and has changed the way biological science is carried out. In particular our understanding of the genome, its molecular organisation, structure, the regulation and control mechanisms and of course its genetic content and function has changed beyond recognition. During this revolution there have been watershed events such as the introduction of Sanger Sequencing around 1977, the first gene to be sequenced and the first genomes such as H.influenzae in 1995 closely followed by the first eukaryote C.elegans in 1998. Included in this list of events should be the Human genome project, from its inception to “Completion” it took around 13 years at a multi-billion dollar cost producing a mere 23 gigabases of sequence data. But yet this part of the sequencing revolution pales, at least in comparison to the astounding capability of Next Generation sequencing and the speed of its advance. The first Next-Gen sequencing platform the Roche 454TM appeared around 2004, however the subsequent years between 2006-2009 have seen at least 3 new technologies come on stream. Advances in these technologies are allowing ever-greater read depth. With the current capability of Next Generation sequencing in 2011, with the most powerful instruments, it is now possible to produce in excess of 600 gigabases of sequence in just over a week, the equivalent of 5 human genomes. And yet, recent changes in the technology herald a change in emphasis and capability that has implications for the future application and use of NGS in science and medicine.

I also hope as always you find this edition useful and informative.

If you’ve forgotten the content of the previous newsletters or want to access back copies see:

http://www.ipc.nxgenomics.org/newsletter.htm

Ken Laing IPC, Centre for Infection and Immunity Division of Clinical Sciences, St George's University of London

Next Generation Sequencing

Figure 1 The sequencing revolution a time line.

Of course with all new technologies there is a degree of hyperbole and that is certainly also true of “Next-Gen” sequencing. The technology raises many moral and ethical questions about how we use and interpret sequence data and has led to the introduction of legislation in the US in 2008 (Genetic Information Non-discrimination Act1) to protect the individual’s rights where their genome is concerned. However, the technology and the capability of the current platforms are evolving at a phenomenal rate which means that both the science and informatics around sequencing are in a constant state of catch-up. Having said that, “Next-Gen” sequencing is also supplying some innovative solutions to how we derive sequences and the real-world applications to which sequencing is providing an answer is rapidly expanding.

The introduction of “Next-Gen” Sequencing in 2004 has altered the volume of sequence data available and to put this understatement into some kind of perspective, by mid 2011, instruments such as Illumina’s HiSeqTM 2000 were capable of producing more than 600 gigabases of sequence, the equivalent of 5 human genomes at 40x coverage in a single run lasting 8 days. Compare this to the 23 gigabases reportedly produced by the human genome project2,3 over its life time of 13 years using conventional Sanger sequencing. With the changing technological landscape there have been similar changes in the media and marketing of “Next-Gen”, originally referred to as “Next Generation” or “Next-Gen” sequencing the latest and newest terminology has evolved into 2nd and now 3rd Generation sequencing. Some of the most recent developments in

the technology involve so-called single molecule sequencing and non-light-based sequencing. The former is whereby the individual sequencing reactions are performed on “single” molecules avoiding the need for clonal expansion of individual sequences within the library (see Helicos HeliscopeTM and Pacific BioSciencesTM). The latter refers to the detection chemistry where alternatives such as semi-conductor technology has evolved to measure alteration in the surface state of metal oxides for example induced by pH change (Ion TorrentTM) during elongation of a template.

However, three commercial technologies currently dominate the high throughput sequence market although two or three others compete, all be it on the periphery (take a look at the Google map of locations of Next-Gen Sequencers4). Then there are the new entrants on the scene like Ion Torrent that have introduced machines into the low-cost low-throughput market space. This instrument Competes with for market space with other machines like the Roches 454TM Junior and Illumina’s new MiSeqTM instrument. This emerging market of the “low-cost” instrument promises to radically alter the landscape in sequencing and move the technology out of the large sequencing centres like the Sanger, JVCI etc.. The three dominant technologies, Roche’s 454, Illumina’s GA/HiSeqTM and Life Tech’s SOLiDTM 4/5500 are also the most mature and thus evolved chemistries and systems. The earliest of these to appear was the Roche 454, the latest incarnation of this (2011) being the GS FLX+.

Figure 2 Summary of 454TM chemistry based upon emulsion PCR and adaptation of traditional pyrosequencing in a picotitre plate.

All three manufacturers have platforms with a number of aspects in common, they all rely upon single base extension, light emission and capture and sequential image processing. next generation sequencing differs fundamentally from traditional sequencing in that it performs highly parallel reactions of spatially separated molecules that are amplified to produce clonal populations on a bead or in a flow cell. These reactions use, as their starting point a “fragment library”, such a library might consist of a randomly fragmented and size selected DNA representing thousands or millions of unique but over lapping segments from a human or other genome. The sequencing itself uses sequential single base extension that progresses by moving the read position by one base at a time if the corresponding base is supplied to the reaction. Incorporation of a corresponding nucleotide occurs, that then yields florescence or luminescence that is in turn is captured as an image and sequential image analysis yielding the sequence.

In general Next-Gen produces small read lengths compared to traditional Sanger sequencing, of around 35-400 bases but very large numbers of these reads are produced (ranging from 100 thousand on the GS Junior to 1.4 billion single end reads on the SOLiDTM4). With recent improvements in the chemistry and additional hardware upgrades, the GS FLX+ is one of few exceptions to this short read length, with a modal read length of 700 bases. However, in contrast to other platforms it has a relatively low but still impressive output of 700 megabases of data over a 23 hour run time. The only other technologies to compete with this on read length is the PacBio RS system and the Helicos HelscopeTM.

The chemistry of the individual platforms has practical implications and has led to suggestions that different platforms are better suited to different applications. Roche’s 454TM chemistry is summarized in figure 2. It like other chemistries uses adapter ligation, followed by size selection of the fragment library and the attachment of single DNA molecules via a specific linker to the surface of beads. Oil-water emulsions are then formed with individual beads contained within a micelle and emulsion PCR performed. Conceptually this process of isolation and clonal amplification of the sequences within a fragment library is integral not only to sequencing with the 454TM but also several other platforms, although some systems, such as Illumina’s Hi-SeqTM/GA and Mi-SeqTM, do this by different means. Following amplification the emulsion is broken and beads selected and locate within “wells” of a picotitre plate that also acts as a flow cell in which the sequencing takes place.

454TM sequencing is essentially highly parallel pyrosequencing, whereby extension of the template is performed by flooding the flow cell with one nucleotide (an A, T, C or G) at a time. The consequence of incorporation is the production of pyrophosphate that in turn is locally converted via sulfurylase to ATP. The production of ATP itself in turn enables redox reaction that converts luciferin to oxyluciferin and back releasing light. The luminescence produced by this means is captured after each extension identifying the base incorporated at a given location within the flow cell. The extension is then iterated through the bases until completion. Thus a series of extension reactions and the light captured from these reactions yields in the case of the 454TM FLX+ 1.25 million reads.

Figure 3 Illumina clonal expansion by bridge amplification.

Figure 4 The basis of Illumina sequencing in a flow cell.

Both the Illumina and SOLiDTM chemistry differ from 454TM in that they incorporate florescent dyes as markers of nucleotide extension. Illumina employ a novel approach to clonal expansion termed bridge amplification. This is performed on the surface of the flow cell (see figure 3). As before a fragment library is produced in which adaptors are ligated to opposite ends of the fragments. Linkers that are covalently attached to the surface of the flow cell capture the library and a bridge is formed between these. Isothermal amplification is then initiated from the double stranded ends of the bridge. Displacement and re-annealing to the surface and amplification of the locally bound sequence thus forms a cluster of clonally expanded sequences. Finally opening of the bridge allows the release of one end in preparation for sequencing (figure 4).

The initial step in preparation for sequencing is blocking of the free 3’ end of the molecules attached to the flow cell. An iterated process of single base extension occurs in which the flow cell is flooded by fluorescently labeled and chemically modified nucleotides that as a consequence cannot be extended. After washing of the flow cell and excitation of the incorporated label, the emitted light is captured as an image identifying which of the four nucleotides have been incorporated by the spectral characteristics of the label in a given cluster. In order to repeat this the incorporated nucleotide is chemically unblocked and the process repeated. Thus by sequentially analyzing each cluster, it is possible to derive the sequence of the molecules within the cluster. The read depth and read length of the system adopted by Illumina are determined largely by a combination of the limitations imposed by bridge formation and cluster density. Although the approach currently has read lengths of around 100 bases, a top of the range HiSeqTM 2000 produces 187 million single end reads per lane approximately 3 billion per run. Currently Illumina hold the record in terms of amount of sequence data that can be produced per run.

Figure 5 Out line of the sequencing work flow for the SOLiDTM 4

The third of the three original technologies is that of Life Tech’s SOLiDTM system. Although this is similar to 454TM in that it uses emulsion PCR to clonally expand a fragment library bound to beads via adaptors (compare figure 2 & 5), the similarity begins and ends with this aspect. The molecular approach used by the SOLiDTM is perhaps the most complicated of the three and is based not on extension of an initiation complex by a DNA polymerase but a hybridization and ligation with a mixture of four partially degenerate oligonucleotides for which the first two nucleotides of the octamer are known along with the first position the same for each (eg ATN6, AA N6, AG N6, AC N6) and have a 5’ fluorescent label (figures 5, 6a & b). After expansion of the library by emulsion PCR, the beads will have a clonal population of a single but different nucleotide sequence attached to each bead. The coated beads are next captured on the surface of a flow cell via a 3’ terminal modification of the sequences. Sequencing is initiated by the hybridization of an oligonucleotide spacer that is complementary to the adapter. Following on from this, the first set of four octamers are passed through the flow cell, each of these are labeled with a fluorescent molecule and allowed to locate 5’ to spacer and competitive ligation determined by the specificity of the first two nucleotides of the octamer. The spectral characteristic of the octamer thus determines which if any of the di-nucleotides were successfully ligated adjacent to the spacer. In order to extend the sequence, the three terminal nucleotides and fluorescent label are then removed. The process is repeated with a different set of oligonucleotides with known di-nucleotide sequences (figure 6a). In this way a moving window sequence is produced interspersed with unknown sequence. In order to fill the “gaps” so called primer reset is performed where the complementary sequence is melted off the bead bound molecule and the process repeated with a spacer that differs in length by a single nucleotide from the previous round. In all five rounds of primer reset are performed (figure 6b). The consequence of this approach is that each nucleotide is in effect interrogated twice, but the complexity has an impact on the speed and length of the sequence produced. The SOLiDTM 4 and 5500 series produced the shortest read of all the Next Gen instruments with single end read length of 50-75 bases and around 100 gigabases of sequence in a 7 day run for the for the SOLiDTM 4.

Figure 6 a & b . a. SOLiDTM 4 hybridisation and ligation based sequencing with partially degenerate octamers with a known di-nucleotide at position 1 and 2. b. Primer reset creates an overlapping series of windows interrogating the underlying sequence.

What’s new in Next-Gen sequencing? One of the most striking changes recently is the introduction of bench top instruments such as the 454TM Junior, Ion Torrent and MiSeqTM with a “low end” specification and output of 10 megabases and upwards to around 500 megabases. The developments in this direction are primarily with one eye on the human and animal health diagnostic markets. Eukaryotic whole genome sequencing is unlikely to be cost effective in the interim and presents huge problems in terms of how we handle, store and interpret this information. Consequently targeted sequencing of amplicons spanning specific regions of the genome are much more attractive and are already being routinely performed using Next-Gen technology. Whilst applications in human genetics and cancer are obvious applications, sequencing can meet many other needs in diagnostics not least in relation to infectious diseases and all the all three of the main company’s are putting efforts into sequencing solutions in this field (as an entertaining example of this frontier see here6). Up until now, epidemiological studies of infectious agents have been conducted on the larger instruments such as the Paciffic BioSciences RS in the 2010 paper looking into the origin of the Haitian Cholera Outbreak by Chin et al7. More recently, German E. coli outbreak strains were sequenced by Rasko et al8 using the same instrument and Brzuszkiewicz et al9 sequenced two isolates on the 454TM Titanium FLX. However, news of both Ion Torrent and Illumina’s MiSeqTM sequencers also being used to produce “real-time” sequences that were made publicly available from the German outbreak over shadowed this10.

In 2011 the BUG@S group at St George’s with some funding from the EUC, acquired an Ion Torrent NGS system and are currently sequencing clinical isolates of pathogens with the aim of developing answers to clinical problems in infection control. The group are keen on developing collaborations across the school and externally where ever application for this technology is appropriate and not simply in the area of infectious diseases.

The Ion Torrent is an exciting platform that differs from the other similarly scaled systems in that it uses semi-conductor technology11 to detect the stepwise polymerization of the complementary strand during elongation by a DNA polymerase. This is achieved as a result of localized changes in the surface properties of a metal oxide due to a shift in the pH caused by H+ ion release during elongation of a complementary strand. These changes in turn induce a voltaic potential in an adjacent sensor plate. Arranging sensors in a two dimensional array and combining such semi-conductor technology with microfluidics enables the detection of changes within individual wells measuring around 1.3 microns. The arrangement of more than a million such wells within a “plate” allows the label free monitoring of parallel reactions, as the wells are supplied with individual bases each in turn. In order to perform many thousands of reactions simultaneously a library must be created and clonal populations representing the individual sequences within the library must be placed within individual wells. This process is similar to the methodology taken by 454TM and SOLiDTM and uses emulsion PCR to clonally expand individual sequences within the library that are attached to beads (see figure 2 & 5). The system as of September 2011 can produce sequences with a modal length around 120-150 bases and depending on the chip size up to 500 megabases of sequence. The smallest of the Ion Torrent chips (314) have a specification of 10 megabases of output, but produces well in excess of this between of 40-55 megabases depending on the quality of the library. Enough sequence to perform novo sequencing of most bacterial genomes and or to align and identify the genetic content of an organism with existing genomes. The cost of a single run on a 314 chip is currently around £500 and £800 for a 316 chip and these costs are set to fall with the introduction of Tagged sequencing.

One of the attractions of this system is that it uses existing semi-conductor fabrication technology and materials. Expansion of the capacity is, in a large part but not entirely, down to increasing the number of usable wells. Ion Torrent is due to launch the latest chip (318) with more than 1 gigabase capacity in the autumn of 2011 and this will further extend its capability. In the first six months to a year since its launch, the Ion Torrent, has established itself as a ground breaking technology that the traditional market leaders Illumina and Roche must now take note of if they are to compete for market share in “low throughput” NGS sequencing. The emergence of NGS in routine diagnostics has the potential for huge benefit in the fight against emerging pathogens, and will undoubtedly impact upon health care provision and infection control. With the competition hotting up between Ion Torrent, Illumina’s MiSeqTM and Roches’s 454TM Junior, the future will undoubtedly herald some remarkable changes in many area not least clinical microbiology.

References

  1. Genetic Information Nondiscrimination Act (GINA) of 2008. National Human Genome Research Institute Web site http://www.genome.gov/24519851
  2. LG Dressler and SF Terry How Will GINA Influence Participation in Pharmacogenomics Research and Clinical Testing? Clinical pharmacology & Therapeutics 86 (5) 472-475
  3. International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004)
  4. Lander ES Initial impact of the sequencing of the human genome Nature (2011) 470 187-197
  5. http://pathogenomics.bham.ac.uk/hts/
  6. http://www.my454.com/applications/pathogendetection/index.asp
  7. Chen-Shan Chin et al The Origin of the Haitian Cholera Outbreak Strain N Engl J Med. 2011 Jan 6;364(1):33-42
  8. Rasko D.A. et al Origins of the E. coli Strain Causing an Outbreak of Hemolytic–Uremic Syndrome in Germany N Engl J Med. 2011 Aug 25;365(8):709-17. Epub 2011 Jul 27.
  9. Brzuszkiewicz E, Thürmer A, Schuldes J, Leimbach A, Liesegang H, Meyer FD, Boelter J, Petersen H, Gottschalk G, Daniel R. Genome sequence analyses of two isolates from the recent Escherichia coli outbreak in Germany reveal the emergence of a new pathotype: Entero-Aggregative-Haemorrhagic Escherichia coli (EAHEC). Arch Microbiol. 2011 Jun 29.
  10. News & Analysis Scientists Rush to Study Genome of Lethal E. coli Science (2011) 332 1249-1250
  11. Jonathan M. Rothberg et al An integrated semiconductor device .enabling non-optical genome sequencing. Nature (475) 348-352 (2011)
St George's Internet

St George's Portal