Computer practical :Principles of DNA sequencing, the use of sequence databases and sequence comparison tools

In this "dry practical" you will be introduced to the principles of DNA sequencing by dideoxy-terminated elongation of an oligonucleotide-primed template. You will learn some principles of database searching, the use of algorithms and scoring matrices in order to carry out sequence comparisons. The practical is principally aimed at extending your knowledge of the use of databases and database tools in modern molecular and medical science; demonstrating a number of simple but important tools in everyday use. The BST2 laboratory practical will already have introduced you to some aspects of sequence traces and the use database searching to find matching sequences, here we aim to further your understanding of this and illustarte other applications of these tools.

* Exercise 1: DNA sequencing in a clinical setting.

The principles of DNA sequencing methodology, specifically the "dideoxy method", will be explored in more detail. Two traces generated by dideoxy-terminated sequencing reactions and the the sequencing products separated by gel electrophoresis from real patients are shown. The traces are from two isolates of Mycobacterium tuberculosis giving only a small portion of the read for the gyrA gene. The aim of the exercise is to familiarise you with the appearance of such traces, to compare the sequence of these two isolates and to understand how this information informs clinical management of such patients.

* Exercise 2: Identification of a sequence in the database and sequence comparison a indicator of diagnostic specificity.

you will use the sequence provided, to assess the specificity of a real time PCR hydrolysis probe that is used in a diagnostic assay for H1N1 influenza A virus. In order to design such probes it is necessary to have an understanding of the homology of the probe sequence with other non-target nucleic acid sequences and thus the potential for hybridisation of the probe sequence to other sequences that may be in a biological sample used. To do this, it is necessary to compare the probe sequence against all other known sequences in the public databases, the largest and most comprehensive being EMBL and GenBank, and this is achieved by performing a blastn search.

* Exercise 3: Multiple Sequence alignment.

Additional gene sequences will be provided for you. You will then compare these using a web-based tool, to identify differences in the sequences using the MUSCLE program. In doing so you will perform a mutliple sequence alignment. This will enable you to explore the 'biological' importance of differences found in the nucleotide sequences of beta thallassemia by a comparison with the "normal" gene sequence of the beta haemoglobin (HBB) gene involved in beta-thalassemia. By also comparing gene sequences to the message you will be able then to identify the location of alterations relative to features within the gene such as introns/ exons, specific codons, reading frames and polyadenylation sequences.

The background and revision Notes that are on this page are also available as a pdf click here


You will be expected to submit a Moodle quiz evaluating your understanding of the practical, the background to it and the results obtained from your analysis before you leave.

You are prompted to download/open files as you work and during the session you will generate results that you will NEED to use subsequently. It is possible to complete all the tasks DURING the session and you are required to do so. It is essential that you read the background and follow the instructions carefully, you are not expected to be a computer wizard!

You are expected to use your common sense and read the instructions. ; so use the opportunity to ask, if something is unclear.

You should submit your answers before leaving failing to do so may result in no marks being awarded

DNA Sequencing


Several different techniques for nucleotide sequencing have been developed over the years. However, they are nearly all similar in that they rely upon the chemistry of nucleotide chain elongation by a family of enzymes known as DNA dependent DNA polymerases. You will recall you have already met these enzymes in CMB2, this family of enzymes includes the thermal polymerases used for PCR and you will remember that these enzymes have so called end-filling properties that initiate elongation of a partially double stranded DNA template from the 3'end of the strand that is complementary to the template strand. Thus elongation results as the enzyme moves along the newly synthesised strand away from 5' end in a 5' to 3' direction.

The addition of deoxy-nucleotide tri-phosphate (dNTP) to the 3' end of a nucleic acid entails the formation of an ester bond between the alpha phosphate of the dNTP and the OH group at position 3 of the ribose sugar moiety. However if the enzyme is given a modified nucleotide tri-phosphate missing the OH group i.e. a dideoxy-nucleotide triphosphate (ddNTP), after the ddNTP is added elongation is unable to proceed further because the dideoxy-nucleotide occupies the terminal position of the elongating strand and lacks the requisite OH group.

deoxyribose with a tri-phosphate group at position 5 of the ribose ring,
note the OH group at position 3
dideoxyribose with a only tri-phosphate group at the position 5 carbon of the ribose ring

Thus, if a reaction is allowed to proceed, whereby the four nucleotide tri-phosphates (i.e. GTP, CTP, TTP and ATP) are present as a mixture of deoxy- (dGTP) and dideoxy-nucleotides (ddGTP), elongation will be terminated at some point during the elongation phase of the reaction. This being dependent upon the point at which a dideoxy-nucleotide is captured by the polymerase and incorporated into the elongating strand, this in turn is itself dependant upon the concentration of each of the competing deoxy and dideoxy-nucleotides. The reaction will produce a mixed population of molecules, the length of each individual molecule in that mixture will be determined by the position in the sequence of the nucleotide corresponding to the dideoxy-nucleotide that terminates the chain (see powerpoint show , to pause this right click and select "pause", to exit "end show").

In order to identify which of the four bases occur at the terminating position each of the four dideoxy-nucleotides (ddGTP, ddCTP, ddTTP and ddATP) are labelled with a different fluorescent dye. These different dyes emit light at different wavelengths. Therefore the sequence can be determined simply by the sorting by length and the detection of the light emitted from the incorporated label of each of the four dideoxy-nucleotides. This length dependent sorting is achieved by capillary electrophoresis. Gel electrophoresis was also covered in lectures in CMB2 and your BST2 tutorials and you should by now understand the principals of this technique.

In brief, it passes the nucleic acid through a gel matrix by applying a voltage across two electrodes and the negatively charged (anionic) nucleic acid migrates toward the positive electrode (or anode). The molecules are retarded by the matrix according to their size and those which are larger are retarded to a greater extent and as a consequence move through the matrix more slowly. Molecules of the same size will migrate together and this population of nucleic acid species are detected using the fluorescent label on the terminating dideoxy-nucleotide.

comparison of the length all four populations of molecules terminated by each of the four di-deoxynucleotides

Sequencing has now been carried out by semi-automated sequencers for some years. In these instruments the captured light is transformed into a trace (a graph showing the normalised emission of the four dyes). This is converted to a sequence according to the peaks produced for each dye. A typical readout is seen below in exercise 1.


The gyrA gene of Mycobacterium tuberculosis encodes subunit A of DNA gyrase, a type II topoisomerase. It is a protein that catalyses the energy-dependent breakage, passage and rejoining of double-stranded DNA to allow the uncoiling and separation of interlocking strands of DNA and thus is essential for the replication of a circular bacterial chromosome which otherwise cannot segregate. Inhibition of the catalytic activity of gyrA by the fluoroquinolones such as Ofloxacin and Ciprofloxacin leads to the killing of many organisms reliant upon DNA Gyrase. Mutations in the gyrA gene can confer resistance and alterations in the effective dosage reflected in the minimum inhibitory concentration (MIC) of a drug.

The region of gyrA given in the traces corresponds to nucleotide positions 229-313 and includes the region of gyrA of some well documented mutations that lead to increased resistance to the fluoroquinolones that are used in the treatment of tuberculosis. Also in this region there are some other well known non-synonymous single nucleotide polymorphisms (SNPs), some of these are also coincidentally seen in the same populations but themselves have no effect on sensitivity to the drug.

A single nucleotide polymorphism is simply an established variant of a single nucleotide at a specific location within a sequence. Such variation can occur anywhere within the genome but if it occurs within a coding region it can be one of two types either synonymous or non-synonymous, the latter altering the amino acid coded for by the specific codon affected, the former not.

  1. Take a look at the two partial traces below labelled Isolate A & B. The sequence traces have been shortened and have been aligned with one another such that they start at the same nucleotide position within gyrA to make it more manageable for the purpose of the practical. You have only been given the trace of the forward strand in each case, the normal procedure when sequencing a piece of DNA such as this in a clinical setting would be to sequence it from beginning to end and from opposite directions with an overlap in the centre covering any mutation hot spots twice. Thus a forward and reverse read would both be expected to identify any mutations giving higher confidence in their existence.
Two traces obtained from dideoxy terminated sequencing of the gyrA gene of M. tuberculosis, these traces were derived from organisms obtained from sputum of patients suspected of having a multi-drug resistant strain of the organism. Each line is coloured according to the fluorescent label identifying the dideoxy nucleotide incorporated.
Suspected Drug resistant Isolate A
Suspected Drug resistant Isolate B
  1. Note that the first molecules detected by the sensor (those toward the beginning of the trace), correspond to the population of molecules produced by the polymerase with a shorter length and have the least number of preceding nucleotides prior to the terminating dideoxy-nucleotide.
  2. Open the following link gyrA.doc. This file contains the drug susceptible sequence within the gyrA gene for the corresponding nucleotides, it also contains the sequence read from the traces for both Isolate A and B.
  3. These sequences have been derived using a trace reader to give the sequence corresponding to the respective peaks for each nucleotide from the two trace above. Note how and if the sequences for isolate A and/or B differs from the fluoroquinolone susceptible gyrA sequence YOU WILL NEED THIS INFORMATION TO ANSWER THE QUESTIONS IN THE MOODLE QUIZ .
  4. In order to determine if the sequence obtained for isolate A and B confers resistance to a fluoroquinolone we need some means of comparing the sequence to previously identified mutations with known effects on the MIC.
  5. For Mycobacterium tuberculosis a database of the known drug resistance mutations relating to the commonly used classes of anti-tubercular drugs has been compiled and can be found at Open this link.
  1. Below the title banner is a list of abbreviations of the drug classes used for treatment of tuberculosis (AMI:aminoglycosides, EMB:enthambutol, ETH: ethionamide etc) . Click on "FLQ" for the fluoroquinolones. These drugs target type II and IV topoisomerases, gyrases are type II and they inhibit the action of gyrase by binding to the gyrA subunit. The systematic name for the gyrA gene is Rv0006. Now click on "Rv0006 gyrA".
  2. The graphic displays the circular genome of the organism on the left (the archetypical strain of M. tuberculosis is shown in the graphic called H37Rv). The block coloured in red at "12 o'clock" on the genome is the locus corresponding to the gyrA gene. Toward the bottom of the page there is a hyperlink "Display high confidence mutations" Click on this link and a table should appear listing all the mutations that affect the minimum inhibitory concentration (MIC) of the drug that have been reported by more than 5 independent sources. A snap shot of this is shown below.
  1. In the fifth column of the table is the listing of the known polymorphisms (mutations) determined to confer a change in drug sensitivity. The way this is described is by giving the sequence of the codon triplet that is altered, for example a triplet for proline GCC might be altered to a triplet for alanine TCC and this is given as "GCC/TCC" ie an alteration of the first guanine to a thymine in the codon. The sixth column gives the nucleotide position affected, the seventh gives the position of the codon affected and so on until the fourteenth column (headed "MIC") which gives the reported effect on the MIC and the drug this has been documented for (Ofl being ofloxacin). Sometimes polymorphisms conferring drug resistance are observed associated with other reported polymorphisms, this is shown in the "Additional mutations" column. But these additional mutations don't necessarily confer resistance to the same drug and may not confer resistance to any drug unless they can be found in the polymorphism column.
  2. From this table you can identify which, if any of the polymorhism(s) observed in the sequence have a high confidence of conferring resistance, which codon is affected and what is the amino acid change leading to resistance. Finally note the effect on the MIC and which fluoroquinolone this effect has been documented for. Again you will need this information to be able to answer the questions.
  3. Now answer Q.1 to 8: for Exercise 1

Sequence Alignment and Database searching


In the following exercise, you will use the sequence provided, to assess the specificity of a TaqmanTM hydrolysis probe that is used in a diagnostic assay for H1N1 influenza A. In order to design such probes it is necessary to have an understanding of the likelihood of homology of the probe sequence with other non-target nucleic acid sequences and thus the potential for hybridisation of the probe sequence to other sequences that may be present in the biological sample to be tested. Thus this reflects the likelihood of interference in the assay that could reduce the sensitivity or cause positivity as a consequence of detection of non-target sequences such as human genomic DNA instead of viral sequences. To do this it is necessary to compare the probe sequence against all other known sequences in the public databases such as EMBL or GenBank. This should also enable you to identify the gene target that is used in this case and any other non-target sequences that are similar and thus likely to affect the specifity of the hybridisation of the probe to its target.

In order to do so you will perform a simple database search operation, which allows the interrogation of the sequence databases with the probe sequence comparing millions of sequences in a matter of seconds. The ability to do this, lies at the heart of modern molecular diagnostics and molecular medicine. It is a technique that is used every day to inform us about the origin and epidemiological relationships of sequences, their likely identity and function.

In essence both sequence alignments and database searching are the same process i.e. a comparison of one sequence against another and the computation of a measure of their relatedness. In practice database searching is a compromise between various parameters.

The result obtained from a database search is a measure of sequence similarity, it is this, which allows us to infer that two or more sequences are homologous. Genes that are homologous are frequently deemed to have arisen from the same ancestral gene and are therefore related, this is the basis of phylogenetics.

So how do we measure similarity? There are many different measures, the simplest is that of identity. In the most basic system, an identical base or amino acid pair is scored 1, a non-identical pair is scored 0 (such a scoring system is termed the identity matrix), in the example gaps would not be scored and the measure of similarity would simply be the sum of the diagonal resulting from each alignment. The alignment giving the highest score is therefore the alignment with the greatest similarity.

The identity matrix for nucleic acids would look something like the table below:

A 1 0 0 0
T 0 1 0 0
G 0 0 1 0
C 0 0 0 1

Alignments of two sequences using the identity matrix with no gap penalty or gap extension penalty

Sequence 2
Sequence 1
A 1 0 0 0 1 0 0 0
A 1 0 0 0 1 0 0 0 0
G 0 0 1 1 0 0 0 1 0
G 0 0 1 1 0 0 0 1 1
A 1 0 0 0 1 0 0 0 2
C 0 0 0 0 0 1 1 0 1
G 0 0 1 1 0 0 0 1 0
score 0 0 2 1 0 2 5 3

Best alignment with a score of 5:

Sequence 1
Sequence 2

In practice however, identity matrices for amino acid sequences are not used but for nucleotide sequences a variety of types of identity matrices are used, however they are essentially variations on the one shown above.

The most commonly used scoring matrices for proteins are those derived by Dayhoff (PAM) and Henikoff & Henikoff (BLOSUM). The two matrices are derived by different means and the PAM matrices are more commonly used for phylogenetics. The BLOSUM matrix is derived by the frequency of transition from one base to another in a large group of sequences with a known similarity. We would therefore score very different values in our matrix of all possible alignments if such matrices were used to align amino acid sequences but the process is the same.

A further complication is the possible occurrence of insertions or deletions in sequences. Alignment algorithms get around this problem by allowing gaps in one or other of the sequences during the alignment. As you might expect introducing GAPS and EXTENDING GAPs also have a score which contributes to the overall score, although these are usually negative values and are therefore referred to as PENALTIES and in these programs GAP scoring is important in finding the optimal alignment.

The software calculates a probability value giving the likelihood that a given pairing (match) being coincidence (a chance event) or not. Since there are 4 nucleotides and consecutive nucleotides in a sequence are not statistically speaking independent of one another in the sequence, then a sequence of 100 bases would have one of 1.6x106 possible combinations, this is because the number of possible sequences increases 4 fold with every additional base. In addition, a shorter sequence would likely be found in this population of all possible combinations multiple times depending upon the length of the shorter sequence. Thus a short sequence may be represented many times in the database because it may be found in many different gene sequences submitted to a database like EMBL or GenBank and thus may not be unique to any one given entry. Thus the longer the sequence provided in the search the more likely a unique match will be found with a low probability that the match occurred by chance.

Exercise 2: Database search and alignment with Blastn

You will now carry out a nucleic acid search using the GenBank database in order to assess the specificity of a sequence intended for use as a probe in this diagnostic assay for influenza A.

A sequence is likely to lack specificity in an assay if it shows similarity to other sequences that may occur in a biological sample used. It is worth noting that all biological tissue and body fluids will contain large amounts of the DNA from the organism or person from which it came, as well as that from commensal or other organisms infecting that tissue. Thus the likelihood of a false positive test as a consequence of detection of non-target sequences such as contaminating human genomic DNA is an important consideration.

The probe used in the influenza A viral assay consists of a 23 base single stranded nucleic acid, the sequence of which is be given below via the hyper-link. This sequence has been derived from alignments of previously sequenced strains of H1N1 influenza A virus using a similar methodology you will use in exercise 3.

Sequence comparisons are carried out using algorithms designed to allow the calculation of a statistic by which we can estimate the likelihood that a given alignment is due to chance. A number of different algorithms are available and are used for different purposes. The algorithm you will use is implemented in a program called BLAST and the specific subset of the BLAST program you will use for nucleic acid comparisons is blastn. The software will use a look-up table or matrix as described above to calculate the most likely alignments and provide the highest ranking of these.

  1. It is advisable to read ahead before continuing, however, you will be able to toggle between these instructions and the new window you are about to open.
  2. Click on NCBI-blastn. This will open a new browser window, so you may switch between pages as you need to.
  3. Having arrived at the blast page at the NCBI, there are a number of parameters which are required for a search to be carried out!

  1. Whilst you are waiting answer the first part of Q.9-11
  2. The results page will eventually open, this page is divided into three parts:


  1. Now answer questions Q.12-15.
  2. If you returned to the blast submission page and changed the database option to "Human genomic + transcript" in the second section "Choose Search Set". The program would only search for matches in the human subsection of the database, you don't need to do this now for the practical.
  3. The graphic would look different showing that the sequences identified containing mis-matches ie the black bars do not fill the entire width of the graphic and there is only a partial match.
  4. The list (see the image below), that would be produced would contain only human sequences. Note that the e-value is around 10000 times higher than with the previous alignments.
  5. Because none of the matching stretches correspond to more than 15 contiguous bases (with no mismatch) out of the 23 bases of the probe, the probe will not hybridise to these sequences if they were present in the sample because of these mismatches and thus they have a high degree of specificity in the assay. This doesn't mean that the probe does not need to be empirically tested but it does allow the design to be based upon a rational choice with a high likelihood of success.

MUSCLE: multiple sequence Comparison by Log-Expectation


Having obtained nucleotide sequences for a gene it is sometimes useful to make multiple alignments of many sequences. Such a comparison may enable the identification of functionally important domains (regions) within a gene or to link specific mutations with a hereditary condition. The comparison of sequences at the nucleotide level can be misleading since several different codons can code for the same amino acid. You will need to look up and use the codon table to obtain a better understanding of the implications of this. It is therefore useful to be able to translate sequences and align these at the protein level. Since each codon comprises three nucleotides there are three possible forward and three reverse frames for any segment of DNA and only one of these is used. The correct reading frame depends upon where the sequence we have lies within the gene with respect to the start codon; and the answer to that we may not know and can't tell simply by looking at the sequence itself. Mutations that lead to the incorrect positioning of this reading frame lead to mis-sense proteins, those that introduce "in-frame" stop codons will cause premature termination of translation and both have very deleterious effects on the function of a gene product. Others mutations such as those affecting the normal processing of message such as splicing or polyadenylation can also influence normal gene expression, RNA stability and function.

You have already been introduced to The haemoglobin Beta gene (HBB), which is located on chromosome 11 as part of the cluster of haemoglobin genes including epsilon, gamma-G and A, delta and beta. Think back to the lectures in CMB2 on the control of gene expression. The HBB gene covers about 1.73kb. It includes several exons along with the intervening introns. Mutations in the beta globin gene leads to normally an autosomal-recessive condition know as beta-thalassemia. This is a spectrum of disease linked to the severity of the reduced expression of the beta-globin chain and the consequent imbalance between the alpha and beta globin chains. See GeneReveiws for more about HHB and thalassemia. The pathophysiology of beta thalassemia is complex but it leads to ineffective erythropoiesis (production of erythroid cells) as a consequence of excess alpha-globin chains. This in turn stimulates erythropoiesis causing splenomegaly and expansion of the erythroid marrow with many characteristic adverse effects.

A huge number of mutations in the beta-globin gene have been documented including mutations resulting in the down regulation of transcription, as well as alterations in the splicing or maturation of the mRNA, along side alterations leading to truncated or incorrectly coded protein.

some of the variations in the HBB gene

Single base differences in the gene sequence within the coding region (the part that codes for protein) can lead to alterations in the amino acid content of the translated product and as we've already learnt, these alterations are termed non-synonymous single nucleotide polymorphisms (SNPs), however alterations can also occur in this region that do not affect the protein encoded and these are referred to as synonymous single nucleotide polymorphisms. Synonymous SNPs are a consequence of redundancy in the codon usage. Think about the codon table and how many times each amino acid appears in the table.

It is possible to carry out nucleic acid sequence alignments to compare sequences using a program called MUSCLE or Clustal Omega. Unlike the alignment carried out by BLAST, which is an alignment of pairs of sequences, MUSCLE carries out alignments on multiple sequences of nucleic acids and finds the optimal global alignment for all.

Exercise 3 :

  1. Click on the following link HBBsequences.txt, this will open in a new window. The file contains 11 sequences, The first sequence is much shorter and is provided to enable you to identify the exonic sequence contained in the the gene, as it only contains the fully functional "Beta-globin-message". The second sequence is from a normal individual and consists of the "normal HBB" gene sequence. The nine other sequences are the HBB gene sequence from patients with either Beta0 or Beta+ thalassemia (PatientsA-I), . The gene sequences are from the start of the first to the end of the last exon including the intervening introns. Each sequence is given in the FASTA format i.e.">Name" followed by the sequence starting on a new line.

Example of FASTA formated sequence

  1. Click on the link to the MUSCLE tool
  2. Copy the eleven nucleic acid sequences from the hyperlinked text file and paste them including the ">name" in to the MUSCLE query box.

  3. Leave everything else unchanged
  4. We will run the analysis interactively i.e. we want the results to be returned to the browser and will do a full alignment using the fast alignment option in "step 2"

  1. Now click the "submit" button
  2. When the results are returned and this may take a little time depending on how busy the server is, Scroll down and look at the alignment you have performed.

  1. Below the alignment is a series of stars and within the alignment itself are dashes, these help you to compare sequence for example to identify possible insertions or deletions (indels). Now answer the first question of exercise 3
  2. Now go to the top of the page and press the tab "Results Summary". The page will change to provide a two tables, you may be given a prompt about Java select to continue. Java will load a button, near the top you will see it marked Start JalView
  3. Click this button, a warning will appear accept this and run the program, this may take a few seconds to start so be patient. This will start a program for visualising the alignment by coloring it in different ways, calculating and displaying dendograms and editing the output..

  1. From the menu bar of the new window now click colour, select percentage identity. JalView will colour the alignment leaving the mismatches highlighted as below.
  2. If you now stretch or maximise the window and click "Format" and select "Wrap", this will wrap the sequence from the 5' to 3' end within the window. You will still have to scroll down to see the rest of the sequence as the longest is 1748 bases. Placing your mouse over a mismatch nucleotide (highlighted in white) will show the residue position and sequence ID in the lower left hand corner. This is useful to identify the location of any mutation, however it is important to realise the numbering is the position with respect to that particular sequence not the normal gene, there is at least one sequence with a deletion and one with an insertion.

  1. Note that the Beta-globin message will be aligned to all the gene sequences. Comparing the message to the normal gene sequence will identify the matching exons, with the intervening unmatched sequences corresponding to the introns. These two sequences (the message and the normal HBB gene) are provided to help you to be able to identify the location of the mutation in the remaining Beta thalassemia patient samples as either being intronic or exonic. Given that the first nucleotide of the start codon ATG triplet is located at position 177 and the last nucleotide of the TAA stop codon is at position 620 of the normal message, knowing this allows you to further identify if the mutation is found in the 5' or 3' UTR or coding sequence.
  2. Now Answer the next part about the nucleic acid alignment in Q.17-20

  1.  A useful way of displaying information about how related sequences are is as a dendogram. This is basically a branched system connecting related data, the length of the branch indicates relatedness i.e. the shorter the branch the more closely related the data. Select the 10 gene sequences excluding the beta globin message by clicking on the sequence names and dragging across these. Click "Calculate" select Calculate Tree and then "Average Distance using BLOSUM62", a dendogram will appear.

  1. Now answer the last questions of the Moodle quiz.