Computer practical :Principles of DNA sequencing, the use of sequence databases and sequence comparison tools
In this "dry practical" you will be introduced to the principles of DNA sequencing by dideoxy-terminated elongation of an oligonucleotide-primed template. You will learn some principles of database searching, the use of algorithms and scoring matrices in order to carry out sequence comparisons. The practical is principally aimed at extending your knowledge of the use of databases and database tools in modern molecular and medical science; demonstrating a number of simple but important tools in everyday use. The BST2 laboratory practical will already have introduced you to some aspects of sequence traces and the use database searching to find matching sequences, here we aim to further your understanding of this and illustarte other applications of these tools.
* Exercise 1: DNA sequencing in a clinical setting.
The principles of DNA sequencing methodology, specifically the "dideoxy method", will be explored in more detail. Two traces generated by dideoxy-terminated sequencing reactions and the the sequencing products separated by gel electrophoresis from real patients are shown. The traces are from two isolates of Mycobacterium tuberculosis giving only a small portion of the read for the gyrA gene. The aim of the exercise is to familiarise you with the appearance of such traces, to compare the sequence of these two isolates and to understand how this information informs clinical management of such patients.
* Exercise 2: Identification of a sequence in the database and sequence comparison a indicator of diagnostic specificity.
you will use the sequence provided, to assess the specificity of a real time PCR hydrolysis probe that is used in a diagnostic assay for H1N1 influenza A virus. In order to design such probes it is necessary to have an understanding of the homology of the probe sequence with other non-target nucleic acid sequences and thus the potential for hybridisation of the probe sequence to other sequences that may be in a biological sample used. To do this, it is necessary to compare the probe sequence against all other known sequences in the public databases, the largest and most comprehensive being EMBL and GenBank, and this is achieved by performing a blastn search.
* Exercise 3: Multiple Sequence alignment.
Additional gene sequences will be provided for you. You will then compare these using a web-based tool, to identify differences in the sequences using the MUSCLE program. In doing so you will perform a mutliple sequence alignment. This will enable you to explore the 'biological' importance of differences found in the nucleotide sequences of beta thallassemia by a comparison with the "normal" gene sequence of the beta haemoglobin (HBB) gene involved in beta-thalassemia. By also comparing gene sequences to the message you will be able then to identify the location of alterations relative to features within the gene such as introns/ exons, specific codons, reading frames and polyadenylation sequences.
The background and revision Notes that are on this page are also available as a pdf click here
You will be expected to submit a Moodle quiz evaluating your understanding of the practical, the background to it and the results obtained from your analysis before you leave.
You are prompted to download/open files as you work and during the session you will generate results that you will NEED to use subsequently. It is possible to complete all the tasks DURING the session and you are required to do so. It is essential that you read the background and follow the instructions carefully, you are not expected to be a computer wizard!
You are expected to use your common sense and read the instructions. ; so use the opportunity to ask, if something is unclear.
You should submit your answers before leaving failing to do so may result in no marks being awarded
Several different techniques for nucleotide sequencing have been developed over the years. However, they are nearly all similar in that they rely upon the chemistry of nucleotide chain elongation by a family of enzymes known as DNA dependent DNA polymerases. You will recall you have already met these enzymes in CMB2, this family of enzymes includes the thermal polymerases used for PCR and you will remember that these enzymes have so called end-filling properties that initiate elongation of a partially double stranded DNA template from the 3'end of the strand that is complementary to the template strand. Thus elongation results as the enzyme moves along the newly synthesised strand away from 5' end in a 5' to 3' direction.
The addition of deoxy-nucleotide tri-phosphate (dNTP) to the 3' end of a nucleic acid entails the formation of an ester bond between the alpha phosphate of the dNTP and the OH group at position 3 of the ribose sugar moiety. However if the enzyme is given a modified nucleotide tri-phosphate missing the OH group i.e. a dideoxy-nucleotide triphosphate (ddNTP), after the ddNTP is added elongation is unable to proceed further because the dideoxy-nucleotide occupies the terminal position of the elongating strand and lacks the requisite OH group.
Thus, if a reaction is allowed to proceed, whereby the four nucleotide tri-phosphates (i.e. GTP, CTP, TTP and ATP) are present as a mixture of deoxy- (dGTP) and dideoxy-nucleotides (ddGTP), elongation will be terminated at some point during the elongation phase of the reaction. This being dependent upon the point at which a dideoxy-nucleotide is captured by the polymerase and incorporated into the elongating strand, this in turn is itself dependant upon the concentration of each of the competing deoxy and dideoxy-nucleotides. The reaction will produce a mixed population of molecules, the length of each individual molecule in that mixture will be determined by the position in the sequence of the nucleotide corresponding to the dideoxy-nucleotide that terminates the chain (see powerpoint show , to pause this right click and select "pause", to exit "end show").
In order to identify which of the four bases occur at the terminating position each of the four dideoxy-nucleotides (ddGTP, ddCTP, ddTTP and ddATP) are labelled with a different fluorescent dye. These different dyes emit light at different wavelengths. Therefore the sequence can be determined simply by the sorting by length and the detection of the light emitted from the incorporated label of each of the four dideoxy-nucleotides. This length dependent sorting is achieved by capillary electrophoresis. Gel electrophoresis was also covered in lectures in CMB2 and your BST2 tutorials and you should by now understand the principals of this technique.
In brief, it passes the nucleic acid through a gel matrix by applying a voltage across two electrodes and the negatively charged (anionic) nucleic acid migrates toward the positive electrode (or anode). The molecules are retarded by the matrix according to their size and those which are larger are retarded to a greater extent and as a consequence move through the matrix more slowly. Molecules of the same size will migrate together and this population of nucleic acid species are detected using the fluorescent label on the terminating dideoxy-nucleotide.
Sequencing has now been carried out by semi-automated sequencers for some years. In these instruments the captured light is transformed into a trace (a graph showing the normalised emission of the four dyes). This is converted to a sequence according to the peaks produced for each dye. A typical readout is seen below in exercise 1.
The gyrA gene of Mycobacterium tuberculosis encodes subunit A of DNA gyrase, a type II topoisomerase. It is a protein that catalyses the energy-dependent breakage, passage and rejoining of double-stranded DNA to allow the uncoiling and separation of interlocking strands of DNA and thus is essential for the replication of a circular bacterial chromosome which otherwise cannot segregate. Inhibition of the catalytic activity of gyrA by the fluoroquinolones such as Ofloxacin and Ciprofloxacin leads to the killing of many organisms reliant upon DNA Gyrase. Mutations in the gyrA gene can confer resistance and alterations in the effective dosage reflected in the minimum inhibitory concentration (MIC) of a drug.
The region of gyrA given in the traces corresponds to nucleotide positions 229-313 and includes the region of gyrA of some well documented mutations that lead to increased resistance to the fluoroquinolones that are used in the treatment of tuberculosis. Also in this region there are some other well known non-synonymous single nucleotide polymorphisms (SNPs), some of these are also coincidentally seen in the same populations but themselves have no effect on sensitivity to the drug.
A single nucleotide polymorphism is simply an established variant of a single nucleotide at a specific location within a sequence. Such variation can occur anywhere within the genome but if it occurs within a coding region it can be one of two types either synonymous or non-synonymous, the latter altering the amino acid coded for by the specific codon affected, the former not.
Sequence Alignment and Database searching
In the following exercise, you will use the sequence provided, to assess the specificity of a TaqmanTM hydrolysis probe that is used in a diagnostic assay for H1N1 influenza A. In order to design such probes it is necessary to have an understanding of the likelihood of homology of the probe sequence with other non-target nucleic acid sequences and thus the potential for hybridisation of the probe sequence to other sequences that may be present in the biological sample to be tested. Thus this reflects the likelihood of interference in the assay that could reduce the sensitivity or cause positivity as a consequence of detection of non-target sequences such as human genomic DNA instead of viral sequences. To do this it is necessary to compare the probe sequence against all other known sequences in the public databases such as EMBL or GenBank. This should also enable you to identify the gene target that is used in this case and any other non-target sequences that are similar and thus likely to affect the specifity of the hybridisation of the probe to its target.
In order to do so you will perform a simple database search operation, which allows the interrogation of the sequence databases with the probe sequence comparing millions of sequences in a matter of seconds. The ability to do this, lies at the heart of modern molecular diagnostics and molecular medicine. It is a technique that is used every day to inform us about the origin and epidemiological relationships of sequences, their likely identity and function.
In essence both sequence alignments and database searching are the same process i.e. a comparison of one sequence against another and the computation of a measure of their relatedness. In practice database searching is a compromise between various parameters.
The result obtained from a database search is a measure of sequence similarity, it is this, which allows us to infer that two or more sequences are homologous. Genes that are homologous are frequently deemed to have arisen from the same ancestral gene and are therefore related, this is the basis of phylogenetics.
So how do we measure similarity? There are many different measures, the simplest is that of identity. In the most basic system, an identical base or amino acid pair is scored 1, a non-identical pair is scored 0 (such a scoring system is termed the identity matrix), in the example gaps would not be scored and the measure of similarity would simply be the sum of the diagonal resulting from each alignment. The alignment giving the highest score is therefore the alignment with the greatest similarity.
The identity matrix for nucleic acids would look something like the table below:
Alignments of two sequences using the identity matrix with no gap penalty
or gap extension penalty
Best alignment with a score of 5:
In practice however, identity matrices for amino acid sequences are not used but for nucleotide sequences a variety of types of identity matrices are used, however they are essentially variations on the one shown above.
The most commonly used scoring matrices for proteins are those derived by Dayhoff (PAM) and Henikoff & Henikoff (BLOSUM). The two matrices are derived by different means and the PAM matrices are more commonly used for phylogenetics. The BLOSUM matrix is derived by the frequency of transition from one base to another in a large group of sequences with a known similarity. We would therefore score very different values in our matrix of all possible alignments if such matrices were used to align amino acid sequences but the process is the same.
A further complication is the possible occurrence of insertions or deletions in sequences. Alignment algorithms get around this problem by allowing gaps in one or other of the sequences during the alignment. As you might expect introducing GAPS and EXTENDING GAPs also have a score which contributes to the overall score, although these are usually negative values and are therefore referred to as PENALTIES and in these programs GAP scoring is important in finding the optimal alignment.
The software calculates a probability value giving the likelihood that a given pairing (match) being coincidence (a chance event) or not. Since there are 4 nucleotides and consecutive nucleotides in a sequence are not statistically speaking independent of one another in the sequence, then a sequence of 100 bases would have one of 1.6x106 possible combinations, this is because the number of possible sequences increases 4 fold with every additional base. In addition, a shorter sequence would likely be found in this population of all possible combinations multiple times depending upon the length of the shorter sequence. Thus a short sequence may be represented many times in the database because it may be found in many different gene sequences submitted to a database like EMBL or GenBank and thus may not be unique to any one given entry. Thus the longer the sequence provided in the search the more likely a unique match will be found with a low probability that the match occurred by chance.
Exercise 2: Database search and alignment with Blastn
You will now carry out a nucleic acid search using the GenBank database in order to assess the specificity of a sequence intended for use as a probe in this diagnostic assay for influenza A.
A sequence is likely to lack specificity in an assay if it shows similarity to other sequences that may occur in a biological sample used. It is worth noting that all biological tissue and body fluids will contain large amounts of the DNA from the organism or person from which it came, as well as that from commensal or other organisms infecting that tissue. Thus the likelihood of a false positive test as a consequence of detection of non-target sequences such as contaminating human genomic DNA is an important consideration.
The probe used in the influenza A viral assay consists of a 23 base single stranded nucleic acid, the sequence of which is be given below via the hyper-link. This sequence has been derived from alignments of previously sequenced strains of H1N1 influenza A virus using a similar methodology you will use in exercise 3.
Sequence comparisons are carried out using algorithms designed to allow the calculation of a statistic by which we can estimate the likelihood that a given alignment is due to chance. A number of different algorithms are available and are used for different purposes. The algorithm you will use is implemented in a program called BLAST and the specific subset of the BLAST program you will use for nucleic acid comparisons is blastn. The software will use a look-up table or matrix as described above to calculate the most likely alignments and provide the highest ranking of these.
- Of the circular option buttons in the 2nd section "Choose Search Set" select "Others (nr etc.):". This selects the entire GenBank data base of 30 million plus sequences.
- further down the page under the section "Program Selection" click on the check box marked "Somewhat similar sequences (blastn)" This tells the server to use blastn.
- Open this text file, copy and paste the query sequence in this file (using ctrl+c and ctrl+v) into the box provided in the first section "Enter Query Sequence", note that the sequence you are given is in the 5' to 3' direction. Sequences are always handled in this orientation.
- Now click on the square check box next to the large BLAST button forcing the browser to "show the results in new window". You will come back to this query submission page later.
- The page should look something like that shown below, now click the BLAST button to perform the search.
MUSCLE: multiple sequence Comparison by Log-Expectation
Having obtained nucleotide sequences for a gene it is sometimes useful to make multiple alignments of many sequences. Such a comparison may enable the identification of functionally important domains (regions) within a gene or to link specific mutations with a hereditary condition. The comparison of sequences at the nucleotide level can be misleading since several different codons can code for the same amino acid. You will need to look up and use the codon table to obtain a better understanding of the implications of this. It is therefore useful to be able to translate sequences and align these at the protein level. Since each codon comprises three nucleotides there are three possible forward and three reverse frames for any segment of DNA and only one of these is used. The correct reading frame depends upon where the sequence we have lies within the gene with respect to the start codon; and the answer to that we may not know and can't tell simply by looking at the sequence itself. Mutations that lead to the incorrect positioning of this reading frame lead to mis-sense proteins, those that introduce "in-frame" stop codons will cause premature termination of translation and both have very deleterious effects on the function of a gene product. Others mutations such as those affecting the normal processing of message such as splicing or polyadenylation can also influence normal gene expression, RNA stability and function.
You have already been introduced to The haemoglobin Beta gene (HBB), which is located on chromosome 11 as part of the cluster of haemoglobin genes including epsilon, gamma-G and A, delta and beta. Think back to the lectures in CMB2 on the control of gene expression. The HBB gene covers about 1.73kb. It includes several exons along with the intervening introns. Mutations in the beta globin gene leads to normally an autosomal-recessive condition know as beta-thalassemia. This is a spectrum of disease linked to the severity of the reduced expression of the beta-globin chain and the consequent imbalance between the alpha and beta globin chains. See GeneReveiws for more about HHB and thalassemia. The pathophysiology of beta thalassemia is complex but it leads to ineffective erythropoiesis (production of erythroid cells) as a consequence of excess alpha-globin chains. This in turn stimulates erythropoiesis causing splenomegaly and expansion of the erythroid marrow with many characteristic adverse effects.
A huge number of mutations in the beta-globin gene have been documented including mutations resulting in the down regulation of transcription, as well as alterations in the splicing or maturation of the mRNA, along side alterations leading to truncated or incorrectly coded protein.
some of the variations in the HBB gene
Single base differences in the gene sequence within the coding region (the part that codes for protein) can lead to alterations in the amino acid content of the translated product and as we've already learnt, these alterations are termed non-synonymous single nucleotide polymorphisms (SNPs), however alterations can also occur in this region that do not affect the protein encoded and these are referred to as synonymous single nucleotide polymorphisms. Synonymous SNPs are a consequence of redundancy in the codon usage. Think about the codon table and how many times each amino acid appears in the table.
It is possible to carry out nucleic acid sequence alignments to compare sequences using a program called MUSCLE or Clustal Omega. Unlike the alignment carried out by BLAST, which is an alignment of pairs of sequences, MUSCLE carries out alignments on multiple sequences of nucleic acids and finds the optimal global alignment for all.
Exercise 3 :
Example of FASTA formated sequence
- MUSCLE like blast can use different matrices however, we will not alter any of these settings and we will use the defaults in step 2 and 3. So no need to change anything.
- You do not need to be notified by email of the result
NOW SUBMIT YOUR MOODLE QUIZ