Bioinformatics Half Module:

Computer practical:

Principles of DNA sequence alignment and database searches using Blastn & ClustalW.

* Exercise 1: Identification of the sequence in the database.

You will use the sequence provided in your project in a comparison with sequences held in the Human Embl database at the nucleotide level using a Blastn search. This will allow you to identify similar or related sequences in the database.

* Exercise 2 & 3: Sequence alignment.

Additional sequences will be provided for you. Carry out a comparison, to identify differences in the sequences using web-based tools at the EBI mirror site, and you will align them using the ClustalW program. Eventually, you will explore the ‘biological’ importance of differences found in the nucleotide sequences by a comparison with the results derived from a protein sequence alignment. You will explore the relationship between a series of orthologues and visualise these as cladograms and phylograms


Sequence Alignment and Database searching

Background

In the following exercise use the sequence provided for your project, which you should have as both nucleic acid and amino acid sequence, and compare it against the gene sequences in the EMBL database and the human subset of the EMBL database. This should enable you to identify the gene from which it is derived with a high degree of likelihood . Optionally, you can perform simple operations, such as translating a nucleic acid sequence into an amino acids, multiple sequence alignments using alternative web based tools form those you have already used finally there are links to other sources of information about the your gene product .

In essence both sequence alignments and database searching are the same process i.e. a comparison of one sequence against another and the computation of a measure of their relatedness. In practice database searching is a compromise between parameters which :

  1. give the best alignment
  2. best distinguish related from unrelated sequences
  3. are fast

The result obtained from a database search is a measure of sequence similarity, it is this which allows us to infer that two or more sequences are homologous.

The story so far as you'll remember:

There are many different measures, the simplest is that of identity. In the most basic system, an identical base or amino acid pair is scored 1, a non-identical pair is scored 0 (such a scoring system is termed the identity matrix), in the example gaps would not be scored and the measure of similarity would simply be the sum of the diagonal of resulting from each alignment. The alignment giving the highest score is therefore the alignment with the greatest similarity.

 

A

T

G

C

A

1

0

0

0

T

0

1

0

0

G

0

0

1

0

C

0

0

0

1

The most basic identity matrix for nucleic acids:

 
 

A

T

G

C

A

5

-4

-4

-4

T

-4

5

-4

-4

G

-4

-4

5

-4

C

-4

-4

-4

5

The identity matrix used for nucleic acids by blastn:

Alignments of two sequences using the identity matrix with no gap penalty or gap extension penalty

Sequence 2

Sequence 1

score

 

A

T

G

G

A

C

C

G

 

A

1

0

0

0

1

0

0

0

 

A

1

0

0

0

1

0

0

0

0

G

0

0

1

1

0

0

0

1

0

G

0

0

1

1

0

0

0

1

1

A

1

0

0

0

1

0

0

0

2

C

0

0

0

0

0

1

1

0

1

G

0

0

1

1

0

0

0

1

0
score     0 0 2 1 0 2

5

3

Best alignment with a score of 5:

Sequence 1

A

T

G

G

A

C

C

G

 

*

-

*

*

*

*

-

 

Sequence 2

A

A

G

G

A

C

G

 

In practice however, just a simple identity matrices for amino acid sequences are not used since such scoring systems allows no weighting for the frequency with which amino acids or bases occur (a/t/g or c's do not occur with equally frequency nor do individual amino acids), the base composition of exons and introns, the redundancy observed in the amino acid codon table etc. In order to obtain better alignments many programs such as blastx first translate the sequence into protein in all six reading frames and only then do they carryout alignments using much more complex matrices containing values other than 0 or 1.

The most commonly used scoring matrices are those derived by Dayhoff (PAM) and Henikoff & Henikoff (BLOSUM). The PAM matrices we have not really discussed, however, are arrived at following the assumption that all homologous sequences are derived from a single common ancestral gene. It is the directness of this path from the ancestral to the current sequence that gives rise to the scoring matrix . The Blosum matrix is derived by a slightly different route as we discussed, i.e. sequences with a given level of similarity are aligned and the frequency of transition from one base to another is compared to the frequency of the individual bases within the alignment. This gives the log likelihood ratio used in the matrix.

As you'll remember a  further complication is the possible occurrence of insertion or deletions in sequences. Alignment algorithms get around this problem by allowing gaps in one or other of the sequences during the alignment. As we discussed introducing GAPS and EXTENDING GAPS also have an contribution to the overall score, although these are usually negative values and are therefore referred to as PENALTIES and in these programs GAP scoring is important in finding the optimal alignment.

Database search and alignment with Blastn

Exercise1 :

Carry out a nucleic acid search on the EMBL database in order to identify the sequence you have derived by some other means

Sequence comparisons are carried out using algorithms designed to allow the calculation of a statistic by which we can estimate the likelihood that a given alignment is due to chance.  The family of programs that you will use is called Blast and the specific subset of the Blast program used for nucleic acid comparisons is blastn, blastx, and for proteins blastp.

There are a number of different ways in which we can make a comparison of two sequences ( gapped or ungapped as well as global and local alignments). The details of how or why all these types of alignments we have already covered.

  1. It is now advisable to read ahead before continuing, however, you will be able to toggle between these instructions and the new window you are about to open.
  2. Click on the EBI tool box link here to open a new navigator window, so you may switch between windows if you need to.
  3. Now click on Homology & Similarity : and then on the link to Blast2-WU nucleotide

  1. Having arrived at the Washington University blast page at the EBI, there are a number of parameters which are required for a successful search to be carried out!
  • you must supply your E-mail address
  • in the pull down list under Programs you can select the blast program you require eg Blastn however this should be the default, changing the blast  program to for example tblastx would  allow a search of a translated nucleotide sequence against a  nucleotide database translated into all six frames. The default database to search is  EMBL Release. Change this option to EMBL Expressed Sequence Tag. (an EST is Expressed Sequence Tags. A small sequence from an expressed gene that can be amplified by PCR. ESTs act as physical markers for cloning and full length sequencing of the cDNAs of expressed genes. Typically identified by purifying mRNAs, converting to cDNAs, and then sequencing a portion of the cDNAs). Changing the option under "DATABASE" to "Protein" will forward you to the protein similarity search page using blastp or blastx which you don't at the moment.
  • you will notice the matrix selected is set to default, for blastn this is a type of identity matrix with weighted scoring however, for blastp (protein) searches this is the blosum62 matrix of Henikoff and Henikoff, you should leave this as the default value. You can try different choices later.
  • You can now copy and paste the sequence of your choice ; into the box provided, remember it is important to enter it in the 5' to 3' direction. Alternatively you may use the sequence provided here NM_001250.txt
  • the page should look something like that shown below, now click RUN Blast

  1. When you receive the results, you will be presented with a summary table and various buttons providing different ways to view the information

  1. scroll down, you will see below this is a second table containing a summary of the High-scoring Segment Pairs, sometimes refered to as HSPs. The list is ordered in lowest to highest Probability which represents the likelihood of obtaining the given alignment by chance. So that a P(n) of 0.0001 means that you would expect a similar alignment to be found for every 10,000 sequences searched i.e. the probability is 1 in 10,000 (the probability value is given under the last column with the heading "E()"). Sometimes instead of listing a probability an statistic an  E- value is supplied, this represents the number of similar alignments you would expect to retrieve from the database with the same alignment purely by chance. E-values are calculated from the p-value but since it takes into account the number of sequences searched it provides a better idea of whether the HSPs provided are real or chance matches. The Score is the sum of the path calculated through the alignment matrix and is thus a summary statistic for the alignment and is used to calculate the P-value.

  1. Go back to the summary table at the top of the page now click on the "Blast Result" button. The lower table will be replaced by the actual output from the program. It is important to be able to recognize and understand the output from the blast suite of programs, this information is often presented in different formats depending on the web interface you are accessing. Some information is provided in the header of the raw blast out put regarding the program used and the database searched and the number of sequences it contains
  1. Below this will be the list of Sequences producing High-scoring Segment Pairs, that you already saw. The list contains a line of information (often incomplete ) about each sequence found and a hyperlink to the entry in the database. This is the same information listed in the source column above, click on several of these near the top of the list.
  2. Take a look at the one of the entries returned (clicking on the hyperlink opens a new window or a new tab, so the blast results window will remain open behind the new one). The  format of these files is a standard type known as EMBL file format 
  3. Return to the blast results window  in your browser.
  4. If the identity of your query sequence was unknown then this method could be used to identfy it with a given probability .
  5. Now scroll down below the list to see the first of the actual alignments. Notice each new HSP starts with ">" followed by the identifier and summary information
  1. Go back to summary table by clicking on the appropriate button at the top of the page now click on the button "MVeiw". This is a visualisation tool which stacks the pairwise alignments that blast has produced and colours them it is NOT a multiple sequence alignment!
  2. Select some matches you find interesting, for example all those listed as "full length cDNA". This can be done by clicking on the check box. Next download those entries/sequences from the data base to the browser by clicking the "download" button (make sure the adjacent drop down box is on "fasta". Copy and paste the seqences from the browser to note pad or word and then Save the sequnces as a text document ! (Using the save as txt document in the browser under file may not work)
  3. You can now try running the same search using different database eg "EMBL Release" or "EMBL coding Sequence", why might you choose a particular database such as these over any other.

ClustalW: multiple sequence alignments

Background

Having obtained related nucleotide sequences it is sometimes useful to make multiple alignments of many sequences in order to simultaneously compare them. Such a comparison may enable the identification of functionally important domains within a gene or to link specific mutations with a hereditary condition/phenotype. The comparison of sequences at the nucleotide level can be misleading since several different codons can code for the same amino acid. The codon table is presented in the lecture material. It is therefore useful to be able identify the coding region and to translate these sequences and align them at the protein level.

Since the available time is limited we will compare the sequences you have you downloaded from the previous exercise you can compare these sequences at the nucleic acid and amino acid level (the later only after translating them) using a program called ClustalW2. Unlike the alignment carried out by blast which is an alignment of pairs of sequences , ClustalW carries out alignments on multiple sequences and finds the optimal global alignment for all.

Exercise 2:

  1. Either Download the following text file downloadedCodingSeqs.txt (a selection of high scoring HSP derived from a blast search of the EMBL coding sequence database with CD40) or use your own selection generated in Exercise 1. Each sequence is given in the FASTA format i.e. ">Name" followed by the sequence starting on a new line.
  2. Click the back button in to access the Tool Box page at EBI. Look under  Sequence Analysis and click on ClustalW.
  3. Copy the nucleic acid sequences from the text file  and paste them including the name in to the query box as below.

  1. Now click Run ClustalW
  2. Look at aligned sequences. What you have in the browser window should be each sequence aligned with all others, stacked one upon the other and the whole alignment folows on with each consequtive 60 nucleotides. There should be a button which says "show colours", try viewing the alignment with the colour "clicked on" even this aid is not very useful with a large number of sequnces. Visually assessing the output is very difficult
  3. A useful way of displaying information about how related sequences are is as a cladogram or phylogram. This is basically a branch system connecting related data defined by some measure of "distance". In the phylogram, the length of the branch indicates relatedness within a phylum ie reflecting the evolutionary change between all members of the pyhylum. A cladogram shows the relationship within the phylum refecting the common ancestry but is not as a measure of evolutionary time since some of the sequences may be relatively ancient and have remained unchaged for some time whilst others may be modern sequences with recent origin. In either case the shorter the branch the more closely related the sequences. Click on "View Guide Tree"  view the alignment as both cladogram and phylogram.
  4. Compare the difference between cladogram and phylogram, ? make sure you understand why they are different.

Exercise 3:

Over an evolutionary timescale organisms diverge and as the evolutionary distance increases then the sequence of any given sequence drifts subject to evolutionary pressure for either change or conservation of given residues. The drift observed in protein sequence will always be less than for the equivalent nucleic acid sequence because of the redunancy in the codon usage. The conservation of protein sequence is critical only in certain regions such as functional domains, domains required for the structural integrity of the protein or for binding of ligands. Such regions of a protein may contribute significantly towards enzyme activity, the scaffold on which the protein depends or the environment in which folding occurs and therefore may be conserved in order to maintain function.

Sometimes as divergence occurs some genes may be duplicated and whilst the maintenance of one copy is functionally necessary the second is not and this duplicated gene may also diverge where it is evolutionarily advantageous and eventually acquire a different but related function. The similarity between such homologous genes is often sufficient to be able to detect such a relationship by sequence comparison.

Even functional domains may contain variable residues, this is particularly true in distantly related organisms, however even under such circumstances functional domains of homologous genes remain highly conserved at specific positions which have very narrow requirements in the physiochemical properties.

Proteins that have a common ancestral gene but may not necessarily have the same function are termed homologues, homologues found in different organisms which have a common ancestral gene are termed orthologues, however, where homologues arose by gene duplication they are termed paralogues.

  1. Open the following text file DNAK.txt. containing orthologues from a number of different bacteria of the protein DNAK (HSP40). Each sequence is given in the FASTA format i.e. ">Name" followed by the sequence starting on a new line. 
  2. Run ClustalW just as above using these protein sequences .

Another tool provided enabling you veiw the data in different ways is a Java applet called JalView

  1. In order to use JalView and see the alignment displayed in a number of different ways go to the top of the page and press the JalView button, this may take a few seconds so be patient.

  1. The alignment should appear after a short period. The main window contains the alignment from left to right with a scroll bar at the bottom to allow you to move along. The alignment can be wrapped within the window and coloured in different ways, some are only applicable to protein alignments (such as BLOSUM62 score etc). You can also modify the colouring according to consevation or identity threshold (try clicking on Colour->"Above Identity Treshold" and use the slide to adjust the percentage identity threshold). Jalview also generates a consensus sequence and a histogram indicating the similarity at any given position.
  2. Now click Calculate Tree. select Average distance tree using % Identity (PID). JalView will calculate the distances from the alignment and display the information about the four sequences as a dendogram/cladogram (again select view then show distances to see these values).
  3. Go back to the main JalView window and now click Calculate Tree. select neighbour joining distance using % Identity (PID). JalView will calculate the distances from the alignment and display the information about the four sequences as a dendogram/phylogram (again select view then show distances to see these values).
  4. Note that your results will differ from those you obtained before there are many different ways of calculating distance measures and constructing a cladogram or phylogram, which presents some problems in the interpretation of such information. One useful feature of the tree is that by clicking a any given position down the tree you can colour the branches according to the "cut" across the tree at that point. The tree is "linked" to the main JalView window and colours the seqence names accorded by the "cut" in the hierachy you made. You can also resort the order of the sequences in the window accorded by that same "cut" in the hierachy by clicking on sort -> By Tree Order and selecting an option.