ClustalW
It is a general purpose program that identifies alignments within multiple DNA or protein sequences.
Introduction:
ClustalW is one of the most widely used sequence alignment programs. There are many reasons that you might want to align some sequences. For example, differences in alignment can be used to infer phylogenetic history and evolution. DNA sequences are often aligned to identify identical residues to target mutations, and alignments of protein sequences can identify conserved domains or moieties.
This section of the course will briefly describe the program CLUSTALW. We will begin with a brief description of the input and output formats, and then move on to the alignments per se. The session will conclude with a couple of examples of alignments performed by CLUSTALW.
DNA Sequence input
All the sequences must all be in one file and in the same format. CLUSTALW can take many formats, but the simplest and most widely used is the FASTA format. We will use the fasta format for all our analysis.CLUSTALW looks at the sequence to see what type of sequence it is. If 85% or more of the sequence consists of the following letters: A,C,G,T,U or N CLUSTALW will assume that the sequence represents a nucleotide sequence. Otherwise it will assume that the sequence represents a protein sequence.
Sequence Alignments
Multiple alignments are carried out in 3 stages:
- all sequences are compared to each other (pairwise alignments)
- a dendrogram (like a phylogenetic tree) is constructed, describing the approximate groupings of the sequences by similarity.
- the final multiple alignment is carried out, using the dendrogram as a guide.
There are several parameters that you can set in clustal, and we will examine most of them using examples by trying them and seeing what happens.
These parameters control the final multiple alignment.
Each step in the final multiple alignment consists of aligning two alignments or sequences. This is done progressively, following the branching order in the guide tree. The basic parameters to control this are two gap penalties and the scores for various identical-non-indentical residues.
The gap penalties control the cost of opening up every new gap and the cost of every item in a gap. Increasing the gap opening penalty will make gaps less frequent. Increasing the gap extension penalty will make gaps shorter. Gaps near the ends of sequences are not penalised so that the last few residues or bases don't get forced to the end of the sequence.
The delay divergent sequences switch delays the alignment of the most distantly related sequences until after the most closely related sequences have been aligned. The setting shows the percent identity level required to delay the addition of a sequence; sequences that are less identical than this level to any other sequences will be aligned later.
The transition weight gives transitions (A <--> G or C <--> T i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of zero means that the transitions are scored as mismatches, while a weight of 1 gives the transitions the match score. For distantly related DNA sequences, the weight should be near to zero; for closely related sequences it can be useful to assign a higher score.
Matrices
More help on the matrices is given here
The protein weight matrix offers a choice of weight matrices. The default for proteins is the PAM series derived by Gonnet and colleagues. Different matrices work differently at each evolutionary distance.
The DNA weight matrix has a choice of two matrices, IUB and CLUSTAL. The CLUSTAL is the matrix used by BESTFIT for comparison of nucleic acid sequences.
Click here to access the online clustal search page and here to read more about the parameters.
Package for clustalW:
http://abies.nmsu.edu/pkgsrc/clustalw/
References about clustal
The following three references describe using clustal in more detail.
- Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992) CLUSTAL V: improved software for multiple sequence alignment. Computer Applications in the Biosciences (CABIOS), 8(2):189-191.
- Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22(22):4673-4680.
- Higgins DG, Thompson JD, Gibson TJ (1996) Using CLUSTAL for multiple sequence alignments.Methods Enzymol 266:383-402