Sequence preprocessing

Sequence preprocessing help

This implementation will automatically search the input sequence (amino acid or nucleotide) for the V3 loop.

Search options

Force V3 at position 1: This option forces the program to score the sequence from position 1. Gaps are left as-is. If the sequence is given in nucleotides, the translation is performed in frame 1.
Fast-find V3: A quick attempt to find the V3 loop (after nucleotide translation, if necessary) is made using a regular expression. All forward and reverse frames are searched. Gaps are left as-is. This method is very fast, but not perfect.
Align to matrix: The most rigorous method, and the slowest. The program will find and score the portion of the sequence best aligning to a consensus V3 loop. The program will make insertions and deletions in order to put input V3 sites in register with homologous sites represented in the matrix. The output will indicate where insertions were removed and deletions identified to give the resulting score.

If the resulting alignment or PSSM score is out of the usual range (the middle 95% of a general sample of subtype B or subtype C sequences), this will be noted in the returned results. An unusual score indicates that the alignment (whether yours or the program's) is probably unreliable.

Scoring degenerate sequences

Checking Expand degenerate sequences will instruct the program to score all possible combinations of amino acid sequences, given an input nucleotide sequence containing IUPAC ambiguity symbols.

There are two options: Average score will deliver only the simple average of scores over all combinations; Full expansion will enumerate and score each sequence combination separately, as well as report the average. Note that it doesn't take too many ambiguities in the sequence for the number of possible sequences to become very large. For example, a sequence with 9 codons containing amino-acid-changing ambiguities would yield 512 different sequences upon expansion. The upper limit for the number of combinations the progam is willing to analyze is 16,384 when computing the average score only, and 512 when requesting enumeration of all combinations. Some efficiencies are built in. Ambiguities, say in third positions, that do not change amino acids are not expanded. Only amibiguities within the sequence spanned by the matrix are expanded.

The user is advised that the average score over combinations is an extremely rough guide to the "X4-ness" of the population.

26 Feb 2009