Takes sequences, a tree and a column statistic as input, and generates a short sequence quality output string, which will be stored into the database under a user defined key.
First the sequences are split into different slices:-
2 pieces (front and back half)
-
5 equally sized pieces
-
user defined pieces
The programs sums up the weighted mutations for each sequence slice using a maximum likelihood technique.
For each slice a students t-test (see http://en.wikipedia.org/wiki/T-test) is performed and its result is written into the XXX portions of the entries mentioned below. The t-test tests whether the likelyhood of a specific sequence slice (of one species) follows a t-distribution of the likelyhoods for that sequence slice in all examined species.
The meaning of each X contains the result of the t-test (the "t-value") as follows:-
if the t-test succeeds the value of X is '1' up to '8' (where '5' is shown as '-').
-
if the t-test fails '0' or '9' is written to the X's
-
if there is not enough data to perform the t-test, '.' is written to the X's
Rule of thumb: Values near '0' or '9' indicate regions with an abnormal, values near '5' ('-') regions with a normal (i.e. expectable) number of weighted mutations.
The sequence quality string written into a user-definable species field has the following format:
MED SUM aXX bXXXXX cXXXXX...XXXX
where:
-
MED is the median of all t-values (0.0 = normal; <5.0 = succeeds t-test (mean); >5.0 = abnormal)
-
SUM is the sum of all t-values
-
aXX shows the quality for 2 pieces
-
bXXXXX shows the quality for 5 pieces
-
cXXXXX...XXXX shows the quality for user defined slices
Optionally a 'quality' entry may be written to the alignment, allowing to display it in EDIT4 below the sequence. That quality entry simply is a "blown up" version of the "cXXXXX...XXXX" part of the sequence quality field.
|