Definition of specificity residues based on entropy values. Combinatorial entropy difference as a function of residue position (in rank order) for the actual (solid line) and randomized (dashed line) multiple alignment of 390 protein kinase sequences . Deviations from the linear fit to the entropy curve define the specificity region (yellow, about 20 residues, conserved in subfamilies but varying between subfamilies) and conserved region (blue, about 50 residues, conserved across all subfamilies). The randomized alignment, obtained by independently shuffling residues in each column of the original alignment, serves as a point of reference. The shuffling does not affect the residue content in the columns, but it washes out the subfamily distinctions. The greater the differences between the native and the randomized entropy curves, the more reliable the corresponding prediction of specificity residues. To automate visual parsing of the extreme ends of the entropy plots, we perform a simple linear fit to the central region, covering a fraction P = 0.5 to 0.7 (depending on the length of the alignment) of the sequence length (horizontal range). The line segment is centered at a point corresponding to the best linear fit. To identify the turning points at the extremes, we compute the root mean square deviation from a simple line in the central region and record the points outside of the central region where the curve deviates by more than δ
from the extrapolated line segment. In most cases, this simple procedure is in agreement with visual identification of downturn and upturn at the extremes. A reasonable subset of specificity residues (low end of entropy difference) and conserved residues (high end) can then be read off from the horizontal axis of the entropy plot.