From: Structural variant calling: the long and the short of it
Word | Definition |
---|---|
Accuracy | Proportion of correctly identified events (T) to the overall events: (TP + TN)/(TP + TN + FP + FN). |
Breakpoints | Positions on the genome denoting the start and end of SVs relative to the reference genome. |
Contigs | Contiguous sequence stretches assembled from reads. |
De Bruijn graph | Directed graph consisting of nodes with exactly n incoming and n outgoing edges. In genome assemblies, a de Bruijn graph is built where the nodes are k-mers (sequences of length k) and the edges correspond to the overlap on k − 1 bases between nodes. |
String graph-based assembly | Similar method to De Bruijn graph-based assembly, but in this case, the overlaps between all read pairs (instead of k-mers) are computed to construct a string graph based on the overlaps. |
Insert size | The distance between the two paired-end reads. |
Overhang | Portion of a mapped read that cannot be aligned and thus could indicate a structural variation. |
Phasing | The identification of two or more heterozygous variations are co-occurring on the same or different DNA molecule. |
Precision (or positive predictive value) | Proportion of predictions (FP + TP) that are correct (TP). |
Recall (or sensitivity or true-positive rate) | Proportion of the total positives (FN + TP) that were correctly identified (TP). |
Scaffold | Connected contiguous sequence stretches, with unresolved sequence stretches in between. |
Split reads | Reads containing parts that map in different loci on the reference genome. They are found by splitting the read in sub-segments, align individually each sub-segment, and then grouping sub-fragments from one read. |
Tandem sequence | A specific type of repetitive region that was repeated directly adjacent to each other. |