The reference sequence wheel lets you focus on a specific reference sequence, or chromosome. When a chromosome is selected, the app samples variants from only the selected chromosome and regenerates all the charts. If "All" is selected, the metrics are regenerated by sampling variants from across the entire genome.
The variant density chart shows the distribution of variants across the genome. To zoom in on a specific reference sequence (chromosome), you can select the coloured sequence boxes under the distribution. The appearance of this chart is dependent on a number of different factors, but specifically, the type of sequencing undertaken, e.g. whole genome (WGS) or whole exome (WES).
Whole genome sequencing
For a WGS VCF file, the variant distribution should be evenly distributed across the whole genome, as shown for a human sample below.
You will notice that there are some areas of zero variant density in the above chart. Chromosomes 1, 3, 16, 19, and 20 are metacentric chromosomes, and have a centromeric region where no variants will be present, close to the centre of the chromosome. This gap is particularly large in chromosomes 1 and 16. Chromosome 9 is submetacentric and has a large centromere, but the centromere is offset leading to p and q arms of different lengths. Finally, chromosomes 13, 14, 15, 21, and 22 are acrocentric. Here the p arm is so short, it is difficult to observe, leading to the centromere appearing at the beginning of the chromosomes.
Observing gaps in coverage for these chromosomes is expected, but any other gaps could be indicative of problems with the variant calls. Male samples in the VCF file have reduced coverage on the X chromosome (since they only have one copy), so it is also expected that the average variant density is lower on this chromosome. Often variant calls are not generated for the Y chromosome, or pseudoautosomal and heterochromatic regions are removed, leading to few calls.
Whole exome sequencing
For a WES VCF file containing many samples, the variant distribution will appear much more uneven, since the VCF file will only contain variants in exonic regions. Some of the gaps described for the WGS can still be observed, but there is still generally coverage across the whole chart. Any large gaps would be suggestive of data problems.
Note that the number of sampled variants is often low for exome sequencing. This is because variants are sampled from across the whole genome, but variants are only found in exonic regions. This can be resolved by restricting sampling to exonic regions using the controls in the top right corner of this chart.
No values present
Transition/transversion (Ts/Tv) ratio
The transition / transversion (or Ts / Tv) ratio is calculated by looking at SNPs in the selected file, and is expected to have a value of the order of 2. The nucleotides Adenine (A) and Guanine (G) are purines, and Cytosine (C) and Thymine (T) are pyrimidines.
A transition occurs when a purine mutates into another purine, or a pyrimidine mutates into another pyrimidine. There are four possible transition mutations, as shown in the figure to the left.
A transversion occurs when a purine mutates into a pyrimidine or vice versa, and there are eight possible ways for this to occur.
If single base-pair mutations were purely random events, we would expect to see twice as many transversions as transitions, as there are twice as many ways for a transversion to occur. We would then expect to see a Ts/Tv ratio of 0.5. However, transitions are chemically and biologically more favourable (transition mutations are about ten times more common than transversions). Consequently for human DNA, we actually expect to see far more transitions than we would expect by chance: typically we see a Ts/Tv ratio closer to 2. If the observed Ts/Tv is significantly lower than 2, then it is a potential signal of data problems.
Allele Frequency Spectrum
No values present
Allele frequency spectrum
The allele frequency spectrum shows the percentage of alleles (only using SNPs) in the called population that exhibit the alterate allele. This plot is highly dependent on the number of samples that were simultaneously analyzed while generating the variant calls. For example, if this VCF file was generated by considering only a single sample, the allele frequency (AF) can only take one of three values: 0%, the sample is homozygous for the reference allele. Typically, a variant caller would determine that there is no variant at this location, so this would not be reported.; 50%, the sample is heterozygous, and; 100%, the sample is homozygous for the alternate allele.
If the calling was performed on a large number of samples (this is often the case - even if only studying a small number of samples, variant calling is often performed with a background of samples. This ensures that common variants in the human population will be represented), we would expect to see the population allele frequency spectrum. This is reasonably well approximated by the ExAC dataset.
If only a small number of samples was used in the calling step, this plot will be quite discrete and not hugely informative.
No values present
As discussed in the Ts/Tv help information, transitions are much more common than transversion mutations. This means that it is far more likely for an Adenine (A) to mutate into a Guanine (G), and vice versa, than to mutate to either a Cytosine (C) or Thymine (T). This means the mutation spectrum is expected to look similar to the image below.
If transition mutations do not dominate the mutation spectrum, this might indicate a problem with the variant calling. If this is the case, you should also observe a Ts/Tv ratio significantly less than 2.
No values present
This chart shows a distribution of the variant types present in the selected VCF file. SNPs are by far the most prevalent type of genetic mutation, so it is expected that these dominate the distribution.
For whole genome sequencing data, SNPs dominate, but there is typically an appreciable number of insertion (Ins) and deletion (Del) mutations, as well as Other variant types. These other variants include anything that doesn't fall into the previous categories. For example, consider the case where the reference sequence is ACTG, and sometimes the A is mutated into a C, but whenever it is, the T is always deleted. Rather than these two mutations appearing separately in the VCF file, since they are only ever observed together, the reference will be ACTG, and the alternate will be CCG (the A mutated to C, and the T deleted). This mutation is a combination of a SNP and a deletion, and is listed as a complex variant type and will appear in the Other category.
It is much less likely to find structural variants, when looking at whole exome data. Large events are often not detected since the breakpoints fall outside of the exome sequencing regions. Also, insertions in, or deletions of large coding regions are likely to be more deleterious than in the non-coding regions captured in whole genome sequencing, and so there are less present in these regions. For these reasons, SNPs are expected to dominate the variant types distribution even more in exome sequencing.
This chart is primarily used to check that all expected variants are present. If you are expecting to see a complete VCF including insertions and deleted, but the file has been filtered to leave only SNPs, this chart will quickly help identify this.
Insertion & Deletion Lengths
No values present
Distribution of indels
This plot shows how many insertions and deletions (indels) of specific lengths are present in the VCF file. It is typically easier for variant callers to detect deletion alleles, so we expect to see more deletions than insertions, and we expect to see more short indels than long ones. A typical distribution will look something like:
You can click the "Outliers" button to expand the chart to include indels of all lengths. Typically, the number of larger indels is dwarfed by the short, so you typically only see a distribution close to zero. However, you can zoom in on regions of the chart using the lower chart.
No values present
Variant quality scores
Each variant is assigned a quality score to represent the confidence in the variant. The distribution of these scores is highly variable, and no specific distribution is expected. Here we mainly want to check that the quality scores are not crowded into the low (< 100) end. Below is an example of a quality score distribution that is weighted to higher scores and so would be considered acceptable.
Below is another average quality score distribution that would be considered acceptable. There are many variants that have a low quality score (e.g. between 0 and 100), however, there are still a large number of variants with high scores. If variant prioritization techniques are used, it may be necessary to check individual variants to ensure that their scores are acceptable.
Below is a quality score distribution that looks more questionable. There are still many variants that appear to be of high quality, but the majority are clustered close to zero. Seeing such a distribution should raise warning flags, but the file should not just be thrown out. Different variant calling tools use different methods to generate quality scores, so it would be prudent to check with whoever generate the variant calls to try to understand the cause of such a distribution.