1.  Overview

Quality control is a key step in association analysis. We may use a variety of criteria to clean our data (criteria directly from original vcf file or statistic summary we are interested in). Variants can also be subsetted based on these properties such as variant information, annotations, summary statistics, etc, which are displayed by vtools show fields.

2.  QC Pipeline in Brief

  • Variant & Genotype Level QC (vtools select and vtools exclude)
    • remove low quality variants which do not pass the variants filter.
    • remove low quality genotypes with low genotype quality score (GQ)
    • remove low quality genotypes with low genotype depth of coverage (GD).
  • Hardy–Weinberg equilibrium Filter (vtools update)
    • remove variants which does not pass Hardy–Weinberg equilibrium criterion (e.g. HWE pvalue < {$10^{-8}$} ).
  • Basic Summary Statistics on Variant Level (vtools update)
    • based on variant level, we can get basic counts of different genotypes (calls, wild types, mutation types, homozygote genotypes, heterozygote genotypes, alternative alleles, etc).
    • based on variant level, we can calculate alternative allele frequency, and subset variants by different allele frequency cutoff defined as common variants or rare variants.
    • based on variant level, we cat get counts of different functional types of variants (synonymous, missense, nonsense, etc).
    • based on variant level, we cat get counts of singletons and doubletons.
  • Basic Summary Statistics on Sample Level (vtools phenotype)
    • based on individual level, we can get basic counts of different genotypes (calls, wild types, mutation types, homozygote genotypes, heterozygote genotypes, alternative alleles, etc) and calculate average counts.
    • based on individual level, we cat get counts of different types of variants (synonymous, missense, nonsense, etc), and calculate average counts.
    • based on individual level, we cat get counts of transitions and transversions, and calculate transitions V.S. transversions ratio (Ti/Tv ratio).
    • based on individual level, we cat get counts of singletons and doubletons, and calculate average counts.

3.  QC Pipeline in Detail

  • Introductory tutorial: a 30min tutorial using ~100 samples from 1000 Genomes Project
  • Basic tutorial: a basic demonstration of quality control using exome sequencing data from European and Asian samples of the 1000 genomes project
  • Advanced tutorial: an advanced demonstration of quality control using same dataset as the basic tutorial