A recent presentation about variant tools (Oct, 3rd, 2013)

variant tools is a software tool for the manipulation, annotation, selection, and analysis of variants in the context of next-gen sequencing analysis. Unlike some other tools used for Next-Gen sequencing analysis, variant tools is project based and provides a whole set of tools to manipulate and analyze genetic variants. Please refer to what you can do with variant tools for a list of features provided by variant tools.

News

  • Feb 27th, 2014: Release of variant tools 2.3.0.
  • Jan 16th, 2014: Release of variant tools 2.2.0.
  • Nov 6th, 2013: Release of variant tools 2.1.0, which adds a few useful features such as functions genotype() and samples() SQL function, and the --as option to command vtools use.
  • Oct 9, 2013: Release of variant tools 2.0.1, which is a maintenance release of version 2.0.0.
  • Aug 27, 2013: Release of variant tools 2.0. This is a major release of variant tools with many new features. Please check ChangeLog for details.
  • May 16, 2013: Release of variant tools 1.0.6, which contains a lot of small features and bug fixes.
  • Mar 20, 2013: Release of variant tools 1.0.5. This release adds commands vtools admin --update_resource and vtools_report sequences, and allows the use of arbitrary characters for names of variant tables.
  • Feb 20, 2013: Release of variant tools 1.0.4. This release comes with numerous bug fixed and new minor features. Please check the ChangeLog for details.
  • Oct 21, Nov 10, Nov 26, and Nov 29. 2012: Release of variant tools 1.0.3a, b, c and d to address various small issues.
  • Sep 25, 2012: Release of variant tools version 1.0.3, with new features and improvements in vtools associate, vtools update, vtools phenotype and vtools_report commands.
  • Jul 9th, 2012: Release of variant tools version 1.0.3rc1. Other than a few bug fixes and major performance improvements, this release introduces new commands vtools associate and vtools admin, with more than 20 association tests implemented under a unified association test framework.
  • Jan 24th, 2012: Release of variant tools version 1.0.2. This release fixes a major bug that causes duplicate output in commands vtools output and vtools export when range-based annotation databases are used. All users are recommended to upgrade.
  • more ...
  • Jan 2nd, 2012: Release of variant tools version 1.0.1. This version contains a few new features and bug fixes, and more importantly, dramatic performance improvement for many commands. Please refer to ChangeLog for details about this release.
  • Dec 30th, 2011: the gwasCatalog annotation source is available for download. See examples of how to use gwasCatalog to find published GWA hits that are near your variants.
  • Dec 15th, 2011: Two new annotation sources are available: Cancer Gene Census from the Cancer Genome Project, and the 5400 exomes EVS annotation database from the NHLBI Exome Sequencing Project.
  • Dec 4th, 2011: An application note that describes variant tools has been published online in Bioinformatics.
  • Nov 13, 2011: Release of variant tools version 1.0.
  • Jan 24th, 2012: Release of variant tools version 1.0.2. This release fixes a major bug that causes duplicate output in commands vtools output and vtools export when range-based annotation databases are used. All users are recommended to upgrade.
  • Jan 2nd, 2012: Release of variant tools version 1.0.1. This version contains a few new features and bug fixes, and more importantly, dramatic performance improvement for many commands. Please refer to ChangeLog for details about this release.
  • Dec 30th, 2011: the gwasCatalog annotation source is available for download. See examples of how to use gwasCatalog to find published GWA hits that are near your variants.
  • Dec 15th, 2011: Two new annotation sources are available: Cancer Gene Census from the Cancer Genome Project, and the 5400 exomes EVS annotation database from the NHLBI Exome Sequencing Project.
  • Dec 4th, 2011: An application note that describes variant tools has been published online in Bioinformatics.
  • Nov 13, 2011: Release of variant tools version 1.0.
  • Nov 7, 2011: A new annotation source called EVS (Exome Variant Server) is available consisting of exome sequencing variants from the NHLBI Exome Sequencing Project (ESP). This data was retrieved from the project's EVS server and contains population-specific allele frequencies (currently for European Americans and African Americans) and various functional annotations for predicted variants in approximately 2500 exomes.
  • Nov 2, 2011: Release of release candidate version 1.0rc3. This version adds option --jobs to a number of vtools commands and allow them to execute in multiple threads or processes. User interface is further cleaned for the final 1.0 release. As a result, support for the MySQL backend is temporarily disabled.
  • Oct 16, 2011: Release of release candidate version 1.0rc2. This version has a new option --children for command vtools init, which allows the creation of a project by merging multiple subprojects.
  • Oct 7, 2011: Release of release candidate version 1.0rc1. This version has a new vtools export command that can export in ANNOVAR and VCF formats.
  • Sep 27, 2011: Release of the second beta. This version contains full Python 3 support and a much more powerful vtools import command.
  • Sep 10, 2011: Release of 1.0 beta.
  • July 15, 2011: Initial public release.

The integrative design of variant tools

If you have used other sequencing or association analysis tools such as bedtools and pseq, you will be surprised that variant tools usually does not give you a nice report with a list of variants or genes with some useful information after performing an analysis. Instead, all the information, including results of analysis, are saved in the project in a consistent manner. An extra step is needed to output the information you need. In another word, the addition and presentation of information about variants are two different processes, and you typically add more and more information to your project during the analysis of your data. The end result is that you have immediate access to a large amount of information for the variants you are interested in, which can in turn help you perform more in-depth analysis. Using a fabricated and unusually long command,

% vtools output                                             # 2
             myvariants                                     # 1
             chr pos ref alt                                # 3
             hom_case hom_ctrl                              # 4
             dbNSFP.SIFT_score dbSNP.name refGene.name2     # 5
             asso1.p_value asso2.p_value                    # 6
             "ref_sequence(chr, pos - 5, pos + 5)"          # 7             
             "track('LP056A.BAM')"                          # 8
             "genotype('WGS1')"                             # 9
             "samples()"

  1. myvariants contains a list of variants, which is a subset of the master variant table (all the variant of your project) and is typically created using command vtools select.
  2. Command vtools output output information for all variants in myvariant, which include
  3. chr, pos, ref, alt constitute a variant, namely location and type of a mutation.
  4. hom_case and hom_ctrl are number of homozygous genotypes of this variant in cases and controls. These are called variant info fields and are added to the project using command vtools update --from_stat
  5. dbNSFP.SIFT_score, dbSNP.name and refGene.name2 are annotation information from different annotation databases. Annotation databases are not part of the project. They are connected to the project using command vtools use.
  6. asso1.p_value and asso2.p_value are results of two different association analysis. These are annotation databases created by command vtools associate.
  7. ref_sequence(chr, pos-5, pos+5) is a function provided by variant tools to retrieve the reference sequence around the variant. Here 5 basepair of the up and downstream of each variant is returned.
  8. track() is a function to extract information from external files. In this example, the depth of coverage at the location of the variant in the specified bam file is returned.
  9. genotype('WGS1') is another function to get the genotype of this variant in sample WGS1, for example, 1 for heterozygote and 2 for homozygote. Function samples() lists the samples that contain the variant.

As you can see, individual commands such as vtools use and vtools update do not produce any output, but adds information to the project that can be displayed along with others. Then, it is important to remember that all such information can be used to select, prioritize, and analyze your variants. Another fabricated command would look like

% vtools select                                             # 1
           myvariants                                       # 2
           "refGene.name2='BRCA1'"                          # 3
           'dbNSFP.SIFT_score > 0.95'                   
           'hom_case > 15' 'hom_ctrl = 0'                   # 4
           'asso1.p_value < 0.05 OR asso2.p_value < 0.05'   # 5
           --to_table significant                           # 6
  1. Command vtools select selects variants according to their properties
  2. It starts from table myvariants, which was itself selected using some other crieteria,
  3. The variants must be in gene BRCA1 and must have > 0.95 SIFT scores (probably damagin)
  4. it should appear in at least 15 of the cases as homozygote, and not available (as homozygote) in any of the controls
  5. and it should be significant in one of the association tests,
  6. the selected variant are written to another table names significant.

In summary, variant tools is NOT designed to be a black-box tool that analyzes your data and generates a nice-looking report with a list of candidate variants or genes. It is a platform under which you can analyze your data using several methods, compare and analyze results, re-compare and re-analyze, and again using different methods or annotation sources, based on the information abtained from your previous analyses. The unique advantage of variant tools is that you generally do not need to write a bunch of scripts to connect input output of different tools and parse and compare results in different formats, and you have easy access to a huge amount of information that help you select, prioritize and analyze your variants, all from your command line. However, because of the uniqueness of this design, please read through the Concepts section of this website before using variant tools.

Things you can do with variant tools

CatagoryTasks
Variant callingCall variants from raw reads in FASTQ or BAM (convert to FASTQ first) formats using the GATK best practice pipeline.
Import variantsImport variants and genotypes in VCF format, with options to import specified variant and genotype info fields.
 Import all info and genotype fields, including customized fields from VCF files.
 Import SNP and Indel variants from the Illumina CASAVA pipeline before version 1.8 (text files), and variants called from the Complete Genomics pipeline.
 Pipeline to import variants from the recent versions of the Illumina CASAVA pipeline (in VCF format) that provides variant calls from two probabilistic models.
 Import variants in text or CSV files.
 Import variants from files in Plink format
 Import variants from a list of rsnames (dbSNP IDs), or just chromsome and positions, variant information are retrieved from the dbSNP database.
 Import data in arbitrary format by defining customized format-description file.
Reference genomeNative support for build hg18 and hg19 of the human genome, although other reference genomes can be supported. Reference genomes are downloaded automatically when they are used.
 Variants in different reference genomes can be imported and analyzed together, through automatic mapping between primary and alternative reference genomes.
 Supports the use of annotations in a different reference genome by mapping genomic coordinates across reference genomes
 Easily retrieve reference sequences around variant sites through function ref_sequence. This allows you to check if variants are in, for example, mononucleotide or short-tandem repeat sequences.
 Validate the build of reference genome if you are uncertain about the reference genome used in the data.
Variant annotationStandardize annotations from different sources so that you do not have to worry about inconsistencies between the use of chromosome names (with/without leading chr), genomic positions (0- or 1-based) and other nomenclatures.
 Annotations are automatically downloaded from online repository, or build from source if needed. Annotation databases are automatically updated although you can use a prior version, or use different versions of the same annotation database at the same time.
 Detailed descriptions of available annotation databases are readily available from command vtools show annotation.
 Supports CCDS, Entrez, Known Gene, and ref seq definitions of genes, which allow you to identify variants in genes, exon regions, or upstream/downstream of these genes.
 Standardize gene names through the use of HUGO Gene Nomenclature Committee approved gene names
 Identify variants in Catalogue of Somatic Mutations in Cancer or within Database of Genomic Variants.
 Identify variants in all versions of dbSNP databases, Exome Sequencing project, the thousand genomes project, and the HapMap project
 Annotate variants with SIFT, PolyPhen, MutationTaster and many other prediction scores from dbNSFP.
 Check for variants that are in the GWAS Catalog database, or variants that are within certain range of GWAS hits.
 Identify variants in highly conserved regions through the phastCons database, or variants in genomic duplication regions.
 Pipelines to automatically annotate variants using ANNOVAR and snpEff
 Allow the creation of annotation databases from your own data in vcf format
 Convert variants in a variant tools project to an annotation database to be used by another project, or convert an annotation database to a project for detailed analysis
 Users can define and create their own annotation databases through [[Annotation/New|customized annotation description files].
External AnnotationRetrieve calls, reads, quality, and coverage information from BAM files, filtered by quality score, strand, type, or flags, and use such information to select variants. This provides a command line alternative to IGV to check raw reads for called variants.
 Retreive variant info and genotype information from local or online tabix indexed vcf.gz files, this allows you, for example, to obtain variant info from vcf files on the 1000 genomes website.
 Retrive annotation from bigWig or bigBed files, from the ENCODE project
Samples and PhenotypesImport and keep track of samples using filename and sample names.
 Rename samples and merge genotypes from multiple input files.
 Arbitrary sample information such as sex, BMI, and ethnicity can be saved as phenotype and used for sample selection or association analysis.
 Calculation of number of genotypes, alternative alleles, homozygotes, heterzygotes and other types of genotypes in all or subset of samples.
 Calculate minimal, maximum, average values of genotype info (e.g. quality score) across all or selected samples for each variant.
Variant SelectionUse sample statistics to select, for example, homozygous variants with acceptable quality that appear only in cases.
 Select variants based on their membership in annotation databases such as dbSNP and thousand genomes project.
 Select variants from multiple conditions that involves multiple variant and annotation info fields (e.g. SIFT score).
 Variants selected by different criteria are kept in multiple variant tables, with meta information.
 Compare variant tables and examine differences between two or more variant tables.
 Identify De Novo mutations from family based samples, identify variants that share the same sites with an existing set of variants.
 Pipelines to identify de novo or recessive mutations that might cause the phenotype of an affected offspring in a family of unaffected parents
Output variantsOutput a large number of variant info and annotation fields across different annotation databases altogether.
 Output expressions of variant info and annotation fields, including vtools-specific SQL functions.
 Output reference sequence around variant site, genotypes of one or more samples, and samples that harbor the variants.
 Output summary statistics (e.g. count, average) of variants and variant info fields, grouped by specified fields.
Export variantsExport variants in vcf format, with variant and annotation info, and genotypes.
 Export variants in other formats such as ANNOVAR and Plink to be analyzed by these programs.
 Export variants with variant info and annotation fields in csv format.
Association analysisUse more than 20 association analysis methods to associate variants and genes with qualitative or quantitative traits.
 Execute multiple association tests across the genome using multiple processes.
 Results of association analyses are saved as annotation databases and are used to annotate individual variants, regardless of groups used to analyze data
 Draw manhantan and other figures from association test results
 Perform meta analysis from association test results
ReportsPrint reference sequences for particular regions, or gene, exome etc
 Calculate discordance rate between samples.
 Calculate average depth of coverage, number of SNPs and Indels for all or selected samples.
 Calculate transition transversion ratio for all or selected variants.
 Scatter, box plot, histgram plots for variant info fields, genotype info fields, and phenotypes
Data ManagementA project can be saved, transferred and loaded easily as snapshots. A number of online snapshots are provided for learning purposes.
 Remove genotypes based on different criteria (e.g. quality score), or remove variants in a variant table.
 Merge data from several sub projects (e.g. adding data from different batches).
 Split project into sub projects to focus on particular sets of variants or samples.
 A resource management system to download and update resources on demand, or in batch.

Please refer to a list of tutorials to get started.

Citation for variant tools

Please cite

F. Anthony San Lucas, Gao Wang, Paul Scheet, and Bo Peng (2012) Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools, Bioinformatics 28 (3): 421-422.

if you find variant tools helpful and use it in your publication. Thank you.