Department of Translational Genomics Lab Pages, Student Resources, and Web Applications

Understanding Clinical Bioinformatics via Case Study Of Somatic Variant Calling


In this exercise, we will review several concepts that include regulation of genomic testing and the requirements for creating highly reproducible, repeatable analytically validated tests.  We will conduct an example using variant calls from two somatic variant callers, where somatic variant calls typically refers to identifying genetic variants in one tissue or set of cells that is not from the individuals constitutional or germline inherited genome.  The purpose will be to get a better understanding of the unique challenges of assessing analytical validity of tests.

You will be expected to know the following: (1) Differences between LDTs and CLIA tests from a regulatory perspective, understanding which agencies regulate them.  (2) You will understand core metrics of analytically validity and be able to assess a proper comparison of two callers.  

Regulatory Oversight

Regulatory considerations. Deployment of a clinical assay in the United States involves two paths, each with accompanying regulations administered under the US Department of Health and Human Services (HHS). The first are those approved through the CLIA of 1988 that allow 'laboratory-developed tests' (LDTs). LDTs are in vitro diagnostic tests that are developed and used within a single approved laboratory and are not marketed towards any other laboratory. CLIA regulations monitor the laboratory process to ensure the accuracy, reliability and appropriateness of laboratory testing, from sample acquisition, handling and storage to the interpretation and reporting of test results. The guidelines for approving CLIA laboratories are established by accredited professional organizations, such as the College of American Pathologists or by other agencies approved by the Centers for Medicare and Medicaid Services (CMS), such as in the state of New York. CLIA regulations of LDTs do not address the clinical validity or clinical utility of an assay, but instead provide a framework whereby clinical laboratories validate analytical performance measures of the LDTs within their own laboratory facility. The second set of regulations for clinical assay deployment are the Medical Device Amendments of 1976, which expanded FDA oversight for the marketing of in vitro diagnostic devices (IVDs). FDA premarket review of IVDs assures the assay has established analytical and clinical validity; with the exception of companion diagnostics, the FDA does not typically require demonstration of clinical utility for clearance or approval of IVDs. Demonstration of clinical validity (for LDTs) and clinical utility (for LDTs and IVDs) can follow the initial clearance or approval of a diagnostic test; clinical utility, in particular, requires broader clinical evaluation across multiple sites and/or within clinical trials. For example, the Afirma (Veracyte) microarray-based gene expression classifier for thyroid nodule assessment was launched in 2011 as a CLIA-regulated LDT. Subsequent studies have reported on clinical validity and clinical utility the latter involving a multi-site study that demonstrated the effects of the Afirma test on clinical care recommendations, which resulted in a reduction of unnecessary surgeries.

In the US, the distinction over when NGS assays are under regulatory oversight by the FDA or the CMS is emerging as an area of regulatory and legislative debate. In late 2014, the FDA proposed a regulatory framework for LDTs that will, in all likelihood, alter the regulatory landscape discussion for RNA-seq assays moving to the clinic. The FDA also provided a perspective on the mammoth shifts created by technological advances associated with NGS, and the requirement for the agency to change from the current 'general enforcement discretion' — in which the FDA has generally not enforced regulations with respect to LDTs — to having a more active role, with proposed premarket review and quality system regulation requirement. Under this proposed LDT framework, the CMS (under CLIA) would oversee the laboratory operations and testing processes and the FDA would monitor compliance with quality system regulations.

The effects of expanding regulatory oversight by the FDA on RNA-seq are predicated around the FDA approval process for the first FDA-cleared NGS instrument and NGS in vitro diagnostic tests, the Illumina MiSeqDx and the associated in vitro diagnostic assays for cystic fibrosis, the Illumina MiSeqDx Cystic Fibrosis 139-Variant and Cystic Fibrosis Clinical Sequencing Assays. Accuracy was evaluated using a representative subset of variants, rather than evaluating all possible variants, and relied on publicly available data to support clinical relevance of the variants. Although evaluation of analytical performance may continue to involve this subset-based approach, the proposed new standards, as outlined by the FDA, could include defined technical metrics for data quality, additional standards for computational approaches and standard best practices for quality assurance. The debate over FDA oversight is largely focused on the presence or absence detection of DNA variants, such as germline cystic fibrosis transmembrane conductance regulator (CFTR) or BRCA2 testing. While the FDA guidance and debate is limited in use of examples, the broad scope of additional regulation on all NGS-developed tests, including RNA-seq, may provide regulatory uncertainty for RNA-seq and impede its adoption in the clinic. The proposed FDA regulations around NGS have not gone without debate, emphasizing that the limited enforcement capabilities and regulatory guidance could unnecessarily stifle adoption and innovation. International regulatory frameworks vary across jurisdictions, with evolving practice guidelines and regulations for the clinical use of NGS. For example, in the European Union (EU), IVD tests require a Conformité Européenne (CE) mark to indicate compliance with the EU IVD Directive (98/79/EC). Similar to the US, the EU is reviewing policy changes related to IVDs, with proposed changes to harmonize the IVD market and increase oversight, including the use of a risk-based classification scheme to define clinical evidence requirements, such as analytical and clinical performance, for IVD approval. The pending regulatory changes, both in the US and internationally, may substantially impact the clinical utility of RNA, particularly until greater consensus is reached towards reference standards.

Clinical Utility, Analytical Validity

Translation and broader adoption of a laboratory test into the clinic involves evaluation and demonstration of analytical validity, clinical validity and, eventually, clinical utility. Analytical validity generally refers to the ability of the test to measure the intended biomolecules within clinically relevant conditions. Establishment of an analytically valid test can have different meanings depending on the regulatory framework that the test falls under, as discussed below. However, analytical validity generally implies that the test has undergone thorough technical performance characterization. Clinical validity refers to the ability of a test to predict a clinical outcome given a set of events, irrespective of whether the test results can enable an effective therapy. Clinical utility indicates whether a test provides useful information, positive or negative, for the patient being tested. Tests that can either indicate a more effective therapy, such as a companion diagnostic, or provide information on avoiding some therapies may both have clinical utility.

Performance metrics and reference standards. To be analytically valid, a laboratory test must deliver accurate information with reproducible and robust performance. Accuracy is determined by evaluating a measured or calculated value compared to a reference 'gold standard', with evaluation of sensitivity (ability to detect true positives) and specificity (ability to detect true negatives). The test must also provide the same or similar results with repeat testing (reproducibility) and withstand small, deliberate changes in pre-analytic or analytic variables associated with testing (robustness).

Sensitivity, Specificity, Accuracy, Precision

Reportable Range. Reportable Range refers to the range test result values over which the laboratory can establish or verify the accuracy of the assay or test. For quantitative assays such as RNA-seq reportable range establishes the linearity of the assay, and requires that replicates of samples by analyzed at multiple points across the reportable range.

Analytical sensitivity.  Analytically sensitivity is also referred to as a limit-of-detection study.  Analytical sensitivity requires studying multiple samples at at the anticipated limit of detection across multiple days to determine sensitivity at this level.

Analytical specificity.  Analytical specificity is the ability of an assay to report only the specified biomarker in the context of cross-reactivity or interference from related or potentially interfering nucleic acids or specimen-related conditions.  Validation of analytical specificity typically contains interfering substances Analytical specificity is also referred to as an interference study.  Typically, measurements are evaluated in the context of substances that either could have or did compromise measurement.  Analytical specificity is particularly relevant for translating RNA-seq due to the sensitive nature of RNA molecules to degradation.  Moreover, amplification methods such as from FFPE can significantly confound measurements amplifying from poly(A) tails.

Accuracy. Accuracy (or trueness) studies usually require an external measure of trueness to establish true positives, false positives, true negatives, and true positives.  Typically, measures of truth come from the same sample being run at different laboratories or also from reference standards.  For RNA-seq, we describe in further details efforts to establish these standards within the text.

Precision.  Precision is the “closeness of agreement” or reproducibility between independent measurements from a single homogeneous sample, and generally include both studies of reproducibility and repeatability. Reproducibility studies are those that evaluate consistency in measurements if the assay is repeated from the beginning, such as RNA isolation.  Repeatability studies evaluate consistency in measurement under the same conditions, such as from the same and same operator and sample.

Another way to understand in the context of medical tests is that sensitivity is the extent to which true positives are not missed/overlooked (so false negatives are few) and specificity is the extent to which positives really represent the condition of interest and not some other condition being mistaken for it (so false positives are few). Thus a highly sensitive test rarely overlooks a positive (for example, showing "nothing bad" despite something bad existing); a highly specific test rarely registers a positive for anything that is not the target of testing (for example, finding one bacterial species when another closely related one is the true target); and a test that is highly sensitive and highly specific does both, so it "rarely overlooks a thing that it is looking for" and it "rarely mistakes anything else for that thing." Because most medical tests do not have sensitivity and specificity values above 99%, "rarely" does not equate to certainty. But for practical reasons, tests with sensitivity and specificity values above 90% have high credibility, albeit usually no certainty, in differential diagnosis.

Sensitivity therefore quantifies the avoiding of false negatives, and specificity does the same for false positives. For any test, there is usually a trade-off between the measures – for instance, in airport security since testing of passengers is for potential threats to safety, scanners may be set to trigger alarms on low-risk items like belt buckles and keys (low specificity), in order to increase the probability of identifying dangerous objects and minimize the risk of missing objects that do pose a threat (high sensitivity). This trade-off can be represented graphically using a receiver operating characteristic curve. A perfect predictor would be described as 100% sensitive, meaning all sick individuals are correctly identified as sick, and 100% specific, meaning no healthy individuals are incorrectly identified as sick. In reality, however, any non-deterministic predictor will possess a minimum error bound known as the Bayes error rate.

  • True positive: Carriers people correctly identified as carriers
  • False positive: Healthy people incorrectly identified as carriers
  • True negative: Healthy people correctly identified as healthy
  • False negative:Carriers people incorrectly identified as healthy


Its important to understand the rapidly evolving field of CLIA testing, and we review several examples for discussion.


PCDx cobranded confidential sample report

sample report

TN Sample Report Follicular Lymphoma 12.20.10


sample report

Additional Reading

Read Section B: FDA Guidance

GA4GH: GA4GH Guidance

Comparison Paper: 023754.full


References are needed to enable interpretation of results generated using analytical pipelines that may differ significantly across institutions and to account for bias or variability in sample preparation and sequencing.

In order to define references to support the implementation of sequencing in the clinic, the National Institute of Standards and Technology (NIST) has established the Genome in a Bottle (GIAB) Consortium. By integrating fourteen sequencing data sets generated from the NA12878 cell line using five different technologies and that were analyzed using multiple aligners and variant detection tools, they defined a benchmark set of genotypes. Additionally, Illumina’s Platinum Genome project has publically released sequencing data and analysis of a three-generation seventeen-member CEPH (Centre d’Etude du Polymorphisme Humain; Utah residents with northern and western European ancestry) pedigree (1463) in order to evaluate the accuracy of variant calling.

To assess a somatic caller, we will use tumor normal variant calling from a cell ublically releasing somatic alterations identified from paired tumor/constitutional cell lines available from ATCC ( In the latter study, Pleasance et al. performed whole genome sequencing (WGS) on COLO829, an immortal cell line derived from a metastasis from a cutaneous melanoma patient, and COLO829BL, a lymphoblastoid line from the same subject. 

Calculate sensitivity, specificity, precision, accuracy for two somatic callers, Mutect and Strelka, using VCFs from a reference.

You are provided with Reference somatic variant calls for COLO829 and variant calls from two callers.  The data is below, but also on the server in the /classes/trgn515/colo829/vcfs directory.

Please create two text files with the following data for two variant callers on a reference standard COLO829, named mutect.analyticalvalidity.txt and strelka.analyticalvalidity.txt over the reportable range reportable_range.bed. The two VCF callers have only produced calls within the reportable range, though the reference VCF is across the genome.   Note that this is an exome from Agilent within a bed format. Please note that the approach is not going to be evaluated.  If you like, you could can complete the exercise using R, Excel, Matlab, Bash, Python, Perl, or any tool of your choice - the approach will not be evaluated.

  1. True Positives
  2. True Negatives
  3. False Positives
  4. False Negatives
  5. Sensitivity
  6. Specificity
  7. Accuracy
  8. Precision
  9. True Negative Rate

Reference: colo829.reference.vcf

  = Reference with calls only in the reportable range: colo829.reference.rr.vcf

Mutect Calls: colo829.mutect.rr.vcf

Strelka2 Calls: colo829.strelka2.rr.vcf

Reportable Range: reportable_range.bed