Annals of Internal Medicine RESEARCH AND REPORTING METHODS Evaluating Diagnostic Accuracy in the Face of Multiple Reference standards Christiana A Naaktgeboren, MPH; Joris A H. de Groot, PhD; Maarten van Smeden, MSc; Karel G.M. Moons, PhD; and Johannes B Reitsma, MD, PhD A universal challenge in studies that quantify the accuracy of diag- differential verification exist, but not all introduce the same risk of nostic tests is establishing whether each participant has the disease bias. A risk-of-bias assessment requires detailed information about of interest. Ideally, the same preferred reference standard would be which participants receive which reference standards and an esti used for all participants; however, for practical or ethical reasons. mate of the accuracy of the alternative reference standard. This altemative reference standards that are often less accurate are artide classifies types of differential verification and explores how frequently used instead. The use of different reference standards they can lead to bias. It also provides guidance on how to report across participants in a single study is known as differential results and assess the risk of bias when differential verification verification occurs and highlights potential ways to correct for the bias Differential verification can cause biased accuracy esti- Ann Intem Med.2013: 159:195-202 www.annals.org mates of 1 e test or model being severely Many variations of For author affiliations, see end of text. universal challenge in quantifying the accuracy of di- ferred reference standards as interchangeable when analyz agnostic tests or models is establishing whether each ing, reporting, or making inferences. When reference patient has the disease of interest(1). This classification is standards do not correspond well with the underlying necessary to calculate various measures of diagnostic accu-"true"disease status(as is often the case with an inferior racy for the test being studied, such as sensitivity and spec- reference standard), the final disease classification will be ificity, predictive values, likelihood ratios, or receiver- wrong in some patients. The misclassification can cause operating characteristic curves(2). It is also a prerequisite biased estimates of diagnostic accuracy and regression pa for the derivation and validation of multivariate diagnostic rameters(12). When a mix of reference standards is used prediction models. When making the classifications, re- index test or model accuracy estimates will systematically searchers aim to use the best available method (the pre differ from the ideal study in which all patients have the ferred reference standard) to verify the diagnosis in all preferred reference standard. This systematic de elation Participants called differential verification bias(4, 13, 14). Figure 1 Because of practical or ethical constraints, it is not relevant example of it always possible to ascertain disease status in all participants This article elaborates on problems that can be con- by using the preferred reference standard. Often an alter- fused with differential verification, proposes a classification native, less accurate method (inferior reference standard)is system for types of differential verification, and explores used in patients who do not receive the preferred reference the mechanisms by which each type leads to bias. It pro- standard. The use of different reference standards in differ- vides guidance on how to clearly report results when dif- ent groups of participants in a diagnostic study is known ferential verification is present and how to assess and co differential verification(3). Differential verification is com- rect for the risk of bias. We believe that such guidanc mon;it may occur in up to one quarter of all diagnostic extends and improves STARD (Standards for the Report- accuracy studies(4, 5) ing of Diagnostic Accuracy Studies) and QUADAS-2 As shown in Table 1. differential verification occurs (Quality Assessment of Diagnostic Accuracy Studies 2) for various reasons and in a wide range of clinical fields. A(13, 16) distinctive example comes from studies on the accuracy of mammography in detecting breast cancer(6, 11). The pre- ferred reference standard, biopsy, is performed only when a CLASSIFICATION OF VERIFICATION PATTERNS lesion is found during mammography, indicating where to ire 2 gives an overview of the main verifi perform biopsy. In the ideal scenario, patients without le patterns. Complete verification is the ideal situation p sions would also undergo this preferred reference standard, which all participants are verified by the same reference which would mean that they should have random biopsies. standard. A reference standard is often a single test, but However, this option is not ethical and is not equivalent to when no single highly accurate test is available, multiple a targeted biopsy. Instead, patients without lesions are fol- component tests may be used as a single reference method lowed to see whether breast cancer develops before the next instead. Component tests can be used to make a final di screening agnosis in several ways. These include composite reference Relying on disease classification that is based on an standards(a fixed rule for combining individual test results alternative reference standard may seem logical, but prob into a final di lems arise if one mistakenly treats the alternative and pre- by a group of experts), and latent class models (a statistical O 2013 American College of Physicians 195 DownloadedFrom:httpannalsorgbyaFudanUniversityUseron08/05/2013
Evaluating Diagnostic Accuracy in the Face of Multiple Reference Standards Christiana A. Naaktgeboren, MPH; Joris A.H. de Groot, PhD; Maarten van Smeden, MSc; Karel G.M. Moons, PhD; and Johannes B. Reitsma, MD, PhD A universal challenge in studies that quantify the accuracy of diagnostic tests is establishing whether each participant has the disease of interest. Ideally, the same preferred reference standard would be used for all participants; however, for practical or ethical reasons, alternative reference standards that are often less accurate are frequently used instead. The use of different reference standards across participants in a single study is known as differential verification. Differential verification can cause severely biased accuracy estimates of the test or model being studied. Many variations of differential verification exist, but not all introduce the same risk of bias. A risk-of-bias assessment requires detailed information about which participants receive which reference standards and an estimate of the accuracy of the alternative reference standard. This article classifies types of differential verification and explores how they can lead to bias. It also provides guidance on how to report results and assess the risk of bias when differential verification occurs and highlights potential ways to correct for the bias. Ann Intern Med. 2013;159:195-202. www.annals.org For author affiliations, see end of text. Auniversal challenge in quantifying the accuracy of diagnostic tests or models is establishing whether each patient has the disease of interest (1). This classification is necessary to calculate various measures of diagnostic accuracy for the test being studied, such as sensitivity and specificity, predictive values, likelihood ratios, or receiveroperating characteristic curves (2). It is also a prerequisite for the derivation and validation of multivariate diagnostic prediction models. When making the classifications, researchers aim to use the best available method (the preferred reference standard) to verify the diagnosis in all participants. Because of practical or ethical constraints, it is not always possible to ascertain disease status in all participants by using the preferred reference standard. Often an alternative, less accurate method (inferior reference standard) is used in patients who do not receive the preferred reference standard. The use of different reference standards in different groups of participants in a diagnostic study is known as differential verification (3). Differential verification is common; it may occur in up to one quarter of all diagnostic accuracy studies (4, 5). As shown in Table 1, differential verification occurs for various reasons and in a wide range of clinical fields. A distinctive example comes from studies on the accuracy of mammography in detecting breast cancer (6, 11). The preferred reference standard, biopsy, is performed only when a lesion is found during mammography, indicating where to perform biopsy. In the ideal scenario, patients without lesions would also undergo this preferred reference standard, which would mean that they should have random biopsies. However, this option is not ethical and is not equivalent to a targeted biopsy. Instead, patients without lesions are followed to see whether breast cancer develops before the next screening. Relying on disease classification that is based on an alternative reference standard may seem logical, but problems arise if one mistakenly treats the alternative and preferred reference standards as interchangeable when analyzing, reporting, or making inferences. When reference standards do not correspond well with the underlying “true” disease status (as is often the case with an inferior reference standard), the final disease classification will be wrong in some patients. The misclassification can cause biased estimates of diagnostic accuracy and regression parameters (12). When a mix of reference standards is used, index test or model accuracy estimates will systematically differ from the ideal study in which all patients have the preferred reference standard. This systematic deviation is called differential verification bias (4, 13, 14). Figure 1 depicts a clinically relevant example of it. This article elaborates on problems that can be confused with differential verification, proposes a classification system for types of differential verification, and explores the mechanisms by which each type leads to bias. It provides guidance on how to clearly report results when differential verification is present and how to assess and correct for the risk of bias. We believe that such guidance extends and improves STARD (Standards for the Reporting of Diagnostic Accuracy Studies) and QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies 2) (13, 16). CLASSIFICATION OF VERIFICATION PATTERNS Figure 2 gives an overview of the main verification patterns. Complete verification is the ideal situation in which all participants are verified by the same reference standard. A reference standard is often a single test, but when no single highly accurate test is available, multiple component tests may be used as a single reference method instead. Component tests can be used to make a final diagnosis in several ways. These include composite reference standards (a fixed rule for combining individual test results into a final diagnosis), panel diagnosis (consensus diagnosis by a group of experts), and latent class models (a statistical Annals of Internal Medicine Research and Reporting Methods © 2013 American College of Physicians 195 Downloaded From: http://annals.org/ by a Fudan University User on 08/05/2013
RESEARCH AND REPORTING METHODS Evaluating Diagnostic Accuracy in the Face of Multiple Reference Standards Key Summary Points factors determine which reference standard is used? " Figure Differential verification occurs in diagnostic accuracy 3 shows 2 basic differential verification patterns. The key studies when some patients receive a different, often distinction between the patterns is the reason that patients rece less accurate, reference standard ived one reference standard over another Differential verification can be thought of as a missing Differential verification may compromise the validity of data problem in which the preferred reference standard is study results if it is not properly analyzed and reported missing in some patients and replaced by an alternative, inferior reference standard (21). Data can be missing com Clear reporting of differential verification requires address- ing the accuracy of the alternative reference standard pletely at random, or the missingness can be related to providing information about why patients received the measured or unmeasured patient characteristics(17, 22). If reference standards they did(the verification pattern). missingness is completely at random, estimates will be un- and analyzing and presenting results for each reference biased but inefficient, leading to CIs that will be wider standard than those obtained with complete verification. If missing- ness is dependent on patient characteristics, estimates Judging the risk of bias due to differential verification re probably biased but the bias can be corrected for if these quires considering the accuracy of the inferior reference ient characteristics are measured standard and investigating the verification pattern pa in peat depends entirely on the index test results. The cern a depicted in Figure 3, reference standard earliest definition of differential verification was limited pattern A and was described as the situation in which"neg method that uses associations among tests to define the ative [index] test results are verified by a different, often unobserved disease status)(1). Although these methods use less thorough, [reference] standard, for example follow-u several pieces of information, the same information is avail-(4). This particular situation, in which all positive index able for all study participants and the same method is used test results are verified by the preferred reference standard to reach a final diagne whereas all negative index test results are verified by an The situation in which disease status is not ascertained inferior standard, was later termed complete differential at all in some patients, and is thus missing, is termed par- verification(11), but we propose referring to it as complete tial verification. When the missingness of the reference index test-dependent diferential verification for clarity. Pat standard is somehow related to the index test results, index tern A either is a product of the clinical situation or is test accuracy estimates based on only the complete cases determined a priori by the researcher will be biased(17). Depending on the pattern of missing- Frequently in diagnostic studies, the decision about ness, it may be possible to mathematically correct for par- which reference standard each patient receives is less tial verification bias(18, 19) straightforward and is determined by various factors be The situation in which different reference standards sides, or in addition to, index test results. In pattern B are used in different groups of patients is termed differen- depicted in Figure 3, the choice of reference standard is tial verification. The word differential means"varying ac- influenced by other factors, such as signs and symptoms, ording to circumstances or relevant factors"(20). A salient other test results, or patient or physician preference. Pat- question in the case of differential verification is, What tern B may be broken down into 3 subtypes based on the Table 1. Examples of Differential Verification Index Test Preferred Reference or Reference Reason for Not Using Preferred Condition Standard Standard Reference Standard for All Patients Mammography Follow-up Impossible to perform a biopsy when 6 no lesion is detected during Diagnostic rule for ruling Ultrasonography together Follow-up To test the safety of the rule in nts with a normal score on the diagnostic rule were sent home without ometry(blood Clinical work-up Congenital anomalies Too expensive to perform a clinical 8 defect ygen levels) registry rk-up of all infants to detect a Diagnostic prediction In-person interview Telephone interview Not practically feasible to interview Tuberculosis Chest radiography and/or Clinical records were used instead of 10 a diagnosti 1966 August2013 ternal Medicine Volume 159.Number www.annals.org DownloadedFrom:httpannalsorgbyaFudanUniversityUseron08/05/2013
method that uses associations among tests to define the unobserved disease status) (1). Although these methods use several pieces of information, the same information is available for all study participants and the same method is used to reach a final diagnosis. The situation in which disease status is not ascertained at all in some patients, and is thus missing, is termed partial verification. When the missingness of the reference standard is somehow related to the index test results, index test accuracy estimates based on only the complete cases will be biased (17). Depending on the pattern of missingness, it may be possible to mathematically correct for partial verification bias (18, 19). The situation in which different reference standards are used in different groups of patients is termed differential verification. The word differential means “varying according to circumstances or relevant factors” (20). A salient question in the case of differential verification is, “What factors determine which reference standard is used?” Figure 3 shows 2 basic differential verification patterns. The key distinction between the patterns is the reason that patients received one reference standard over another. Differential verification can be thought of as a missing data problem in which the preferred reference standard is missing in some patients and replaced by an alternative, inferior reference standard (21). Data can be missing completely at random, or the missingness can be related to measured or unmeasured patient characteristics (17, 22). If missingness is completely at random, estimates will be unbiased but inefficient, leading to CIs that will be wider than those obtained with complete verification. If missingness is dependent on patient characteristics, estimates are probably biased but the bias can be corrected for if these patient characteristics are measured. In pattern A depicted in Figure 3, reference standard assignment depends entirely on the index test results. The earliest definition of differential verification was limited to pattern A and was described as the situation in which “negative [index] test results are verified by a different, often less thorough, [reference] standard, for example follow-up” (4). This particular situation, in which all positive index test results are verified by the preferred reference standard whereas all negative index test results are verified by an inferior standard, was later termed complete differential verification (11), but we propose referring to it as complete index test– dependent differential verification for clarity. Pattern A either is a product of the clinical situation or is determined a priori by the researcher. Frequently in diagnostic studies, the decision about which reference standard each patient receives is less straightforward and is determined by various factors besides, or in addition to, index test results. In pattern B depicted in Figure 3, the choice of reference standard is influenced by other factors, such as signs and symptoms, other test results, or patient or physician preference. Pattern B may be broken down into 3 subtypes based on the Key Summary Points Differential verification occurs in diagnostic accuracy studies when some patients receive a different, often less accurate, reference standard. Differential verification may compromise the validity of study results if it is not properly analyzed and reported. Clear reporting of differential verification requires addressing the accuracy of the alternative reference standard, providing information about why patients received the reference standards they did (the verification pattern), and analyzing and presenting results for each reference standard. Judging the risk of bias due to differential verification requires considering the accuracy of the inferior reference standard and investigating the verification pattern. Table 1. Examples of Differential Verification Target Condition Index Test Preferred Reference Standard Inferior Reference Standard Reason for Not Using Preferred Reference Standard for All Patients Reference Breast cancer Mammography Biopsy Follow-up Impossible to perform a biopsy when no lesion is detected during mammography 6 Deep venous thrombosis Diagnostic rule for ruling out deep venous thrombosis Ultrasonography together with follow-up Follow-up To test the safety of the rule in clinical practice, patients with a normal score on the diagnostic rule were sent home without extensive testing 7 Congenital heart defect Pulse oxometry (blood oxygen levels) Clinical work-up Congenital anomalies registry Too expensive to perform a clinical work-up of all infants to detect a rare disease 8 Depression Diagnostic prediction rule for depression In-person interview Telephone interview Not practically feasible to interview all participants in person 9 Tuberculosis Screening test Sputum culture Chest radiography and/or follow-up Clinical records were used instead of setting up a diagnostic study 10 Research and Reporting Methods Evaluating Diagnostic Accuracy in the Face of Multiple Reference Standards 196 6 August 2013 Annals of Internal Medicine Volume 159 • Number 3 www.annals.org Downloaded From: http://annals.org/ by a Fudan University User on 08/05/2013
Evaluating Diagnostic Accuracy in the Face of Multiple reference Standards I RESEARCH AND REPORTING METHODS Figure I. An example of bias due to differential verification. Pap smear has imperfect accuracy (sensitivity, 0.7; specificity, 1) Colposcopy plus biopsy ensitivity=150/(150+70)=0.68 630/(630+150)=0.81 Pap smear has perfect accuracy (sensitivity, 1: specificity, 1) Colposcopy plus biopsy Pap smear Estimated accuracy of VIA: 150 0 Sensitivity=150/150+100)=0.60 The example is loosely inspired by a study on the accuracy of VIA in screening for cervical cancer(15). The preferred reference standard is colposcopy biopsy when a lesion is detected. Because the preferred standard is invasive, one might use an alternative, less invasive reference standard, the P: smear,for participants with a normal VIA result. If one assumed that the Pap smear had perfect accuracy, the naive(biased) estimates of sensitivity an specificity for the VIA would be 0.68 and 0.81, respectively. If one recognized the sensitivity of the Pap smear as only 0.70, the true estimate of the sensitivity for the VIA would be 0. 60. Pap= Papanicolaou; VIA= visual inspection using acetic acid of missingness"of the preferred reference standard of the technology. Second, unknown factors(those not (17). First, the choice of reference standard sometimes de- recorded) could also infuence this choice. An example of pends solely on known factors that are measured and such a factor is patient preference: When the preferred corded in the study. This pattern may occur, for example, reference standard is burdensome, some participants, pa when different study centers use slightly different imaging ticularly those with less severe symptoms, may opt out and techniques as the reference standard because of availabilit be followed instead. Third, studies can be designed in Figure 2. Main verification patterns and terminology. Single test as reference standard? Reference standard Same component test applied in all patients? Ye in different groups Latent class analysis 6 August 2013 Annals of Internal Medicine Volume 159. Number 3197 DownloadedFrom:httpannalsorgbyaFudanUniversityUseron08/05/2013
pattern of “missingness” of the preferred reference standard (17). First, the choice of reference standard sometimes depends solely on known factors that are measured and recorded in the study. This pattern may occur, for example, when different study centers use slightly different imaging techniques as the reference standard because of availability of the technology. Second, unknown factors (those not recorded) could also influence this choice. An example of such a factor is patient preference: When the preferred reference standard is burdensome, some participants, particularly those with less severe symptoms, may opt out and be followed instead. Third, studies can be designed in Figure 2. Main verification patterns and terminology. Single test as reference standard? Differential verification (use of different reference standards in different groups of patients) Possible methods for assessing accuracy: Composite reference standards Panel diagnosis Latent class analysis Reference standard applied in all patients? Complete verification Partial verification Same component test in all patients? Yes No Yes No Yes No Figure 1. An example of bias due to differential verification. Pap smear has imperfect accuracy (sensitivity, 0.7; specificity, 1) Colposcopy plus biopsy VIA + + 150 150 0 0 – – Pap smear Estimated accuracy of VIA: Sensitivity = 150/(150 + 70) = 0.68 Specificity = 630/(630 + 150) = 0.81 VIA + +0 0 70 630 – – Pap smear has perfect accuracy (sensitivity, 1; specificity, 1) Colposcopy plus biopsy VIA + + 150 150 0 0 – – Pap smear Estimated accuracy of VIA: Sensitivity = 150/(150 + 100) = 0.60 Specificity = 600/(600 + 150) = 0.80 VIA + +0 0 100 600 – – The example is loosely inspired by a study on the accuracy of VIA in screening for cervical cancer (15). The preferred reference standard is colposcopy plus biopsy when a lesion is detected. Because the preferred standard is invasive, one might use an alternative, less invasive reference standard, the Pap smear, for participants with a normal VIA result. If one assumed that the Pap smear had perfect accuracy, the naive (biased) estimates of sensitivity and specificity for the VIA would be 0.68 and 0.81, respectively. If one recognized the sensitivity of the Pap smear as only 0.70, the true estimate of the sensitivity for the VIA would be 0.60. Pap Papanicolaou; VIA visual inspection using acetic acid. Evaluating Diagnostic Accuracy in the Face of Multiple Reference Standards Research and Reporting Methods www.annals.org 6 August 2013 Annals of Internal Medicine Volume 159 • Number 3 197 Downloaded From: http://annals.org/ by a Fudan University User on 08/05/2013
RESEARCH AND REPORTING METHODS Evaluating Diagnostic Accuracy in the Face of Multiple reference Standards Figure 3. Classification of differential verification patterns Index test than the index ference standard test alone Preferred Preferred 18h h : h Disease Disea outcome Missingness of results No data missing by design :m如 A. the tes which the reference standard is Given that differential verification often causes bias the on is whether the bias is large to be clinically relevant. The magnitude and direction of FACTORS LEADING TO DIFFERENTIAL VERIFICATION BIAS ifferential verification bias are influenced by various The differential verification pattern--and, notably, the tors: the verification pattern, the accuracy of the reference reasons that patients receive the reference standards they do--plays a major role in whether bias is introduced dard, and disease prevalence(11). Formal correction meth vell as whether this bias can be(partially) adjusted fo ods may address these factors, but it is difficult for the Current guidelines for assessing the risk of bias in a diag- reader to consider multiple factors simultaneously when nostic study simply warn that there is a risk of bias when performing an explicit assessment of the risk of clinically multiple reference standards are used (13). This guidance is relevant bias supported by the results of 2 meta-analyses on factors in- We recommend that readers break down the problem Auencing diagnostic accuracy estimates, which found that into a few questions that can independently be used to rule studies in which differential verification was present had a out the risk of differential verification bias (Table 2). In- diagnostic odds ratio that was, on average, 1.6(95%CI formation about the disease prevalence and the proportion 9 to 2.9)to 2.2(CI, 1.5 to 3. 3)times higher than that in verified by each reference standard may be the most readily studies of the same test that used a single reference stan- available information, but knowledge about either of these dard(4, 5). The example in Figure 1 illustrates how di factors alone does not allow one to rule out the risk of ferential verification can lead to overestimates of accuracy. clinically relevant bias. In many situations, the preferred Although differential verification seems to generally result reference standard is assumed-correctly or in a substantial overestimate of index test accuracy, it does to have near-perfect accuracy. We recommend, there- not lead to clinically relevant bias in some situations fore, that readers focus on the accuracy of the inferior ref- ternal Medicine Volume 159.Number www.annals.org DownloadedFrom:httpannalsorgbyaFudanUniversityUseron08/05/2013
which the reference standard is assigned completely at random. FACTORS LEADING TO DIFFERENTIAL VERIFICATION BIAS The differential verification pattern—and, notably, the reasons that patients receive the reference standards they do—plays a major role in whether bias is introduced as well as whether this bias can be (partially) adjusted for. Current guidelines for assessing the risk of bias in a diagnostic study simply warn that there is a risk of bias when multiple reference standards are used (13). This guidance is supported by the results of 2 meta-analyses on factors in- fluencing diagnostic accuracy estimates, which found that studies in which differential verification was present had a diagnostic odds ratio that was, on average, 1.6 (95% CI, 0.9 to 2.9) to 2.2 (CI, 1.5 to 3.3) times higher than that in studies of the same test that used a single reference standard (4, 5). The example in Figure 1 illustrates how differential verification can lead to overestimates of accuracy. Although differential verification seems to generally result in a substantial overestimate of index test accuracy, it does not lead to clinically relevant bias in some situations. Given that differential verification often causes bias, the pertinent question is whether the bias is large enough to be clinically relevant. The magnitude and direction of differential verification bias are influenced by various factors: the verification pattern, the accuracy of the reference standards, the proportion verified by each reference standard, and disease prevalence (11). Formal correction methods may address these factors, but it is difficult for the reader to consider multiple factors simultaneously when performing an explicit assessment of the risk of clinically relevant bias. We recommend that readers break down the problem into a few questions that can independently be used to rule out the risk of differential verification bias (Table 2). Information about the disease prevalence and the proportion verified by each reference standard may be the most readily available information, but knowledge about either of these factors alone does not allow one to rule out the risk of clinically relevant bias. In many situations, the preferred reference standard is assumed— correctly or incorrectly— to have near-perfect accuracy. We recommend, therefore, that readers focus on the accuracy of the inferior refFigure 3. Classification of differential verification patterns. Index + +a b 0 0 – – + +0 0 g h – – Missingness of results No data missing by design Missing data possible All data missing by design Participants A Index test results Preferred reference Inferior reference Disease outcome Choice of reference standard depends on: Naive analyses are based on these 2 x 2 tables: Index test Index + Index – Participants B Variables other than the index test alone Preferred reference Inferior reference Index test Index + +a b c d – – Index + +a b g h – – + +e f g h – – Preferred reference Inferior reference Index + +a b c d – – + +e f g h – – Disease outcome Truth Index + + a+e b+f c+g d+h – – In pattern A, the test a participant receives depends solely on the index test result, whereas in pattern B, other variables may influence reference standard assignment. Research and Reporting Methods Evaluating Diagnostic Accuracy in the Face of Multiple Reference Standards 198 6 August 2013 Annals of Internal Medicine Volume 159 • Number 3 www.annals.org Downloaded From: http://annals.org/ by a Fudan University User on 08/05/2013
Evaluating Diagnostic Accuracy in the Face of Multiple reference Standards I RESEARCH AND REPORTING METHODS erence standard and the verification pattern when ruling When the inferior reference standard is believed to out the risk of clinically relevant differential verification have high accuracy, clinically relevant differential verifica- bi tion bias is unlikely and there is no need to look into the other factors influencing bias. This was believed to be the Accuracy of the Inferior Reference Standard case in an example of differential verification from a stud As a general rule, the risk of clinically relevant differ- involving a clinical prediction rule for screening for depres ential verification bias decreases as the accuracy of the in- sion in primary care(see Table 1)(9). In this study, in- ferior reference standard increases. To estimate the accu- person or telephonic questionnaires were used as the refer acy of the inferior reference standard, its performance ence standard, but because the authors had reason to needs to be compared with that of the preferred reference assume that these methods for assessing dep standard. Although this may not be possible or ethical to similar accuracy, they argued that clinically relevant differ- do within the study, accuracy estimates may be available in ential verification bias was unlikely. When the inferior ref- the literature erence standards accuracy is questionable, however, the monly used alternative reference standard next step is to consider the verification pattern making a final disease classification is follow-up(following Verification Pattern patients over time to see whether symptoms worsen or The pattern of verification plays an important role in improve). The follow-up information is used to decide ret- whether bias is introduced. The most straightforward ver rospectively whether patients did indeed have the disease at ification pattern is when the choice of reference standard the time the index test was done. Assessing the accuracy of fully dependent on the index test results(Figure 3, pattern is difficult, even when the preferred reference A). Studies with this pattern are likely to have biased esti- standard and follow-up are both done in a random subset mates of sensitivity, specificity, diagnostic odds ratios, and f patients, because tl condition of the patients can im- likelihood ratios because these estimates rely partly on dis prove or worsen between when the preferred reference ease status classification by the inferior reference standard standard is performed and the end of follow-up An exception is positive predictive value estimates(the Because estimating the accuracy of follow-up is diffi- probability that a person has the disease given that th cult, it is particularly important to consider the quality and index test results are abnormal). The positive predictive length of follow-up from a biological perspective, taking value estimate is not affected by differential verification into account the natural course of existing cases as well as bias in pattern a because all patients with positive index the incidence of new cases. We provide a cursory example test results receive the preferred reference standard (3) of how this can be done using a study that investigated the Negative predictive value estimates can also still be inter- accuracy of blood oxygen concentration measured at birth reted in a meaningful way in the sense that they provide in detecting congenital heart defects(see Table 1)(23). In information on the proportion of missed cases, as defined this study, newborns with low oxygen levels were treated by the inferior reference standard(see Figure 4 for an by a cardiologist, whereas the rest were followed up after 1 example) year through a congenital anomalies register When the choice of reference standard is not fully Follow-up should be long enough to allow as many dependent on the index test results(Figure 3, pattern B) hidden cases of disease to progress to a detectable stage as differential verification is likely to bias all accuracy esti- possible. However, if follow-up is too long, long, new cases de- mates because they rely to some degree on disease classifi veloping after the index test was performed will also be cation by the inferior reference standard. In the rare case detected. In the example, follow-up might be considered too short because some types of congenital heart defects are detected later in life. On the other hand it was not too Table 2. Questions to Ask When Assessing Risk of long because ce ital heart disease is, by definition, al- Differential Verification Bias eady present at birth. The second point to consider is whether follow-up allows detection of the same type and was the choice of reference standard completely dependent on the results severity of disease as the preferred reference standard. Re- searchers should focus on whether the test being studied If the answer to the first question is no, how accurate is the inferior reference standard?(The higher the accuracy of the inferior reference detects cases that will benefit from clinical intervention standard, the lower the risk of bias. rather than simply the presence of any disease(24). In the What percentage of the participants were diagnosed by use o the interior example, more serious types of defects are probably de inferior standard. the risk of bias is low Several factors must be taken into tected at birth, whereas less serious ones are detected dur- account when determining whether the percentage is negligible.) follow If follow-up is used as the inferior reference standard, does it identify almost though this may not be a problem, if hidden cases present at the time of the index test but very few new follow-up instead detects the more pronounced cases, the cases that develop afterward? Does follow-up detect the same type of estimated sensitivity of the index test will be an overesti cases as the preferred reference standard? (lf the answer to both questions is yes, the risk of bias is low mate of its sensitivity in detecting serious cases DownloadedFrom:httpannalsorgbyaFudanUniversityUseron08/05/2013
erence standard and the verification pattern when ruling out the risk of clinically relevant differential verification bias. Accuracy of the Inferior Reference Standard As a general rule, the risk of clinically relevant differential verification bias decreases as the accuracy of the inferior reference standard increases. To estimate the accuracy of the inferior reference standard, its performance needs to be compared with that of the preferred reference standard. Although this may not be possible or ethical to do within the study, accuracy estimates may be available in the literature. A commonly used alternative reference standard for making a final disease classification is follow-up (following patients over time to see whether symptoms worsen or improve). The follow-up information is used to decide retrospectively whether patients did indeed have the disease at the time the index test was done. Assessing the accuracy of follow-up is difficult, even when the preferred reference standard and follow-up are both done in a random subset of patients, because the condition of the patients can improve or worsen between when the preferred reference standard is performed and the end of follow-up. Because estimating the accuracy of follow-up is diffi- cult, it is particularly important to consider the quality and length of follow-up from a biological perspective, taking into account the natural course of existing cases as well as the incidence of new cases. We provide a cursory example of how this can be done using a study that investigated the accuracy of blood oxygen concentration measured at birth in detecting congenital heart defects (see Table 1) (23). In this study, newborns with low oxygen levels were treated by a cardiologist, whereas the rest were followed up after 1 year through a congenital anomalies register. Follow-up should be long enough to allow as many hidden cases of disease to progress to a detectable stage as possible. However, if follow-up is too long, new cases developing after the index test was performed will also be detected. In the example, follow-up might be considered too short because some types of congenital heart defects are detected later in life. On the other hand, it was not too long because congenital heart disease is, by definition, already present at birth. The second point to consider is whether follow-up allows detection of the same type and severity of disease as the preferred reference standard. Researchers should focus on whether the test being studied detects cases that will benefit from clinical intervention rather than simply the presence of any disease (24). In the example, more serious types of defects are probably detected at birth, whereas less serious ones are detected during follow-up. Although this may not be a problem, if follow-up instead detects the more pronounced cases, the estimated sensitivity of the index test will be an overestimate of its sensitivity in detecting serious cases. When the inferior reference standard is believed to have high accuracy, clinically relevant differential verification bias is unlikely and there is no need to look into the other factors influencing bias. This was believed to be the case in an example of differential verification from a study involving a clinical prediction rule for screening for depression in primary care (see Table 1) (9). In this study, inperson or telephonic questionnaires were used as the reference standard, but because the authors had reason to assume that these methods for assessing depression had similar accuracy, they argued that clinically relevant differential verification bias was unlikely. When the inferior reference standard’s accuracy is questionable, however, the next step is to consider the verification pattern. Verification Pattern The pattern of verification plays an important role in whether bias is introduced. The most straightforward verification pattern is when the choice of reference standard is fully dependent on the index test results (Figure 3, pattern A). Studies with this pattern are likely to have biased estimates of sensitivity, specificity, diagnostic odds ratios, and likelihood ratios because these estimates rely partly on disease status classification by the inferior reference standard. An exception is positive predictive value estimates (the probability that a person has the disease given that the index test results are abnormal). The positive predictive value estimate is not affected by differential verification bias in pattern A because all patients with positive index test results receive the preferred reference standard (3). Negative predictive value estimates can also still be interpreted in a meaningful way in the sense that they provide information on the proportion of missed cases, as defined by the inferior reference standard (see Figure 4 for an example). When the choice of reference standard is not fully dependent on the index test results (Figure 3, pattern B), differential verification is likely to bias all accuracy estimates because they rely to some degree on disease classifi- cation by the inferior reference standard. In the rare case Table 2. Questions to Ask When Assessing Risk of Differential Verification Bias Was the choice of reference standard completely dependent on the results of the index test? (If so, the predictive values are clinically interpretable.) If the answer to the first question is no, how accurate is the inferior reference standard? (The higher the accuracy of the inferior reference standard, the lower the risk of bias.) What percentage of the participants were diagnosed by use of the inferior reference standard? (If a negligible percentage of participants received an inferior standard, the risk of bias is low. Several factors must be taken into account when determining whether the percentage is negligible.) If follow-up is used as the inferior reference standard, does it identify almost all hidden cases present at the time of the index test but very few new cases that develop afterward? Does follow-up detect the same type of cases as the preferred reference standard? (If the answer to both questions is yes, the risk of bias is low.) Evaluating Diagnostic Accuracy in the Face of Multiple Reference Standards Research and Reporting Methods www.annals.org 6 August 2013 Annals of Internal Medicine Volume 159 • Number 3 199 Downloaded From: http://annals.org/ by a Fudan University User on 08/05/2013