ROC curve a Discrete classifier produces an(FPR, TPR) pair corresponding to a single point in ROC space a Some classifier, such as a naive bayes or a neural network, naturally yield an instance probability or score, a numeric value that represents the degree to which an instance is a member of a class a Such a ranking or scoring classifier can be used with a threshold to produce a discrete classifier a Plotting the Roc point for each possible threshold value results in a curve
ROC curve ◼ A discrete classifier produces an (FPR, TPR) pair corresponding to a single point in ROC space. ◼ Some classifier, such as a Naïve Bayes or a neural network, naturally yield an instance probability or score, a numeric value that represents the degree to which an instance is a member of a class. ◼ Such a ranking or scoring classifier can be used with a threshold to produce a discrete classifier ◼ Plotting the ROC point for each possible threshold value results in a curve
ROC curve 1.0 四≥2 0 A A 巴≥=00 0.6 0.4 0.4 0.2 0.60.81.0 0.2 False positive rate False Positive rate ROC curves show the tradeoff between sensitivity and specificity The closer the curve follows the upper -left border of the Roc space the more accurate the test The closer the curve comes to the 45-degree diagonal of the roc space the less accurate the test a common method is to calculate the area under the roc curve
ROC curve ◼ ROC curves show the tradeoff between sensitivity and specificity. ◼ The closer the curve follows the upper –left border of the ROC space, the more accurate the test. ◼ The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. ◼ A common method is to calculate the area under the ROC curve
Evaluating a classifier a How well will the classifier we learned perform a novel data? a We can estimate the performance(e.g, accuracy, sensitivity) of the classifier using a test data set Performance on the training data is not a good indicator of performance on future data Test set: independent instances that have not been used in any way to create the classifier Assumption both training data and test data representative samples of the underlying problem
Evaluating a classifier ◼ How well will the classifier we learned perform a novel data? ◼ We can estimate the performance (e.g., accuracy, sensitivity) of the classifier using a test data set ◼ Performance on the training data is not a good indicator of performance on future data ◼ Test set: independent instances that have not been used in any way to create the classifier ◼ Assumption: both training data and test data representative samples of the underlying problem
Holdout Cross-validation method ■ oldout method Given data is randomly partitioned into two independent sets n Training set(e.g, 2/3)for model construction n Test set (e.g, 1/3)for accuracy estimation o Random sampling: a variation of holdout o Repeat holdout k times, accuracy= avg. of the accuracies obtained a Cross-validation(k-fold where ke= 10 is most popular o Randomly partition the data into k mutually exclusive subsets each approximately equal size At i-th iteration, use D as test set and others as training set Leave-one-out k folds where k=# of tuples, for small sized data
Holdout & Cross-validation method ◼ Holdout method ◆ Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation ◆ Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained ◼ Cross-validation (k-fold, where k=10 is most popular) ◆ Randomly partition the data into k mutually exclusive subsets, each approximately equal size ◆ At i-th iteration, use Di as test set and others as training set ◆ Leave-one-out: k folds where k = # of tuples, for small sized data
Bootstrap Bootstrap Works well with small data sets Samples the given training tuples uniformly with replacement n i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set a Several bootstrap methods and a common one is. 632 bootstrap o Adata set with d tuples is sampled d times, with replacement resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. about 63.2%of the original data end up in the bootstrap and the remaining 36.8%form the test set(since(1-1/d)d=e1=0.368) o Repeat the sampling procedure k times, overall accuracy of the model: ACC( M) Xi=10.632*Acc(Mi)testset +0.368* Acc Trainset
Bootstrap ◼ Bootstrap ◆ Works well with small data sets ◆ Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set ◼ Several bootstrap methods and a common one is .632 bootstrap ◆ A data set with d tuples is sampled d times, with replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test set (since (1-1/d) d =e-1=0.368) ◆ Repeat the sampling procedure k times, overall accuracy of the model: 𝐴𝑐𝑐(𝑀) = 1 𝑘 σ𝑖=1 𝑘 (0.632 ∗ 𝐴𝑐𝑐(𝑀𝑖 )𝑡𝑒𝑠𝑡𝑠𝑒𝑡 + 0.368 ∗ 𝐴𝑐𝑐 𝑀𝑖 𝑡𝑟𝑎𝑖𝑛𝑠𝑒𝑡 )