IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 6 instance Selection),based on a novel instance-based feature mapping and the 1-norm SVM model for simultaneous feature selection and classification. However,most of the existing methods cannot answer the essential question giving rise to the gap between multiple-instance learning and single-instance learning:which are true positive instances and what property do the true positive instances have?This motivates the work in this paper. 3 MILD:MULTIPLE-INSTANCE LEARNING VIA DISAMBIGUATION In this section,our disambiguation method is described in detail,and then two feature rep- resentation schemes are presented for instance-level classification and bag-level classification, respectively,based on which the MIL problem is converted into a standard single-instance leaning problem that can be directly solved by SIL algorithms,such as SVM.In addition,multi-view learning [12]can be easily adapted for MIL by combining the two views,the instance-level view and the bag-level view,of the bags. 3.1 Notations B denotes a positive bag and B.denotes a negative bag.When the label of a bag does not matter,we simply denote the bag as B.B denotes an instance in a positive bag B and B is an instance in a negative bag B.Let B={B,B,...,B,B,B2,...,B-}denote the set of n+positive and n-negative training bags.I(Bi)E+1,-1}is the bag label of Bi and l(Bj)[+1,-1)is the instance label of Bij.In general,we always represent all instances as feature vectors of the same dimensionality.Hence,in this paper,an instance also refers to a feature vector. 3.2 Disambiguation According to the MIL formulation,all instances in the negative bags are negative and hence there exists no ambiguity in the negative bags if there is no labeling noise.As for the positive bags,the only thing we know is that each positive bag must contain at least one true positive instance,but it may also contain many negative instances.Thus,ambiguity arises in the positive bags since we do not know the labels of the instances there.The goal of disambiguation is to identify the true positive instances in the positive bags. March 1.2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 6 instance Selection), based on a novel instance-based feature mapping and the 1-norm SVM model for simultaneous feature selection and classification. However, most of the existing methods cannot answer the essential question giving rise to the gap between multiple-instance learning and single-instance learning: which are true positive instances and what property do the true positive instances have? This motivates the work in this paper. 3 MILD: MULTIPLE-INSTANCE LEARNING VIA DISAMBIGUATION In this section, our disambiguation method is described in detail, and then two feature representation schemes are presented for instance-level classification and bag-level classification, respectively, based on which the MIL problem is converted into a standard single-instance leaning problem that can be directly solved by SIL algorithms, such as SVM. In addition, multi-view learning [12] can be easily adapted for MIL by combining the two views, the instance-level view and the bag-level view, of the bags. 3.1 Notations B + i denotes a positive bag and B − i denotes a negative bag. When the label of a bag does not matter, we simply denote the bag as Bi . B + ij denotes an instance in a positive bag B + i and B − ij is an instance in a negative bag B − i . Let B = {B + 1 , B+ 2 , . . . , B+ n+ , B− 1 , B− 2 , . . . , B− n− } denote the set of n + positive and n − negative training bags. l(Bi) ∈ {+1, −1} is the bag label of Bi and l(Bij ) ∈ {+1, −1} is the instance label of Bij . In general, we always represent all instances as feature vectors of the same dimensionality. Hence, in this paper, an instance also refers to a feature vector. 3.2 Disambiguation According to the MIL formulation, all instances in the negative bags are negative and hence there exists no ambiguity in the negative bags if there is no labeling noise. As for the positive bags, the only thing we know is that each positive bag must contain at least one true positive instance, but it may also contain many negative instances. Thus, ambiguity arises in the positive bags since we do not know the labels of the instances there. The goal of disambiguation is to identify the true positive instances in the positive bags. March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 7 3.2.1 Some Properties of True Positive Instances Assumption 1:We assume that given a true positive instance t,the probability that an instance Bij is positive is calculated as follows: Pr(l(Bij)=+1|t)=exp llt-Bisll 042 (1) where denotes the 2-norm of the vector x,and a is a parameter larger than 0. Remark 1:The rationale of Assumption 1 is similar to that of kernel density estimation with the kernel being a Gaussian.Kernel density estimation can be thought of as placing a small "bump"at each observation,which will give higher probability density near the observations. Similarly,if we have already observed an instance t to be truly positive,then the chance that an instance near t is truly positive will be relatively high.However,one key point that should be noted is that unlike kernel density estimation which tries to estimate a probability density function,Assumption 1 is used to compute a conditional probability.From(1),we can easily see that0≤Pr(l(B)=+1|t)≤1,Pr(l(B)=+1|t)=0 when llt-Bl‖l=+oo, and Pr(l(Bj)=+1t)=1 when -Bjll =0.Obviously,this is a reasonable probability function,but not a reasonable probability density function.Furthermore,the assumption in(1) also coincides well with our intuition.If we have known that t is a true positive instance and llt-Bjl=0,Bj will definitely be a true positive instance because Bij is just equal to t, which means that Pr(l(Bij)=+1t)=1 is reasonable.Similarly,iflt-Bjll =+oo,Bij is infinitely far away from the true positive instance t,Pr(l(Bj)=+1t)=0 is also reasonable. The farther Bii is away from t,the lower is the probability that Bii is positive given t,which is reasonable based on our intuition. Remark 2:We try to deal with the general case that all the true positive instances are not necessarily identical to a single point in the instance space,but are generated from some probability distribution over the whole instance space.Hence,each point in the instance space is associated with a probability value to indicate how confidently that point is truly positive. Definition 1:The most-likely-cause estimator for estimating the probability that a bag Bi is 1.This parameter will be learned from the training data by the algorithm introduced later. March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 7 3.2.1 Some Properties of True Positive Instances Assumption 1: We assume that given a true positive instance t, the probability that an instance Bij is positive is calculated as follows: Pr (l(Bij ) = +1 | t) = exp − kt − Bijk 2 σt 2 ! , (1) where kxk , pP k x 2 k denotes the 2-norm of the vector x, and σt is a parameter 1 larger than 0. Remark 1: The rationale of Assumption 1 is similar to that of kernel density estimation with the kernel being a Gaussian. Kernel density estimation can be thought of as placing a small “bump” at each observation, which will give higher probability density near the observations. Similarly, if we have already observed an instance t to be truly positive, then the chance that an instance near t is truly positive will be relatively high. However, one key point that should be noted is that unlike kernel density estimation which tries to estimate a probability density function, Assumption 1 is used to compute a conditional probability. From (1), we can easily see that 0 ≤ Pr(l(Bij ) = +1 | t) ≤ 1, Pr(l(Bij ) = +1 | t) = 0 when kt − Bijk = +∞, and Pr(l(Bij ) = +1 | t) = 1 when kt − Bijk = 0. Obviously, this is a reasonable probability function, but not a reasonable probability density function. Furthermore, the assumption in (1) also coincides well with our intuition. If we have known that t is a true positive instance and kt − Bijk = 0, Bij will definitely be a true positive instance because Bij is just equal to t, which means that Pr(l(Bij ) = +1 | t) = 1 is reasonable. Similarly, if kt − Bijk = +∞, Bij is infinitely far away from the true positive instance t, Pr(l(Bij ) = +1 | t) = 0 is also reasonable. The farther Bij is away from t, the lower is the probability that Bij is positive given t, which is reasonable based on our intuition. Remark 2: We try to deal with the general case that all the true positive instances are not necessarily identical to a single point in the instance space, but are generated from some probability distribution over the whole instance space. Hence, each point in the instance space is associated with a probability value to indicate how confidently that point is truly positive. Definition 1: The most-likely-cause estimator for estimating the probability that a bag Bi is 1. This parameter will be learned from the training data by the algorithm introduced later. March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 8 a positive bag given a true positive instance t is defined as follows: Pr(l(Bi)=+1t)=max Pr(l(Bii)=+1t) B∈B t- B 器ep d2(t,Bi) exp (2) where d(t,B)=min lt-Bl: (3) Bii∈B In other words,the distance d(t,Bi)between an instance t and a bag Bi is simply equal to the distance between t and the nearest instance in Bi. The definition of the most-likely-cause estimator completely follows the underlying principle of the MIL formulation.Let Bik be the instance in bag Bi which is the most likely one to be positive,i.e.,Pr(l(Bk)=+1t)is the largest among all instances in B:.If Pr(l(B)=+1t) is small enough,certainly we should assign Bi to be negative.Otherwise,B;should be positive, even if all instances in B:other than Bik have low probabilities to be positive.Moreover,the definition is also consistent with intuition.If d(t,Bi)=0,which implies that t is an instance of Bi,Pr(l(B)=+1t)will be 1.This is reasonable,because B;contains a true positive instance t.In general,the larger d(t,Bi)is,the lower is the probability that Bi is positive given the true positive instance t. Theorem 1:Given a true positive instance t,there exists a threshold O which makes the decision function defined in (4)label the bags according to the Bayes decision rule. +1 ,(B) d(t,B)≤0e (4) -1 otherwise Proof:According to the Bayes decision rule [29,page 23],the label of Bi given a true positive instance t should be decided as follows: (B:)= +1 Pr(l(Bi)=+1|t)z Pr(l(Bi)=-1|t) -1 otherwise March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 8 a positive bag given a true positive instance t is defined as follows: Pr(l(Bi) = +1 | t) = max Bij∈Bi Pr(l(Bij ) = +1 | t) = max Bij∈Bi exp − kt − Bijk 2 σ 2 t ! = exp − d 2 (t, Bi) σ 2 t , (2) where d(t, Bi) = min Bij∈Bi kt − Bijk. (3) In other words, the distance d(t, Bi) between an instance t and a bag Bi is simply equal to the distance between t and the nearest instance in Bi . The definition of the most-likely-cause estimator completely follows the underlying principle of the MIL formulation. Let Bik be the instance in bag Bi which is the most likely one to be positive, i.e., Pr(l(Bik) = +1 | t) is the largest among all instances in Bi . If Pr(l(Bik) = +1 | t) is small enough, certainly we should assign Bi to be negative. Otherwise, Bi should be positive, even if all instances in Bi other than Bik have low probabilities to be positive. Moreover, the definition is also consistent with intuition. If d(t, Bi) = 0, which implies that t is an instance of Bi , Pr(l(Bi) = +1 | t) will be 1. This is reasonable, because Bi contains a true positive instance t. In general, the larger d(t, Bi) is, the lower is the probability that Bi is positive given the true positive instance t. Theorem 1: Given a true positive instance t, there exists a threshold θt which makes the decision function defined in (4) label the bags according to the Bayes decision rule. h t θt (Bi) = +1 d(t, Bi) ≤ θt −1 otherwise (4) Proof: According to the Bayes decision rule [29, page 23], the label of Bi given a true positive instance t should be decided as follows: l(Bi | t) = +1 Pr(l(Bi) = +1 | t) ≥ Pr(l(Bi) = −1 | t) −1 otherwise March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 9 From Definition 1,we have Pr(l(B)=+1|t)-Pr(l(B)=-1|t) d2(t,Bi) exp )--e(f】 2exp d2(t,Bi) -1 Furthermore,we have Pr(l(Bi)=+1t)>Pr(l(Bi)=-1t) ÷Pr(l(B)=+1|t)-Pr(l(B)=-1|t)≥0 台2exp d2(t,Bi) 02 -1≥0 ÷d(t,B:)≤otV1n2 (5) Hence,if ,=o:vIn 2,the decision function defined in (4)will label the bags in accordance with the Bayes decision rule 口 Therefore,if t is a true positive instance,there must exist a decision function as defined in (4)to label the bags well,meaning that the distances from t to the positive bags are expected to be smaller than those to the negative bags. For a negative instance (i.e.,false positive instance),however,its distances to the positive and negative bags do not exhibit the same distribution as those from t.Since some positive bags may also contain negative instances just like the negative bags,the distances from the negative instance to the positive bags may be as random as those to the negative bags.This distributional difference provides an informative hint for identifying the true positive instances. 3.2.2 Disambiguation Method Unlike in the previous subsection,t does not necessarily refer to a true positive instance in this subsection.However,we still define a decision function as that in (4)even when t is a negative instance.The difference is that if t is a negative instance,the corresponding decision function cannot label the bags well.It is this very phenomenon that forms the basis of our disambiguation method. Definition 2:The empirical precision of the decision function in (4)is defined as follows: P(0)= 1 nt+”1+,(B)(B) (6) n++n- 2 i=1 March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 9 From Definition 1, we have Pr(l(Bi) = +1 | t) − Pr(l(Bi) = −1 | t) = exp − d 2 (t, Bi) σ 2 t − 1 − exp − d 2 (t, Bi) σ 2 t = 2 exp − d 2 (t, Bi) σ 2 t − 1. Furthermore, we have Pr(l(Bi) = +1 | t) ≥ Pr(l(Bi) = −1 | t) ⇔ Pr(l(Bi) = +1 | t) − Pr(l(Bi) = −1 | t) ≥ 0 ⇔ 2 exp − d 2 (t, Bi) σ 2 t − 1 ≥ 0 ⇔ d(t, Bi) ≤ σt √ ln 2. (5) Hence, if θt = σt √ ln 2, the decision function defined in (4) will label the bags in accordance with the Bayes decision rule. Therefore, if t is a true positive instance, there must exist a decision function as defined in (4) to label the bags well, meaning that the distances from t to the positive bags are expected to be smaller than those to the negative bags. For a negative instance (i.e., false positive instance), however, its distances to the positive and negative bags do not exhibit the same distribution as those from t. Since some positive bags may also contain negative instances just like the negative bags, the distances from the negative instance to the positive bags may be as random as those to the negative bags. This distributional difference provides an informative hint for identifying the true positive instances. 3.2.2 Disambiguation Method Unlike in the previous subsection, t does not necessarily refer to a true positive instance in this subsection. However, we still define a decision function as that in (4) even when t is a negative instance. The difference is that if t is a negative instance, the corresponding decision function cannot label the bags well. It is this very phenomenon that forms the basis of our disambiguation method. Definition 2: The empirical precision of the decision function in (4) is defined as follows: Pt(θt) = 1 n+ + n− n + X +n− i=1 1 + h t θt (Bi)l(Bi) 2 , (6) March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X o where Bi E B is a training bag. The empirical precision essentially measures how well the decision function()with threshold mimics l()in predicting the bag labels. Now,the question is that we do not know the scaling factor o?of the function in(2)for a specific instance t.As such,we cannot compute the threshold 6t.It will be desirable to learn o? from the training data automatically.To do so,we find the best threshold that maximizes the empirical precision to estimate o:*=arg maxe,P().Thus the best (maximum)empirical precision P*(t)for t is given by: P*(t)=max P(0:). (7) Remark 3:The basic idea of finding 0;is analogous to maximum likelihood estimation (MLE) which finds the parameters that best explain the observed data.However,the criterion used here is the empirical precision as defined in(6)rather than the likelihood.Furthermore,our method is totally different from the DD method.The objective function of DD is multiplicative,making DD very sensitive to noise.Howerver,the objective function of our method is the empirical precision in (6)with an additive form,which guarantees the robustness of our method.The robustness of our method will be further discussed later. In essence,P*(t)reflects the ability of instance t in discriminating the training bags.The larger P(t)is,the more likely is t a true positive instance. It is worth noting that although 6:is a continuous-valued variable,we can find by checking a finite set of candidate values only.This is guaranteed by Theorem 2. Theorem 2:The best empirical precision P*(t)for t is achieved when Ot is an element in the set {d(t,B)i=1,...,n. Proof:Let 0:be a value satisfying P()=P*(t)and be the maximum value in d(t,B )i=1,...,n+}that is no larger than 0. If B is correctly labeled when 0=0:,then d(t,B)<0:,and hence d(t,B)<0. Therefore,any correctly labeled positive bag when is also correctly labeled when 0=0.Similarly,if B is correctly labeled when =0,then d(t,B)>0,and hence d(t,B)>.Therefore,any correctly labeled negative bag when 6t=0*is also correctly labeled when=. Thus we have,P()=P(0;)=P*(t). March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 10 where Bi ∈ B is a training bag. The empirical precision essentially measures how well the decision function h t θt (·) with threshold θt mimics l(·) in predicting the bag labels. Now, the question is that we do not know the scaling factor σ 2 t of the function in (2) for a specific instance t. As such, we cannot compute the threshold θt . It will be desirable to learn σ 2 t from the training data automatically. To do so, we find the best threshold θ ∗ t that maximizes the empirical precision to estimate σ 2 t : θ ∗ t = arg maxθt Pt(θt). Thus the best (maximum) empirical precision P ∗ (t) for t is given by: P ∗ (t) = max θt Pt(θt). (7) Remark 3: The basic idea of finding θ ∗ t is analogous to maximum likelihood estimation (MLE) which finds the parameters that best explain the observed data. However, the criterion used here is the empirical precision as defined in (6) rather than the likelihood. Furthermore, our method is totally different from the DD method. The objective function of DD is multiplicative, making DD very sensitive to noise. Howerver, the objective function of our method is the empirical precision in (6) with an additive form, which guarantees the robustness of our method. The robustness of our method will be further discussed later. In essence, P ∗ (t) reflects the ability of instance t in discriminating the training bags. The larger P ∗ (t) is, the more likely is t a true positive instance. It is worth noting that although θt is a continuous-valued variable, we can find θ ∗ t by checking a finite set of candidate values only. This is guaranteed by Theorem 2. Theorem 2: The best empirical precision P ∗ (t) for t is achieved when θt is an element in the set {d(t, B+ i ) | i = 1, . . . , n+}. Proof: Let θ ∗ t be a value satisfying Pt(θ ∗ t ) = P ∗ (t) and θ 0 t be the maximum value in {d(t, B+ i ) | i = 1, . . . , n+} that is no larger than θ ∗ t . If B + i is correctly labeled when θt = θ ∗ t , then d(t, B+ i ) ≤ θ ∗ t , and hence d(t, B+ i ) ≤ θ 0 t . Therefore, any correctly labeled positive bag when θt = θ ∗ t is also correctly labeled when θt = θ 0 t . Similarly, if B − i is correctly labeled when θt = θ ∗ t , then d(t, B− i ) > θ∗ t , and hence d(t, B− i ) > θ0 t . Therefore, any correctly labeled negative bag when θt = θ ∗ t is also correctly labeled when θt = θ 0 t . Thus we have, Pt(θ 0 t ) = Pt(θ ∗ t ) = P ∗ (t). March 1, 2009 DRAFT