《人工智能、机器学习与大数据》课程教学资源（参考文献）Multiple-instance learning via disambiguation.pdf_P6-P10

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 6 instance Selection),based on a novel instance-based feature mapping and the 1-norm SVM model for simultaneous feature selection and classification. However,most of the existing methods cannot answer the essential question giving rise to the gap between multiple-instance learning and single-instance learning:which are true positive instances and what property do the true positive instances have?This motivates the work in this paper. 3 MILD:MULTIPLE-INSTANCE LEARNING VIA DISAMBIGUATION In this section,our disambiguation method is described in detail,and then two feature rep- resentation schemes are presented for instance-level classification and bag-level classification, respectively,based on which the MIL problem is converted into a standard single-instance leaning problem that can be directly solved by SIL algorithms,such as SVM.In addition,multi-view learning [12]can be easily adapted for MIL by combining the two views,the instance-level view and the bag-level view,of the bags. 3.1 Notations B denotes a positive bag and B.denotes a negative bag.When the label of a bag does not matter,we simply denote the bag as B.B denotes an instance in a positive bag B and B is an instance in a negative bag B.Let B={B,B,...,B,B,B2,...,B-}denote the set of n+positive and n-negative training bags.I(Bi)E+1,-1}is the bag label of Bi and l(Bj)[+1,-1)is the instance label of Bij.In general,we always represent all instances as feature vectors of the same dimensionality.Hence,in this paper,an instance also refers to a feature vector. 3.2 Disambiguation According to the MIL formulation,all instances in the negative bags are negative and hence there exists no ambiguity in the negative bags if there is no labeling noise.As for the positive bags,the only thing we know is that each positive bag must contain at least one true positive instance,but it may also contain many negative instances.Thus,ambiguity arises in the positive bags since we do not know the labels of the instances there.The goal of disambiguation is to identify the true positive instances in the positive bags. March 1.2009 DRAFT

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 6 instance Selection), based on a novel instance-based feature mapping and the 1-norm SVM model for simultaneous feature selection and classification. However, most of the existing methods cannot answer the essential question giving rise to the gap between multiple-instance learning and single-instance learning: which are true positive instances and what property do the true positive instances have? This motivates the work in this paper. 3 MILD: MULTIPLE-INSTANCE LEARNING VIA DISAMBIGUATION In this section, our disambiguation method is described in detail, and then two feature representation schemes are presented for instance-level classification and bag-level classification, respectively, based on which the MIL problem is converted into a standard single-instance leaning problem that can be directly solved by SIL algorithms, such as SVM. In addition, multi-view learning [12] can be easily adapted for MIL by combining the two views, the instance-level view and the bag-level view, of the bags. 3.1 Notations B + i denotes a positive bag and B − i denotes a negative bag. When the label of a bag does not matter, we simply denote the bag as Bi . B + ij denotes an instance in a positive bag B + i and B − ij is an instance in a negative bag B − i . Let B = {B + 1 , B+ 2 , . . . , B+ n+ , B− 1 , B− 2 , . . . , B− n− } denote the set of n + positive and n − negative training bags. l(Bi) ∈ {+1, −1} is the bag label of Bi and l(Bij ) ∈ {+1, −1} is the instance label of Bij . In general, we always represent all instances as feature vectors of the same dimensionality. Hence, in this paper, an instance also refers to a feature vector. 3.2 Disambiguation According to the MIL formulation, all instances in the negative bags are negative and hence there exists no ambiguity in the negative bags if there is no labeling noise. As for the positive bags, the only thing we know is that each positive bag must contain at least one true positive instance, but it may also contain many negative instances. Thus, ambiguity arises in the positive bags since we do not know the labels of the instances there. The goal of disambiguation is to identify the true positive instances in the positive bags. March 1, 2009 DRAFT

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 7 3.2.1 Some Properties of True Positive Instances Assumption 1:We assume that given a true positive instance t,the probability that an instance Bij is positive is calculated as follows: Pr(l(Bij)=+1|t)=exp llt-Bisll 042 (1) where denotes the 2-norm of the vector x,and a is a parameter larger than 0. Remark 1:The rationale of Assumption 1 is similar to that of kernel density estimation with the kernel being a Gaussian.Kernel density estimation can be thought of as placing a small "bump"at each observation,which will give higher probability density near the observations. Similarly,if we have already observed an instance t to be truly positive,then the chance that an instance near t is truly positive will be relatively high.However,one key point that should be noted is that unlike kernel density estimation which tries to estimate a probability density function,Assumption 1 is used to compute a conditional probability.From(1),we can easily see that0≤Pr(l(B)=+1|t)≤1，Pr(l(B)=+1|t)=0 when llt-Bl‖l=+oo, and Pr(l(Bj)=+1t)=1 when -Bjll =0.Obviously,this is a reasonable probability function,but not a reasonable probability density function.Furthermore,the assumption in(1) also coincides well with our intuition.If we have known that t is a true positive instance and llt-Bjl=0,Bj will definitely be a true positive instance because Bij is just equal to t, which means that Pr(l(Bij)=+1t)=1 is reasonable.Similarly,iflt-Bjll =+oo,Bij is infinitely far away from the true positive instance t,Pr(l(Bj)=+1t)=0 is also reasonable. The farther Bii is away from t,the lower is the probability that Bii is positive given t,which is reasonable based on our intuition. Remark 2:We try to deal with the general case that all the true positive instances are not necessarily identical to a single point in the instance space,but are generated from some probability distribution over the whole instance space.Hence,each point in the instance space is associated with a probability value to indicate how confidently that point is truly positive. Definition 1:The most-likely-cause estimator for estimating the probability that a bag Bi is 1.This parameter will be learned from the training data by the algorithm introduced later. March 1,2009 DRAFT

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 7 3.2.1 Some Properties of True Positive Instances Assumption 1: We assume that given a true positive instance t, the probability that an instance Bij is positive is calculated as follows: Pr (l(Bij ) = +1 | t) = exp − kt − Bijk 2 σt 2 ! , (1) where kxk , pP k x 2 k denotes the 2-norm of the vector x, and σt is a parameter 1 larger than 0. Remark 1: The rationale of Assumption 1 is similar to that of kernel density estimation with the kernel being a Gaussian. Kernel density estimation can be thought of as placing a small “bump” at each observation, which will give higher probability density near the observations. Similarly, if we have already observed an instance t to be truly positive, then the chance that an instance near t is truly positive will be relatively high. However, one key point that should be noted is that unlike kernel density estimation which tries to estimate a probability density function, Assumption 1 is used to compute a conditional probability. From (1), we can easily see that 0 ≤ Pr(l(Bij ) = +1 | t) ≤ 1, Pr(l(Bij ) = +1 | t) = 0 when kt − Bijk = +∞, and Pr(l(Bij ) = +1 | t) = 1 when kt − Bijk = 0. Obviously, this is a reasonable probability function, but not a reasonable probability density function. Furthermore, the assumption in (1) also coincides well with our intuition. If we have known that t is a true positive instance and kt − Bijk = 0, Bij will definitely be a true positive instance because Bij is just equal to t, which means that Pr(l(Bij ) = +1 | t) = 1 is reasonable. Similarly, if kt − Bijk = +∞, Bij is infinitely far away from the true positive instance t, Pr(l(Bij ) = +1 | t) = 0 is also reasonable. The farther Bij is away from t, the lower is the probability that Bij is positive given t, which is reasonable based on our intuition. Remark 2: We try to deal with the general case that all the true positive instances are not necessarily identical to a single point in the instance space, but are generated from some probability distribution over the whole instance space. Hence, each point in the instance space is associated with a probability value to indicate how confidently that point is truly positive. Definition 1: The most-likely-cause estimator for estimating the probability that a bag Bi is 1. This parameter will be learned from the training data by the algorithm introduced later. March 1, 2009 DRAFT

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X o where Bi E B is a training bag. The empirical precision essentially measures how well the decision function()with threshold mimics l()in predicting the bag labels. Now,the question is that we do not know the scaling factor o?of the function in(2)for a specific instance t.As such,we cannot compute the threshold 6t.It will be desirable to learn o? from the training data automatically.To do so,we find the best threshold that maximizes the empirical precision to estimate o:*=arg maxe,P().Thus the best (maximum)empirical precision P*(t)for t is given by: P*(t)=max P(0:). (7) Remark 3:The basic idea of finding 0;is analogous to maximum likelihood estimation (MLE) which finds the parameters that best explain the observed data.However,the criterion used here is the empirical precision as defined in(6)rather than the likelihood.Furthermore,our method is totally different from the DD method.The objective function of DD is multiplicative,making DD very sensitive to noise.Howerver,the objective function of our method is the empirical precision in (6)with an additive form,which guarantees the robustness of our method.The robustness of our method will be further discussed later. In essence,P*(t)reflects the ability of instance t in discriminating the training bags.The larger P(t)is,the more likely is t a true positive instance. It is worth noting that although 6:is a continuous-valued variable,we can find by checking a finite set of candidate values only.This is guaranteed by Theorem 2. Theorem 2:The best empirical precision P*(t)for t is achieved when Ot is an element in the set {d(t,B)i=1,...,n. Proof:Let 0:be a value satisfying P()=P*(t)and be the maximum value in d(t,B )i=1,...,n+}that is no larger than 0. If B is correctly labeled when 0=0:,then d(t,B)<0:,and hence d(t,B)<0. Therefore,any correctly labeled positive bag when is also correctly labeled when 0=0.Similarly,if B is correctly labeled when =0,then d(t,B)>0,and hence d(t,B)>.Therefore,any correctly labeled negative bag when 6t=0*is also correctly labeled when=. Thus we have,P()=P(0;)=P*(t). March 1,2009 DRAFT

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 10 where Bi ∈ B is a training bag. The empirical precision essentially measures how well the decision function h t θt (·) with threshold θt mimics l(·) in predicting the bag labels. Now, the question is that we do not know the scaling factor σ 2 t of the function in (2) for a specific instance t. As such, we cannot compute the threshold θt . It will be desirable to learn σ 2 t from the training data automatically. To do so, we find the best threshold θ ∗ t that maximizes the empirical precision to estimate σ 2 t : θ ∗ t = arg maxθt Pt(θt). Thus the best (maximum) empirical precision P ∗ (t) for t is given by: P ∗ (t) = max θt Pt(θt). (7) Remark 3: The basic idea of finding θ ∗ t is analogous to maximum likelihood estimation (MLE) which finds the parameters that best explain the observed data. However, the criterion used here is the empirical precision as defined in (6) rather than the likelihood. Furthermore, our method is totally different from the DD method. The objective function of DD is multiplicative, making DD very sensitive to noise. Howerver, the objective function of our method is the empirical precision in (6) with an additive form, which guarantees the robustness of our method. The robustness of our method will be further discussed later. In essence, P ∗ (t) reflects the ability of instance t in discriminating the training bags. The larger P ∗ (t) is, the more likely is t a true positive instance. It is worth noting that although θt is a continuous-valued variable, we can find θ ∗ t by checking a finite set of candidate values only. This is guaranteed by Theorem 2. Theorem 2: The best empirical precision P ∗ (t) for t is achieved when θt is an element in the set {d(t, B+ i ) | i = 1, . . . , n+}. Proof: Let θ ∗ t be a value satisfying Pt(θ ∗ t ) = P ∗ (t) and θ 0 t be the maximum value in {d(t, B+ i ) | i = 1, . . . , n+} that is no larger than θ ∗ t . If B + i is correctly labeled when θt = θ ∗ t , then d(t, B+ i ) ≤ θ ∗ t , and hence d(t, B+ i ) ≤ θ 0 t . Therefore, any correctly labeled positive bag when θt = θ ∗ t is also correctly labeled when θt = θ 0 t . Similarly, if B − i is correctly labeled when θt = θ ∗ t , then d(t, B− i ) > θ∗ t , and hence d(t, B− i ) > θ0 t . Therefore, any correctly labeled negative bag when θt = θ ∗ t is also correctly labeled when θt = θ 0 t . Thus we have, Pt(θ 0 t ) = Pt(θ ∗ t ) = P ∗ (t). March 1, 2009 DRAFT