IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X MILD:Multiple-Instance Learning via Disambiguation Wu-Jun Li,Dit-Yan Yeung Abstract In multiple-instance learning(MIL),an individual example is called an instance and a bag contains a single or multiple instances.The class labels available in the training set are associated with bags rather than instances.A bag is labeled positive if at least one of its instances is positive; otherwise,the bag is labeled negative.Since a positive bag may contain some negative instances in addition to one or more positive instances,the true labels for the instances in a positive bag may or may not be the same as the corresponding bag label and,consequently,the instance labels are inherently ambiguous.In this paper,we propose a very efficient and robust MIL method, called MILD(Multiple-Instance Learning via Disambiguation),for general MIL problems.First,we propose a novel disambiguation method to identify the true positive instances in the positive bags. Second,we propose two feature representation schemes,one for instance-level classification and the other for bag-level classification,to convert the MIL problem into a standard single-instance learning(SIL)problem that can be solved by well-known SIL algorithms,such as support vector machine.Third,an inductive semi-supervised learning method is proposed for MIL.We evaluate our methods extensively on several challenging MIL applications to demonstrate their promising efficiency,robustness and accuracy. Index Terms Multiple-instance learning,learning from ambiguity,CBIR,object recognition,co-training, drug activity prediction Wu-Jun Li and Dit-Yan Yeung are with the Department of Computer Science and Engineering,Hong Kong University of Science and Technology.Hong Hong,China.E-mail:liwujunecse.ust.hk;dyyeungecse.ust.hk Manuscript received xxxx xx,2008;revised xxxx xx,xxxx. March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 1 MILD: Multiple-Instance Learning via Disambiguation Wu-Jun Li, Dit-Yan Yeung Abstract In multiple-instance learning (MIL), an individual example is called an instance and a bag contains a single or multiple instances. The class labels available in the training set are associated with bags rather than instances. A bag is labeled positive if at least one of its instances is positive; otherwise, the bag is labeled negative. Since a positive bag may contain some negative instances in addition to one or more positive instances, the true labels for the instances in a positive bag may or may not be the same as the corresponding bag label and, consequently, the instance labels are inherently ambiguous. In this paper, we propose a very efficient and robust MIL method, called MILD (Multiple-Instance Learning via Disambiguation), for general MIL problems. First, we propose a novel disambiguation method to identify the true positive instances in the positive bags. Second, we propose two feature representation schemes, one for instance-level classification and the other for bag-level classification, to convert the MIL problem into a standard single-instance learning (SIL) problem that can be solved by well-known SIL algorithms, such as support vector machine. Third, an inductive semi-supervised learning method is proposed for MIL. We evaluate our methods extensively on several challenging MIL applications to demonstrate their promising efficiency, robustness and accuracy. Index Terms Multiple-instance learning, learning from ambiguity, CBIR, object recognition, co-training, drug activity prediction ✦ Wu-Jun Li and Dit-Yan Yeung are with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Hong, China. E-mail: liwujun@cse.ust.hk; dyyeung@cse.ust.hk Manuscript received xxxx xx, 2008; revised xxxx xx, xxxx. March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 2 1 INTRODUCTION 1.1 Multiple-Instance Learning The task of drug activity prediction [1]is to classify aromatic molecules according to whether or not they are"musky".Here,the same molecule can manifest several different steric configurations (i.e.,molecular shapes),each with very different energy properties.A molecule is said to be musky if it binds itself to a particular receptor when it is in one or some of its configurations, although it cannot bind itself to the receptor in its other configurations.When a molecule cannot bind itself to the receptor in any of its configurations,it is said to be non-musky A new learning paradigm called multiple-instance learning (MIL)was proposed in [1]to model learning problems like drug activity prediction.In an MIL problem [1],an individual example is called an instance and a bag contains a single or multiple instances.A bag is labeled positive if at least one of its instances is positive;otherwise,the bag is labeled negative.In the example of drug activity prediction,a bag corresponds to a molecule,and an instance corresponds to a configuration.A configuration is said to be positive if it can make the molecule bind to the receptor.Otherwise,it is called negative.In MIL,the class labels available in the training set are associated with bags rather than instances.For example,we only know whether or not a molecule is musky,but do not know in what configuration a molecule can bind itself to the receptor. As more and more applications have been formulated as MIL problems,some of them are slightly different from the original formulation of MIL in [1].Recent works [2],[3]have pointed out that there are actually two different settings for MIL.One setting is based on the existential assumption,which assumes that a bag can be determined to be positive as long as one single positive instance exists in it.This setting is the same as the original formulation in [1].The other setting is based on the collective assumption [3],which assumes that a bag's label is collectively determined by a set of instances or even all the instances in the corresponding bag. One representative application conforming to the collective assumption is the region-based image classification task [4],[5],where the label of an image always refers to some object in that image. Because current automatic image segmentation methods may cut the object responsible for the label of the image into multiple parts,any single part cannot represent the object satisfactorily. It is the collective property of multiple parts that determines the label of the image. March 1.2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 2 1 INTRODUCTION 1.1 Multiple-Instance Learning The task of drug activity prediction [1] is to classify aromatic molecules according to whether or not they are “musky”. Here, the same molecule can manifest several different steric configurations (i.e., molecular shapes), each with very different energy properties. A molecule is said to be musky if it binds itself to a particular receptor when it is in one or some of its configurations, although it cannot bind itself to the receptor in its other configurations. When a molecule cannot bind itself to the receptor in any of its configurations, it is said to be non-musky. A new learning paradigm called multiple-instance learning (MIL) was proposed in [1] to model learning problems like drug activity prediction. In an MIL problem [1], an individual example is called an instance and a bag contains a single or multiple instances. A bag is labeled positive if at least one of its instances is positive; otherwise, the bag is labeled negative. In the example of drug activity prediction, a bag corresponds to a molecule, and an instance corresponds to a configuration. A configuration is said to be positive if it can make the molecule bind to the receptor. Otherwise, it is called negative. In MIL, the class labels available in the training set are associated with bags rather than instances. For example, we only know whether or not a molecule is musky, but do not know in what configuration a molecule can bind itself to the receptor. As more and more applications have been formulated as MIL problems, some of them are slightly different from the original formulation of MIL in [1]. Recent works [2], [3] have pointed out that there are actually two different settings for MIL. One setting is based on the existential assumption, which assumes that a bag can be determined to be positive as long as one single positive instance exists in it. This setting is the same as the original formulation in [1]. The other setting is based on the collective assumption [3], which assumes that a bag’s label is collectively determined by a set of instances or even all the instances in the corresponding bag. One representative application conforming to the collective assumption is the region-based image classification task [4], [5], where the label of an image always refers to some object in that image. Because current automatic image segmentation methods may cut the object responsible for the label of the image into multiple parts, any single part cannot represent the object satisfactorily. It is the collective property of multiple parts that determines the label of the image. March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 3 From the formulation of MIL,we can easily see that a positive bag may contain some negative instances in addition to one or more positive instances.Hence,the true labels for the instances in a positive bag may or may not be the same as the corresponding bag label and,consequently,the instance labels are inherently ambiguous.In the MIL literature [2],true positive instances and false positive instances refer to the positive and negative instances,respectively,in the positive bags. 1.2 Motivation Since labels are not available for the training instances,some methods [2],[6]simply assume that all instances in a bag have the same label as the bag label.However,this assumption can be very unreasonable for positive bags because a positive bag may contain as few as only one true positive instance.If the majority of the negative instances in a positive bag are mislabeled this way,the features learned for distinguishing positive instances from negative instances may end up being very misleading.Other methods [7],[8],[9]try to extend some standard single-instance learning (SIL)methods for multi-instance data by adding some constraints.Unfortunately,the resulting methods typically require solving non-convex optimization problems which suffer from the local minima problem and have high computation cost. Considering the limitations of some previous methods,we advocate here the necessity of instance label disambiguation as a way to eventually improve the prediction accuracy of the bag labels.In the context of MIL,disambiguation essentially refers to identifying the true positive instances in the positive bags. However,existing disambiguation methods cannot achieve promising performance.The APR (axis-parallel rectangle)method [1]tries to find an axis-parallel rectangle (or,more generally, hyper-rectangle)in the feature space to represent the area of the true positive instances.This rectangle should include at least one instance from each positive bag but exclude all instances from the negative bags.Although APR works quite well for the drug activity prediction problem, it is highly possible that no aPR can be found for some other applications,such as image or text categorization,to satisfy the requirement that at least one instance from each positive bag is included while all instances from the negative bags are excluded.Moreover,APR is very sensitive to labeling noise.Suppose only one negative bag is mislabeled as a positive one.In order to include at least one instance from this mislabeled negative bag,the computed APR may March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 3 From the formulation of MIL, we can easily see that a positive bag may contain some negative instances in addition to one or more positive instances. Hence, the true labels for the instances in a positive bag may or may not be the same as the corresponding bag label and, consequently, the instance labels are inherently ambiguous. In the MIL literature [2], true positive instances and false positive instances refer to the positive and negative instances, respectively, in the positive bags. 1.2 Motivation Since labels are not available for the training instances, some methods [2], [6] simply assume that all instances in a bag have the same label as the bag label. However, this assumption can be very unreasonable for positive bags because a positive bag may contain as few as only one true positive instance. If the majority of the negative instances in a positive bag are mislabeled this way, the features learned for distinguishing positive instances from negative instances may end up being very misleading. Other methods [7], [8], [9] try to extend some standard single-instance learning (SIL) methods for multi-instance data by adding some constraints. Unfortunately, the resulting methods typically require solving non-convex optimization problems which suffer from the local minima problem and have high computation cost. Considering the limitations of some previous methods, we advocate here the necessity of instance label disambiguation as a way to eventually improve the prediction accuracy of the bag labels. In the context of MIL, disambiguation essentially refers to identifying the true positive instances in the positive bags. However, existing disambiguation methods cannot achieve promising performance. The APR (axis-parallel rectangle) method [1] tries to find an axis-parallel rectangle (or, more generally, hyper-rectangle) in the feature space to represent the area of the true positive instances. This rectangle should include at least one instance from each positive bag but exclude all instances from the negative bags. Although APR works quite well for the drug activity prediction problem, it is highly possible that no APR can be found for some other applications, such as image or text categorization, to satisfy the requirement that at least one instance from each positive bag is included while all instances from the negative bags are excluded. Moreover, APR is very sensitive to labeling noise. Suppose only one negative bag is mislabeled as a positive one. In order to include at least one instance from this mislabeled negative bag, the computed APR may March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 4 contain so many negative instances that it cannot represent the area of true positive instances at all (cf.Section 3.2.3).DD (diverse density)value [10]is proposed to measure how many different positive bags have instances near a point in the feature space and how far the negative instances are from that point.The DD method tries to find the point with the highest DD value as the target concept.The DD method is very sensitive to labeling noise too since the DD value at a point will be exponentially reduced if an instance from some negative bag is close to that point (cf.Section 3.2.3).This phenomenon has been validated empirically by [11].Moreover, since the DD landscape contains local maxima,searching for the point with the highest DD value generally requires multiple restarts and hence incurs high computation cost.Other methods,such as mi-SVM [7],adopt some heuristic methods to solve the optimization problem,which may lead to local minima and incur high computation cost. 1.3 Main Contributions In this paper,we propose a very efficient and robust MIL method,called MILD (Multiple- Instance Learning via Disambiguation),for general MIL problems.The main contributions of this paper can be summarized as follows: By investigating the properties of true positive instances in depth,we propose a novel disambiguation method for identifying the true positive instances in the positive bags.This method is not only very efficient but also very robust. Two feature representation schemes,one for instance-level classification and the other for bag-level classification,are proposed to convert the MIL problem into a standard single- instance learning problem that can be solved directly by well-known SIL algorithms,such as support vector machine (SVM). By combining the two feature representation schemes,a multi-view semi-supervised learning method based on co-training [12]is proposed for MIL.To the best of our knowledge,this is the first inductive semi-supervised learning method for MIL. To demonstrate the promising performance of our method,we extensively compare our method with many state-of-the-art MIL methods in diverse applications,including drug activity prediction,protein sequence classification and image classification. One of the most attractive advantages of our methods is that,after instance label disambigua- tion,the MIL problem is converted into a standard single-instance learning problem.As a result, March 1.2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 4 contain so many negative instances that it cannot represent the area of true positive instances at all (cf. Section 3.2.3). DD (diverse density) value [10] is proposed to measure how many different positive bags have instances near a point in the feature space and how far the negative instances are from that point. The DD method tries to find the point with the highest DD value as the target concept. The DD method is very sensitive to labeling noise too since the DD value at a point will be exponentially reduced if an instance from some negative bag is close to that point (cf. Section 3.2.3). This phenomenon has been validated empirically by [11]. Moreover, since the DD landscape contains local maxima, searching for the point with the highest DD value generally requires multiple restarts and hence incurs high computation cost. Other methods, such as mi-SVM [7], adopt some heuristic methods to solve the optimization problem, which may lead to local minima and incur high computation cost. 1.3 Main Contributions In this paper, we propose a very efficient and robust MIL method, called MILD (MultipleInstance Learning via Disambiguation), for general MIL problems. The main contributions of this paper can be summarized as follows: • By investigating the properties of true positive instances in depth, we propose a novel disambiguation method for identifying the true positive instances in the positive bags. This method is not only very efficient but also very robust. • Two feature representation schemes, one for instance-level classification and the other for bag-level classification, are proposed to convert the MIL problem into a standard singleinstance learning problem that can be solved directly by well-known SIL algorithms, such as support vector machine (SVM). • By combining the two feature representation schemes, a multi-view semi-supervised learning method based on co-training [12] is proposed for MIL. To the best of our knowledge, this is the first inductive semi-supervised learning method for MIL. • To demonstrate the promising performance of our method, we extensively compare our method with many state-of-the-art MIL methods in diverse applications, including drug activity prediction, protein sequence classification and image classification. One of the most attractive advantages of our methods is that, after instance label disambiguation, the MIL problem is converted into a standard single-instance learning problem. As a result, March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 5 many existing supervised and semi-supervised learning methods can be easily adapted for MIL applications. 2 RELATED WORK Over the past few years,many applications have been formulated as MIL problems.These include drug activity prediction [1],[13],[14],[15],image classification [4],[11],[16],[17],[18],text classification [2],[7],stock selection [7],[10],protein family modeling [19],computer aided diagnosis [20],[21],and security applications [22].To solve these problems,many MIL methods have been proposed.The first MIL method is APR [1],which represents the target concept by an axis-parallel rectangle (or hyper-rectangle)in the feature space.The rectangle includes at least one instance from each positive bag but excludes all instances from the negative bags.[10] proposed a measure called DD(diverse density),which essentially measures how many different positive bags have instances near a point in the feature space and how far the negative instances are from that point.A DD-based method tries to find the point with the highest DD value as the target concept.EM-DD [23],[24]combines expectation-maximization (EM)[25]with the DD formulation by using EM to search for the most likely concept. Several other methods try to modify standard single-instance learning methods for MIL by introducing constraints derived from the MIL formulation.[7]proposed two methods based on SVM,one(mi-SVM)for instance-level classification and the other (MI-SVM)for bag-level clas- sification.Both methods are formulated as mixed integer quadratic programming problems.[26] applied deterministic annealing to the SVM formulation to find better local mimima compared to the heuristic methods.[6]proposed a kernel function directly for bags.With this multi-instance (MI)kernel,in principle any kernel-based classifier can be trained for classifying the bags.[27] extended this work by proposing a marginalized MI kernel to convert the MIL problem from an incomplete data problem to a complete data problem.[28]proposed an SVM-based method for sparse positive bags by directly enforcing the desired constraint that at least one of the instances in a positive bag is positive.[9]proposed to apply transductive SVM for solving MIL problems. Recently,some methods try to convert MIL into a standard single-instance problem and then solve it with conventional methods by representing the bags as feature vectors.[4]proposed the DD-SVM method through a feature mapping defined on some instance prototypes learned by DD.[11]proposed another method,called MILES (Multiple Instance Learning via Embedded March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 5 many existing supervised and semi-supervised learning methods can be easily adapted for MIL applications. 2 RELATED WORK Over the past few years, many applications have been formulated as MIL problems. These include drug activity prediction [1], [13], [14], [15], image classification [4], [11], [16], [17], [18], text classification [2], [7], stock selection [7], [10], protein family modeling [19], computer aided diagnosis [20], [21], and security applications [22]. To solve these problems, many MIL methods have been proposed. The first MIL method is APR [1], which represents the target concept by an axis-parallel rectangle (or hyper-rectangle) in the feature space. The rectangle includes at least one instance from each positive bag but excludes all instances from the negative bags. [10] proposed a measure called DD (diverse density), which essentially measures how many different positive bags have instances near a point in the feature space and how far the negative instances are from that point. A DD-based method tries to find the point with the highest DD value as the target concept. EM-DD [23], [24] combines expectation-maximization (EM) [25] with the DD formulation by using EM to search for the most likely concept. Several other methods try to modify standard single-instance learning methods for MIL by introducing constraints derived from the MIL formulation. [7] proposed two methods based on SVM, one (mi-SVM) for instance-level classification and the other (MI-SVM) for bag-level classification. Both methods are formulated as mixed integer quadratic programming problems. [26] applied deterministic annealing to the SVM formulation to find better local mimima compared to the heuristic methods. [6] proposed a kernel function directly for bags. With this multi-instance (MI) kernel, in principle any kernel-based classifier can be trained for classifying the bags. [27] extended this work by proposing a marginalized MI kernel to convert the MIL problem from an incomplete data problem to a complete data problem. [28] proposed an SVM-based method for sparse positive bags by directly enforcing the desired constraint that at least one of the instances in a positive bag is positive. [9] proposed to apply transductive SVM for solving MIL problems. Recently, some methods try to convert MIL into a standard single-instance problem and then solve it with conventional methods by representing the bags as feature vectors. [4] proposed the DD-SVM method through a feature mapping defined on some instance prototypes learned by DD. [11] proposed another method, called MILES (Multiple Instance Learning via Embedded March 1, 2009 DRAFT