IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X Therefore,to estimate the best threshold *(and hence o?),we only need to compute the distance from t to each positive training bag.This procedure can be done very efficiently Ideally,disambiguation should identify all the true positive instances in the positive bags. However,the only thing we know is that each positive bag contains at least one true positive instance,but we do not know the exact number of true positive instances for one specific positive bag.Different positive bags may have different number of positive instances,and furthermore, the discriminative ability of the true positive instances from different positive bags might also be different.Hence,although we know in general the true positive instances will have higher ability to discriminate the training bags than negative instances,it is not easy to find a simple rule to identify all the true positive instances in the positive bags.Actually,from our experiments(cf. Section 4.3),we find that identifying all the true positive instances might exceed the requirement necessary for us to complete common MIL tasks,and consequently incur some unnecessary cost. Here,we design a simple disambiguation method to identify just one positive instance from each positive bag,which is enough for most MIL tasks based on our experiments2. Since the MIL formulation requires that each positive bag contains at least one true positive instance,it is always possible to find a true positive instance in a positive bag.In our algorithm, we select from each positive training bag the instance with the largest P*value as a candidate true positive instance.After true positive instance selection,the disambiguation process is completed. Algorithm 1 summarizes the disambiguation method presented above. 3.2.3 Comparison with Other Disambiguation Methods APR (axis-parallel rectangle)[1]and DD (diverse density)[10]are two well-known disam- biguation methods.Compared with them,our disambiguation method is more efficient and robust.Quantitative empirical comparison will be performed in the next section.We first give a qualitative comparison here with respect to the robustness property. APR represents the target concept by an axis-parallel rectangle which includes at least one instance from each positive bag but excludes all instances from the negative bags.Figure 1 gives a toy example to show how APR works.There are altogether nine bags (B,B,B,B,B,B,B7,Bs,B) in the two-dimensional feature space.If there is no noise,APR will choose the area in the red 2.The extension of our simple version of disambiguation method for some complex tasks will be discussed in Section 3.5.1. March 1.2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 11 Therefore, to estimate the best threshold θ ∗ t (and hence σ 2 t ), we only need to compute the distance from t to each positive training bag. This procedure can be done very efficiently. Ideally, disambiguation should identify all the true positive instances in the positive bags. However, the only thing we know is that each positive bag contains at least one true positive instance, but we do not know the exact number of true positive instances for one specific positive bag. Different positive bags may have different number of positive instances, and furthermore, the discriminative ability of the true positive instances from different positive bags might also be different. Hence, although we know in general the true positive instances will have higher ability to discriminate the training bags than negative instances, it is not easy to find a simple rule to identify all the true positive instances in the positive bags. Actually, from our experiments (cf. Section 4.3), we find that identifying all the true positive instances might exceed the requirement necessary for us to complete common MIL tasks, and consequently incur some unnecessary cost. Here, we design a simple disambiguation method to identify just one positive instance from each positive bag, which is enough for most MIL tasks based on our experiments 2 . Since the MIL formulation requires that each positive bag contains at least one true positive instance, it is always possible to find a true positive instance in a positive bag. In our algorithm, we select from each positive training bag the instance with the largest P ∗ value as a candidate true positive instance. After true positive instance selection, the disambiguation process is completed. Algorithm 1 summarizes the disambiguation method presented above. 3.2.3 Comparison with Other Disambiguation Methods APR (axis-parallel rectangle) [1] and DD (diverse density) [10] are two well-known disambiguation methods. Compared with them, our disambiguation method is more efficient and robust. Quantitative empirical comparison will be performed in the next section. We first give a qualitative comparison here with respect to the robustness property. APR represents the target concept by an axis-parallel rectangle which includes at least one instance from each positive bag but excludes all instances from the negative bags. Figure 1 gives a toy example to show how APR works. There are altogether nine bags (B + 1 , B+ 2 , B+ 3 , B+ 4 , B− 5 , B− 6 , B− 7 , B− 8 , B− 9 ) in the two-dimensional feature space. If there is no noise, APR will choose the area in the red 2. The extension of our simple version of disambiguation method for some complex tasks will be discussed in Section 3.5.1. March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X Algorithm 1 Disambiguation Method Input:All training bags B,...,B,Br,...,B- Initialize:X={1,2,...,p}is the set of re-indexed instances from all positive training bags;T*=(empty set),where T*is used to keep the set of selected true positive instances. for k =1 to p do Compute P*(ck)according to (7) end for for i=1 to n+do t片=arg maxB时eB时P*(B晴) Add t;to T* end for Output:T* (smaller)rectangle as the true positive area,which is a very reasonable choice for this noise-free case.However,if we mislabel even just one negative bag,say B,to be positive,APR can no longer find a rectangle to include at least one instance from each positive bag but exclude all instances from the negative bags.If B6 is not in the training set,then the algorithm will choose the blue (bigger)rectangle,which is obviously not the real true positive area.Hence,APR is very sensitive to labeling noise.Furthermore,in many applications such as image classification, it is very difficult or even impossible for APR to find a rectangle that satisfies all the constraints. As for DD [10],it finds a single point in the feature space as well as the best feature weights corresponding to that point to represent the concept and then decides the label of each bag based on the distance from the bag to the concept point.If the weighted distance from the concept point to any instance of a bag is below a threshold,the bag will be labeled as positive.Hence, the true positive area of DD is an ellipse (or,more generally,hyperellipsoid).In the toy example in Figure 1,we can assume that both features have equal weights and hence the computed true positive area is a circle.When there is no noise,DD finds the red point as the computed concept point and the red(smaller)circle as the true positive area.This result is quite reasonable. However,DD is also very sensitive to labeling noise,which has been pointed out in [11].From the toy example,we can also observe this phenomenon easily.If we mislabel only one positive March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 12 Algorithm 1 Disambiguation Method Input: All training bags B + 1 , . . . , B+ n+ , B− 1 , . . . , B− n− Initialize: X = {x1, x2, . . . , xp} is the set of re-indexed instances from all positive training bags; T ∗ = φ (empty set), where T ∗ is used to keep the set of selected true positive instances. for k = 1 to p do Compute P ∗ (xk) according to (7) end for for i = 1 to n + do t ∗ i = arg maxB + ij∈B + i P ∗ (B + ij ) Add t ∗ i to T ∗ end for Output: T ∗ (smaller) rectangle as the true positive area, which is a very reasonable choice for this noise-free case. However, if we mislabel even just one negative bag, say B − 5 , to be positive, APR can no longer find a rectangle to include at least one instance from each positive bag but exclude all instances from the negative bags. If B − 6 is not in the training set, then the algorithm will choose the blue (bigger) rectangle, which is obviously not the real true positive area. Hence, APR is very sensitive to labeling noise. Furthermore, in many applications such as image classification, it is very difficult or even impossible for APR to find a rectangle that satisfies all the constraints. As for DD [10], it finds a single point in the feature space as well as the best feature weights corresponding to that point to represent the concept and then decides the label of each bag based on the distance from the bag to the concept point. If the weighted distance from the concept point to any instance of a bag is below a threshold, the bag will be labeled as positive. Hence, the true positive area of DD is an ellipse (or, more generally, hyperellipsoid). In the toy example in Figure 1, we can assume that both features have equal weights and hence the computed true positive area is a circle. When there is no noise, DD finds the red point as the computed concept point and the red (smaller) circle as the true positive area. This result is quite reasonable. However, DD is also very sensitive to labeling noise, which has been pointed out in [11]. From the toy example, we can also observe this phenomenon easily. If we mislabel only one positive March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 13 bag,say Bf,to be negative,then the concept point will move to the blue point and the true positive area will be the blue (bigger)circle,which is also not the real true positive area. BB B陆 B Fig.1.A toy example to show that APR and DD are sensitive to labeling noise.B and B are two instances from the same positive bag B,while B and B are two instances from the same negative bag B. On the contrary,our disambiguation method is much more robust.In the empirical precision measure defined in(6),negating a small number of labels will not change the overall precision value significantly.This makes our method more robust towards labeling noise.To illustrate this,let us assume that x+is a true positive instance from a positive training bag and x- is a negative instance (false positive instance)from the same bag.The precisions P*(x+)and P*(x-)are computed from(7).Since negative instances also exist in positive bags,the distances from x-to the positive bags are as random as those to the negative bags.Hence,P*()may approach 50%while P(+)is relatively high because d(+,B)is expected to be smaller than d(,B).If we add noise by changing the labels of d%of positive bags and d%of negative bags,then P*(-)may still be about 50%but p*(+)may decrease by up to d%. However,as long as P*()-d%>P(-),the selected true positive instance is still a+.In real applications,P(+)can be very high,even 100%,if all the true positive instances form a March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 13 bag, say B + 1 , to be negative, then the concept point will move to the blue point and the true positive area will be the blue (bigger) circle, which is also not the real true positive area. B + 11 B + 21 B + 31 B + 41 B + B 12 − 51 B − 81 B − 61 B + 22 B − 71 B − 91 B + 32 B − 52 B − 82 B − 62 B + 42 B − 72 B − 92 Fig. 1. A toy example to show that APR and DD are sensitive to labeling noise. B + ij and B + ik are two instances from the same positive bag B + i , while B − ij and B − ik are two instances from the same negative bag B − i . On the contrary, our disambiguation method is much more robust. In the empirical precision measure defined in (6), negating a small number of labels will not change the overall precision value significantly. This makes our method more robust towards labeling noise. To illustrate this, let us assume that x + is a true positive instance from a positive training bag and x − is a negative instance (false positive instance) from the same bag. The precisions P ∗ (x +) and P ∗ (x −) are computed from (7). Since negative instances also exist in positive bags, the distances from x − to the positive bags are as random as those to the negative bags. Hence, P ∗ (x −) may approach 50% while P ∗ (x +) is relatively high because d(x +, B+ i ) is expected to be smaller than d(x +, B− j ). If we add noise by changing the labels of d% of positive bags and d% of negative bags, then P ∗ (x −) may still be about 50% but P ∗ (x +) may decrease by up to d%. However, as long as P ∗ (x +) − d% > P∗ (x −), the selected true positive instance is still x +. In real applications, P ∗ (x +) can be very high, even 100%, if all the true positive instances form a March 1, 2009 DRAFT