DMNet:Difference Minimization Network for Semi-supervised Segmentation in Medical Images Kang Fang and Wu-Jun Li) National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology,Nanjing University, National Institute of Healthcare Data Science at Nanjing University,Nanjing,China fangk@lamda.nju.edu.cn,liwujun@nju.edu.cn Abstract.Semantic segmentation is an important task in medical image analysis.In general,training models with high performance needs a large amount of labeled data.However,collecting labeled data is typ- ically difficult,especially for medical images.Several semi-supervised methods have been proposed to use unlabeled data to facilitate learning. Most of these methods use a self-training framework,in which the model cannot be well trained if the pseudo masks predicted by the model itself are of low quality.Co-training is another widely used semi-supervised method in medical image segmentation.It uses two models and makes them learn from each other.All these methods are not end-to-end.In this paper,we propose a novel end-to-end approach,called difference minimization network (DMNet),for semi-supervised semantic segmen- tation.To use unlabeled data,DMNet adopts two decoder branches and minimizes the difference between soft masks generated by the two decoders.In this manner,each decoder can learn under the supervision of the other decoder,thus they can be improved at the same time.Also, to make the model generalize better,we force the model to generate low-entropy masks on unlabeled data so the decision boundary of model lies in low-density regions.Meanwhile,adversarial training strategy is adopted to learn a discriminator which can encourage the model to gen- erate more accurate masks.Experiments on a kidney tumor dataset and a brain tumor dataset show that our method can outperform the base- lines,including both supervised and semi-supervised ones,to achieve the best performance. Keywords:Semantic segmentation.Semi-supervised learning 1 Introduction Semantic segmentation is of great importance in medical image analysis,because it can help detect the location and size of anatomical structures and aid in mak- ing therapeutic schedule.With the development of deep learning,deep neural This work is supported by the NSFC-NRF Joint Research Project(No.61861146001) O Springer Nature Switzerland AG 2020 A.L.Martel et al.(Eds.):MICCAI 2020,LNCS 12261,pp.532-541,2020. https:/doi.org/10.1007/978-3-030-59710-8_52
DMNet: Difference Minimization Network for Semi-supervised Segmentation in Medical Images Kang Fang and Wu-Jun Li(B) National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, National Institute of Healthcare Data Science at Nanjing University, Nanjing, China fangk@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract. Semantic segmentation is an important task in medical image analysis. In general, training models with high performance needs a large amount of labeled data. However, collecting labeled data is typically difficult, especially for medical images. Several semi-supervised methods have been proposed to use unlabeled data to facilitate learning. Most of these methods use a self-training framework, in which the model cannot be well trained if the pseudo masks predicted by the model itself are of low quality. Co-training is another widely used semi-supervised method in medical image segmentation. It uses two models and makes them learn from each other. All these methods are not end-to-end. In this paper, we propose a novel end-to-end approach, called difference minimization network (DMNet), for semi-supervised semantic segmentation. To use unlabeled data, DMNet adopts two decoder branches and minimizes the difference between soft masks generated by the two decoders. In this manner, each decoder can learn under the supervision of the other decoder, thus they can be improved at the same time. Also, to make the model generalize better, we force the model to generate low-entropy masks on unlabeled data so the decision boundary of model lies in low-density regions. Meanwhile, adversarial training strategy is adopted to learn a discriminator which can encourage the model to generate more accurate masks. Experiments on a kidney tumor dataset and a brain tumor dataset show that our method can outperform the baselines, including both supervised and semi-supervised ones, to achieve the best performance. Keywords: Semantic segmentation · Semi-supervised learning 1 Introduction Semantic segmentation is of great importance in medical image analysis, because it can help detect the location and size of anatomical structures and aid in making therapeutic schedule. With the development of deep learning, deep neural This work is supported by the NSFC-NRF Joint Research Project (No. 61861146001). c Springer Nature Switzerland AG 2020 A. L. Martel et al. (Eds.): MICCAI 2020, LNCS 12261, pp. 532–541, 2020. https://doi.org/10.1007/978-3-030-59710-8_52
DMNet for Semi-supervised Segmentation 533 networks especially fully convolutional networks(FCN)[12]have shown promis- ing performance in segmenting both natural images and medial images.The models in these methods have millions of parameters to be optimized,thus a large amount of labeled data with pixel-level annotations is typically needed for training such models to achieve promising performance.However,it is generally difficult to collect a large amount of labeled data in medical image analysis.One main reason is that annotating medical images needs expertise knowledge but few experts have time for annotation.Another reason is that it is time-consuming to annotate medical images. Semi-supervised learning can utilize a large amount of unlabeled data to improve model performance.semiFCN [2]proposes a semi-supervised network- based approach for medical image segmentation.In semiFCN,a network is trained to predict pseudo masks.The predicted pseudo masks are then used to update the network in turn.ASDNet [14]trains a confidence network to select regions with high confidence in soft masks for updating the segmentation network.Zhou et al.[18]propose to jointly improve the performance of disease grading and lesion segmentation by semi-supervised learning with an attention mechanism.Souly et al.[17 use weakly labeled data and unlabeled data to train a generative adversarial network (GAN)[8],which can force real data to be close in feature space and thus cluster together.These methods all use a self-training framework,in which the model is updated using pseudo masks pre- dicted by the model itself.If the pseudo masks predicted by the model itself have low quality,the model will be updated using data with noise.On the other hand,co-training [4 uses two models and each model is updated using unla- beled data with pseudo masks predicted by the other model and labeled data with ground truth.In this manner,each model in co-training is supervised by the other model.So the two models can be improved in turn.Several methods [9,15] explore co-training in deep learning.But they are not end-to-end methods. In this paper,we propose a novel end-to-end approach,called difference minimization network(DMNet),for semi-supervised semantic segmentation in medical images.The contributions of our method can be listed as follows: DMNet is a semi-supervised segmentation model,which can be trained with a limited amount of labeled data and a large amount of unlabeled data. DMNet adopts the widely used encoder-decoder structure [1,7,16,but it has two decoder branches with a shared encoder.DMNet minimizes the difference between the soft masks predicted by the two decoders to utilize unlabeled data.Unlike co-training which is often not end-to-end,the two decoders in DMNet can be updated at the same time in an end-to-end way. -DMNet uses the sharpen [3 operation to force the model to generate pre- dictions with low entropy on unlabeled data,which can improve the model performance. DMNet adopts adversarial learning derived from GAN for further improve- ment. Experiments on a kidney tumor dataset and a brain tumor dataset show that our method can outperform other baselines to achieve the best performance
DMNet for Semi-supervised Segmentation 533 networks especially fully convolutional networks (FCN) [12] have shown promising performance in segmenting both natural images and medial images. The models in these methods have millions of parameters to be optimized, thus a large amount of labeled data with pixel-level annotations is typically needed for training such models to achieve promising performance. However, it is generally difficult to collect a large amount of labeled data in medical image analysis. One main reason is that annotating medical images needs expertise knowledge but few experts have time for annotation. Another reason is that it is time-consuming to annotate medical images. Semi-supervised learning can utilize a large amount of unlabeled data to improve model performance. semiFCN [2] proposes a semi-supervised networkbased approach for medical image segmentation. In semiFCN, a network is trained to predict pseudo masks. The predicted pseudo masks are then used to update the network in turn. ASDNet [14] trains a confidence network to select regions with high confidence in soft masks for updating the segmentation network. Zhou et al. [18] propose to jointly improve the performance of disease grading and lesion segmentation by semi-supervised learning with an attention mechanism. Souly et al. [17] use weakly labeled data and unlabeled data to train a generative adversarial network (GAN) [8], which can force real data to be close in feature space and thus cluster together. These methods all use a self-training framework, in which the model is updated using pseudo masks predicted by the model itself. If the pseudo masks predicted by the model itself have low quality, the model will be updated using data with noise. On the other hand, co-training [4] uses two models and each model is updated using unlabeled data with pseudo masks predicted by the other model and labeled data with ground truth. In this manner, each model in co-training is supervised by the other model. So the two models can be improved in turn. Several methods [9,15] explore co-training in deep learning. But they are not end-to-end methods. In this paper, we propose a novel end-to-end approach, called difference minimization network (DMNet), for semi-supervised semantic segmentation in medical images. The contributions of our method can be listed as follows: – DMNet is a semi-supervised segmentation model, which can be trained with a limited amount of labeled data and a large amount of unlabeled data. – DMNet adopts the widely used encoder-decoder structure [1,7,16], but it has two decoder branches with a shared encoder. DMNet minimizes the difference between the soft masks predicted by the two decoders to utilize unlabeled data. Unlike co-training which is often not end-to-end, the two decoders in DMNet can be updated at the same time in an end-to-end way. – DMNet uses the sharpen [3] operation to force the model to generate predictions with low entropy on unlabeled data, which can improve the model performance. – DMNet adopts adversarial learning derived from GAN for further improvement. – Experiments on a kidney tumor dataset and a brain tumor dataset show that our method can outperform other baselines to achieve the best performance
534 K.Fang and W.-J.Li 2 Notation We use X E RHxW to denote an image in the labeled training set,and YE 10,1]HxWxK to denote the corresponding ground-truth label which is encoded into a one-hot format.Here,K is the number of classes,H and W are the height and width of the image respectively.DMNet has two segmentation branches, and we denote the class probability maps generated by the two segmentation branches as Y(1),Y(2)ERHxWxK.Furthermore,we denote an unlabeled image asU∈rHxw.We use[1:N]to denote[l,2,…,MⅥ. 3 Method The framework of DMNet is shown in Fig.1,which is composed of a segmentation network with two decoder branches,a sharpen operation for unlabeled data and a discriminator for both labeled and unlabeled data.Each component will be described detailedly in the following subsections. For Unlabeled Data Segmentation Network Predicted 水 nput moge Sharpon 6 Shared Encoder Predicted Mask Discriminator 3 Fig.1.The framework of DMNet 3.1 Segmentation Network As shown in Fig.1,the segmentation network in DMNet adopts the widely used encoder-decoder architecture,which is composed of a shared encoder and two different decoders.By sharing an encoder,our segmentation network has some advantages.First,it can save GPU memory compared to the architecture in which two decoders use separate encoders.Second,since the encoder is shared by two decoders,it can be updated by the information from two decoders.Therefore
534 K. Fang and W.-J. Li 2 Notation We use X ∈ RH×W to denote an image in the labeled training set, and Y ∈ {0, 1}H×W×K to denote the corresponding ground-truth label which is encoded into a one-hot format. Here, K is the number of classes, H and W are the height and width of the image respectively. DMNet has two segmentation branches, and we denote the class probability maps generated by the two segmentation branches as Yˆ (1),Yˆ (2) ∈ RH×W×K. Furthermore, we denote an unlabeled image as U ∈ RH×W . We use [1 : N] to denote [1, 2, ··· , N]. 3 Method The framework of DMNet is shown in Fig. 1, which is composed of a segmentation network with two decoder branches, a sharpen operation for unlabeled data and a discriminator for both labeled and unlabeled data. Each component will be described detailedly in the following subsections. Fig. 1. The framework of DMNet 3.1 Segmentation Network As shown in Fig. 1, the segmentation network in DMNet adopts the widely used encoder-decoder architecture, which is composed of a shared encoder and two different decoders. By sharing an encoder, our segmentation network has some advantages. First, it can save GPU memory compared to the architecture in which two decoders use separate encoders. Second, since the encoder is shared by two decoders, it can be updated by the information from two decoders. Therefore
DMNet for Semi-supervised Segmentation 535 it can learn better features from the difference between soft masks generated by two decoders,which can lead to better performance.This will be verified by our experimental results in Sect.4.The two decoders in DMNet use different architectures to introduce diversity.By adopting different architectures,the two decoders will not typically output exactly the same segmentation masks and they can learn from each other.By using labeled and unlabeled data in turn,DMNet can utilize unlabeled data adequately to improve segmentation performance. DMNet is a general framework,and any segmentation network with an encoder- decoder architecture,such as UNet [16],VNet [13],SegNet [1]and DeepLab v3+[7],can be used in DMNet.In this paper,we adopt UNet [16]and DeepLab v3+[7]for illustration.The shared encoder can extract latent representation with high-level semantic information of the input image.Then we use the ground truth to supervise the learning of segmentation network for labeled data while minimizing the difference between the masks generated by the two decoders to let them learn from each other for unlabeled data. We use Dice loss [13 to train our segmentation network on labeled data, which is defined as follows: Ldice(Y),Y②),Y;0) 2∑从1W1五wk9k where Yh.w,k=1 when the pixel at position (h,w)belongs to class k,and other values inis set to be.is the probability that the pixel at position (h,w)belongs to class k predicted by the segmentation branch i.0s is the parameter of the segmentation network. The loss function used for unlabeled data is described in Sect.3.3. 3.2 Sharpen Operation Given an unlabeled data U,our segmentation network can generate soft masks Y(1)and Y(2).To make the predictions of the segmentation networks have low entropy or high confidence,we adopt the sharpen operation [3 to reduce the entropy of predictions on unlabeled data,which is defined as follows: Sharpen(Y0e,I))= (8cT ∑K1(Y9)/T h∈[1:H],w∈[1:W],T∈(0,1) where Y(i)is the soft mask predicted by decoder branch i and temperature T is a hyperparameter. 3.3 Difference Minimization for Semi-supervised Segmentation As described in Sect.3.1,two decoders can generate two masks on unlabeled data.If the two masks vary from each other,it means the model is unsure about the predictions and thus the model cannot generalize well.Therefore
DMNet for Semi-supervised Segmentation 535 it can learn better features from the difference between soft masks generated by two decoders, which can lead to better performance. This will be verified by our experimental results in Sect. 4. The two decoders in DMNet use different architectures to introduce diversity. By adopting different architectures, the two decoders will not typically output exactly the same segmentation masks and they can learn from each other. By using labeled and unlabeled data in turn, DMNet can utilize unlabeled data adequately to improve segmentation performance. DMNet is a general framework, and any segmentation network with an encoderdecoder architecture, such as UNet [16], VNet [13], SegNet [1] and DeepLab v3+ [7], can be used in DMNet. In this paper, we adopt UNet [16] and DeepLab v3+ [7] for illustration. The shared encoder can extract latent representation with high-level semantic information of the input image. Then we use the ground truth to supervise the learning of segmentation network for labeled data while minimizing the difference between the masks generated by the two decoders to let them learn from each other for unlabeled data. We use Dice loss [13] to train our segmentation network on labeled data, which is defined as follows: Ldice(Yˆ (1),Yˆ (2),Y ; θs) = 2 i=1 1 − 1 K K k=1 2 H h=1 W w=1 Yh,w,kYˆ (i) h,w,k H h=1 W w=1(Yh,w,k + Yˆ (i) h,w,k) , where Yh,w,k = 1 when the pixel at position (h, w) belongs to class k, and other values in Yh,w,k is set to be 0. Yˆ (i) h,w,k is the probability that the pixel at position (h, w) belongs to class k predicted by the segmentation branch i. θs is the parameter of the segmentation network. The loss function used for unlabeled data is described in Sect. 3.3. 3.2 Sharpen Operation Given an unlabeled data U, our segmentation network can generate soft masks Yˆ (1) and Yˆ (2). To make the predictions of the segmentation networks have low entropy or high confidence, we adopt the sharpen operation [3] to reduce the entropy of predictions on unlabeled data, which is defined as follows: Sharpen(Yˆ (i) h,w,c, T) = (Yˆ (i) h,w,c)1/T K i=1 (Yˆ (i) h,w,i)1/T ∀h ∈ [1 : H], w ∈ [1 : W], T ∈ (0, 1), where Yˆ (i) is the soft mask predicted by decoder branch i and temperature T is a hyperparameter. 3.3 Difference Minimization for Semi-supervised Segmentation As described in Sect. 3.1, two decoders can generate two masks on unlabeled data. If the two masks vary from each other, it means the model is unsure about the predictions and thus the model cannot generalize well. Therefore,
536 K.Fang and W.-J.Li we minimize the difference between the two masks to make the two decoders generate consistent masks on the same unlabeled data.In other words,the two decoders can learn under the supervision of each other. More specifically,given an unlabeled data U,the two decoder branches can generate two probability masks Y(1)and Y(2)which are processed by the sharpen operation.Since dice loss can measure the similarity of two segmenta- tion masks and the loss can be backpropogated through two terms,we extend dice loss to the unlabeled setting and get the corresponding loss Lsemi as follows: 152"8k LmW:a)=1-下三gk+9 From the definition of Lsemi,we can see that the two decoders can be updated by minimizing the difference between the masks they generate. 3.4 Discriminator In DMNet,we also adopt adversarial learning to learn a discriminator.Unlike the original discriminator in GAN which discriminates whether an image is generated or is real,our discriminator adopts a fully convolutional network (FCN).The FCN discriminator is composed of three convolutional layers whose stride is 2 for downsampling and three corresponding upsampling layers.Each convolutional layer is followed by a ReLU layer.It can discriminate whether a region or some pixels are predicted or from ground truth. Adversarial Loss for Discriminator.The objective function of discriminator can be written as follows: Ldis(),(2),Y:0d)=Lbce(D(Y()),0:0d)+Lbce(D(Y(2)),0;0a) +Lbce(D(Y),1;0d), where 0d is the parameter of the discriminator D().1 and 0 are tensors filled with 1 or 0 respectively,with the same size as that of the outputs of D().The term Lbce(D(Y),1)in Ldis(Y(1),Y(2),Y;0a)is used only when the input data is labeled and is ignored when the input data is unlabeled data.Loce is defined as follows: HW H W Lie(A,B:)=-∑∑B.log.-∑∑Il-Baw)log-Ahl h=1w=1 h=1w=1 where 6 is the parameter of A. Adversarial Loss for Segmentation Network.In the adversarial learning scheme,the segmentation network tries to fool the discriminator.Hence,there is an adversarial loss Lade for segmentation network to learn consistent features: Ladm(O:9s)=Lbce(D(Y),1:0g)+Lbce(D(Y②),1;0g)
536 K. Fang and W.-J. Li we minimize the difference between the two masks to make the two decoders generate consistent masks on the same unlabeled data. In other words, the two decoders can learn under the supervision of each other. More specifically, given an unlabeled data U, the two decoder branches can generate two probability masks Yˆ (1) and Yˆ (2) which are processed by the sharpen operation. Since dice loss can measure the similarity of two segmentation masks and the loss can be backpropogated through two terms, we extend dice loss to the unlabeled setting and get the corresponding loss Lsemi as follows: Lsemi(U; θs)=1 − 1 K K k=1 2 H h=1 W w=1 Yˆ (1) h,w,kYˆ (2) h,w,k H h=1 W w=1(Yˆ (1) h,w,k + Yˆ (2) h,w,k) . From the definition of Lsemi, we can see that the two decoders can be updated by minimizing the difference between the masks they generate. 3.4 Discriminator In DMNet, we also adopt adversarial learning to learn a discriminator. Unlike the original discriminator in GAN which discriminates whether an image is generated or is real, our discriminator adopts a fully convolutional network (FCN). The FCN discriminator is composed of three convolutional layers whose stride is 2 for downsampling and three corresponding upsampling layers. Each convolutional layer is followed by a ReLU layer. It can discriminate whether a region or some pixels are predicted or from ground truth. Adversarial Loss for Discriminator. The objective function of discriminator can be written as follows: Ldis(Yˆ (1),Yˆ (2),Y ; θd) = Lbce(D(Yˆ (1)), 0; θd) + Lbce(D(Yˆ (2)), 0; θd) + Lbce(D(Y ), 1; θd), where θd is the parameter of the discriminator D(·). 1 and 0 are tensors filled with 1 or 0 respectively, with the same size as that of the outputs of D(·). The term Lbce(D(Y ), 1) in Ldis(Yˆ (1),Yˆ (2),Y ; θd) is used only when the input data is labeled and is ignored when the input data is unlabeled data. Lbce is defined as follows: Lbce(A, B; θ) = − H h=1 W w=1 Bh,w log Ah,w − H h=1 W w=1 [(1 − Bh,w) log(1 − Ah,w)], where θ is the parameter of A. Adversarial Loss for Segmentation Network. In the adversarial learning scheme, the segmentation network tries to fool the discriminator. Hence, there is an adversarial loss Ladv for segmentation network to learn consistent features: Ladv(O; θs) = Lbce(D(Yˆ (1)), 1; θs) + Lbce(D(Yˆ (2)), 1; θs),