医学图像分割：U_Net 论文阅读-编程知识

“U-Net: Convolutional Networks for Biomedical Image Segmentation” 是一篇由Olaf Ronneberger, Philipp Fischer, 和 Thomas Brox发表的论文，于2015年在MICCAI的医学图像计算和计算机辅助干预会议上提出。这篇论文介绍了一种新型的卷积神经网络架构——U-Net，特别是为了处理医学图像分割问题而设计。

背景和挑战
在医学图像分析领域，图像分割是一个基本且重要的任务，它涉及将图像分割成不同的区域或对象，例如，区分正常组织与肿瘤组织。传统的分割方法依赖于手工特征提取和复杂的模型，而深度学习方法，特别是卷积神经网络（CNN），提供了一种端到端的自动特征学习方法。
U-Net 架构
U-Net的设计灵感来源于全卷积网络（FCN），但做了显著的改进以更好地适应医学图像分割。U-Net的架构形状像字母"U"，由两部分组成：

收缩路径（Contracting Path）：

也称为编码器部分，包括多个卷积层和池化层，用于提取图像特征。
随着网络深度的增加，空间分辨率逐渐降低，但特征通道数增加，以学习更复杂的图像表示。

扩展路径（Expansive Path）：

也称为解码器部分，由多个上采样操作和卷积层组成。
扩展路径的目的是将低分辨率的特征映射恢复到高分辨率，以便于精确的定位。

网络特点：

跳跃连接（Skip Connections）：
- 跳跃连接将编码器部分的特征图与解码器部分的对应特征图连接起来，这有助于网络在上采样过程中恢复精确的定位信息。
- 通过跳跃连接，网络能够利用上下文信息进行更准确的分割。
数据增强（Data Augmentation）：
- 论文中特别强调了数据增强在训练过程中的重要性，因为医学图像数据通常是有限的。
- 使用了随机旋转、缩放和弹性变形等方法来扩展训练数据集，从而提高模型的泛化能力。

成果和影响
U-Net在2015年的ISBI挑战赛中取得了突破性的结果，并且由于它出色的性能和灵活性，迅速成为医学图像分割领域的一个里程碑。U-Net的架构和思想被广泛应用于各种医学图像分割任务，并且激发了许多后续的研究和改进。

结论
U-Net提供了一种有效的医学图像分割方案，通过其独特的结构设计，它在处理小量数据集时仍然能够实现很高的精度。它解决了传统分割方法难以捕捉复杂特征和形状的问题，并为医学图像分割领域的发展开辟了新的方向。

------------------------------------------------------------以下是原文阅读----------------------------------------------------------------------

Abstract.

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.unifreiburg.de/people/ronneber/u-net .

广泛认为，成功训练深度网络需要数千个带有注释的训练样本。在本文中，我们提出了一种网络和训练策略，通过强烈使用数据增强技术，更有效地利用可用的标注样本。**该架构包括一个收缩路径来捕捉上下文信息和一个对称扩展路径来实现精确定位。**这样的网络可以从非常少量的图像进行端到端训练，并且在ISBI挑战中对EM stacks(EM堆栈)（electron microscopic stacks）中神经结构分割的先前最佳方法（滑动窗口卷积网络）取得了更好的效果。使用同一网络在透射光显微镜图像（相差显微镜和差显微镜）上进行训练，我们在ISBI细胞追踪挑战2015中以较大的优势赢得了这些类别。此外，该网络速度快。对于一个512x512的图像，分割只需不到一秒钟的时间在最新的GPU上完成。完整的实现（基于Caffe）和训练过的网络可在http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net找到。

Introduction

In the last two years, deep convolutional networks have outperformed the state of the art in many visual recognition tasks, e.g. [ 7 , 3]. While convolutional networks have already existed for a long time [ 8], their success was limited due to the size of the available training sets and the size of the considered networks. The breakthrough by Krizhevsky et al. [ 7] was due to supervised training of a large network with 8 layers and millions of parameters on the ImageNet dataset with 1 million training images. Since then, even larger and deeper networks have been trained [12].
在过去的两年中，深度卷积网络在许多视觉识别任务中超越了最先进的方法，例如[7, 3]。虽然卷积网络已经存在很长时间[8]，但由于可用训练集的规模和考虑网络的规模有限，它们的成功受到了限制。Krizhevsky等人的突破是通过在ImageNet数据集的100万个训练图像上对一个包含8个层和数百万个参数的大型网络进行监督训练来实现的[7]。从那时起，甚至更大更深的网络已经被训练出来[12]。

The typical use of convolutional networks is on classification tasks, where the output to an image is a single class label. However, in many visual tasks, especially in biomedical image processing, the desired output should include localization, i.e., a class label is supposed to be assigned to each pixel. Moreover, thousands of training images are usually beyond reach in biomedical tasks. Hence, Ciresan et al. [ 1] trained a network in a sliding-window setup to predict the class label of each pixel by providing a local region (patch) around that pixel as input. First, this network can localize. Secondly, the training data in terms of patches is much larger than the number of training images. The resulting network won the EM segmentation challenge at ISBI 2012 by a large margin.
卷积网络的典型用途是在分类任务中，其中图像的输出是一个单一的类别标签。然而，在许多视觉任务中，特别是在生物医学图像处理中，期望的输出包括定位，即应为每个像素分配一个类别标签。 此外，在生物医学任务中，通常无法获取成千上万的训练图像。因此，Ciresan等人[1]在用滑动窗口训练网络，通过提供每个像素周围的局部区域（patch——每个patch包含很多pixel）作为输入来预测每个像素的类别标签。首先，该网络可以进行定位。其次，以patch形式的训练数据远大于训练图像的数量。最终得到的网络在2012年的ISBI EM分割挑战中以较大优势获胜。
在这里插入图片描述
Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.
图1. U-net架构（最低分辨率为32x32像素的示例）。每个蓝色框代表一个多通道的特征图。通道数在框的顶部标示。x-y大小在框的左下角提供。白色框代表复制的特征图。箭头表示不同的操作，如右下角所示。

Obviously, the strategy in Ciresan et al. [1] has two drawbacks.

First, it is quite slow because the network must be run separately for each patch, and there is a lot of redundancy due to overlapping
patches.
Secondly, there is a trade-off between localization accuracy and the use of context. Larger patches require more max-pooling layers
that reduce the localization accuracy, while small patches allow the
network to see only little context.

More recent approaches [11,4] proposed a classifier output that takes into account the features from multiple layers. Good localization and the use of context are possible at the same time.
显然，Ciresan等人的策略[1]有两个缺点。

首先，它相当慢，因为网络必须为每个patch单独运行，且由于patch重叠导致大量冗余。
其次，定位精度和上下文使用之间存在权衡。较大的patch需要更多的最大池化层，这会降低定位精度，而小patch让网络只能看到很少的上下文。

更近期的方法[11,4]提出了一个考虑了多层特征的分类器输出。好的定位和上下文的使用可以同时实现。

In this paper, we build upon a more elegant architecture, the so-called “fully convolutional network” [9]. We modify and extend this architecture such that it works with very few training images and yields more precise segmentations; see Figure 1. The main idea in [9] is to supplement a usual contracting network by successive layers, where pooling operators are replaced by upsampling operators. Hence, these layers increase the resolution of the output. In order to localize, high resolution features from the contracting path are combined with the upsampled output. A successive convolution lay
在本文中，我们构建了一个更为优雅的架构，即所谓的“全卷积网络”[9]。我们修改并扩展了这一架构，使其可以使用非常少量的训练图像，并产生更精确的分割；见图1。[9]中的主要思想是通过连续层来补充一个常规的收缩网络（successive layers），在这些层中，池化操作（pooling operators）被上采样操作（upsampling operators）替代。因此，这些层增加了输出的分辨率。为了实现定位，来自收缩路径的高分辨率特征与上采样的输出相结合。一个连续的卷积层
在这里插入图片描述
Fig. 2. Overlap-tile strategy for seamless segmentation of arbitrary large images (here segmentation of neuronal structures in EM stacks). Prediction of the segmentation in the yellow area, requires image data within the blue area as input. Missing input data is extrapolated by mirroring
图 2. 无缝分割任意大图像的重叠平铺策略（这里是对电子显微镜堆叠中神经结构的分割）。预测黄色区域内的分割需要蓝色区域内的图像数据作为输入。缺失的输入数据通过镜像法进行外推。

One important modification in our architecture is that in the upsampling part we have also a large number of feature channels, which allow the network to propagate context information to higher resolution layers. As a consequence, the expansive path is more or less symmetric to the contracting path, and yields a u-shaped architecture. The network does not have any fully connected layers and only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image. This strategy allows the seamless segmentation of arbitrarily large images by an overlap-tile strategy (see Figure 2). To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. This tiling strategy is important to apply the network to large images, since otherwise the resolution would be limited by the GPU memory.

我们架构中的一个重要改进是，在上采样部分我们也有大量的特征通道，这使得网络能够将上下文信息传播到更高分辨率的层。因此，扩展路径（the expansive path）或多或少地对称于收缩路径（the contracting path），并且产生了一个U形的架构。该网络没有任何全连接层，并且只使用每个卷积的有效部分，即分割图仅包含输入图像中具有完整上下文的像素。这种策略通过重叠-平铺的方法，实现对任意大小图像的无缝分割。（见图2）。为了预测图像边缘区域的像素，通过镜像输入图像来推断缺失的上下文。这种 tiling strategy对于将网络应用于大图像非常重要，否则分辨率将受限于GPU内存。。

As for our tasks there is very little training data available, we use excessive data augmentation by applying elastic deformations to the available training images. This allows the network to learn invariance to such deformations, without the need to see these transformations in the annotated image corpus. This is particularly important in biomedical segmentation, since deformation used to be the most common variation in tissue and realistic deformations can be simulated efficiently. The value of data augmentation for learning invariance has been shown in Dosovitskiy et al. [2] in the scope of unsupervised feature learning.

鉴于我们的任务可用的训练数据非常有限，我们通过对现有训练图像应用弹性变形（elastic deformations）来进行过度的数据增强。这使得网络能够学习对这些变形的不变性，而不需要在标注的图像语料库中看到这些变换。这在生物医学分割中尤其重要，因为变形常常是组织中最常见的变化，而且可以有效地模拟真实的变形。Dosovitskiy等人[2]在无监督特征学习的范畴内，已经展示了数据增强对学习不变性的价值。

Another challenge in many cell segmentation tasks is the separation of touching objects of the same class; see Figure 3. To this end, we propose the use of a weighted loss, where the separating background labels between touching cells obtain a large weight in the loss function.

在许多细胞分割任务中的另一个挑战是分离同一类别中相互接触的对象；参见图3。为此，我们提出使用加权损失，其中touching cells之间分隔的背景标签在损失函数中获得较大的权重。

The resulting network is applicable to various biomedical segmentation problems. In this paper, we show results on the segmentation of neuronal structures in EM stacks (an ongoing competition started at ISBI 2012), where we outperformed the network of Ciresan et al. [1]. Furthermore, we show results for cell segmentation in light microscopy images from the ISBI cell tracking challenge 2015. Here we won with a large margin on the two most challenging 2D transmitted light datasets.

生成的网络适用于各种生物医学分割问题。在本文中，我们展示了在EM stacks(EM堆栈)中神经结构分割的结果（这是一个始于2012年国际生物成像学会(ISBI)的持续竞赛），我们的性能超越了Ciresan等人[1]的网络。此外，我们还展示了来自ISBI细胞跟踪挑战赛2015的光镜图像中的细胞分割结果。在这两个最具挑战性的2D透射光数据集上，我们以很大的优势获胜。

Network Architecture 网络架构

The network architecture is illustrated in Figure 1. It consists of a contracting path (left side) and an expansive path (right side). The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels. Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a 1x1 convolution is used to map each 64- component feature vector to the desired number of classes. In total the network has 23 convolutional layers.

如图1所示。它由一个收缩路径（左侧） 和一个 扩展路径（右侧） 组成。

收缩路径遵循典型的卷积网络架构。
它由两个3x3卷积（不填充卷积）的重复应用组成，每个卷积后跟一个修正线性单元（ReLU）和一个2x2最大池化操作，步幅为2，用于下采样。在每个下采样步骤中，我们将特征通道数量加倍。
扩展路径中的每个步骤由特征图的上采样后跟一个2x2卷积（“上卷积”）组成，
该卷积将特征通道数量减半，然后将其与从收缩路径中对应裁剪的特征图进行串联，并进行两个3x3卷积，每个卷积后跟一个ReLU。
由于每次卷积都会导致边界像素的丢失，因此裁剪是必要的。

在最后一层，使用1x1卷积将每个64个分量的特征向量映射到所需的类别数量。
总体上，该网络具有23个卷积层。

To allow a seamless tiling of the output segmentation map (see Figure 2), it is important to select the input tile size such that all 2x2 max-pooling operations are applied to a layer with an even x- and y-size.

为了实现输出分割图的seamless tiling（见图2），重要的是选择输入卷积核大小，使得所有2x2最大池化操作应用于具有偶数x和y大小的层。

Training 训练

The input images and their corresponding segmentation maps are used to train the network with the stochastic gradient descent implementation of Caffe [6]. Due to the unpadded convolutions, the output image is smaller than the input by a constant border width. To minimize the overhead and make maximum use of the GPU memory, we favor large input tiles over a large batch size and hence reduce the batch to a single image. Accordingly we use a high momentum (0.99) such that a large number of the previously seen training samples determine the update in the current optimization step

使用输入图像及其相应的分割地图来训练网络，采用Caffe的随机梯度下降实现[6]。由于无填充卷积，输出图像比输入图像小一个常数边框宽度。为了最小化开销并充分利用GPU内存，我们更喜欢使用较大的输入瓷砖而不是较大的批量大小，因此将批量大小减小为单个图像。相应地，我们使用高动量（0.99），以便大量先前看到的训练样本决定当前优化步骤中的更新。

在这里插入图片描述
能量函数通过对最终特征图进行pixel-wise soft-max计算，并结合交叉熵损失函数来计算。soft-max方程定义如下：

$P_{k}(x) = exp(a_{k}(x))/(\sum\limits_{k'=1}^{K}exp(a_{k'}(x))$

其中 $a_{k}(x) ( x∈Ω（Ω⊂Z_{2}))$ 表示像素位置处特征通道k的激活值。
K是类别的数量，
pk(x) 是近似的最大函数。
即对于具有最大激活值ak(x)的k，pk(x)≈1，对于其他所有的k，pk(x)≈0。

交叉熵损失函数会惩罚 $p_{l(x)}(x)$ 与1之间的偏差。

能量函数：

$\sum\limits_{x∈Ω}w(x)log(p_{l(x)}(x))$

where l : Ω → {1, . . . , K} is the true label of each pixel and w : Ω → R is a weight map that we introduced to give some pixels more importance in the training.

其中，l : Ω → {1, . . . , K} 是每个像素的真实标签，w : Ω → R 是我们引入的权重映射，用于在训练中赋予一些像素更重要的作用。
在这里插入图片描述
Fig. 3. HeLa cells on glass recorded with DIC (differential interference contrast) microscopy. (a) raw image. (b) overlay with ground truth segmentation. Different colors indicate different instances of the HeLa cells. © generated segmentation mask (white: foreground, black: background). (d) map with a pixel-wise loss weight to force the network to learn the border pixels.
图3. 用差分干涉对显微镜下的玻璃上的HeLa细胞进行记录。(a) 原始图像。(b) 与基本真实分割叠加。不同的颜色表示HeLa细胞的不同实例。© 生成的分割掩模（白色：前景，黑色：背景）。(d) 用于强制网络学习边界像素的像素级损失权重映射。

We pre-compute the weight map for each ground truth segmentation to compensate the different frequency of pixels from a certain class in the training data set, and to force the network to learn the small separation borders that we introduce between touching cells (See Figure 3c and d).
我们预先计算每个真实标签分割的权重映射，以弥补训练数据集中某个类别像素的不同频率，并迫使网络学习我们在接触细胞之间引入的小分隔边界（参见图3c和d）。

The separation border is computed using morphological operations. The weight map is then computed as
分割边界是通过形态学操作计算得出的。然后，权重映射被计算为：
$w(x)=w_{c}(x)+w_{}*exp(-((d_{1}(x)+d_{2}(x))^2)/2\sigma^{2})$

where wc : Ω → R is the weight map to balance the class frequencies, d1 : Ω → R denotes the distance to the border of the nearest cell and d2 : Ω → R the distance to the border of the second nearest cell. In our experiments we set w0 = 10 and σ ≈ 5 pixels
其中，

wc：Ω→R是用于平衡类别频率的权重映射，
d1：Ω→R表示到最近细胞边界的距离，
d2：Ω→R表示到第二近细胞边界的距离。

在我们的实验中，我们设置w0 = 10和σ≈5像素。

In deep networks with many convolutional layers and different paths through the network, a good initialization of the weights is extremely important. Otherwise, parts of the network might give excessive activations, while other parts never contribute. Ideally the initial weights should be adapted such that each feature map in the network has approximately unit variance. For a network with our architecture (alternating convolution and ReLU layers) this can be achieved by drawing the initial weights from a Gaussian distribution with a standard deviation of p 2/N, where N denotes the number of incoming nodes of one neuron [5]. E.g. for a 3x3 convolution and 64 feature channels in the previous layer N = 9 · 64 = 576.

在具有许多卷积层和网络中的不同路径的深度网络中，良好的权重初始化非常重要。否则，网络的某些部分可能会给出过高的激活，而其他部分从不起作用。理想情况下，初始权重应该适应于网络中的每个特征图具有大约单位方差。对于我们的架构网络（交替的卷积和ReLU层），这可以通过从具有标准差为p 2/N的高斯分布中抽取初始权重来实现，其中N表示一个神经元的输入节点数[5]。例如，对于3x3卷积和上一层的64个特征通道，N = 9 · 64 = 576。

Data Augmentation 数据增强

Data augmentation is essential to teach the network the desired invariance and robustness properties, when only few training samples are available. In case of microscopical images we primarily need shift and rotation invariance as well as robustness to deformations and gray value variations. Especially random elastic deformations of the training samples seem to be the key concept to train a segmentation network with very few annotated images. We generate smooth deformations using random displacement vectors on a coarse 3 by 3 grid. The displacements are sampled from a Gaussian distribution with 10 pixels standard deviation. Per-pixel displacements are then computed using bicubic interpolation. Drop-out layers at the end of the contracting path perform further implicit data augmentation.

当只有少量训练样本可用时，数据增强对于教导网络所需的不变性和鲁棒性是必不可少的。在显微镜图像的情况下，我们主要需要平移和旋转不变性以及对变形和灰度变化的鲁棒性。特别是对于只有很少标注图像的分割网络，随机弹性变形训练样本似乎是训练的关键概念。我们使用在粗糙的3x3网格上的随机位移向量来生成平滑变形。位移是从标准差为10个像素的高斯分布中采样得到的。然后使用双三次插值计算每个像素的位移。在收缩路径结束时的Drop-out层执行进一步的隐式数据增强。

Experiments

We demonstrate the application of the u-net to three different segmentation tasks. The first task is the segmentation of neuronal structures in electron microscopic recordings. An example of the data set and our obtained segmentation is displayed in Figure 2. We provide the full result as Supplementary Material. The data set is provided by the EM segmentation challenge [14] that was started at ISBI 2012 and is still open for new contributions. The training data is a set of 30 images (512x512 pixels) from serial section transmission electron microscopy of the Drosophila first instar larva ventral nerve cord (VNC). Each image comes with a corresponding fully annotated ground truth segmentation map for cells (white) and membranes (black). The test set is publicly available, but its segmentation maps are kept secret. An evaluation can be obtained by sending the predicted membrane probability map to the organizers. The evaluation is done by thresholding the map at 10 different levels and computation of the “warping error”, the “Rand error” and the “pixel error” [14].

我们展示了u-net在三个不同的分割任务中的应用。第一个任务是电子显微镜记录中神经结构的分割。数据集的示例和我们得到的分割结果如图2所示。我们提供完整的结果作为补充材料。该数据集由EM分割挑战[14]提供，该挑战始于2012年的ISBI，并仍然对新的贡献开放。训练数据是来自果蝇一龄幼虫腹神经索(VNC)的连续切片透射电子显微镜的30张图像(512x512像素)。每个图像都附带有相应的完全注释的细胞(白色)和膜(黑色)的地面真值分割图。测试集是公开可用的，但其分割图是保密的。可以通过将预测的膜概率图发送给组织者来获得评估。评估是通过在10个不同的阈值下对图像进行二值化，并计算“弯曲误差”、“Rand误差”和“像素误差”[14]来完成的。

The u-net (averaged over 7 rotated versions of the input data) achieves without any further pre- or postprocessing a warping error of 0.0003529 (the new best score, see Table 1) and a rand-error of 0.0382.
U-net（对输入数据的7个旋转版本进行平均）在没有进一步的预处理或后处理的情况下，达到了0.0003529的弯曲误差（新的最佳得分，见表1）和0.0382的Rand误差。

This is significantly better than the sliding-window convolutional network result by Ciresan et al. [1], whose best submission had a warping error of 0.000420 and a rand error of 0.0504. In terms of rand error the only better performing algorithms on this data set use highly data set specific post-processing methods1 applied to the probability map of Ciresan et al. [1].

这比Ciresan等人的滑动窗口卷积网络结果要好得多[1]，其最佳提交的弯曲误差为0.000420，Rand误差为0.0504。就Rand误差而言，在这个数据集上表现更好的算法只使用了高度数据集特定的后处理方法，应用于Ciresan等人的概率图[1]。
在这里插入图片描述

Fig. 4. Result on the ISBI cell tracking challenge. (a) part of an input image of the “PhC-U373” data set. (b) Segmentation result (cyan mask) with manual ground truth (yellow border) © input image of the “DIC-HeLa” data set. (d) Segmentation result (random colored masks) with manual ground truth (yellow border).
图4. ISBI细胞跟踪挑战赛结果。（a）“PhC-U373”数据集的部分输入图像。（b）手动标注的分割结果（青色掩膜）和人工标注的地面真实值（黄色边界）。（c）“DIC-HeLa”数据集的输入图像。（d）随机着色的分割结果（随机颜色掩膜）和人工标注的地面真实值（黄色边界）。

Table 2. Segmentation results (IOU) on the ISBI cell tracking challenge 2015.
表2. 2015年ISBI细胞跟踪挑战赛的分割结果（IOU）。
在这里插入图片描述
We also applied the u-net to a cell segmentation task in light microscopic images. This segmenation task is part of the ISBI cell tracking challenge 2014 and 2015 [10,13]. The first data set “PhC-U373”2 contains Glioblastoma-astrocytoma U373 cells on a polyacrylimide substrate recorded by phase contrast microscopy (see Figure 4a,b and Supp. Material). It contains 35 partially annotated training images. Here we achieve an average IOU (“intersection over union”) of 92%, which is significantly better than the second best algorithm with 83% (see Table 2). The second data set “DIC-HeLa”3 are HeLa cells on a flat glass recorded by differential interference contrast (DIC) microscopy (see Figure 3, Figure 4c,d and Supp. Material). It contains 20 partially annotated training images. Here we achieve an average IOU of 77.5% which is significantly better than the second best algorithm with 46%.

我们还将U-Net应用于光学显微图像中的细胞分割任务。这个分割任务是ISBI细胞跟踪挑战赛2014年和2015年的一部分。第一个数据集“PhC-U373”包含通过相差显微镜记录的Glioblastoma-astrocytoma U373细胞在聚丙烯酰胺基质上的图像（见图4a、b和补充材料）。它包含了35个部分注释的训练图像。在这里，我们实现了平均IOU（“交并比”）为92%，明显优于第二好的算法的83%（见表2）。第二个数据集“DIC-HeLa”是通过差干涉对比显微镜记录的HeLa细胞在平坦玻璃上的图像（见图3、图4c、d和补充材料）。它包含了20个部分注释的训练图像。在这里，我们实现了平均IOU为77.5%，明显优于第二好的算法的46%。

Conclusion

The u-net architecture achieves very good performance on very different biomedical segmentation applications. Thanks to data augmentation with elastic deformations, it only needs very few annotated images and has a very reasonable training time of only 10 hours on a NVidia Titan GPU (6 GB). We provide the full Caffe[6]-based implementation and the trained networks4 . We are sure that the u-net architecture can be applied easily to many more tasks

U-Net架构在不同的生物医学分割应用中取得了非常好的性能。通过使用弹性变形进行数据增强，它只需要很少的标注图像，并且在NVidia Titan GPU（6 GB）上的训练时间非常合理，只需10小时。我们提供了基于Caffe的完整实现和训练好的网络。我们相信U-Net架构可以很容易地应用于更多的任务中。