This section includes the dataset description, data preprocessing, data augmentation, network architecture of the proposed model, the experimental results, and the discussion.
Data preprocessing
Dataset
Cholec80 [31] is a cholecystectomy surgery videos dataset containing 80 videos, performed by 13 surgeons. Video resolution is 1920 × 1080 pixels with 25 Frames Per Second(fps) as frame rate. Video length is varied between 12 min 19 s (minimum) and 1 hour 39 min 55 s (maximum), with 38 min and 26 s on average and more than 51 hours of surgery in total. Cholec80 is fully annotated with image-level surgical tool labels for binary detection.
In Cholec80, seven tools were used and annotated (Fig. 4) show an example of the seven surgical tools present in the dataset, namely: specimen bag, bipolar, scissors, clipper, hook, grasper, and irrigator. As the images are collected using different laparoscopes and from different surgeons, they come with different angles and resolutions, and sometimes they have a poor resolution, focus, or blurred. In addition, the tool is labeled as present if half of it appears in the image. One binary label is provided per image and per tool as an annotation (Multilabel classification).
Data preprocessing
Videos are processed using FFmpeg 3.0 and all video streams are encoded with libx264, using 25 frame per second (FPS).
Firstly, the video width is scaled to 480, and the height is determined to maintain the aspect ratio of the original input video. Next, the audio is isolated from all videos.
Since videos are brut and not edited, they have several empty and irrelevant frames at multiple scenes (beginning and the end of the videos). Furthermore, these frames are noisy and computationally expensive. Therefore, we cut these nonrelevant frames using a background detection model. The latter was trained to identify unimportant segments that were captured outside the body. Next, these frames are used to recognize the real start and end of the surgery in the original video and cut it down.
This step and the final verified video files are automatically processed and stored in local computer.
Since the Cholec80 dataset is labelled with 1 fps tool presence annotation, we split the preprocessed videos in 1 frame per second images. Splitting videos is a display technique that reposes on fractioning the video into images.
Finally, and given that neural networks receive inputs of the same size, all images need to be resized to a fixed size before inputting them to the CNN. Moreover, the image dimension is often reduced in order to fit a reasonably sized batch in GPU memory. Thus, each image is resized to 250 × 250pixels.
Data augmentation
In Cholecystectomy surgery, some tools are used more frequently than others. Consequently, Cholec80 video frames belonging to those tools outnumber the video frames belonging to the other tools, leading to unbalanced data. This issue affects the generalization of the model and reduces the CNN efficiency to classify the different tools.
To overcome this problem, image augmentation techniques are used to increase the size of the minority classes. Images are augmented by affine transformations and blurring (Fig. 5). We consider those transformations that preserve tool presence, like:
-
Rotation: Minority class images are rotated at an angle of zero, 40, 85, 125, 250, and 300.
-
Mirroring: Mirror the image along the x-axis and y-axis.
-
Shearing: The images were shifted at 40 degrees in the counter-clockwise direction.
-
Padding: Padding 5px on each border, using the reflect mode, which pad with the reflection of image without repeating the last value on the edge.
As shown in Fig. 5, the total number of images before the this phase was 245086 frames, and after the augmentation it becomes 310054 frames.
Network architecture
InceptionResnetV2
The residual neural network (ResNet) is an artificial neural network that was probably the most revolutionary work in computer vision/deep learning in recent years. Resnet made a state-of-art results on many computer vision applications, such as object detection and face recognition. From Alexnet [32] in 2012, that win the SVRC2012 classification contest, Researchers focused on developing Deep Residual Networks. VGG Network (19 layers) and GoogleNet (22 layers) are state-of-arts CNN architectures. They tried to go deeper and deeper, however, piling layers does not mean increasing network depth. The vanishing gradient problem makes the neural network very hard to train, that’s why Resnet introduce “Identity Shortcut Connection” , to skip layers. The authors of [33] argue that stacking layers shouldn’t degrade the network performance, because we could simply stack identity mappings (layer that doesn’t do anything) upon the current network, and the resulting architecture would perform the same.
Another neural network that was a milestone in the CNN development is the Inception network. CNNs tends to stack convolution layers. Unlike CNNs, inception uses other solutions to get a better performance in terms of speed and accuracy.
InceptionResnet-V2 (Fig. 6) is a hybrid inception network that combines inception and residual networks (both are SOTA architectures), to boost the performance. InceptionResnetV2 is trained on more than one million images from the ImageNet dataset [34]. The network has a default input size of 299-by-299.
Transfer learning
Transfer learning is taking a network pretrained on a dataset and apply it to recognize new image/object categories. Essentially, we can exploit the robust, discriminative filters learned by state-of-the-art networks on challenging datasets (such as ImageNet or COCO), and use these networks to recognize objects the model was never trained on.
In deep learning, feature extraction and fine-tuning are two types of transfer learning:
-
Transfer learning via feature extraction is done by freezing all the convolutional neural network layers, and changing only the classification layer, which is the final layer. The pretrained network is used to extract the input image features. This technique is used when the new data are similar to the original training dataset (Imagenet).
-
On the other hand, fine-tuning requires more modifications than feature extraction (Fig. 7). The layers are initialized with the pre-trained neural network model weights. The model architecture is updated by removing the fully connected layer heads and replacing it with a new one, then training it to predict the input classes. Furthermore, some of the last layers could be unfrozen, in order to perform a second pass of training. Freezing means that these layer weights will not be updated in the training process. This technique is used when data similarity is low between the original training dataset (Imagenet) and our images. That is why, in our case we fine-tuned InceptionResnetV2.
Fine-tuning inceptionResnetV2
In the fine-tuning approach, the representations learnt by the previous network are used to extract the meaningful features of the new dataset images and the activation maps generated from the last convolutional layer are fed to the newly constructed fully connected network which acts as the classifier.
Therefore, the first step is to truncate the fully connected node at the end of the pretrained network (Softmax layer), and change it with a new freshly initialized sigmoid layer that is compatible with our multilabel classification task. This will predict a probability of class membership for the seven labels and assign a value between 0 and 1. The sigmoid function is calculated as :
$$\begin{aligned} \begin{array}{rcl} S(x)=\frac{1}{1+e^{-x}}= \frac{e^{x}}{e^{x}+1} = 1 - S(-x) \end{array} \end{aligned}$$
(1)
After leaving off the fully connected (FC) head of InceptionResnetV2 that contains 1000 classes (the 1000 output classes of the ImageNet dataset), we construct a new FC layer (classifier layer), with seven classes (surgical tools number), and append it to our model.
Next, in order to learn very generic features, we freeze the first early blocks. Moreover, it lets the network capture common features like edges and curves that are applicable to our new classification task. Additionally, this step ensures that any previous robust features learned by the CNN are not destroyed.
Then, we train the new FC head that is connected to the model to take the lower level features from the front of the network and map them to the desired output classes. Once this has been done, we unfreeze some layers of the top layer of the frozen model, by setting these layers as “trainable=True”, and continue training, so that in further SGD epochs their weights can be fine-tuned for the new task too. Furthermore, we used a smaller learning rate to train the network because we expect that the pretrained weights are quite good already as compared to randomly initialized weights.
Finally, the input size of the model was changed to (250, 250, 3), with 250 * 250 as our frame dimension and 3 is our color channels (RGB).
Experimental results and discussion
Implementation parameters
As we mention in section A-1, we used Cholec80 for performance evaluation. 60 videos were assigned to the training set (producing 241 842 image), while 20 videos (producing 68 212 images) were assigned to the test set. The grasper and hook appear more often than other tools, leading to unbalanced data. Image augmentation techniques are applied to overcome this problem, as described in the A-2 section.
We fine-tuned InceptionResnet-v2. The latter is pretrained on ImageNet dataset. The fine-tuning process is defined in an earlier section.
The tool detection is multilabel and multiclass task, as different tools can be present at the same time. Our model is trained using stochastic gradient descent, while binary Cross-Entropy is used as a loss function and Sigmoid is used as a final activation function.
The training process was run for 70K iterations on a batch size of 32 images, with an initial learning rate of 0.001, that we decay it by a factor of 10 after 12K iterations.
The network is trained on Intel®CoreTM i7-9700K processor, 16GB memory and NVIDIA Geforce GTX 2080.