Towards more efficient CNN-based surgical tools classification using transfer learning

Context-aware system (CAS) is a system that can understand the context of a given situation and either share this context with other systems for their response or respond by itself. In surgery, these systems are intended to assist surgeons enhance the scheduling productivity of operating rooms (OR) and surgical teams, and promote a comprehensive perception and consciousness of the OR. Furthermore, the automated surgical tool classification in medical images is a real-time computerized assistance to the surgeons in conducting different operations. Moreover, deep learning has embroiled in every facet of life due to the availability of large datasets and the emergence of convolutional neural networks (CNN) that have paved the way for the development of different image related processes. The aim of this paper is to resolve the problem of unbalanced data in the publicly available Cholec80 laparoscopy video dataset, using multiple data augmentation techniques. Furthermore, we implement a fine-tuned CNN to tackle the automatic tool detection during a surgery, with prospective use in the teaching field, evaluating surgeons, and surgical quality assessment (SQA). The proposed method is evaluated on a dataset of 80 cholecystectomy videos (Cholec80 dataset). A mean average precision of 93.75% demonstrates the effectiveness of the proposed method, outperforming the other models significantly.

Minimally invasive surgery has myriad advantages than conventional open surgery [1,2]: On the other hand, learning MIS procedures have multiple limitations, such as: -The operative area is displayed on the monitor as a 2-D image, instead of the regular 3-D eyesight. This absence of depth perception is a challenge for surgeons. -Limited training chances because of patient safety and resource concerns.
To overcome these issues, the surgical video stream is recorded and exploited. These videos are utilized in retrospective analysis and postoperative Surgical Quality Assessment (SQA), which are investigating the recorded videos narrowly, to detect possible mistakes and evaluate the surgeon expertise. They are also used as academic material for debutante surgeons [4]. Furthermore, recording the surgical intervention is compulsory in many countries [5], and provided as proof in divers circumstances.
However, MIS videos tend to last several hours. Thus, navigation and searching through these videos are cumbersome and time-effort consuming. To overcome this problem, we propose a deep learning-based solution to classify surgical tools in MIS videos and stock the results in a database, to execute specific queries, and allow  In this study, we designed a system for the classification of surgical tools in MIS videos utilizing advanced deep neural networks. The system consists of the following steps: Firstly, preprocessing the digital images which includes splitting the Cholec80 videos in one frame per second images, then resizing these images to a fixed size.
The next essential step includes an augmentation process to overcome the unbalanced data problem by increasing the size of minority classes.
Finally, we use a well-known deep learning model to train it on the augmented data images. Then, we compare the proposed approach with other methods. This paper is organized as follows: Sect. "Related works" presents the related works, Sect. "Deep learning and computer vision" defines the basic concepts of the deep learning and computer vision, Sect. "Our approach" describes the detailed methodology of the proposed approach, including the implementation and experimental results, whereas sect. "Conclusions and future works" covers the conclusion of the paper.

Related works
Classification, segmentation, and tracking using convolutional neural networks (CNN) have made state-of-the-art results in the field of medicine, for example, pulmonary tuberculosis segmentation [6], brain tumor segmentation [7], stroke lesion segmentation [8], breast cancer classification [9], and artery and vein classification [10].
The purpose of cholecystectomy is to remove the gallbladder: this operation can be performed laparoscopically and monitored through an endoscope. The recorded videos were used for multiple purposes. We summarize some uses of computer vision in minimally invasive surgery videos in: -Laparoscopy: Kletz [11] applied a region-based convolutional neural network (R-CNN), to recognize the surgical instruments in laparoscopy. A custom dataset generated from laparoscopic gynecological videos was used. Amy [12] combined a region-based convolutional neural network (faster R-CNN) and VGG16 to detect laparoscopic surgical tools and perform an operative skill assessment. The dataset used is M2CAI. Christian [13] performed the segmentation of surgical instruments in laparoscopic surgery using a self-supervised method based on the kinematic model of the robot as a source of information. A fully convolutional neural network (FCN) was used on VIVO dataset. This dataset was obtained from a robotized endoscopy system. Choi [14] [18]. Islam [19] proposed a real-time instrument segmentation tool in robot-assisted minimally invasive surgery (RMIS) using Multiresolution Feature Fusion (MFF) block and a light-weight CNN to identify the surgical tool. This approach was tested on MICCAI 2017 dataset. Shvets [20] presented an instrument segmentation framework, in robotic surgery, using adversarial learning, Multi-resolution Feature Fusion (MFF), and a fully convolutional network (FCN). Colleoni [21] applied spatiotemporal layers using a fully convolutional neural network (FCNN), to perform robotic surgical tool detection and articulation estimation. EndoVis challenge 2015 dataset was used. Automatic instrument segmentation in robotassisted surgery was implemented by Shvets [20] using U-Net, TernausNet, LinkNet, VGG11, and VGG16 encoders, on a custom dataset obtained from DA Vinci Xi surgical system. Sarikaya [22] performed the detection and localization of robotic tools in robot-assisted surgery videos using region proposal network (RPN), on ATLAS Dione dataset. -Other uses of MIS videos: As we mentioned earlier, minimally invasive surgery videos often have a duration of several hours. That's why Chittajallu [23] presented a content-based video retrieval system, similar to a query image. The author used a CNN model called ResNet50, trained with a siamese triplet, ranked and refined with IQR (iterative query refinement), based on the user feedback. The Model can detect the surgical instrument in the query image and search for similar frames in the stocked MIS videos.
The preoperative surgery duration prediction is done manually, thus, the surgeons underestimate surgery duration by 31 min [24] on average, causing a longer waiting time for patients and a non-optimized exploitation of the operation room. Twinanda [25] proposed an automatic method, using the visual information from laparoscopic video, to detect instruments corresponding to a specific phase, and it continuously predicts remaining surgery duration, without any human intervention. A CNN is used to extract visual features from the video frames, and LSTM (long-short term memory) was used to predict the remaining surgery time. The results were very positive, with 15.2 ± 4.7 minutes for short videos, 12.5 ± 4.6 minutes in medium videos, and 23.1 ± 9.4 minutes in long videos.

Deep learning and computer vision
Machine learning (ML), unlike other types of computer programming, provides the ability to automatically learn and improve from experience without being explicitly programmed. ML focuses on the development of computer programs that can gather data and use it to learn. Deep Learning (DL) is a member of the ML family and has seen a qualitative leap in the past years, driven by the availability of large datasets and the evolution of computer resources. This field has witnessed imminent progress in the ability of machines to understand and manipulate data, including videos and images. DL is based on artificial neural networks, that imitate the human brain in the decision-making process by generating patterns.
Some of the biggest hits of deep learning have been in the area of computer vision (CV). CV is a field of artificial intelligence that trains computers to interpret and understand the visual world from digital images or videos. Computer vision is also defined as an area of study that seeks to develop techniques to help computers "see" and understand the content of digital images. When the computer receives an image as an input, it's an array of pixel values (Fig. 2), each of these values is a number between zero and 255, which tells the intensity of the pixel, meaningless to humans, but it is the only input to machines.
Deep Learning achieved revolutionary results on challenging computer vision problems such as image classification [26], object detection [27] and face recognition [28]. Medical imaging is one of the most benefiting fields from the evolution of computer vision, it helps doctors by alerting them when an area in an image is doubtable (e.g. detecting a tumor in an image).
Convolutional neural networks (CNNs) [29] is a specific class of neural network (NN) that was created to learn visual features from images. Nowadays, it is the most successful deep learning approach to handle the image classification task [30].
The standard CNN is basically composed of three layers: convolutional, pooling, and fully connected layers.
-Convolutional layer: is the core building block of a CNN, it does massive math operations. The picture is convoluted with a specific filter to extract the desired features. It is the first action is a CNN, the result is an activation map or a feature map. -Pooling Layer: aims to reduce the dimensions of the feature maps. For each feature map received from the previous layer, a filter is applied to summarize the features lying within the region covered by the filter. -Fully connected layer: is the last layer, and the classification result is specified based on the category (or categories) with highest value. At this stage, the output of the previous layer is flattened and turned into a single vector, then assigning weights to predict the correct label and giving the final probabilities for each category.
In the medical field, CNN is employed quite successfully in detecting and tracking surgical tools (Fig. 3). CNN obtain an image as input and transform it to a vector and apply some simple operations like convolution, pooling, then, output a probability of potential image classes.

Our approach
This section includes the dataset description, data preprocessing, data augmentation, network architecture of the proposed model, the experimental results, and the discussion.

Dataset
Cholec80 [31] is a cholecystectomy surgery videos dataset containing 80 videos, performed by 13 surgeons. Video resolution is 1920 × 1080 pixels with 25 Frames Per Second(fps) as frame rate. Video length is varied between 12 min 19 s (minimum) and 1 hour 39 min 55 s (maximum), with 38 min and 26 s on average and more than 51 hours of surgery in total. Cholec80 is fully annotated with image-level surgical tool labels for binary detection.
In Cholec80, seven tools were used and annotated (Fig. 4) show an example of the seven surgical tools present in the dataset, namely: specimen bag, bipolar, scissors, clipper, hook, grasper, and irrigator. As the images are collected using different laparoscopes and from different surgeons, they come with different angles and resolutions, and sometimes they have a poor resolution, focus, or blurred. In addition, the tool is labeled as present if half of it appears in the image. One binary label is provided per image and per tool as an annotation (Multilabel classification).

Data preprocessing
Videos are processed using FFmpeg 3.0 and all video streams are encoded with libx264, using 25 frame per second (FPS). Firstly, the video width is scaled to 480, and the height is determined to maintain the aspect ratio of the original input video. Next, the audio is isolated from all videos.
Since videos are brut and not edited, they have several empty and irrelevant frames at multiple scenes (beginning and the end of the videos). Furthermore, these frames are noisy and computationally expensive. Therefore, we cut these nonrelevant frames using a background detection model. The latter was trained to identify unimportant segments that were captured outside the body. Next, these frames are used to recognize the real start and end of the surgery in the original video and cut it down.
This step and the final verified video files are automatically processed and stored in local computer.
Since the Cholec80 dataset is labelled with 1 fps tool presence annotation, we split the preprocessed videos in 1 frame per second images. Splitting videos is a display technique that reposes on fractioning the video into images.
Finally, and given that neural networks receive inputs of the same size, all images need to be resized to a fixed size before inputting them to the CNN. Moreover, the image dimension is often reduced in order to fit a reasonably sized batch in GPU memory. Thus, each image is resized to 250 × 250pixels.

Data augmentation
In Cholecystectomy surgery, some tools are used more frequently than others. Consequently, Cholec80 video frames belonging to those tools outnumber the video frames belonging to the other tools, leading to unbalanced data. This issue affects the generalization of the model and reduces the CNN efficiency to classify the different tools.
To overcome this problem, image augmentation techniques are used to increase the size of the minority classes. Images are augmented by affine transformations and blurring (Fig. 5). We consider those transformations that preserve tool presence, like:  As shown in Fig. 5, the total number of images before the this phase was 245086 frames, and after the augmentation it becomes 310054 frames.

InceptionResnetV2
The residual neural network (ResNet) is an artificial neural network that was probably the most revolutionary work in computer vision/deep learning in recent years. Resnet made a state-of-art results on many computer vision applications, such as object detection and face recognition. From Alexnet [32] in 2012, that win the SVRC2012 classification contest, Researchers focused on developing Deep Residual Networks. VGG Network (19 layers) and GoogleNet (22 layers) are state-of-arts CNN architectures. They tried to go deeper and deeper, however, piling layers does not mean increasing network depth. The vanishing gradient problem makes the neural network very hard to train, that's why Resnet introduce "Identity Shortcut Connection" , to skip layers. The authors of [33] argue that stacking layers shouldn't degrade the network performance, because we could simply stack identity mappings (layer that doesn't do anything) upon the current network, and the resulting architecture would perform the same.
Another neural network that was a milestone in the CNN development is the Inception network. CNNs tends to stack convolution layers. Unlike CNNs, inception uses other solutions to get a better performance in terms of speed and accuracy.
InceptionResnet-V2 (Fig. 6) is a hybrid inception network that combines inception and residual networks (both are SOTA architectures), to boost the performance. InceptionResnetV2 is trained on more than one million images from the ImageNet dataset [34]. The network has a default input size of 299-by-299.

Transfer learning
Transfer learning is taking a network pretrained on a dataset and apply it to recognize new image/object categories. Essentially, we can exploit the robust, discriminative filters learned by state-of-the-art networks on challenging datasets (such as ImageNet or COCO), and use these networks to recognize objects the model was never trained on.
In deep learning, feature extraction and fine-tuning are two types of transfer learning: -Transfer learning via feature extraction is done by freezing all the convolutional neural network layers, and changing only the classification layer, which is the final layer. The pretrained network is used to extract the input image features. This technique is used when the new data are similar to the original training dataset (Imagenet). -On the other hand, fine-tuning requires more modifications than feature extraction (Fig. 7). The layers are initialized with the pre-trained neural network model weights.
The model architecture is updated by removing the fully connected layer heads and replacing it with a new one, then training it to predict the input classes. Furthermore, some of the last layers could be unfrozen, in order to perform a second pass of training. Freezing means that these layer weights will not be updated in the training process. This technique is used when data similarity is low between the original training dataset (Imagenet) and our images. That is why, in our case we fine-tuned Inception-ResnetV2.

Fine-tuning inceptionResnetV2
In the fine-tuning approach, the representations learnt by the previous network are used to extract the meaningful features of the new dataset images and the activation maps generated from the last convolutional layer are fed to the newly constructed fully connected network which acts as the classifier. Therefore, the first step is to truncate the fully connected node at the end of the pretrained network (Softmax layer), and change it with a new freshly initialized sigmoid layer that is compatible with our multilabel classification task. This will predict a probability of class membership for the seven labels and assign a value between 0 and 1. The sigmoid function is calculated as : After leaving off the fully connected (FC) head of InceptionResnetV2 that contains 1000 classes (the 1000 output classes of the ImageNet dataset), we construct a new FC layer (classifier layer), with seven classes (surgical tools number), and append it to our model.
Next, in order to learn very generic features, we freeze the first early blocks. Moreover, it lets the network capture common features like edges and curves that are applicable to our new classification task. Additionally, this step ensures that any previous robust features learned by the CNN are not destroyed.
Then, we train the new FC head that is connected to the model to take the lower level features from the front of the network and map them to the desired output classes. Once this has been done, we unfreeze some layers of the top layer of the frozen model, by setting these layers as "trainable=True", and continue training, so that in further SGD epochs their weights can be fine-tuned for the new task too. Furthermore, we used a smaller learning rate to train the network because we expect that the pretrained weights are quite good already as compared to randomly initialized weights.
Finally, the input size of the model was changed to (250, 250, 3), with 250 * 250 as our frame dimension and 3 is our color channels (RGB).

Implementation parameters
As we mention in section A-1, we used Cholec80 for performance evaluation. 60 videos were assigned to the training set (producing 241 842 image), while 20 videos (producing 68 212 images) were assigned to the test set. The grasper and hook appear more often than other tools, leading to unbalanced data. Image augmentation techniques are applied to overcome this problem, as described in the A-2 section.
We fine-tuned InceptionResnet-v2. The latter is pretrained on ImageNet dataset. The fine-tuning process is defined in an earlier section.
The tool detection is multilabel and multiclass task, as different tools can be present at the same time. Our model is trained using stochastic gradient descent, while binary Cross-Entropy is used as a loss function and Sigmoid is used as a final activation function.
The training process was run for 70K iterations on a batch size of 32 images, with an initial learning rate of 0.001, that we decay it by a factor of 10 after 12K iterations. (1) The network is trained on Intel ® Core TM i7-9700K processor, 16GB memory and NVIDIA Geforce GTX 2080.

Results and discussion
We implemented Convolutional Neural Network for surgical tool classification. This network is trained on Cholec80, which is a cholecystectomy surgery videos dataset containing 80 videos. Figure 5 shows that the instances of surgical tool presence in laparoscopy is massively unbalanced. Among the seven tools, the grasper and the hook are present majority of the time, although scissors has very fewer samples. Giving any network such unbalanced data polarizes the network to the surgical tools which have high number of samples. Therefore, we used some augmentation techniques in order to overcome this problem.
The performance of our model is measured by the average precision (AP) metric, presented in Eq. (2). It is a popular metric in measuring the accuracy of object detectors. Each frame is annotated with a single or multiple surgical tools. Accuracy is calculated by comparing the ground truth annotations with the predicted labels. Additionally, we used Precision-Recall curves. Precision-Recall curves are used instead of the ROC(Receiver Operating Characteristic) curves when there is a class unbalance (Fig.8).
The average precision (AP) is calculated as: The performance of each model is calculated by the mean Average Precision (mAP), which is the mean of the average precision score for each surgical tool. It is calculated as : where 7 is the number of surgical instruments.
The classification performance for all instruments using the augmented data is shown in Table 1. It specifies the average precision of the trained model for classifying the seven surgical tools.
In this work, we overcome the unbalanced data problem of the Cholec80 dataset. However, training the model to recognize minor classes (e.g., scissors, irrigator) was a real challenge. It can be seen that the hook obtains the greatest average precision. Two possible explanations are that it has the highest samples number, and it has a unique shape, making it easily recognizable from other tools. Moreover, the specimen bag, bipolar, clipper, and the grasper performed well. Irrigator is often misclassified, maybe due to its universal shape along with infrequent and irregular presence. The latter is an instrument used for flushing and suctioning a space, it is typically used essentially when blood concentrates in the surgical area. When taking a closer look at the average precision between these instruments and scissors, remarkable differences can be regarded, reflecting the fact that the latter has the lowest sample number and its common two-pronged shape.
To evaluate the performance of the proposed model, we compared the results with those of previous studies. Table 2 outlines and compares our model with the best  performing related models. To differentiate our model from the models, we used the average precision computed for all classes, and the mean average precision for each model. Sahu [35] and Twinanda [31] fine-tuned AlexNet on Cholec80 dataset, without any data augmentation techniques. While Amy [12] used VGG16, which is is an extension of Alexnet, containing eight weight layers, with a new architecture including 16 weight layers, and the authors performed data augmentation by randomly flipping frames horizontally. On the other hand, Jo [36] used YOLO9000 with motion vector prediction, and Kanakatte [37] approach was based on Resnet.
It can be seen that our model yields significantly better results than Sahu [35], EndoNet [31], Amy [12], and Jo [36] architectures, especially in the scissors, which is the less represented class in Cholec80. This may be due to the multiple augmentation techniques used. We can also observe that our model outperforms all models in classifying almost all the surgical tools. Two potential justification for this: in the pre-processing phase, we used more augmentation techniques, that's justify what the average precision is higher. In addition, InceptionResnet-v2 is network of 164 layers deep, having much more convolutional layers than VGG16 and AlexNet. Each convolutional layer's parameters consist of a set of learnable filters, allowing the network to learn much more information.
The results demonstrated that leaving off the classic pre-processing techniques improved the classification outcomes. This is principally crucial in the case of classification projects with highly unbalanced data. Data augmentation overcomes this issue, leading to a higher average precision than the models trained without data augmentation. However, one drawback of adding more data is the demand on resources and computational intensiveness, which can increase the time of total data preparation and training time.

Conclusions and future works
In this research, we overcome the unbalanced data problem in the publicly available laparoscopy video dataset Cholec80. We proposed multiple data augmentation techniques. Moreover, we exploited the techniques of transfer learning, especially the fine-tuning approach. Then, we choose to train the augmented data with a fine-tuned Inception-Resnet-v2 network. The latter relies on convolutional neural networks (CNNs) and it is pretrained on ImageNet dataset. Our Model has shown that the classification task works reliably. It was observed that the proposed model outperform the other methods in terms of classifying the surgical tools with an accuracy of 96.58, 95.04, 99.68, 81.24, 95.11, 93.71, and 94.92% for the surgical instruments Grasper, Bipolar, Hook, Scissors, Clipper, Irrigator and SpecimenBag, and an average mean precision of 93.75%. Due to this high average precision of our model, it can be used for Computer-assisted instruction (CAI) as a basis of automatic surgical video indexing.
In future works, we will test some other neural network architectures, such as NAS-Net, DenseNet, and EfficientNet, and try it on other surgical datasets, with other augmentation data techniques. Furthermore, we plan to investigate the impact of the augmentation data phase on the processing time.