The impact of ensemble learning on surgical tools classification during laparoscopic cholecystectomy

Laparoscopic surgery also know as minimally invasive surgery (MIS), is a type of surgical procedure that allows a surgeon to examine the organs inside of the abdomen without having to make large incisions in the skin. It unifies the competence and skills of highly trained surgeons with the power and precision of machines. Furthermore, surgical instruments are inserted through the abdomen with the help of a laparoscope, which is a tube with a high-intensity light and a high-resolution camera at the end. In addition, recorded videos from this type of surgery have become a steadily more important information source. However, MIS videos are often very long, thereby, navigating through these videos is time and effort consuming. The automatic identification of tool presence in laparoscopic videos leads to detecting what tools are used at each time in surgery and helps in the automatic recognition of surgical workflow. The aim of this paper is to predict surgical tools from laparoscopic videos using three states of the arts CNNs, namely: VGG19, Inception v-4, and NASNet-A. In addition, an ensemble learning method is proposed, combining the three CNNs, to solve the tool presence detection problem as a multi-label classification problem. The proposed methods are evaluated on a dataset of 80 cholecystectomy videos (Cholec80 dataset). The results present an improvement of approximately 6.19% and a mean average precision of 97.84% when the ensemble learning method is applied.

refers to any surgical procedure that is performed through tiny incisions instead of a large opening.
Furthermore, there are three major types of MIS: • Endoscopy: The surgeon uses the endoscope itself to do the procedure. The endoscope goes in through the body's natural openings, without the surgeon making any cuts. • Robot-assisted surgery (robotic surgery): The surgeon makes several small cuts to guide the endoscope and robotic tools into the body. From there, the surgeon controls the surgery while sitting nearby a computer console. • Laparoscopy: The surgeon uses a laparoscope, which is a thin tube fixed with a light and camera and several other small surgical instruments.
During laparoscopic surgery, the surgeon makes several small incisions in the skin, typically a few millimeters long. Furthermore, a laparoscope (a long thin tube fixed with a camera and light) is inserted into one of the incisions. Moreover, images from the endoscope are shown on monitors in the operating room so surgeons can get a clear and magnified view of the surgical area. Finally, special surgical tools or instruments are inserted through other small incisions. The surgeon uses these to explore, remove, or repair a problem inside the body. On the other hand, it is well known that junior surgeons are not allowed to operate on real patients. Therefore, surgery videos are a precious tool to learn procedures [2]. Additionally, those videos are widely used in the teaching field, also, there are plenty of online platforms that offers learning surgery procedures, including YouTube. In addition, the video recorded during the surgery offers limitless possibilities. In some countries, it is mandatory to store the surgery as evidence [3]. Moreover, human error and negative events can go unnoticed during the surgery, that's why Surgical Quality Assessment (SQA) using video recorded surgery is essential. SQA is an internal verification approach, to evaluate the surgeons via the surgery videos, and give a review based on some standardized rating checklist. However, MIS videos tend to last several hours. Thus, navigation and searching through these videos are cumbersome and time-effort consuming.
In this work, we tackle these problems by investigating the detection and the classification of surgical tools in laparoscopic surgery. To achieve this goal, we propose numerous models (Fig. 1) based on convolutional neural networks, namely: VGGNET [4], Inception v-4 [5], and NASNet-A [6]. In addition, we combine these three stateof-the-art neural networks to study and investigate the impact of ensemble learning on deep learning models applied to laparoscopic surgery. Therefore, after training each of the three networks, we use average ensembling technique on the models to avoid over fitting. Furthermore, the multilabel classification methods for surgical tool presence detection were tested on the benchmark dataset Cholec80.
The rest of this paper is organized as follows: in "Related works" section, we present the related work; in "Background and methods" section we present the material and methods; in "Experiments and results" section, we provide the experiments and results; and in "Conclusion and future works" section we draw our conclusions and future works.

Related works
Recently, deep learning models have been performing very well and reached exceptional results in different medical image analysis tasks. In particular, convolutional neural networks (CNN) have become the standard approach to handle this kind of problem [7][8][9][10][11]. Furthermore, in the literature, numerous deep learning approaches have been proposed for automated laparoscopic surgical tools classification and segmentation. Pan Shi [12] proposed an attention-guided convolutional neural network (CNN) for frame-by-frame detection of surgical tools in MIS videos, which comprises a coarse (CDM) and a refined (RDM) detection module. The proposed method was tested on two public datasets (EndoVis Challenge and ATLAS Dione). Sheng Wang [13] implemented a deep learning based multilabel classification method for surgical tool presence detection in laparoscopic videos. Moreover, it combines two state-of-the-art deep neural networks (VGGNet and Google Net). This method was tested on M2CAI surgical tool presence detection challenge. Kletz [14] evaluated the achievable performance of segmenting surgical instruments from their background by using a region-based fully convolutional network, for instance-aware instrument segmentation as well as instrument recognition. Jalal [15] designed a deep learning pipeline, namely, a convolutional neural network (CNN) and a nonlinear autoregressive network with exogenous input (NARX), designed to predict surgical phases from laparoscopic videos. A convolutional neural network (CNN) is used to perform the tool classification task by automatically learning visual features from laparoscopic videos. The output of the CNN, which represents binary usage signals of surgical tools, is provided to a NARX neural network that performs multistep-ahead predictions of surgical phases. Wang [16] created a framework for one-stage object detection based on a sample adaptive process controlled by reinforcement learning. Zhang [17] assembled a Modulated Anchoring Network for the detection of laparoscopic surgery tools based on Faster R-CNN. This approach was tested on the m2cai16tool-locations dataset. Namazi [18] made a multilabel classifier, called LapTool-Net to detect the presence of surgical tools in each frame of a laparoscopic video, based on a Recurrent Convolutional Neural Network (RCNN) architecture to simultaneously extract the spatiotemporal features. To overcome the high imbalance and avoid overfitting caused by the lack of variety in the training data, a high down-sampling rate is chosen based on the more frequent combinations. Chittajallu [19] presented a self-supervised method for instrument segmentation using the kinematic model of the robot as a source of information. The authors implemented a Fully Convolutional Neural network (FCN) on VIVO dataset obtained from a robotized endoscopy system, resulting an average precision of 91%. Kletz [20] developed a Real-Time Instrument Segmentation tool in robot-assisted minimally invasive surgery (RMIS) using multiresolution Feature Fusion (MFF) block and a light-weight CNN to identify the surgical tool. The dataset used by the author is MICCAI 2017. Shvets [21] presented deep learning-based solution for the robotic instrument semantic segmentation using U-NET, TernausNet, and LinkNet. This approach was tested on MICCAI 2017 Endoscopic Vision SubChallenge: Robotic Instrument Segmentation dataset. Kanakatte [22] proposed a deep neural network that combines the advantage of spatial representation using CNN and temporal information using LSTM. It is an instance segmentation algorithm, which segments and localizes the surgical tool using a spatiotemporal deep network, using a pre-trained ResNet and VGGNet. Lshirbaji [23] proposed a deep learning-based approach to detect surgical tools in laparoscopic images using a convolutional neural network (VGG16) in combination with two long short-term memory (LSTM) models. Furthermore, a pre-trained CNN model was trained to learn visual features from images. Then, LSTM was employed to include temporal information through a video clip of neighbour frames. Finally, the second LSTM was utilized to model temporal dependencies across the whole surgical video. Experimental evaluation has been conducted with the Cholec80 dataset to validate our approach, resulting an average precision of 91.04%.

Machine learning and deep learning
Machine learning (ML) is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Moreover, ML focuses on the development of a self-adaptive algorithm that gets increasingly better analysis and patterns with experience or with newly added data.
On the other hand, deep learning (DL) is a subfield of machine learning that attempt to simulate the behavior of the human brain in processing data and creating patterns using artificial neural networks. Furthermore, it utilizes a hierarchical level of artificial neural networks to carry out the process of machine learning. The artificial neural networks are built like the human brain, with neuron nodes connected together like the web. While traditional programs build analysis of data in a linear way, the hierarchical function of deep learning systems enables machines to process data with a nonlinear approach. Deep learning drives many artificial intelligence (AI) applications and services that improve automation, performing analytical and physical tasks without human intervention.
The main differences between ML and DL are [24]: • Feature extraction (Fig. 2): A machine learning workflow starts with relevant features being manually extracted from images. The features are then used to create a model that classifies the objects in the image. However, with a deep learning workflow, relevant features are automatically extracted from images. • Data size matter: Deep learning algorithms scale with data which means that they often continue to improve as the size of your data rises. On the other hand, most shallow learning (machine learning) methods stop improving the accuracy at a certain level of performance when you add more examples and training data to the network. • Input data: Machine learning algorithms almost always require structured data, while deep learning networks rely on layers of ANN (artificial neural networks).
In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance. Models are trained by using a large set of labeled data and neural network architectures that contain many layers. While deep learning was first theorized in the 1980s, there are two reasons it has only recently become useful: • Deep learning algorithm requires a large amount of labeled data [25]. • Deep learning algorithm requires substantial computing power [26].
One of the most popular types of deep neural networks is known as convolutional neural networks (CNN or ConvNet).

Computer vision and convolutional neural networks
The internet is comprised of text and images. It is relatively straightforward to index and search text, but to index and search images, algorithms need to know what the images contain. To get the most out of image data, we need computers to "see" an image and understand the content.
Computer vision is a field of study focused on the problem of helping computers to see. Furthermore, it is one of the most powerful and compelling types of AI. It attempts to replicate parts of the complexity of the human vision system and enables computers to identify and process objects in images and video in the same way that humans do. Thanks to advances in deep learning and neural networks and the availability of datasets, the field has been able to take great leaps in recent years and has been able to surpass humans in some tasks related to detecting and labeling objects. Along with a tremendous amount of visual data (more than 3 billion images are shared online every day), the computing power required to analyze the data is now accessible. Computer vision has succeeded in myriad high-level problems, like: medical imaging, surveillance, motion capture, fingerprint recognition and biometrics, automotive safety, optical character recognition (OCR), and much more.
The convolutional neural network (CNN) [27] is most commonly used algorithm to analyze visual imagery. CNN is one of the most popular deep neural networks. They are made up of neurons that have learnable weights and biases. Furthermore, they consist of an input layer, hidden layers and an output layer. In all feed-forward neural networks, any middle layer is called hidden, on the other hand, in CNN, at least one hidden layer must perform convolution. In addition, CNNs have two major components: • The Hidden layer/Feature extraction part: In this part, the network will perform a series of convolutions and pooling operations during which the features are detected. • The Classification part: Here, the fully connected layers will serve as a classifier on top of these extracted features by assigning a probability for the predicted object on the image.
Moreover, we use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer ( Fig. 3): • Input layer: Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons also called an activation map (a highlight of the relevant features of the image) using a feature detector. A feature detector, also known as a kernel or a filter, moves across the receptive field of the image, checking if the feature is present. Furthermore, the first layers detect basic features such as horizontal and vertical edges, while the next layer extracts more complex features. This process is known as convolution. The feature detector is a two-dimensional (2-D) array of weights, which represents part of the image. While they can vary in size, the filter size is typically a 3 × 3 matrix. The most common activation function used in this layer is ReLU. • Pooling layer: conducts dimensionality reduction, by reducing the number of parameters in the input. The kernel applies an aggregation function to the values within the receptive field, populating the output array. There are two main types of pooling: Max pooling and average pooling. Max pooling selects the pixel with the maximum value to send to the output array, while average pooling calculates the average value within the receptive field to send to the output array. • Fully connected layer: in the fully connected layer, each node in the output layer connects directly to a node in the previous layer. This layer performs the task of classification based on the features extracted through the previous layers and their different filters. The classification layer outputs a set of confidence scores using generally a softmax and sigmoid activation functions that specify how likely the images belong to a "class".
Thus, CNNs eliminate the need for manual feature extraction to identify features used to classify images. The CNN works by extracting features directly from images. This automated feature extraction makes deep learning models highly accurate for computer vision tasks such as object classification.

Training a convolutional neural network
In order to train a convolutional neural network to perform object classification, three common ways are used (Fig. 4): • Training from scratch: refers to building a deep network such as CNN in order to learn the features and model. In addition, to train it from scratch, a very large labeled data set is needed. This is a less common approach because with the large amount of data and rate of learning, these networks typically take days or weeks to train. • Transfer learning [28]: is a concept where you transfer the weights of an already trained model (pre-trained) to another problem set on a different dataset. Furthermore, it refers to exploiting a pre-trained model as feature extractor, and taking advantage of features learned by a model trained on a larger dataset in the same domain. This is done by removing the last fully-connected layer, then instantiating a fresh fully-connected layer with an output corresponding to the number of our problem classes. The pre-trained model is "frozen" and only the weights of the classifier get updated during training. In this case, the convolutional base extracted all the fea-tures associated with each image, and you just trained a classifier that determines the image class given that set of extracted features. • Fine-tuning: In this approach, we not only replace and retrain the classifier on top of the ConvNet on the new dataset, but also we fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it's possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.
In addition, using transfer learning and fine-tuning, we can exploit the robust of discriminative filters learned by state-of-the-art networks on challenging datasets (such as ImageNet [29] or COCO [30]), and use these networks to recognize objects the model was never trained on.
In this research, we fine-tuned three state-of-the-arts algorithms, namely: VGG19, Inception v-4, and NASNet-A.

VGG19
VGG was a breakthrough in the world of Convolutional Neural Networks after the AlexNet [31]. Furthermore, it was created by Visual Geometry Group at Oxford. The main contribution of VGG is to show that classification/localisation accuracy can be improved by increasing the depth of CNN instead of using small receptive fields in the layers. Moreover, VGG19 (Fig. 5) is a convolutional neural network that is 19 layers deep, consisting of: • A fixed input size of 224 * 224. For a RGB images, the input network is changed to 224 * 224 * 3, as 224 is the height and the weight, and the 3 is the RGB channels.

Inception v-4
Inception-v4 (Fig. 6) is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than Inception-v3.
Inception module was firstly introduced in Inception-v1 and GoogLeNet. The input goes through 1 × 1, 3 × 3, and 5 × 5 conv, as well as max pooling simultaneously and concatenated together as output. Thus, we don't need to think of which filter size should be used at each layer.
The inception-v4 introduced also a Batch normalization (BN), where ReLU is used as activation function to address the saturation problem and the resulting vanishing gradients. In addition, 5 × 5 conv was replaced by two 3 × 3 convs for parameter reduction while maintaining the receptive field size. Furthermore, a factorization concept is also introduced in convolution layer to reduce the dimensionality, so as to reduce the overfitting problem.

NASNet-A
Equipped with abundance of computing power and engineering genius, Google introduced NASNet (Neural Search Architecture Network), which framed the problem of finding the best CNN architecture as a Reinforcement Learning problem.
Reinforcement Learning is a type of machine learning technique that enables an agent to learn the best actions possible in a virtual environment in order to attain their goals using feedback from its own actions and experiences. Furthermore, the idea was to search the best combination of parameters of the given search space of filter sizes, output channels, strides, number of layers, etc. In this Reinforcement Learning setting, the reward after each search action was the accuracy for the searched architecture on the given dataset.
In NASNet, though the overall architecture is predefined in Fig. 7, the blocks or cells are not predefined by authors. Instead, they are searched by reinforcement learning search method. Moreover, the number of motif repetitions N and the number of initial convolutional filters are as free parameters, and used for scaling.
Specifically, these cells are called Normal Cell and Reduction Cell, where normal cell is a convolutional cell that return a feature map of the same dimension, and the reduction cell is a convolutional cell that return a feature map where the feature map height and width is reduced by a factor of two.
NASNet achieved state-of-the-art result in the ImageNet competition. However, the computation power required for NASNet was so big that only a handful of companies were able to make use of the same methodology.
In order to form the NASNet-A [6] (Fig. 8), the processing time took over 4 days using 500 GPUs resulting in 2000 GPU-hours.

Dataset
The healthcare industry has benefited greatly from deep learning capabilities ever since the digitization of hospital records and images. Furthermore, recent advances in artificial intelligence and availability of surgical datasets hold potential to enhance surgical training, improve surgeon performance and ultimately surgical outcomes. This is particularly true for minimally invasive surgery (MIS).
Gallbladder removal, or cholecystectomy, is performed more than 750,000 times annually in the US alone, mainly for benign gallstone disease which affects 10-15% of adults. The vast majority of gallbladder resections are done in relatively universal and predefined steps. In this work, we use a cholecystectomy surgery video dataset, called Cholec80. Cholec80 [32] is a cholecystectomy surgeries videos dataset containing 80 videos, performed by 13 surgeons These videos are encoded with a frame rate of 25 fps (frames per second) and a video resolution of 1920 × 1080 pixels. Video length is varied between 12 min 19 s (minimum) and 1 h 39 min 55 s (maximum), with 38 min and 26 s on average and more than 51 h of surgery in total. Additionally, cholec80 is fully annotated with frame-level surgical tool labels for binary surgical tools detection.
In cholec80, seven tools were used and annotated ( Fig. 9): specimen bag, bipolar, scissors, clipper, hook, grasper, and irrigator. Each tool is labeled as present if half of it appears in the image. One binary label is provided per image and per tool as an annotation (Multilabel classification).

Experiments and results
In this section, we present the experiments carried out using the presented dataset and the models described in the previous section. First, we describe how we pre-processed the dataset. Next, we present the common setup for the experiments, and a comparison between the models using the three proposed neural network and the ensemble learning approach. Finally, we discuss and compare the outcomes with the state-of-the-art methods.

Data pre-processing
All videos in this study are processed in the same way. In the beginning, videos are processed using FFmpeg 3.0 and all video streams are encoded using 25 frames per second (FPS).

Removing irrelevant frames
Since videos are not edited by any professional, they have several empties and irrelevant scenes, principally at the beginning and the end of the videos. Furthermore, these frames are noisy and computationally expensive. Therefore, we cut these non-relevant frames using a background detection model. The latter was trained to identify unimportant segments that were captured outside the body. Next, these frames are used to recognize the real start and end of the surgery in the original video and cut it down. This step and the final verified video files are automatically processed and stored in local computer.

Splitting videos and resizing frames
The new dataset is now clean and does not contain irrelevant frames. Therefore, we split the preprocessed videos into 1 frame per second images, to match the tool presences annotations.
Finally, and given that neural networks receive inputs of the same size, all images need to be resized to a fixed size before inputting them to the CNNs. Moreover, the image dimension is often reduced in order to fit a reasonably sized batch in GPU memory. Thus, for each proposed neural network, the input images are resized to fit their input shape. Therefore, all images are resized to 224 × 224 pixels for VGG19, 299 × 299 pixels for the inception v4, and 331 × 331 pixels for the NASNet-A.

Data augmentation
In Cholecystectomy surgery, some tools are used more frequently than others. Consequently, Cholec80 video frames belonging to those tools outnumber the video frames belonging to the other tools, leading to unbalanced data (Fig. 10). This issue affects the generalization of the model and reduces the CNN efficiency to classify the different tools.
To overcome this problem, image augmentation techniques are used to increase the size of the minority classes. Images are augmented by affine transformations and blurring. We consider those transformations that preserve tool presence, namely: • Rotation: Minority class images are rotated at an angle of zero, 40, 85, 125, 250, and 300. • Mirroring: Mirror the image along the x-axis and y-axis. • Shearing: The images were shifted at 40° in the counter-clockwise direction.
• Padding: Padding 5px on each border, using the reflect mode, which pad with the reflection of image without repeating the last value on the edge.

Experimental setup
To verify the impact of the ensemble learning approach, we carried out the experiments considering two scenarios: • Scenario 1: The models VGG19, Inception v4, and NASNet-A are fine-tuned and used as a classifier. • Scenario 2: The model combines the three neural networks using ensemble learning.
As noted earlier, fine-tuning a neural network means that the representations learnt by the previous network are used to extract the meaningful features of the new dataset images and the activation maps generated from the last convolutional layer are fed to the newly constructed fully connected network which acts as the classifier.
Thus, for each model, we use its original first blocks as feature extractors. For both scenarios, we freeze all the layer except the last ones, in order to backpropagate the gradient and update the weights of the last layers by setting. In order to unfreeze the last blocks, we set these layers as 'trainable = True' .
Next, we remove the fully connected layer at the end of the pretrained networks (Softmax layer), that contains 1000 classes which is the number of ImageNet dataset labels, and create a new freshly initialized sigmoid layer that is compatible with our multilabel classification task, with seven classes (surgical tools number), and append it to our model. This will predict a probability of class membership for the seven labels and assign a value between 0 and 1.
Finally, we start training our model, so that the SGD epochs their weights can be fine-tuned for the new task. Furthermore, we used a smaller learning rate to train the network because we expect that the pretrained weights are quite good already as compared to randomly initialized weights.
The first frozen early blocks learn very generic features and lets the network capture common features like edges and curves. In addition, this step ensures that any As we mention in the section earlier, we use Cholec80 for performance evaluation. 60 videos are assigned to the training set, while 20 videos are assigned to the test set. For validation purposes, 10 videos from the training set are used.
For all the models we use Adam optimizer with an initial learning rate of 0.001, that we decay by a factor of 10 after 10K iterations.
Since different tools can be present at the same time, the tool detection is Multilabel and Multiclass task, therefore, the final activation function is a Sigmoid function, calculated as : where e = Euler's number And binary Cross-Entropy is used as Loss function, it is calculated as: where y is the label (1 for the surgical tool is present and 0 for the surgical tool is not present) and p(y) is the predicted probability of the surgical bein present for all the surgical tools. All procedures were implemented on Intel ® Core ™ i7-9700K processor, 16GB memory, and NVIDIA Geforce GTX 2080.
The performance of the proposed approaches on the classification task is measured by the average precision (AP). It is a measure that combines recall and precision for a particular surgical tool. The AP is calculated as: The performance of the each model is calculated by the mean Average Precision (mAP), which is the mean of the average precision scores for each surgical tool.
It is calculated as : where 7 is the number of surgical instruments.
Another curve used in this research is the area under the ROC. The area under the ROC curve represents the degree or measure of separability which tells how much the model is capable of distinguishing between classes. Furthermore, an Area under the ROC curve shows the relationship between clinical sensitivity and specificity for every possible cut-off. Moreover, the ROC curve is a graph with: The x-axis showing 1 -specificity, while the y-axis showing sensitivity.

Results and discussion
In this section, our main goal is to compare the performance of the deep learning models for both scenarios described earlier.
As we mention earlier, we implemented multiple Convolutional Neural Networks for surgical tool classification. These networks are trained on Cholec80, which is a cholecystectomy surgery videos dataset containing 80 videos. Each frame is annotated with a single or multiple surgical tools. Thus, accuracy is calculated by comparing the ground truth annotations with the predicted labels.
The classification performance for all the models using the augmented data is reported in Tables 1 and 2. It specifies the average precision and the Areas under the ROC curves    Unexpectedly, the VGG19 architecture performed quite good and outperformed the Inception v-4, even tough it's older and less sophisticated. Furthermore, VGG19 achieved a mean average precision of 95.85% and outperformed the NASNet-A in classifying multiple tools.
Despite the fact that Inception v-4 is more recent and have more layers than VGG19, it obtained the lowest average precisions, therefore, the lowest mean average precision with 93.01%, less than VGG19 and NASNet-A.
It can be seen that the ensemble learning network can effectively improve the detection accuracy.
Therefore, the ensemble learning approach obtained the best results among all the proposed models in classifying all the surgical tools. Moreover, the latter achieved an accuracy of 97.70%, 98.14%, 99.91%, 94.54%, 99.51%, 97.79%, and 97.29% in classifying the Grasper, Bipolar, Hook, Scissors, Clipper, Irrigator, and the SpecimenBag respectively, and a mean average precision of 97.84%.
The average precision scores for all the surgical instruments surpassed 94%. Furthermore, the hook obtained the best detection precision among all the surgical tools, with an AP of 99.91%. Moreover, one potential explication for this high value is that the hooker had a good visibility and high discrimination, making it easily recognizable from other instruments. Additionally, the clipper obtained the second highest average precision. One possible explanation is that it has a unique shape, making it easily distinguishable from other tools.
On the other hand, the grasper, bipolar, irrigator, and the specimen bag attained an average precision above 97%, this means that they occasionally misclassified.
The grasper and irrigator usually have a similar appearance to some other surgical tools and their shape is irregular, this may explain the reason of their misclassification.
Whereas the specimen bag, which is the third most represented class in cholec80, have a universal shape along with infrequent and irregular presence.
However, the scissors obtained the lowest AP with 94.54%, one possible reason is that this tool is less represented class in cholec80, even after augmentation. This problem requires an exploration and explanation in future work.
As a result, the ensemble learning approach yield significantly all the other models, leading to an improved average prediction performance and the reduction in the variance component of prediction errors made by the contributing models. In addition, the ensemble learning minimize the main causes of error in learning models which are noise, bias, and variance.
Meanwhile, according to Table 2, the NASNet-A and the VGG19 gets the best scores. Moreover, VGG19 outperforms NASNet-A in classifying the majority of the surgical tools. However, it falls gravely in classifying scissors, which is the less represented class in the Cholec80 dataset. Thus, NASNET-A obtained the best average score with 99.11%, against 99.09% for the VGG19. These scores are surprising owing to the unsophisticated architecture of the VGG19.
On the other hand, Inception v-4 gets the worst results with and an average score of 98.58%.
Again, the ensemble learning approach outperforms all the models in classifying all the surgical tools. Furthermore, it obtains the scores: 96.88%, 99.81%, 99.93%, 99.64%, 99.95%, 99.85%, and 99.81% in classifying the Grasper, Bipolar, Hook, Scissors, Clipper, Irrigator, and the SpecimenBag respectively, and a mean average precision of 99.41%. Thus, this value means that the proposed classifier is an excellent classifier and it is able to detect more numbers of True positives and True negatives than False negatives and False positives.
To evaluate the performance of the best proposed model, we compared the results with those of previous studies. Table 3 outlines and compares our model with the best performing related models. The average precision computed for all classes and the mean average precision for each model are used to differentiate our model from the others.
It can be seen that our approach outperformed considerably the compared methods, in classifying all the surgical tools.
Compared to other Network, our method's accuracy was improved by 6.19%, which was largely attributed to the introduction of the ensemble learning.
In addition, the augmentation techniques used in the preprocessing phase increased the AP of the less represented class, especially the scissors, which obtained an accuracy of 4% more than best model classifying scissor.
In summary, all of these results demonstrate that our approach had an impressive capability in classifying surgical tools, and surpassed all the state-of-the-art detection algorithms mentioned above by a large margin.

Conclusion and future works
In this paper, we presented a study to analyze the impact of ensemble learning on the frame-by-frame detection of surgical tools in MIS videos. Furthermore, we implemented three convolutional neural networks, namely: VGG19, Inception v-4, and NASNet-A. First, we preprocessed the Cholec80 dataset videos and overcame the unbalanced data problem by using multiple data augmentation techniques. Next, we combined these three state-of-the-art CNNs in an ensemble learning model, then we compare it to a fine-tuned version of each one of them.
The experimental results show that the ensemble learning method yields significantly the other models in classifying surgical tools. In addition, the proposed method outperformed all the compared state-of-the-art algorithms on surgical tools detection in minimally invasive surgery videos, with an improvement of 6.19% in terms of mAP score in frame-level tool detection. For future work, we expect that we improve the accuracy and tackle the real-time nature of the object detection model to implement it in the on-site online learning, assisted direction of surgery, and the surgical quality assessment.