Multi Region-Based Feature Connected Layer (RB-FCL) of deep learning models for bone age assessment

Prediction of bone age from an x-ray is one of the methods in the medical field to support predicting endocrine gland disease, growth abnormalities, and genetic disorders. A decision support system to predict the bone age from the x-ray image has been implemented. It utilizes traditional machine learning methods and deep learning. We propose the Region-Based Feature Connected Layer (RB-FCL) from the essential segmented region of hand x-ray. We treat the deep learning models as the feature extraction for each region of the hand x-ray bone. The Feature Connected Layers are the output from the trained important region, such as 1-radius-ulna, 2-carpal, 3-metacarpal, 4-phalanges, and 5-ephypisis. DenseNet121, InceptionV3, and InceptionResNetV2 are the deep learning models that we used to train the critical region. From the evaluation results, the Mean Absolute Error (MAE) results produced is 6.97. This result is better compared to standard deep learning models, which are 9.41.

In the last decade, evaluation of bone age has become essential to reduce the problems in the manual method for bone age estimation [7]. The main challenge is choosing the most appropriate method for building a bone age prediction system. In general, two methods can be done. The first is the use of image processing to retrieve features that affect bone development. These features will be input for the machine learning algorithm to make predictions. This process is commonly referred to as traditional machine learning or handcrafted method [8]. The second approach is to use deep learning convolutional neural networks. Automated feature extraction has been performed when the convolution occurs, so that prediction of bone age can be directly predicted. TW method is implemented by Davies et al. which extracted edges, and critical points for local image features [9]. Some local image extraction work to predict bone age is done by Zhang et al. [10]. They implemented fuzzy classification for predicting bone age. Somkantha et al. extracted carpal bones edge and Support Vector Regressor to estimate bone age. The histogram of Oriented Gradient (HOG) and BoG (Bag of Visuals Words) is classified with the Random Forest algorithm [11].
Two cutting edge techniques that are used by radiologists to do Bone Ages Assessment are the Greulich-Pyle (GP) [12] and Tanner-Whitehouse (TW) technique [13]. The GP strategy runs dependent on a current hand atlas. The format incorporates x-ray pictures from 0 to 18 years. The GP strategy works dependent on coordinating the x-ray picture that has been acquired with a current hand atlas reference. This methodology is not challenging to do and can be utilized by many radiologists. However, the GP method has a weakness. The outcomes may vary from one radiologist to the other radiologist.
The TW strategy assesses by evaluating the significant regions of the bone x-ray. Region of Interest (ROI) is utilized to see the significant parts in the bone that decide the bone development. Those parts are Ulna, Epyphysis, Metaphysis, Radius, Phalanx, and Metacarpal which are shown in Fig. 1.  This paper consists of five sections. The first section consists of the introduction and background of this paper. The second section explains our research position and literature review. The third section explains our proposed method. The fourth section is the experiment result, and the last section consists of our discussions. Spampinato et al. has utilized a deep learning approach to predict the bone age of children or teenagers [14]. They experiment with a few deep learning models, for example, Bonet, Googlenet, and Oxford. The BAA result from their experiment can deliver MAE for around 9.6 months. The dataset is assembled from an open dataset got from the Digital hand atlas. The number of datasets utilized was 1391 x-ray pictures [15].

Related works
Castillo et al. estimated bone age by utilizing the VGG-16 model [16]. The dataset that is used is the RSNA dataset. It consists of 12,611 x-ray pictures. The MAE result of their experiment was 9.82 months for male patients and 10.75 months for female patients. Lee et al. contributed to segmenting the standardizing processes, segmenting the Region of Interest, pre-process radiographs, and estimating the bone age assessment. The assessment results have indicated 57.32% and 61.40% precision for the forecast of the age of women and men, respectively [6]. The dataset consists of 4047 for male and 4278 for female x-ray picture. Wang et al. utilized an alternate methodology in the field of bone age assessment [17,18]. Given medical references, they categorize bone parts based on the development of the bone components that are appeared in x-ray pictures. It utilized a Faster Region Convolutional Neural Network as the deep learning model [19]. It utilized 600 information for the radius bone and 600 information for the ulna bone. It acquired 92% accuracy for the radius and 90% for the ulna.
Son et al. added to the automatic of the Tanner Whithouse (TW 3) strategy, which is a reference in bone age evaluation [20]. Confinement of the bone epiphysis and metaphysis was done to estimate the age of the bone. The dataset is consists of 3300 x-ray pictures from medical clinics in South Korea. The classification results for the bone area show a precision of 79.6% and 97.2% for top-1 and top-2 accuracy. The Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are 5.62 and 7.44. Liu et al. did a different method regarding pre-processing to estimate bone age. Non-subsampled Contourlet Transform (NSCT) is done before training with a deep learning model [21]. The dataset utilized is the open digital hand atlas dataset. Generally, the RMSE created from this strategy is 8.28.
Bone information is not just utilized in the medical field. In any case, the bone picture is additionally required in the field of paleontology and taphonomy. Bone information is utilized to get some answers concerning archeological and paleontological locales [22]. Explicit bone age forecast is utilized to discover and investigate historical timelines. Knowing when people begin to eat meat, utilize stone apparatuses, investigate new mainlands, and collaborate with savage creatures. Bone surface alteration is recognized by utilizing a deep learning model. Automatic identification is made by utilizing scratched information on fleshed and defleshed bone.
A few scientists use traditional and deep learning to estimate bone age from x-ray images. The utilization of regression to identify bone age has been utilized by a few analysts [23][24][25]. Furthermore, the utilization of random forest [26], K-NN [27], SVM [28][29][30], ANN [24,31,32], and Fuzzy Neural system [33] has been done by a few authors. The utilization of deep learning models also has been contributed by certain scientists to estimate the bone age [6,[34][35][36]54].
The other researcher uses the landmark-based multi-region ensemble CNN for bone age assessment [37]. This work differs from our work in terms of the concatenation of layer, the evaluation, and the proportion of data. We combine the connected feature layers of some regions. However, their work directly using input image and segmented to a few regions. The evaluation of this work only uses each of the regions as a comparison. In our research, we evaluate the whole segmented regions that produce Feature Connected Layers. In terms of the dataset, they evaluate the bone age dataset from digital hand atlas with a proportion of 90% training and 10% testing. However, in our work, we evaluate with two public datasets, digital hand atlas dataset (1392 x-ray images) [38] and RSNA dataset (12,814 x-ray images) [39]. The evaluation proportion of our work is 80% training and 20% testing.
Based on previous references, the proportion of training and testing data tested is 90% and 10% [37]. By using many datasets, we can provide the opportunity for the model to test its performance with less training data. We used two datasets; X-Ray digital hand atlas dataset totaling 1392 samples and RSNA dataset 12,814 samples. This large dataset is possible to be tested with a smaller proportion of training compared to the proportion in [37]. With a proportion of 80% training and 20%, we can provide an opportunity for models to be trained with lower training data but resulting in a good performance.
The performance of the bone age assessment method is presented by Dallora et al. [40]. It shows the machine learning algorithm performance result for each dataset. It gives us wholistic information about the current machine learning performance to estimate bone age. Region detection and maturity classification are proposed by Bui et al. [41]. Thy utilize it to estimate the bone age. Based on the experiment result by using Digital Hand Atlas Dataset, the performance of the MAE is 7.1 months. The performance of deep learning methods to estimate the bone age is presented by Larson et al. [42]. Also, the large Scale Hand X-Ray Dataset bone age estimation is proposed by Pan et al. [43]. The other researcher use two step method in bone age estimation. The authors use deep learning method as a feature extraction then classified it with the age group of the bone [44].
In this research, we try to segment the most important parts of the bone, which is the critical region to estimate bone growth. The baseline method (a manual process) to do bone age assessment is TW and GP, which have been introduced in the introduction sections. We proposed a segmentation in the important area suggested by TW strategy. Those parts are Ulna, Epiphysis, Metaphysis, Radius, Phalanx, and Metacarpal. We choose to follow TW strategy because this method evaluates significant regions of the bone x-ray rather than depend on hand atlas picture as a reference. The essential parts are referred to from the TW method. We use deep learning as a method for extracting Feature Connected Layers (FCL). FCL Concatenation is done to predict the estimated age of bone age using several regressor methods.

Proposed method
In this research, we contributed to create an age prediction expert system from hand x-ray. We do the segmentation of the essential parts of bone x-ray. Based on the radiologist's reference, the radius-ulna, carpal, metacarpal, phalanges, and epiphysis sections are parts that can affect the age of the bone. The results of the segmentation of these parts are trained in deep learning to produce a Feature Connected Layer feature (FCL). Several scenarios are carried out to produce the smallest prediction error. Based on the results of the trial, merging some connected layer features from the segmentation section can produce the smallest MAE error with a value of 6.97 months.
There are two flows to do FCL Fusion. In the first flow, the bone dataset is segmented based on the critical regions in determining bone age. The results of each region segmentation are trained using deep learning models and produce FCL with 1024 dense features. Flow 1 has process identifiers 1.1, 1.2, 1.3, and 1.4 in Fig. 2. Figure 2 shows the segmentation results of the hand x-ray. Part 1, which is yellow, is Radius-Ulna. The second part is carpal with green color. The third part is the space between metacarpal and phalanges with people's color. The fourth part is phalanges with blue. The fifth part is the space between phalanges with the name ephypisis in red. In the second flow, the whole hand bone image is trained using several deep learning models. We extracted FCL with 1024 dense features. The results of the dense layer will be combined with the results in the first path. The way strand is identified in process numbers 2.1, 2.2.
Automatic segmentation is done by using the Faster R-CNN standard to sepa-rate essential regions from the original image [45,46]. Region algorithm is implemented in Faster R-CNN. It utilize Region Proposal Network (RPN) to produce the region proposal. It gives around 0.2 s computation time to detect an image. Each of these regions based training is conducted on several deep learning models, namely Incep-tionV3, Densenet121, and InceptionResnetV2. The selection of this deep learning model is based on the evaluation of FCL results from the deep learning evaluation  Table 2. In both the first and second flows, we use transfer learning from weights derived from the results of x-ray [47,48].
To predict bone age, researchers use several layers in the deep learning model to predict accurately. In the deep learning model, there are several components, including the input layer, the convolution layer, the pooling layer, and the Feature Connected Layer (FCL). In this research, we treat FCL as a result of feature extraction from bone images. FCL of several deep learning models fusion treated as input features to be included in regressors. Several variations of the integration of FCL are combined to obtain the best accuracy results. FCL layers are taken from the deep learning model DenseNet121 [49], InceptionV3 [50], and InceptionResNetV2 [51,52].
The first process flow is indicated by the explanation of Eq. 1 through Eq. 9 If K is an image of a hand bone in Eq. 1, X-Ray and L is the result of region segmentation using RCNN in Eq. 2. There are five results of the segmentation matrix derived from the RCCN (K) process with notation; i = 0, 1, 2, 3, 4 Each region is generated from RCNN will extract its FCL using several deep learning models with FCL M, N, O results. There are five matrices for each FCL deep learning models result. Each FCL layer result has 1024 dense features. AM in Eq. 7 is the result of the combined concatenation of the FCL layer matrix results for each region generated by InceptionV3. AN is the concatenation of the combined FCL layer matrix results for each region produced by DenseNet121. AO is the result of the combined FCL layer matrix results for each region produced by InceptionResNetV2.
The second process flow is shown by the explanation of Eq. (10). If K is an image of hand bone X-Ray and W, X, Y is the result of FCL extraction from the whole image.
Each of the FCL output has 1024 output features. The results of combining FCL from three deep learning models are explained in Eq. 13 notation. Concatenation results will be processed by PCA feature decomposition with 50 components, labeled with variable P , as can be shown in Eq. 6. The scenario is done by combining the matrix between AM, AN, AO, and AZ as P. P notation will be included in the PCA Feature Decomposition.
Variable G is a gender variable, G is 1 for men and 0 for women. The conjugate results of P and G are labeled with variable F in Eq. 16. Bone age prediction is labeled with BA notation. BA is generated from the regressor results using the F features conjugation. We consider using the FCL output from Multi-Path Connectivity and the depth revolution represented by DenseNet and Resnet. In addition, we also tested the output of the deep learning model with Spatial Exploitation, Parallelization, and Inception Block, which is represented by InceptionV3 and InceptionResNetV2. We consider gender as a feature to determine the age of bone images. Feature decomposition is done using Principal Component Analysis (PCA) with a total of 50 components. After that, the gender feature is combined with FCL results from the deep learning model. A complete diagram of the process that we carried out is shown in Fig. 2. We n is the total of data, r i is the forecasted value, and t i is ground truth value.

Result
The hardware specifications that we use in this research are Intel(R) Core(TM) i7-6800K CPU @ 3.40 GHz, 32 GB Physical RAM, and 6 GPU NVIDIA GTX 1080 Ti × 11 GB VRAM. We utilized Ubuntu 16.04 as the operating system. We utilized TensorFlow and Keras framework model on top of python programming language to evaluate our proposed method. Keras model application is also be used by Ren Table 1. The second scenario is the test scenario using a single Feature Connected Layer. The FCL output of scenario 2 is produced by each deep learning model. The label we gave in the second scenario is FCL. The extraction results from 1024 dense feature of FCL will be input to be tested on several regressor algorithms. Table results of the scenario 2 test results are shown in Tables 2  and 3.
The third scenario that we do is to merge the five FCL layers by using Region-Based Feature Connected Layer (RB-FCL). The FCL is produced by each region that has been trained using several deep learning models. The regions are 1-radius-ulna, 2-carpal, 3-metacarpal-phalanges, 4-phalanges, and 5-epiphysis. The results of scenario three are shown in Tables 3 and 4. The label for the third scenario is RB-FCL. The fourth scenario that we do is to do the feature layer concatenation of scenario two (FCL) and scenario three (RB-FCL). In the fourth scenario, we combine the RB-FCL output produced by InceptionV3, DenseNet121, and InceptionResNetV2. We provide IRD labels for the merged features. In general, the smallest MAE value is obtained from the concatenation features merge in the fourth scenario, which is 6.97 months.  Table 1 shows the results of evaluating the deep learning model directly in training into the hand x-ray dataset. From the evaluation results using several deep learning models, for the digital hand dataset, the best MAE value is 9.41, and for the RSNA dataset is 10.89. Experiments from the proposed method that we propose d we propose can produce MAE values up to 6.97. Tables 2 and 3 show the test results using a single Feature Connected Layer output (FCL) scenario for the Digital Hand Atlas dataset and RSNA dataset. The FCL scenario uses a whole hand x-ray image for training into each deep learning model (InceptionV3, DenseNet121, and InceptionResNetV2). Models of the training results   are used to produce output layer features that are used as input for the regressor algorithm. The results of the test metric in Table 2 can be seen as the smallest error metric obtained by FCL output from InceptionResNetV2 with MAE values 9.77, RMSE 14.02, and MAPE 9.76. The results were obtained using the Random Forest Regressor. Table 3 shows the smallest metric errors generated by FCL from DenseNet121 with MAE values of 9.78, RMSE 12.91,and MAPE 10.18. From the test results in Tables 1 and 2, we can see that there is only a small reduction in errors obtained by taking a single feature layer output. For this reason, we try to perform FCL concatenation experiments by using the RB-FCL scenario, which is shown in Tables 4, 5, 6, and 7. Tables 4 and 5 show the metric error results from the Region-Based Feature Layer Output (RB-FCL) scenario. RB-FCL is done by combining 5 FCL output results from each x-ray hand region in the digital hand atlas dataset. From Table 4 The smallest metric values of MAE, RMSE, and MAPE% are shown by Linear Regression (LR). Either using the RB-FCL feature from InceptionV3, DenseNet121, and Inception-ResNetV2, the resulting MAE value is quite small, namely between 7.09 and 7.11. Using the RB-FCL method can minimize the error value from the standard deep learning model evaluation in Table 1 from 9.41 to 7.10, shown in Table 4. Besides, RB-FCL also has a smaller error value when compared to the scenario of using FCL. RB-FCL can produce MAE values up to 7.14 while FCL can only have MAE values of 9.78 on testing using the RSNA dataset Testing with RB-FCL on the RSNA dataset is shown by Table 5. From the results of testing the MAE value, the Random Forest Regressor has the smallest MAE value, which is around 7.14. The results of this error are relatively the same for the three deep learning models. Similar to testing on a digital hand atlas dataset, testing the MAE value on the RSNA dataset has a smaller value than the MAE value of the test results on the standard deep learning models that are equal to 10.09.
From Tables 4 and 5, merging region-based segmentation on RB-FCL from each hand x-ray region can produce MAE, RMSE, and MAPE values that are smaller than the metrics error values in standard deep learning models. The average value of a successful MAPE reduction is reduced by about 3-4% when compared to the standard deep learning. The region-based segmentation scenario by combining the feature layers of each region (RB-FCL) also has a smaller MAPE metric value compared to the FCL scenario. Decrease in MAPE% by around 2-3%.  From Tables 4 and 5, the result has shown that RB-FCL resulting from hand x-ray region segmentation can make regressor models to have smaller errors compared to standard deep learning models that use hand x-ray images as a whole. The division of region-based hand x-ray into five parts, namely 1-radius-ulna, 2-carpal, 3-metacarpalphalanges, 4-phalanges, and 5-epiphysis can make deep learning models to be able to learn only for specific regions. Hence, there is not much general information that models must learn. The deep learning model only studies training data for each region, not the whole. To get global feature information, we combine FCL from each region. Specific information from the highlight region is combined to produce a more representative feature for hand x-ray images.
We can compare the overall performance of our proposed RB-FCL method in Tables 4  and 5 compared to Tables 2 and 3. We can see that the overall error performance of the metrics gives a lower error for RB-FCL in Tables 4 and 5 compared to single FCL in Tables 2 and 3. Also, in Tables 4 and 5, the variation of MAE is between 7.10 and 11.34, while in Tables 2 and 3, the variation of MAE is between 9.77 and 17.9. We can see that RB-FCL gives a smaller variation of error compared to single FCL. Tables 6 and 7 show the results of the combination of RB-FCL with the combined FCL layers of InceptionV3, InceptionResNetV2, and DenseNet121. We combined the label IRD on the Digital Hand Atlas dataset and the RSNA dataset. The FCLOMB label is a combined representation of all feature output results from InceptionV3-RB-FCL, DenseNet121-RB-FCL, and InceptionResNetV2-RB-FCL. The best MAE results from the test scenario are produced by the FCLOMB + IRD scenario with MAE values of 6.97, RMSE of 9346, and MAPE 8128.

Discussions
In general, the results of all tests can be seen in summary in Fig. 3. Based on the results of all tests, the best metric error results obtained by testing the FCLOMB + IRD feature layer scenario with an MAE value of 6.97. These results are obtained based on a combination of several output layer features from each region. Region segmentation creates a deep learning model to produce models that can specifically study the characteristics of each region used to measure the age of bones. The division of regions based on segmentation of 1-radius-ulna, 2-carpal, 3-metacarpal-phalanges, 4-phalanges, and 5-epiphysis is derived from references used by radiologists to determine the age of bones. Obtaining a specific model for each region can produce representative connected layer output features for each region. The acquisition of features that are more representative makes the regressor model can predict bone age better. This is indicated by the decrease in error value in scenario III (RB-FCL) and scenario 4 (FCLOMB-IRD) when compared to scenario 1 (STD) and scenario 2 (FCL). In scenarios 1 and 2, no region segmentation was performed during the deep learning model training, whereas in scenarios III and IV, segmentation was carried out on the bone age determining region.
Based on the literature. Liu et al. use the same hand, atlas dataset [21]. However, they only use the female data from 2 to 15 years old and male from 2.5 to 17 years old. This dataset al.so consists of x-ray data from 0 to 2.5 years old and also above 17 years old. In our research, we use all of the data provided by the public dataset. Son et al. used its private dataset. Also, the reproducible code is not available. The other researcher uses the landmark-based multi-region ensemble CNN for bone age assessment [38]. This work differs from our work in terms of the concatenation of layer, the evaluation, and the proportion of data. We combine the FCL of some regions; however, their work directly using input image and segmented to a few regions. The evaluation of this work only uses each of the regions as a comparison. In our research, we evaluate the whole segmented regions that produce fully connected layers. In terms of the dataset, they evaluate the bone age dataset from digital hand atlas with a proportion of 90% training and 10% testing. However, in our work, we evaluate with two public datasets, digital hand atlas dataset (1392 x-ray images) [38] and RSNA dataset (12,814 x-ray images) [39]. The evaluation proportion of our work is 80% training and 20% testing.
State of the art methods for estimating bone age using Digital Hand Atlas dataset is proposed by Giordano et al. and Spampinato et al. [14,53]. Giordano et al. produced 21.84 months and Spampinato et al. produced MAE 9.48 months. Our proposed method RB-FCL produce MAE value of 7.1 months. The result of the state of the art method to estimate bone age using the RSNA dataset was produced by Castillo et al. [16]. It produces 9.73 months for MAE. While the MAE that we produce for the RB-HCL method is 6.71 months. Based on the RB-FCL evaluation, our method produces an MAE error that has the same value compared to the state of the art method for digital hand atlas data. For RSNA dataset, our approach produces smaller MAE value compared to the state of the art method.
We compare the results of predictions with MAE of 9.6 months [8] for the digital hand atlas dataset. The use of RB-FCL for digital hand atlas datasets has a smaller value of 7.10 months. We compared the RSNA error results of the comparison dataset with other researchers, Castillo et al. Getting the best MAE value of 9.82 months [16]. While our results have a smaller MAE value of 6.97 months. Based on the comparison of these results, the use of the region-based feature layer RB-FCL method can obtain better bone  age prediction values when compared to standard deep learning procedures. In the next research, we will try to make modifications to the convolution method in order to produce more representative output layer features for each region.

Conclusion
Bone age assessment is one way to estimate the age of a human bone. The use of image procession and deep learning techniques has been widely used to conduct bone age assessment procedures. In this study, we propose the Region-Based Feature Connected Layer output (RB-FCL) segmentation of several deep learning models to be able to predict bone age. Region-based is divided according to regions recommended by radiologists when they do the manual assessment on hand x-ray. The regions are 1-radiusulna, 2-carpal, 3-metacarpal-phalanges, 4-phalanges, and 5-epiphysis. From the results of testing using the proposed method (RB-FCL), the best error results obtained were 6.97 months for the MAE, RMSE 9346, and MAPE 8128. The results are obtained from the merging of the output layer features for each region. These results are better than the test results using a standard deep learning procedure that has an MAE value of 9.41 months.