Retinal photograph-based deep learning system for detection of hyperthyroidism: a multicenter, diagnostic study

Screening for hyperthyroidism using gold-standard diagnostic criteria in the general population is not cost-effective, leading to a relatively high rate of undiagnosed and untreated patients. This study aimed to establish a deep learning-based system to detect hyperthyroidism based on retinal photographs. The multicenter, observational study included retinal photographs taken from participants in two hospitals and 24 health care centers throughout China. We first trained two models to identify hyperthyroidism: in model #1, the non-hyperthyroidism individuals were randomly selected, while in model #2, the non-hyperthyroidism group was matched for age and gender with the hyperthyroidism group. After internal validation, we selected the better model for further evaluation using external validation datasets. The study included 22,940 retinal photographs of 11,409 participants for the model development, and 3862 retinal photographs (1870 participants) which were obtained from two hospitals and four medical centers as the external validation datasets. Model #1 achieved a higher area under the receiver operator curve (AUC) than model #2 (0.907, 95% CI: 0.894–0.918 versus 0.850, 95% CI: 0.832–0.866) in the internal validation so that model #1 was used for further evaluation. In external datasets, model #1 reached AUCs ranging from 0.816 (95% CI 0.789–0.846) to 0.849 (95% CI 0.824–0.874) and achieved accuracies between 0.735 (95% CI 0.700–0.773) and 0.796 (95% CI 0.765–0.824). Heatmaps showed a focus of the DL-algorism on large fundus vessels and the optic nerve head. Retinal fundus photographs may serve for DL systems for a cost-effective and non-invasive method to detect hyperthyroidism.


Introduction
Thyroid hormones are essential for body growth, neuronal development, and regulation of metabolism [1] Hyperthyroidism is a globally common thyroid dysfunction with potentially devastating health consequences.Hyperthyroidism is a form of thyrotoxicosis due to inappropriately high synthesis and secretion of thyroid hormones [2] In the United States, the prevalence of hyperthyroidism is about 1.2%, with approximately 40% of the patients with clinically apparent symptoms and 60% of the patients having subclinical signs [3] Screening for hyperthyroidism in the general population is not a costeffective and feasible procedure, because of the relatively low prevalence of the disease and since the examination.Only those who are at high risk due to comorbid conditions, family history, or medication use, are recommended to receive physical examinations and laboratory tests [4] Therefore, hyperthyroidism is frequently unrecognized and untreated.It has been estimated that the prevalence of undiagnosed hyperthyroidism is about 1.72% of the general population Europe [5] It may lead to adverse outcomes and increased costs [6] Improved systems for the detection of hyperthyroidism are thus needed.
Deep learning (DL) is a state-of-the-art technique that allows computational models to learn representations of data with multiple levels of abstraction [7] In the field of medicine and healthcare, DL has been primarily applied to analyze medical images for automatic measurement, augmentation, classification, diagnosis, and even prediction.[8][9][10][11] There is an increasing interest in establishing DL-based low-cost and non-invasive methods trained from color fundus photographs to predict demographic parameters and systematic disorders, including age, gender, cardiovascular risk factors [8], anemia [12] and hepatobiliary diseases [13] Rim and colleagues broadened the applicability of retinal photograph-based DL to predict 47 systemic biomarkers, however, thyroid function has not been well predicted [14] The present study was therefore conducted to develop a DL-based model for the detection of hyperthyroidism based on retinal photographs.

Study design and participants
The multicenter observational study was conducted in two general hospitals and 24 medical examination centers in 19 cities in China, and registered at ClincialTrials.gov(NCT04678375).All medical examination centers as independently operating medical units belonged to the same healthcare group under the supervision of the Tongren Hospital.The geographical distribution of all hospitals and medical examination centers included in this study were presented in Fig. 1.The data was retrospectively collected from June 1st, 2018 to July 31st, 2020.
The data obtained from 20 medical examination centers were used for the development of the DL system (Fig. 2).We first randomly extracted data of normal individuals as the control group and data of patients with hyperthyroidism in a ratio of 2:1.To eliminate potentially confounding variables, we established an additional control group, matching for age and gender with the study group.We thus had two different control groups and one study group.Within the control groups and within the study group, the data was then randomly divided into a development dataset and an internal validation dataset with a ratio of 8:2.It should be noted that images from the same patient were only included in the development dataset or validation dataset.Each development dataset was again randomly divided into a training dataset and a tuning dataset (7:1 ratio).Therefore, there were two DL algorithms trained from two development datasets and then internally tested in two internal validation datasets, respectively.Only the model with the better performance in the internal validation was chosen to be tested in external validation datasets.
Data from the Beijing Tongren Hospital (BTH, Beijing, China), Beijing Friendship Hospital (BFH, Beijing, China), Chongqing Zhuoyue Medical Centre of iKang Healthcare Group (CZMC, Chongqing, China), Shanghai Yuanhua Medical Centre of iKang Guobin Healthcare Group (SYMC, Shanghai, China), Shenzhen Zhuoyue Medical Centre of iKang Healthcare Group (SZMC, Shenzhen, China), and Wuhan Jindun Road Medical Centre of iKang Healthcare Group (WJMC, Wuhan, China) were used as external validation datasets to further evaluate the performance of the system.
Inclusion criteria were the availability of a complete set of clinical data, basic demographic characteristics, medical history, and thyroid function test reports.Hyperthyroidism was defined by a serum concentration of thyroid-stimulating hormone (TSH) level was lower and serum thyroxine (T 4 ) level, triiodothyronine (T 3 ) level, or both were above the reference range, or by a medical history of hyperthyroidism diagnosed by endocrinologists [2] Participants in control group had normal serum concentration of TSH, T 4 , and T 3 level.After receiving blood test, all subjects underwent fundus examination within 10 min.The Medical Ethics Committee of the Beijing Tongren Hospital, and the Ethics Committee the iKang Corporation approved the study protocol fulfilling the requirements published in the Helsinki declaration.For patients whose fundus images were stored in the retrospective databases at each participating hospital, informed consent was waived by the institutional review boards.

Data acquisition and quality control
Performed by trained operator and using various types of non-mydriatic 45-degree fundus camera (CR-2AF, Canon, Tokyo, Japan; Nonmyd α-DIII, Kowa, Tokyo, Japan; TRC-NW300, Topcon, Tokyo, Japan; NT-2000, Nidek, Aichi, Japan), retinal photographs were obtained from one or both eyes of the study participants (Additional file 1: Table S1).The photographs were centered on the mid-point between optic disc and macula.All images were stored in a jpeg format.Quality control was unanimously performed by two trained ophthalmologists to remove poor-quality images resulting from halation, blurs, defocus, and non-retinal images.All images were cropped for the removal of black background with only regions of fundus maintained to uniform the styles of fundus images from different fundus cameras.Both eyes of the same participants were included into the model.To investigate the ability of the DL system for hyperthyroidism detection, those images with obvious ophthalmic diseases were also removed to reduce the training bias and validation bias of the model.

Development of the DL system
For the development of the DL system and analysis of the relationship between hyperthyroidism and retinal photographs, we leveraged the convolutional neural network (CNN) and several state-of-the-art neural network candidature architectures (VGG-19 [15] ResNet-50 [16] InceptionV3 [17] DenseNet-121 [18] etc.) were tested with the same hyper-parameters settings.We selected the model with the best performance for the detection of hyperthyroidism (Additional file 1: Table S2).Among all these candidature architectures, ResNet-50 showed the best performance.In this study, to increase the generalization of the model, we initialized the model with parameters pre-trained from ImageNet.We kept the same configurations for the further comparison analyses.
Before training the networks, we re-sized all digital images to 512 × 512 pixels and some images in low-quality were manually removed.The pixel values of each image were normalized from (0, 255) to (0, 1) with a linear mapping.Some pre-processing strategies were also used for the data augmentation, such as random flip, rotation and crop.The batch size was set as 16.We used a binary cross-entropy loss (BCE loss) and Adam optimizer for stochastic optimization [19] The learning rate started from 3e-4 and dropped by tenfold every 10 epochs.We trained 50 epochs in total (Additional file 1: Fig. S1).
All examinations were implemented with Pytorch 1.8.1 DL toolkit platform [20] the build of all backbones are publicly available in torchvison 0.8.2 site-packages.The algorism training was performed on NVIDIA RTX 3090 GPU with CUDA version 9.0 and cuDNN 7.0.

Validation of the DL system and visualization heatmaps
Two controlled experimental settings were evaluated for the DL system, with the control group being unmatched, or matched for age and gender, with the study group.By comparing the differences between the two models, the influence of age and gender on the detection of hyperthyroidism was assessed.Finally, the model with the better performance was used to for further evaluation in the external datasets.To visualize the decision ways of the model used, we applied the Grad-CAM to generate heatmaps.[21].

Statistical analysis
Statistical analyses were performed using the R software (version 4.0.3).The predictive accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score of the system were evaluated.Bootstrapping with 2,000 replications was used to estimate the 95% confidence intervals (CIs) of the performance metrics [22] We used the receiver operating characteristic (ROC) curve to show the predictive ability of the DL system.

Results
A total of 22,940 retinal photographs from 11,409 subjects were included in the study (Table 1).Among them, 5601 images from 2767 subjects were labeled as hyperthyroidism.We used 19,078 retinal photographs from 9,539 participants for the development of the models, and 3862 retinal photographs from 1870 participants were used as the external validation datasets.The basic characteristics of the study populations were presented in Table 1.In the study population for the assessment of model 1, the median age was 46 (ranging from 20 to 88) and 45 (ranging from 15 to 85) in the development dataset and in the internal validation dataset, respectively.In the study population for the assessment of model 2, the median age was 40 (ranging from 15 to 85) and 41 (ranging from 16 to 88) in the development dataset and in the internal validation dataset, respectively.
In the heatmap visualization, the saliency highlighted the large fundus vessels and optic nerve head areas on the fundus images of patients with hyperthyroidism (Fig. 4).It suggested that the system made the diagnosis based on a fixed pattern.

Discussion
In this study, more than 20,000 retinal photographs taken using various camera types from 26 centers were used.DL algorithms developed in this study were able to detect hyperthyroidism on retinal photographs with an AUC of ranging between 0.82 and 0.85 and accuracies ranging from 0.74 to 0.80.The key finding shows that a DL-based system can well detect hyperthyroidism from retinal photographs.These results suggest that an automated screening for thyroid diseases may be possible based on routinely taken ocular fundus photographs.The heatmap revealed that the DL algorithm was based on the large retinal blood vessels and on the optic nerve head region.
Previous studies have demonstrated a good performance of artificial intelligencerelated and fundus image-based algorithms to differentiate both sexes with an AUC higher than 0.90, and to estimate the age with a coefficient of determination (R 2 ) higher than 0.83 [8,14] Considering that age and gender might act as confounding factors for the performance of DL system for the detection of hyperthyroidism [23] we trained and validated two models using randomly selected controls (model 1) and age and gendermatched controls (model 2), respectively.The two models both showed high accuracy for the detection of hyperthyroidism, with AUCs higher than 0.85.It suggests that the DL systems could detect hyperthyroidism independently of demographic parameters such as age and sex.
Various efforts have been made to screen and diagnose hyperthyroidism in populations with convenient and low-cost approaches.Sato et al. proposed a novel screening method to assist the diagnosis of Graves' hyperthyroidism applying Bayesian-type and Self-organizing map-type neural networks which used routine test data including serum concentrations of alkaline phosphatase, creatinine, and total cholesterol [24] The model was trained in 120 women (35 patients with Graves' hyperthyroidism and 85 healthy Similar models were formed base on the examination of 78 male individuals and reached AUCs higher than 0.95 for screening hyperthyroidism in 133 subjects [25] Our study cannot be directly compared to these studies since this is the first investigation to diagnose hyperthyroidism from non-invasive data rather than traditional methods such as biochemical blood examinations and systemic biomarkers. These findings of the present study illustrate that a fundus photo-based DL system can detect hyperthyroidism.These findings agree with clinical studies showing the involvement of the ocular system in patients with hyperthyroidism.To cite examples, thyroid-associated ophthalmopathy (TAO) occurs mainly in the patients with Graves' disease.Patients with TAO had an increased choroidal thickness [26] Increased retinal microvascular density was detected in patients with active TAO, while the retinal vessel density in the peripapillary area was decreased in eyes with a dysthyroid optic neuropathy [27] In addition, inactive TAO also showed an altered retinal perfusion as assessed in optical coherence tomography angiography [28,29] To our knowledge, understanding is still limited regarding retinal change in hyperthyroidism without TAO.Previous study indicated patients with thyroid dysfunction had wider retinal arterioles [30] To better understand the mechanism of the DL-algorithm and to minimize the "black-box effect", we collected ocular fundus images taken in two general hospitals and four medical examination centers located Fig. 4 Receiver operating characteristic curves illustrate this algorismâ€ ™ s ability to detect hyperthyroidism in external validation datasets in North, East, South and West China.We used four different types of fundus cameras (TRC-NW200, CR-2 AF, nonmyd α-DIII, and NT-2000).The DL-algorithm showed reliable and reproducible results in all these six external validation datasets.Although human experts cannot recognize the changes from retinal images in hyperthyroidism, the heatmap visualization revealed that the large retinal vessels and the optic nerve head area were the regions preferentially assessed by the DLsystem.This may be explainable since hyperthyroidism is hypermetabolic condition, which leads to the increasement of blood flow rate as well as the dilation of peripheral blood vessels.[31].
The DL system found in the present study may have clinical implications.The use of a computational evaluation of fundus photographs may be promising in screening individuals for hyperthyroidism and other thyroid diseases, so that unnecessary cost may be avoided and social burden is reduced.Future studies may evaluate whether the DL system can be integrated into mobile terminals, such as smart phone apps, to identify hyperthyroidism individuals from populations.Identification of hyperthyroidism at an early stage can assist clinicians to better organize future management strategies and provide more treatment options.Patients undergoing ophthalmic examinations and receiving eye surgeries may benefit from better understanding their general conditions such as thyroid function.
When the results of our study are discussed, its limitation should be taken into account.First, the DL system in this study was trained in participants with overt hyperthyroidism, so that the system was not tested to identify subclinical cases.Second, we failed to develop models to predict the precise levels of serum thyroid hormones, including TSH, T 3 , and T 4 .Third, we did not include participants with other thyroid diseases, such as hypothyroidism, thyroiditis, and thyroid cancer, due to the relatively low prevalence and few cases.Fourth, the study population consisted mostly of Chinese, so that the results may not directly be transferable on individuals of other ethnicities.Fifth, we collected data mainly from physical examinations, which included a large number of healthy subjects and relatively few hyperthyroidism patients.It caused the imbalance between hyperthyroidism cases and control groups.Sixth, although we established two models, only one model was applied in the external validation, and the performance of the model in external datasets were not perfect (all AUCs < 0.90).Seventh, in this study, we only considered thyroid diseases while other systemic diseases (e.g.hypertension) were not excluded.Although it might add some confounding factors, the results might be more close to real-world application.Eighth, although some confounding factors such as age and sex were matched, other parameters such as diastolic/systolic blood pressure with potential correlations between fundus images were not fully evaluated in this study.The strengths of our study include that it is the first artificial intelligence-based investigation to assess the association between retinal features and hyperthyroidism; that the DL system was developed from data obtained in 26 various data sources in China, with different settings such as hospitals and healthcare center; and that a total of four types of fundus cameras were used in the training and validation of the models, suggesting a wide applicability of the DL system.

Fig. 1 Fig. 2
Fig. 1 Distribution of included participants.Pie charts show the constitution of hyperthyroidism and non-hyperthyroidism images among development and validation dataset

Fig. 3
Fig. 3 Overview of the study process

Table 1
Characteristics of the study populations Model 1: the control group were randomly selected and unmatched to the hyperthyroidism subjects; Model 2: age and gender of control group were matched to the hyperthyroidism subjects.Hyperthyroidism participants were overlapped in Model 1 and Model 2, whereas non-hyperthyroidism participants were different between Model 1 and Model 2. Data are presented as n, or median values (range) BTH Beijing tongren hospital, BFH Beijing friendship hospital, CZMC Chongqing Zhuoyue medical centre of ikang healthcare group, SYMC Shanghai Yuanhua medical centre of ikang guobin healthcare group, SZMC Shenzhen Zhuoyue Medical Centre of iKang Healthcare Group, WJMC Wuhan Jindun road medical centre of iKang healthcare group

Table 2
Performance of the DL-based system for the prediction of hyperthyroidism from retinal photographs AUC area under the curve, CI confidence interval, BTH beijing tongren hospital, BFH Beijing friendship hospital, CZMC: chongqing zhuoyue medical centre of ikang healthcare group, SYMC Shanghai Yuanhua Medical Centre of iKang Guobin Healthcare Group, SZMC: Shenzhen Zhuoyue medical centre of ikang healthcare group, WJMC Wuhan Jindun Road Medical Centre of iKang Healthcare Group a Only model 1 was applied in external validation