Establishment of an automatic diagnosis system for corneal endothelium diseases using artificial intelligence

,


Introduction
The corneal endothelium is responsible for maintaining corneal transparency, and dysfunction of hexagonal corneal endothelial cells leads to corneal opacity and blindness.According to the 2020 statistical report of the Eye Bank Association of America (EBAA), 53.5% of all keratoplasty procedures were performed for corneal endothelium disease (CED), and 76.7% of the indications for endothelial keratoplasty (EK) were for CED [1]; in addition, the number of EK procedures increased 15.3% to 30,098 in 2021 [2].Similar patterns have been observed in the United States and Europe.In 2016, 58.4% of patients in Germany underwent keratoplasty for CED [3].These finding suggest that CEDs are the most common indication for keratoplasty, and their incidence has increased in recent years.Ophthalmologists should be more aware of the incidence of CEDs worldwide.
The corneal endothelium is the innermost layer of the cornea, and CEDs cannot be accurately diagnosed without specific examination equipment, leading to very high rates of clinical missed diagnosis and misdiagnosis.CEDs include Fuchs' endothelial corneal dystrophy (FECD), posterior polymorphous corneal dystrophy (PPCD), bullous keratopathy, iridocorneal endothelial (ICE) syndrome and viral endotheliitis.In the past, diagnosing CEDs was difficult due to the lack of appropriate equipment.In recent years, with the advent of in vivo confocal microscopy (IVCM), the morphology and structure of corneal endothelial cells can be clearly observed and analysed in vivo, and even mild oedema of the cornea can be detected at the corneal endothelial level.This technological progress has great importance for the understanding and diagnosis of CED [4][5][6][7][8].As the application of IVCM (HRT III) devices in ophthalmology continues to advance and experience accumulates, imaging features of diagnostic significance are constantly being refined and summarized, and the characteristics and diagnostic criteria of these diseases have been clarified.Unfortunately, however, the microscope itself, while providing clear images, cannot supply reports or directly provide a diagnosis, similar to computed tomography (CT) and magnetic resonance imaging (MRI).Therefore, images are sent to doctors for analysis, but diagnosis is often difficult because CEDs are not common in clinical practice; doctors have a very limited understanding of this disease, and no analysis software is provided for the machine.Moreover, China lacks systematic training on IVCM images and a detailed atlas of CEDs.As a result, the ability to read IVCM images remains limited, and IVCM is often not used effectively.Therefore, the ability to diagnose CEDs in China remains inadequate.
Artificial intelligence (AI) has demonstrated rapid advancements in disease diagnosis.In ophthalmology, substantial progress has been made in the diagnosis of fundus diseases using deep neural networks [9][10][11][12] and the detection of glaucomatous optic neuropathy with multimodal machine learning [13].However, AI-based diagnosis of corneal diseases is in its infancy and has focused mainly on corneal endothelial cell (CEC) morphology [14], keratitis [15,16] and keratoconus [17,18].According to a literature review, there is no research on the diagnosis of CEDs using AI technology.
The aim of this study is to develop an automatic diagnostic system to identify FECD, PPCD, owl eye cells in cytomegalovirus (CMV) infection, viral endotheliitis (other CEDs) and normal corneas using AI technology.Through observation and investigation of the imaging characteristics of different endothelial diseases, we identify three aspects that should be considered when designing a proposed model.First, local features and long-range context information are both useful for improving the discriminability of representations, and not all regions in IVCM images equally contribute to helping identify diseases.For example, guttata are markers of FECD, and owl eye cells are markers of CMV infection, both of which occupy limited areas.However, in corneas with PPCD, there is a wide area of abnormal regions, such as craters or ridges, on the corneal endothelium.Second, certain spatial interactions and implicit relationships among abnormal regions or cells may be key to avoiding misdiagnosis.For example, guttata can sometimes be found in corneas with PPCD, which may be confused with FECD.However, guttata appear in the middle of the cornea and spread to other parts of the cornea in FECD, while they are distributed along the ridges in PPCD.Third, our dataset for endothelial diseases is relatively small compared with public DR grading datasets.Therefore, models suitable for small-scale datasets are more effective for this task.
After considering the above aspects, in this paper, we incorporate a cross-head relationaware self-attention mechanism and a token-attention feed-forward network (TaFFN) into a compact convolutional transformer (CCT) [19] to enhance its discriminability, termed an enhanced CCT (ECCT), for diagnosing CEDs from IVCM images.The CCT is designed by adding convolutional blocks to generate tokens from input images with the goal of maintaining local information and reducing the computational burden for subsequent transformer blocks.Therefore, the CCT not only combines local features and global representations but is also suitable for small datasets.Based on the CCT, we introduce a cross-head relative position encoding (CHRPE) scheme to the standard multihead self-attention module with the goal of capturing spatial relationships and semantic information among different tokens.Inspired by LocalViT [20], we adopt a TaFFN to adaptively learn the importance of each token for different inputs.
Overall, our contributions are as follows: 1) To our knowledge, this is the first study to utilize deep learning methods to automatically diagnose CEDs from IVCM images; these methods can assist ophthalmologists in clinical diagnosis and promote the application of IVCM.2) We propose a CHRPE scheme to aggregate the spatial interactions and contextual information among different regions.To give more weight to valuable abnormal regions, we propose a TaFFN.
3) The experimental results show that our ECCT is superior in identifying endothelium diseases compared to certain popular convolutional neural network (CNN)-based methods and transformer-based methods.

Image capture
In this prospective study, images of corneal endothelial cells are acquired using an IVCM system (HRT III Rostock Cornea Module [RCM]; Heidelberg Engineering GmbH, Heidelberg, Germany).The specific IVCM image acquisition steps have been described previously [21].The images are taken from the focal zone of the cornea using section mode and saved in JPG format with 8-bit grey levels and a size of 384 × 384 pixels (400 × 400 μm).The study is performed according to the tenets of the Declaration of Helsinki and was approved by the institutional review board of Peking University Third Hospital (PUTH) (IRB00006761-M2022834).All participants provided written informed consent to participate in the study.

Procedure
First, we select IVCM endothelial images from CED patients.CEDs are diagnosed by corneal specialists (J H, GG X and RM P) in the ophthalmology department of PUTH.CMV endotheliitis is confirmed by reverse transcription-polymerase chain reaction (RT-PCR).Next, the images are used to train our automatic diagnosis system.Seven Chinese hospitals (Beijing Tongren Hospital, Shenyang the Fourth Hospital of People, The First Hospital of China Medical University, The Affiliated Hospital of Qingdao University, Baotou Chaoju Eye Hospital, Liaoning Aier Eye Hospital and The First Affiliated Hospital of Northwest University) supply data to construct the multicentre test set, which is used to test the automatic diagnosis system (Fig. 1).The diagnosis of CED is reviewed by a corneal specialist (J H) from the PUTH Ophthalmology Department.An example IVCM image of a CED is shown in Fig. 2.

Datasets
The

Development of the automated algorithm
CNNs have achieved great success in various medical image analysis tasks.The convolution operations in CNNs utilize convolution kernels with shared weights to interact with input images, and their limited receptive field cannot establish long-range feature dependencies.Recently, transformers with a self-attention mechanism were employed to capture long-range information and global representations.The transformer was first introduced to solve problems in natural language processing [22], in which it has demonstrated excellent performance.Subsequently, the vision transformer (ViT) first applied a standard transformer for image recognition and achieved great performance [23].The authors of the ViT argued that transformers, unlike CNNs, lack inductive biases and must be trained on large-scale datasets to eliminate inductive biases.Therefore, some studies have attempted to add locality to transformers, producing systems such as the CCT [19].The CCT combines local features and global representations while reducing the computational burden for the standard transformer, thus making it suitable for small-scale learning in medical research.We briefly revisit the CCT as follows.It consists of two parts: a convolutional tokenization and a transformer encoder followed by sequence pooling (SeqPool).Given an input image, several convolutional blocks, each of which contains a convolutional layer, a rectified linear unit (ReLU) activation function, and a max pooling layer, are used to generate tokens (a sequence of vectors) [24].Then, the transformer encoder takes the tokens as input and utilizes a series of stacked transformer blocks to extract global features.Each transformer block comprises two sublayers: a multihead self-attention (MHSA) module and a feed forward network (FFN).The normalization layer and the residual connection are applied to the two sublayers.Finally, to predict the final class index, the SeqPool module is used to pool the output sequential embeddings of the transformer encoder in a learnable attention scheme and generate probability estimates for different class labels.
We propose a transformer-based model based on the CCT [19] for automatically diagnosing CEDs from IVCM images.The diagnosis task is regarded as a five-class classification problem (normal, FECD, PPCD, CMV and others).To identify the correct CED, two factors are considered: (1) certain spatial interactions and implicit relationships between abnormal regions or cells may be key to avoiding misdiagnosis, and ( 2) not all regions in the IVCM images contribute equally to disease diagnosis.Based on the above observations, we incorporate a cross-head relation-aware self-attention mechanism and a TaFFN into the CCT to enhance its discriminability, producing an ECCT.Specifically, a novel CHRPE scheme, which utilizes cross-head features to capture spatial relationships and semantic information among different regions, is introduced to the standard MHSA module.The TaFFN employs a token-attention scheme to adaptively learn the importance of each token and substitutes for the conventional FFN.

Cross-head relative position encoding
Transformers cannot explore the order of sequential tokens.Therefore, position encoding methods, including absolute and relative position encoding, have been studied recently to add token location information.For absolute position encoding [22], the encodings are learnable or generated from sinusoidal functions with different frequencies and then added to the input tokens.The ViT [23] utilizes this approach.Relative position encoding [25] focuses on the pairwise distances between sequential tokens and was further improved by Transformer-XL [26] and image RPE (iRPE) [27].In this paper, we use relative position encoding to extract implicit relationships among different regions in IVCM images.The authors of iRPE [27] introduced two relative position modes, bias and contextual, where the bias mode represents encodings as learnable scalars that are independent of input tokens, and the contextual mode represents encodings as trainable vectors that need to interact with query embedding.The encodings are applied to each attention head independently of the MHSA module, as shown in Fig. 3a.Specifically, for an input sequence X ∈ R n×d , an MHSA module configured with iRPE runs self-attention k times (i.e., k attention heads) in parallel, which can be formulated as follows: (1) where the query Q, the key K and the value V are generated by applying projection matrices to X and reshaping; generally, we have B ∈ R k×n×n is the relative position encoding for k heads.In contextual mode, each B k 0 ij ∈ R in the B matrix is calculated as follows: where k 0 ∈ [0, k) is the k 0 -th head, i, j ∈ [0, n) are position indices, and r k 0 ij ∈ R d k is a trainable vector that interacts with query embedding Q k 0 i .r k 0 ij can also be operated on both query and key embeddings.To represent the relative position on 2D feature maps, r k 0 ij can be defined by multiple mapping methods following the original iRPE [27].However, employing iRPE at each attention head independently ignores information from other heads, which may cause performance degradation.Especially in contextual mode, representations from multiple heads can help to learn more sufficient semantic information.On the other hand, the pairwise positional relationships between tokens are the same for all attention heads; thus, it is reasonable to maintain consistent relative position encoding among various attention heads.Therefore, we design our CHRPE based on iRPE to utilize cross-head embeddings to obtain richer encodings.Specifically, query Q is reshaped to Q′ ∈ R n×kd k to aggregate cross-head features, and multiplication of the trainable vectors R ij ∈ R kd k and Q i ′ is performed to generate positional encoding.Finally, the relative position encoding matrix B can be broadcast-added to the attention maps in each head.As illustrated in Fig. 3b, our cross-head relation-aware MHSA can be formulated as follows: (2) where ⊕ is the broadcast addition.Additionally, the proposed cross-head relation-aware MHSA configured with CHRPE also provides certain interactions of information among different heads.

Token-attention feed-forward network
In a standard transformer, the FFN is composed of two fully connected layers that establish global information along the embedding dimension.A nonlinear activation function is applied in the hidden layer.To add more locality, LocalViT [20] incorporates a novel FFN that first converts the input sequence to a 2D feature map, then performs two 1 × 1 convolutions along with a depthwise convolution, and finally converts the feature map back to a sequence, as shown in Fig. 4a.In this study, not all regions in IVCM images contribute equally to identifying CEDs, and each disease has its own characteristic area that should be given additional attention.To determine the importance of each token, we propose a TaFFN inspired by LocalViT.Specifically, we reshape the 2D feature map so that a (4) 4 Comparison between a the local feed-forward network and b the proposed token-attention feed-forward network (TaFFN).The orange areas are newly added parts squeeze-and-excitation (SE) module [28] can be applied on the tokenwise dimension.As shown in Fig. 4b, for an input sequence Z ∈ R n×d , our TaFFN can be formulated as follows: where h and w are the height and width of the 2D feature map, respectively; W 1 and W 2 are the two 1 × 1 convolutions; W d is the depthwise convolution; and σ is the nonlinear activation function.R represents a reshaping operation that converts U r ∈ R d×h×w to U ∈ R n×d×1 .The SE module learns the importance in the token dimension and weights U to generate V , and then, R −1 converts V ∈ R n×d×1 to V r ∈ R d×h×w .In this way, the model is configured to focus on regions that contain more information.

Implementation details
To train the proposed model, the development set is randomly divided into a training set (2723 images) and a validation set (387 images).All the input images have a resolution of 384 × 384 pixels, and the pixel values are normalized to values of 0-1.To reduce the risk of overfitting, data augmentation strategies, including random cropping, random horizontal flipping, random erasing [29], CutMix [30] and RandAugment [31], are adopted on the training set.For the structure of our proposed network, the number of convolutional blocks in the convolutional tokenization is set to 4 for 16 × downsampling.For the transformer encoder, the number of transformer blocks and the number of attention heads are set to 10 and 8, respectively; the dimension of sequential embeddings is set to 512.During training, we first train the proposed model on the ImageNet [32] dataset and then fine-tune the pretrained parameters on our own CED training set.For fine-tuning, the AdamW optimizer is used with a weight decay of 5e −2 and a batch size of 40 [33].We train the network for 100 epochs.The learning rate starts at 5e −5 and then gradually decreases to 1e −8 with the cosine annealing schedule [34].

Heatmaps
To analyse the regions with the greatest contributions to the diagnosis of CEDs using our system, we generate a heatmap that visualizes the attention maps in the transformer blocks of the ECCT by using the attention rollout method [39].For the CED findings, the heatmaps effectively highlight the regions containing lesions on the corneal We also compare the heatmaps from other methods, such as the class activation maps of ResNet-34 and the attention maps of DeiT-S, to our heatmap in Fig. 9. Compared to the class activation maps of ResNet-34, the attention mechanism in DeiT-S and ECCT tends to activate abnormal regions precisely for all cases due to its ability to handle long-range dependencies; compared to DeiT-S, ECCT captures more complete features, which illustrates its greater discriminative capacity for learned features.

Misclassified images
In the PUTH testing set, one "normal" image is misclassified as an "owl eye" image, and another "normal" image is misclassified as "others".One "PPCD" image is misclassified as a "normal" image, five "PPCD" images are misclassified as "FECD", and two "PPCD" images are misclassified as "owl eye".Six "others" images are misclassified as "FECD", and two "others" images are misclassified as "owl eye".In the multicentre testing set, one "normal" image is misclassified as "PPCD"; three "FECD" images are misclassified as "PPCD", and one "FECD" image is misclassified as "others"; one "owl eye" image is misclassified as "PPCD", and one "owl eye" image is misclassified as "others"; five "others" images are misclassified as "normal", ten "others" images are misclassified as "FECD", nine "others" images are misclassified as "PPCD", and sixteen "others" images are misclassified as "owl eye".The details of the classification errors from the ECCT are described in Fig. 10.

Ablation studies on the internal PUTH testing set
To verify the effectiveness of each component of our method, we conduct experiments without ImageNet pretraining on the PUTH testing set.Specifically, we separately analyse the effects of CHRPE and TaFFN.To verify the effectiveness of CHRPE, we compare the results of three RPE options: without RPE, iRPE and CHRPE.The findings show that taking relative positional relationships into account is effective in extracting the characteristics of CEDs, and the proposed CHRPE performs better than the iRPE,  as shown in Table 4.For the feed-forward network (FFN), we also compare the results of three options: linear FFN, local FFN and TaFFN.A linear FFN is used in the standard transformer and is composed of two fully connected layers.A local FFN is the type of FFN adopted in LocalViT.As shown in Table 5, the performance of TaFFN is better than that of the other FFNs.

Discussion
The total accuracy of the ECCT on the PUTH testing set is 97.06%, and the AUC is 0.991.Moreover, the total accuracy of the ECCT on the multicentre testing set is 89.53%, and the AUC is 0.958.The t-SNE technique shows that the features of each category learned by the ECCT algorithm are more separable than those of the other four AI algorithms on both the PUTH and multicentre testing datasets, as shown in Fig. 6b and Fig. 7b.As shown in Fig. 5 and Tables 2 and 3, ECCT not only performs well on the PUTH dataset but also achieves the best accuracy and sensitivity and significantly surpasses the other four AI algorithms on the multicentre dataset, which demonstrates the superiority of our system in generalizing to unseen images.According to the heatmaps, the ECCT effectively highlights the regions containing lesions on the corneal endothelium.This finding suggests that the ECCT can accurately focus on the regions with lesions in CEDs, especially in PPCD and owl eye cell images, which are often ignored or unknown by junior ophthalmologists.
The sensitivity for "others" is relatively lower on the multicentre testing set because the "others" images in the PUTH dataset mainly focus on areas of disturbance that are similar to FECD, PPCD and owl eye cell images; consequently, the "others" images in the PUTH dataset are unable to depict all the alterations seen in endotheliitis.The "others" images from the multicentre dataset contain different diseases that are not found in the PUTH dataset, which is why these images are the most commonly misclassified.
While the automatic diagnosis of several diseases, such as diabetic retinopathy, diabetic macular oedema and keratitis, has been studied in ophthalmology, most of the systems were developed with large datasets (tens of thousands of images).However, for the diagnosis of CEDs based on IVCM images, there are no public datasets, and PPCD and CMV cases are relatively rare; thus, our datasets are relatively small.In the PUTH dataset, we collect multiple images from different corneal positions for each patient to increase the number of PPCD and CMV images.To learn discriminative feature representations on such a small-scale dataset, the proposed ECCT is based on the main   Second, pretraining on ImageNet significantly boosts the performance of all the methods, which shows that features learned from natural images are also helpful for medical image tasks.Furthermore, the proposed architecture captures both local and global features for various patterns of endothelial diseases, as implied by the heatmap comparisons (Fig. 9) to other CNN-and transformer-based methods.Moreover, to establish contextual relationships among different lesion regions when designing the model, we integrate a CHRPE scheme into a standard multihead self-attention module by utilizing crosshead features to obtain richer encoding.In addition, a TaFFN is introduced to learn the importance of tokens for all transformer blocks.Ablation studies on the PUTH testing set demonstrate the advantages of adopting both CHRPE and TaFFN (Tables 4 and 5).
The prevalence of FECD is approximately 7.33%, and the total number of people aged > 30 years with FECD is currently estimated to be nearly 300 million.An increase of 41.7% in the number of FECD-affected patients is expected by 2050 [40].The prevalence of FECD varies by race and geographic location.A study from Iceland (a white population) revealed that the prevalence of cornea guttata was 9.2% [41].In Asia, the incidence of FECD is lower than that in Europe, with rates of 6.7% [42] and 4.1%, respectively, among Singaporean and Japanese individuals [43].There are no data on the prevalence of FECD in China, which reflects the lack of diagnostic ability for this disease in the country.Therefore, developing an automatic diagnostic system for this disease is logical.
PPCD is a relatively rare, autosomal dominant disease.Ophthalmologists have a poor understanding of this disease, which can easily lead to missed diagnoses.Many asymptomatic PPCD cases are found and diagnosed during air force/civil aviation physical examinations in China [44].IVCM reveals hyporeflective, crater-like, vesicular lesions of different sizes on the corneal endothelium [45].
CMV endotheliitis is defined as corneal endothelium-specific inflammation triggered by CMV infection [46] and has been reported mainly in Asian countries [47].The Japan Corneal Endotheliitis Study, which included the largest case series of 106 patients, reported that CMV endotheliitis is most common in middle-aged and older men [48].The features of the owl eye morphology include large cells with nuclei presenting a highly reflective area surrounded by a halo of low reflection [7].These cells, which are considered pathognomonic for CMV, can be detected with IVCM, which may be helpful as an adjunct examination method.IVCM can assist in the evaluation of FECD guttae and owl eye cells.Our system can effectively distinguish between these two diseases.Images of other corneal endotheliitis patients were used as disturbance terms in our study, which is important for improving diagnostic accuracy.The characteristics of corneal endotheliitis on IVCM are diverse and might be confused with those of FECD, PPCD and owl eye cells.IVCM is a very effective methodology for studying corneas and improving the diagnostic ability for CEDs.For the reasons mentioned above, the ability to diagnose CEDs in China remains low.We used an HRT III machine agent to establish five WeChat groups of 500 people each; these groups consisted of ophthalmologists who, every day, consulted on IVCM images within the groups.The development of this system can greatly improve the level of CED diagnosis.Due to the COVID-19 pandemic, free personnel flow between cities is sometimes restricted; with this system, ophthalmologists can upload images to a website and automatically obtain a diagnosis.
Although this study includes a large sample, it is still relatively small compared to that of other AI systems.More images of endotheliitis patients from multiple centres will be used to train and improve our system.The diagnosis of CEDs using the proposed system should be further confirmed through large-scale clinical trials.

Conclusion
This is the first report of an AI diagnostic system for CEDs, and our results show that this system can achieve excellent diagnosability.IVCM is a reliable and effective diagnostic method for CEDs.
Once an ophthalmologist suspects CEDs after IVCM examination, the obtained image is input into our system, and the system automatically recognizes the image and assists in diagnosis to improve the ophthalmologist's understanding of CED.
However, images of endotheliitis patients are still rarer than those of other CEDs.In the future, additional images of endotheliitis patients from multiple centres will be used to train and improve our system.Moreover, the proposed system was tested on an ordinary computer, after which the system was tested online and run on a web page.
With the increased incidence of CEDs, this AI system will play a key role in the prevention of corneal blindness.

Fig. 1
Fig. 1 Summary flow chart of our research.The brown arrow shows the training procedure of the automatic diagnosis system.The blue arrow shows the validation procedure using CED images from multiple centres

Fig. 2
Fig. 2 Example IVCM images of CEDs included in our study.The first row shows FECD of different severities; the second line shows different kinds of PPCD; the third line shows different kinds of owl eye cells; and the fourth line shows positive examples for the "others" group

Fig. 3
Fig. 3 Comparison between a the multihead self-attention (MHSA) module configured with image RPE (iRPE) and b the proposed cross-head, relation-aware MHSA.The green areas are newly added parts

Fig. 5
Fig. 5 The AUCs of the different automatic diagnostic systems on the PUTH and multicentre datasets.a The AUCs of the different automatic diagnostic systems on the PUTH dataset.b The AUCs of the different automatic diagnostic systems on the multicentre dataset

Fig. 6
Fig. 6 Performance of the deep learning algorithms on the PUTH test dataset.a Confusion matrices describing the accuracies of the five deep learning algorithms.b Visualization by t-distributed stochastic neighbour embedding (t-SNE) of the separability of the features learned by the deep learning algorithms.Different coloured point clouds represent different categories of features

Fig. 7
Fig. 7 Performance of the deep learning algorithms on the multicentre test dataset.a Confusion matrices describing the accuracies of the five deep learning algorithms.b Visualization by t-distributed stochastic neighbour embedding (t-SNE) of the separability of the features learned by the deep learning algorithms.The differently coloured point clouds represent the different feature categories

Fig. 8
Fig. 8 Colour heatmaps demonstrating typical findings for different corneas, shown in pairs with the original images (left) and the corresponding heatmaps (right) for each category.a Normal.b.FECD.c PPCD. d Owl eye cells in CMV infection.e Other CEDs

Fig. 9
Fig. 9 Comparison among class activation maps of ResNet-34, attention maps of DeiT-S and attention maps of ECCT.The first row shows the original images

Fig. 10
Fig. 10 Typical examples of misclassified images The images are divided so that data from the same patient are not include in both the development set and the testing set.The total number of images for each of the diseases in the PUTH dataset is shown in Table1.A total of 449 IVCM images from multiple centres are included in the testing set.

Table 1
Characteristics of the datasets

Table 2
Performance of the five AI algorithms in the PUTH and multicentre test datasets

Table 3
Overall performance of the five AI algorithms in the PUTH and multicentre test datasets Underlined numbers indicate the best results when training from scratch; bold numbers indicate the best results when finetuning with ImageNet pretraining

Table 4
Ablation studies of CHRPE architecture of the CCT, utilizing convolutional blocks to avoid overfitting and transformer blocks to capture long-range information.As shown in Table2, we also conduct experiments without ImageNet pretraining (i.e., training from scratch).The table shows the superiority of the ECCT in both configurations.First, this shows that the ECCT can achieve reasonable performance by training from scratch on our relatively small dataset, while other transformer-based methods (DeiT-S and Swin-T) do not perform well.

Table 5
Ablation studies of TaFFN