Enhancing medical explainability in deep learning for age-related macular degeneration diagnosis

Enhancing medical explainability in deep learning for age-related macular degeneration diagnosis


Play all audios:


ABSTRACT Deep learning models hold significant promise for disease diagnosis but often lack transparency in their decision-making processes, limiting trust and hindering clinical adoption.


This study introduces a novel multi-task learning framework to enhance the medical explainability of deep learning models for diagnosing age-related macular degeneration (AMD) using fundus


images. The framework simultaneously performs AMD classification and lesion segmentation, allowing the model to support its diagnoses with AMD-associated lesions identified through


segmentation. In addition, we perform an in-depth interpretability analysis of the model, proposing the Medical Explainability Index (MXI), a novel metric that quantifies the medical


relevance of the generated heatmaps by comparing them with the model’s lesion segmentation output. This metric provides a measurable basis to evaluate whether the model’s decisions are


grounded in clinically meaningful information. The proposed method was trained and evaluated on the Automatic Detection Challenge on Age-Related Macular Degeneration (ADAM) dataset.


Experimental results demonstrate robust performance, achieving an area under the curve (AUC) of 0.96 for classification and a Dice similarity coefficient (DSC) of 0.59 for segmentation,


outperforming single-task models. By offering interpretable and clinically relevant insights, our approach aims to foster greater trust in AI-driven disease diagnosis and facilitate its


adoption in clinical practice. SIMILAR CONTENT BEING VIEWED BY OTHERS AN INTERPRETABLE AND INTERACTIVE DEEP LEARNING ALGORITHM FOR A CLINICALLY APPLICABLE RETINAL FUNDUS DIAGNOSIS SYSTEM BY


MODELLING FINDING-DISEASE RELATIONSHIP Article Open access 12 April 2023 DETECTING MULTIPLE RETINAL DISEASES IN ULTRA-WIDEFIELD FUNDUS IMAGING AND DATA-DRIVEN IDENTIFICATION OF INFORMATIVE


REGIONS WITH DEEP LEARNING Article 08 December 2022 DEEP LEARNING-BASED AUTOMATED DETECTION FOR DIABETIC RETINOPATHY AND DIABETIC MACULAR OEDEMA IN RETINAL FUNDUS PHOTOGRAPHS Article 01 July


2021 INTRODUCTION State-of-the-art deep learning algorithms have achieved impressive performance in analyzing fundus images to detect eye diseases such as glaucoma, age-related macular


degeneration (AMD), and pathological myopia1,2,3,4,5,6. Despite these advancements, these models often lack transparency in their decision-making processes. This issue, commonly referred to


as the AI “black box” problem, presents a significant challenge in the medical field, where understanding the reasoning behind a diagnosis is crucial for both clinicians and patients7,8. The


AI “black box” problem is widely recognized as a key barrier to the broader adoption of AI in clinical practice9,10,11. The field of explainable AI (XAI) seeks to improve understanding of


how neural networks make decisions. A common XAI approach in biomedical imaging is to identify regions of an image most relevant to a model’s decisions, using techniques such as Class


Activation Mapping (CAM) and Gradient-weighted CAM (Grad-CAM)12,13,14. However, these methods lack the ability to provide meaningful medical insights needed to explain the model’s reasoning.


Critical questions remain unanswered: Do the regions highlighted in the heatmaps correspond to clinically relevant features? Can the model support its diagnosis with medical knowledge and


reasoning? Another challenge in XAI is performance evaluation. Unlike traditional deep learning tasks with standardized metrics, there is currently no widely accepted method for assessing


explainability. Our paper aims to address these challenges. We use the term _medical explainability_ to refer to the model’s ability to justify its diagnostic decisions based on medical


knowledge and reasoning, as opposed to _algorithmic explainability_, which relies on general interpretability techniques such as CAM and Grad-CAM. Distinguishing between these two concepts


allows for a more comprehensive framework for addressing the “black box” issue in AI. Medical explainability is essential for building trust in AI-based diagnoses among clinicians and


patients. However, most existing research focuses on algorithmic explainability, while medical explainability remains underexplored. This paper’s contribution is to develop a methodology


that enhances the medical explainability of a deep learning model for diagnosing AMD using fundus images. AMD, a degenerative disorder affecting the macula, is the leading cause of vision


loss in individuals over 50, affecting approximately 200 million people worldwide15,16. Early detection is crucial, as the vision loss caused by AMD is irreversible and the effectiveness of


treatments declines with disease progression. However, access to eye healthcare is often limited, particularly in low-income and rural areas. Therefore, it is important to develop effective


and low-cost methods for AMD detection, and deep learning has shown considerable promise as a solution. Models such as convolutional neural networks (CNNs) have achieved high accuracy in


detecting AMD using retinal fundus images, sometimes outperforming traditional manual approaches17,18,19,20,21. However, the lack of explainability in these models presents a major obstacle


to their clinical adoption, hindering the potential for large-scale AMD screening and early diagnosis. Our methodology enhances medical explainability through two innovative approaches.


First, we propose a multi-task learning framework that simultaneously performs disease classification and lesion segmentation, leveraging the extraction and segmentation of AMD-related


biomarkers to validate the model’s binary classification results. The AMD-related lesions include drusen, exudates, hemorrhages, and scars, among which drusen is a key indicator and defining


feature of the disease, particularly in its early stages. The lesions identified by the segmentation task can provide evidence for the model’s positive AMD diagnosis. This approach of


supporting an AMD diagnosis with associated lesions mirrors the diagnostic process used by clinicians, where the morphological characteristics of lesions play a crucial role in accurate


disease identification. Second, we introduce a novel metric, the Medical Explainability Index (MXI), to enable an in-depth interpretability analysis of the model. The model incorporates a


Grad-CAM module to generate heatmaps from the AMD classification task, and the MXI assesses their medical relevance by measuring the degree of overlap between the highlighted regions in the


heatmaps and AMD-related lesions identified in the segmentation masks. The MXI provides a quantifiable basis for evaluating the medical explainability of the model. It offers valuable


insights into whether and how the model’s predictions are grounded in clinically meaningful information and helps identify the lesions or biomarkers that influence its decisions. By


enhancing understanding of the model’s decision-making process, this new metric can help build greater confidence and trust in AI-assisted diagnoses. The proposed model, Deep Learning with


Medical eXplainability (DLMX), not only enhances medical explainability but also improves model performance by exploiting the inherent correlation between lesion segmentation and disease


classification. The segmentation task provides detailed spatial information about morphological features, while the classification task assesses the overall features and patterns of the


image. By sharing the learned representations between tasks within a shared learning framework, the model effectively utilizes both local and global features, leading to more accurate


predictions for both tasks. Several studies have used the multi-task learning approach to enhance performance and reduce computational costs in medical imaging. Pascal et al. employs a


multi-task model with glaucoma classification, optic disc and optic cup segmentation, and fovea localization for glaucoma detection22. Ju et al. trains a model with two classification tasks


for diabetic retinopathy (DR) and AMD, diseases that share some pathological similarities and thus improve performance of both DR and AMD diagnosis23. To the best of our knowledge, our paper


is the first to utilize the multi-task framework to enhance medical explainability of a deep learning model. In summary, the main contributions of our work are as follows: * 1. We propose a


multi-task learning framework that integrates AMD classification and lesion segmentation, enabling the model to support its diagnoses with AMD-associated lesions identified from


segmentation. Moreover, by leveraging the correlation between AMD classification and lesion segmentation, this model achieves improved performance for both tasks. * 2. We introduce a new


interpretability metric (MXI) to enhance understanding of the model’s decision-making process, ensuring that its predictions are medically explainable. * 3. We evaluate our proposed approach


and validate its effectiveness through extensive experiments on the Automatic Detection Challenge on Age-Related Macular Degeneration (ADAM) fundus image dataset21. METHODS NETWORK


ARCHITECTURE Our proposed model, Deep Learning with Medical eXplainability (DLMX), utilizes a U-Net encoder-decoder architecture and integrates four modules as illustrated in Fig. 1: (1) AMD


classification using a state-of-the-art CNN, (2) Grad-CAM for generating a heatmap, (3) Segmentation of AMD-related lesions, and (4) Generation of MXI, the medical explainability metric, by


evaluating the overlap between the heatmap and the segmented lesion. The DLMX model is based on a U-Net architecture, which is known for its strong performance in biomedical image


segmentation24. The encoder, implemented using a CNN, extracts high-level features while progressively downsampling the input image. We evaluate several state-of-the-art CNN architectures


for the encoder, including EfficientNet-B725, EfficientNet-B3, EfficientNet-B0, and ResNet26. Fig. 2 provides an example of the U-Net architecture with EfficientNet-B7 as the encoder. For


the AMD classification task, the output of the final EfficientNet block in the encoder is passed through a fully connected layer to predict the probability of AMD at the image level. The


decoder in the U-Net follows a standard structure, progressively upsampling feature maps using transposed convolutions. Skip connections, a critical component of the U-Net design,


concatenate feature maps from corresponding encoder layers to those in the decoder. This preserves detailed spatial information at each resolution level and ensures that high-level semantic


features are merged with precise spatial details, improving segmentation accuracy and localization. MULTI-TASK TRAINING The encoder in the U-Net extracts deep feature representations from


the input image, enabling the classification branch to predict AMD. Meanwhile, the decoder feeds into a segmentation block that generates pixel-level maps of AMD-related lesions. The tasks


of classification and segmentation are trained simultaneously within a multi-task learning architecture, leveraging shared information for mutual gain27,28. In this framework, the loss


functions for the classification and segmentation tasks are combined into a single aggregate loss function, and model parameters are shared across tasks, allowing the model to draw on the


strengths of both tasks and enhance its overall performance. For the classification task, a binary cross-entropy loss is used as the objective function to optimize the model parameters:


$${L}_{cls}=-{y}_{i}\text{log}{p}_{i}-\left(1-{y}_{i}\right)\text{log}(1-{p}_{i})$$ (1) where \({p}_{i}\) is the predicted probability and \({y}_{i}\) is the corresponding ground truth


label. For the segmentation task, a combination of cross-entropy loss and Dice loss is employed. Cross-entropy loss, commonly used for pixel-wise classification, penalizes incorrect


predictions at the pixel level and works well when class distributions are balanced. Dice loss measures the overlap between predicted and ground-truth regions, making it effective for


handling class imbalances—common in medical imaging where segmented regions often cover a small fraction of the image. By combining these two loss functions, the model better handles class


imbalance and yields more stable convergence. The cross-entropy and Dice loss functions are denoted as \({L}_{ce}\) and \({L}_{dice}\) and are shown in Eq. (2) and (3), respectively:


$${L}_{ce}=-\frac{1}{{N}_{pix}}{\sum }_{i=1}^{{N}_{pix}}({y}_{i}\text{log}{p}_{i}-\left(1-{y}_{i}\right)\text{log}(1-{p}_{i}))$$ (2) $${L}_{dice}=1-\frac{2{\sum


}_{i=1}^{{N}_{pix}}{p}_{i}{y}_{i}}{\sum_{i=1}^{{N}_{pix}}{p}_{i}+{\sum }_{i=1}^{{N}_{pix}}{y}_{i}}$$ (3) where \({y}_{i}\) is the predicted result of pixel _i_ and \({p}_{i}\) is the


corresponding ground truth label for all \({N}_{pix}\) number of pixels in the image. The combined loss function for the segmentation task, \({L}_{seg}\), is as follows:


$${L}_{seg}={L}_{ce}+{L}_{dice}$$ (4) For the overall loss function, we combine the classification loss and segmentation loss with equal weight to optimize the shared model parameters:


$${L}_{total}={L}_{cls}+{L}_{seg}$$ (5) SINGLE-LESION AND MULTI-LESION SEGMENTATION To demonstrate the proposed methodology, we first focus on drusen as the primary AMD-related lesion before


incorporating additional lesion types. Drusen is the most common and defining feature of AMD, while other lesions may be associated with multiple diseases. For instance, exudates are often


linked to DR but can also indicate AMD, while hemorrhages may be present in AMD, glaucoma, and DR. In the ADAM dataset, drusen has the highest occurrence rate, while other lesions appear


less frequently, which may potentially introduce data imbalance issues. Therefore, drusen serves as the most reliable lesion to demonstrate the effectiveness of the proposed methodology.


After validating our method with drusen for the segmentation task, we expand the model to include additional lesion types, specifically exudates, hemorrhages, and scars. DATASET The proposed


model is evaluated using the ADAM dataset21, which consists of 1,200 retinal fundus images stored in JPEG format, with 8 bits per color channel. These fundus images were captured using a


Zeiss Visucam 500 fundus camera with a resolution of 2124 × 2056 pixels and a Canon CR-2 device with a resolution of 1444 × 1444 pixels. The dataset includes binary labels for AMD and


non-AMD cases and pixel-wise annotations for segmentation masks of the optic disc and various lesions, including drusen, exudates, hemorrhages, and scars. Of the original 1,200 images, 800


are publicly available, with 400 of these containing lesion annotations. Consequently, this study focuses on the 400 annotated images. The dataset exhibits a class imbalance, with 89 images


labeled as AMD and 311 as non-AMD. For this study, the dataset is split into training and testing sets, consisting of 320 and 80 images, respectively. MODEL TRAINING DETAILS We use the


stochastic gradient descent (SGD) optimizer for model training. All models are optimized for 100 epochs. The initial learning rate is set to 0.001, and the learning rate is modulated using a


cosine annealing strategy. The batch size is set to 32. All images are resized to 256 × 256. To address the class imbalance problem, resampling techniques are applied. We use pre-trained


weights on ImageNet to initialize the model parameters, which enables the model to effectively fine-tune for the target tasks and achieve better performance with relatively limited data. In


addition, data augmentation techniques such as random flipping and random cropping are employed to enhance the model’s generalization capability. All models are implemented using the PyTorch


deep learning framework, and experiments are conducted on eight 3090 GPUs. Five-fold cross-validation with re-splitting of the training and testing data is performed to evaluate the


variability of the results. RESULTS MODEL PERFORMANCE RESULTS The DLMX model consists of four modules as illustrated in Fig. 1: (1) AMD classification, (2) Grad-CAM for generating a heatmap,


(3) Lesion segmentation, and (4) Generation of the MXI metric by evaluating the overlap between the heatmap and the segmented lesion. Fig. 3 illustrates an example set of model input,


ground truth lesion annotations, and output images, including (a) a fundus image serving as the input to the model, (b) ground truth lesion annotations; and three output images generated by


the DLMX model: (c) segmentation mask of drusen, (d) a heatmap generated by Grad-CAM, and (e) a heatmap mask converted from the heatmap. Note that Image (d), the heatmap, is produced in the


form of a two-dimensional numerical representation at the pixel level, with values ranging from 0 to 255. It is converted into a binary heatmap mask with pixel values of 0 or 1, seen in (e),


to facilitate comparison with the segmentation mask for the computation of MXI. To convert the heatmap into a binary mask, we apply Otsu’s method29, an automatic thresholding technique that


determines the optimal threshold by maximizing the variance between foreground and background pixel intensities. This approach allows each image to be thresholded based on its own intensity


distribution. Unlike traditional “black box” models that provide only a classification outcome without explanation, the DLMX model’s lesion segmentation supports and substantiates its


classification results. In Fig. 3, the identified drusen in the segmentation mask (c) reinforces a positive AMD diagnosis, as drusen is the most common feature associated with the disease.


Additionally, the MXI measured based on images (c) and (e) reveals whether the regions the model relies on for its diagnosis correspond to medically relevant features, offering insights into


how the model makes decisions and whether the decisions are medically explainable. We first discuss the performance of the classification and segmentation tasks below and follow with a


discussion of the MXI results in the next section. The DLMX model is implemented and evaluated using four state-of-the-art CNNs as the encoder backbone, specifically, EfficientNet-B725,


EfficientNet-B3, EfficientNet-B0, and ResNet26. To evaluate the classification task performance, baseline models trained on a single task of AMD classification are compared with the DLMX


model. The baselines employ the same CNNs as those used in the encoder of DLMX, i.e., EfficientNet-B7, EfficientNet-B3, EfficientNet-B0, or ResNet. Similarly, for segmentation tasks,


baseline models trained on a single segmentation task using the same U-Net architecture as employed in DLMX are compared to the DLMX model. For the classification task, the evaluation


metrics used are accuracy, sensitivity, specificity, F1 score, and area under the curve (AUC), with their definitions summarized in Table 1. AUC is a particularly useful metric for


imbalanced datasets and thus is a key metric for our experiments. The classification results are summarized in Table 2. DLMX based on EfficientNet-B7 achieves the strongest overall


performance. Between DLMX and the baseline models, with EfficientNet-B7 as the backbone encoder, DLMX outperforms the baseline in all performance metrics. For example, DLMX based on


EfficientNet-B7 achieves an AUC of 0.96 ± 0.03 and accuracy of 0.94 ± 0.05, surpassing the baseline AUC of 0.94 ± 0.04 and 0.91 ± 0.06. When ResNet is used as the backbone, DLMX performs


better in specificity and worse in other metrics relative to the baseline model with similar AUC (0.94 and 0.95, respectively). For lesion segmentation tasks, the evaluation metrics include


the Dice similarity coefficient (DSC) and intersection over union (IoU), as defined as follows: $$DSC= \frac{2\times |A\cap B|}{\left|A\right|+|B|}$$ (6) $$IoU= \frac{|A\cap B|}{|A\cup B|}$$


(7) DSC is particularly useful for measuring the model’s ability to detect small objects, such as small lesions in medical imaging. Table 3 presents the results for the segmentation of


drusen. The DLMX model performs better than the corresponding baseline model across all four CNN backbones. DLMX based on EfficientNet-B7 achieves the highest performance among all models,


with a DSC of 0.59 ± 0.18 and an IoU of 0.44 ± 0.17. The superior performance in segmentation accuracy reflects the DLMX’s enhanced ability to capture relevant features associated with


lesions. The overall strong performance of DLMX can be attributed to the underlying multi-task architecture. Multi-task deep learning has demonstrated the ability to generate superior


results compared to single-task models when the tasks are related22,23,27,28. Given that drusen is a core biomarker of AMD, integrating its extraction into a multi-task model along with


classification of AMD enhances overall model performance. The drusen segmentation can capture fine morphological features that aid in AMD classification, while the classification task can


provide non-morphological clues pertinent to the diagnosis, thus improving the joint learning of both tasks. Compared with AMD classification, the drusen segmentation is a more challenging


task as reflected in the performance measures. The highest AUC for classification is 0.96, while the highest DSC for segmentation is 0.59. This is likely due to two reasons: the irregular


shape of the lesion, and the class imbalance with relatively few positive drusen cases21. Notably, the improvement of DLMX over the single-task baselines across all backbone CNNs are more


consistent for segmentation tasks relative to classification tasks, suggesting that the benefits of multiple-task training for performance improvement are greater for more challenging tasks.


MXI METRIC The DLMX model integrates a module to generate heatmaps and interpret their medical significance. CAM and Grad-CAM are the most widely used techniques to generate heatmaps in the


XAI literature, particularly for deep learning-based medical image analysis12,13,14. We use Grad-CAM in our model due to its ability to adapt to a broader range of CNNs and generate more


detailed heatmaps14. Our proposed metric, the Medical Explainability Index (MXI), evaluates the overlap between the segmented lesion mask and the heatmap mask, both generated by the model


(i.e., images c and e in Fig. 3). Specifically, MXI is calculated as the inclusion ratio (IR) between the heatmap mask and the segmentation mask: $$IR(A,B)= \frac{|A\cap B|}{|B|}$$ (8) $$MXI


= IR \, (Heatmap \, mask, \, Segmentation)$$ (9) Here, \(|A\cap B|\) represents the number of overlap pixels between the heatmap mask and segmentation mask, and |B| represents the number of


pixels in the segmentation mask. MXI quantifies the extent to which the segmented lesion mask is represented in the heatmap, a value ranging from zero to one. A value of zero indicates no


overlap, suggesting that none of the lesion features are captured within the heatmap or used by the model to make decisions; in contrast, a value of one indicates complete overlap,


suggesting that the model utilizes all lesion features in its decision-making process. This measure identifies the medical features that influence its decisions, thus offering a mechanism


for the model to reveal the reasoning behind its diagnostic outputs. Another metric that measures the overlap between two regions is the DSC, as defined in Eq. (6). For the MXI metric, we


choose IR over DSC, as IR is useful for capturing the extent of lesion inclusion without penalizing additional areas of interest identified in the heatmap. DSC, on the other hand, calculates


the degree of overlap scaled by the combined area of the regions, which can result in low scores if the heatmap includes broader areas beyond a particular lesion. Since MXI depends on


lesion segmentation, it is essential for the segmentation task to achieve high accuracy for MXI to be a reliable measure. An alternative approach is to utilize the expert-labeled ground


truth lesion annotations to calculate the lesion-heatmap overlap, a parallel metric which we call MXI_GT, computed as the inclusion ratio between the ground truth annotation and the heatmap


mask (i.e., images b and e in Fig. 3): $$MXI\_GT = IR \, (Heatmap \, mask, \, Ground \, Truth)$$ (10) While MXI_GT does not depend on the segmentation accuracy of the model, it requires


ground truth annotations, limiting its applicability in clinical deployment. The advantage of MXI is its sole reliance on model-generated outputs, thus eliminating the need for


clinician-generated ground truth annotations. To our knowledge, this is the first study to propose a method for automatic assessment of whether the deep learning decision-making process


includes disease-related biomarkers generated by the model. The MXI results are presented based on EfficientNet-B7 as the backbone CNN for the DLMX model. Fig. 4 (a) and (b) show the


distribution of MXI and MXI_GT, respectively. Both MXI and MXI_GT display similar distributions, characterized by a distinct bimodal pattern. The mean and median MXI are 0.432 and 0.378, and


the mean and median MXI_GT are 0.462 and 0.410, respectively, suggesting that close to half of the drusen features are represented in the heatmaps. Fig. 4 (c) presents the scatter plot of


MXI vs. MXI_GT and results from two nonparametric rank correlation tests: Spearman’s rank correlation yields ρ = 0.921 (p = 1.28 × 10⁻19), and Kendall’s Tau yields τ = 0.775 (p = 2.59 × 


10⁻13), both indicating a strong positive monotonic association between the two metrics. These statistical results support the validity of MXI by demonstrating its close alignment with


MXI_GT, the ground truth-based metric. We note that when expert-level ground truth annotations are available, MXI_GT is a valuable metric. However, in scenarios where such annotations are


unavailable, as is often the case in real-world clinical deployment, MXI offers a practical, model-generated alternative for assessing medical explainability. Fig. 5 presents examples of two


image sets categorized by their MXI values. In Fig. 5 (a), the MXI is 0.94, indicating that 94% of the segmented drusen is contained within the heatmap mask. This large overlap can be


visualized by comparing the segmentation image with the heatmap mask image in Fig. 5 (a). This result suggests that the model’s decision process relies on the majority of the drusen features


contained in the fundus image, providing reassurance that the model utilizes disease-related, medically relevant features for its diagnosis. Conversely, in Fig. 5 (b), the value of MXI is


0.13, indicating that only a small portion of the drusen region is relevant to the model’s decision-making. The heatmap in Fig. 5 (b) highlights the optic disc/cup area rather than the


drusen-rich regions, suggesting the model based its decision on other features not directly related to the segmented lesion. The DLMX framework is scalable to incorporate more biomarkers. In


addition to focusing on drusen in our model, we add three new segmentation tasks for additional lesion types from the ADAM dataset, specifically exudate, hemorrhage, and scar. This expanded


model performs a more comprehensive analysis by creating separate segmentation masks for each lesion type, which are then aggregated into a single composite lesion mask. This aggregate mask


represents the combined presence of multiple lesions in the image and serves as the benchmark for evaluating the model’s decision-making. The MXI is computed as the IR between this


aggregate multi-lesion segmentation mask and the heatmap mask. Fig. 6 (a) and (b) show the histogram of the MXI and MXI_GT, respectively, for the multi-lesion model. The mean and median MXI


for the multi-lesion model are 0.417 and 0.331, respectively, slightly lower than the corresponding values in the drusen-only model (0.432 and 0.378). This modest decrease likely reflects


the increased complexity introduced by aggregating multiple lesion types into a single benchmark. As the model must attend to a broader range of features, its attention may be more


distributed across the image, resulting in slightly lower overlap with the composite segmentation mask. Nevertheless, the strong positive correlation between MXI and MXI_GT, as shown by


Spearman’s rank correlation of 0.943 (p = 1.88 × 10⁻44) and Kendall’s Tau of 0.822 (p = 8.04 × 10⁻28), supports the robustness of MXI in capturing medically relevant features even in the


more complex multi-lesion setting. DISCUSSION To make deep learning models’ diagnostic outcomes medically explainable, the models must integrate clinical knowledge. However, deep learning


excels at uncovering complex patterns without predefined rules, and imposing domain-specific knowledge risks limiting their ability to learn and generalize effectively. Our approach tackles


this trade-off by incorporating medical explainability in a way that enhances, rather than compromises, the predictive power of deep learning. The proposed multi-task learning architecture


preserves the strengths of conventional deep learning while generating biomarkers through the segmentation path to support and validate classification results. By integrating segmentation


loss into the overall loss function, the model prioritizes disease-related biomarkers in training while retaining the flexibility to learn nuanced patterns through the classification path.


This shared learning mechanism enhances the model’s focus on AMD-associated lesions, leading to improved performance and robustness. Our proposed MXI metric further ensures that the model’s


decisions are medically explainable. The DLMX model demonstrates competitive results when compared to those from the ADAM Challenge Competition21. For AMD classification, DLMX with an


EfficientNet-B7 backbone achieves an AUC of 0.96, compared to the best and median result from the ADAM competition of 0.9714 and 0.9287, respectively, across 10 participating teams. For


drusen segmentation, DLMX achieves a DSC of 0.59, compared to the competition’s best and median result of 0.5549 and 0.4483, respectively. Our approach relies on established medical


knowledge and known biomarkers to provide clinical explanations to the model output. This approach aligns with current diagnostic practices, where decisions are based on recognized


biomarkers. As new biomarkers are identified, they can be incorporated into the DLMX framework, enhancing its diagnostic capabilities and adaptability. Analysis of heatmaps indicates that


the model typically focuses on two regions: the drusen area and the optic disc/cup area. The emphasis on the optic disc/cup region may be attributed to its rich vascular network, potentially


revealing pathological features associated with AMD. This observation suggests the potential for future work to incorporate vascular network features into the segmentation path, as deep


learning models may uncover new, clinically relevant biomarkers that have yet to be recognized. In summary, our proposed methodology offers a practical solution for advancing explainable AI


in medical imaging. By integrating clinical reasoning into model outputs, this method enables clinicians and patients to better understand the model’s decisions, fostering trust in AI-based


diagnoses. Furthermore, the DLMX model provides a scalable framework for enhancing medical explainability, as future work can incorporate a broader range of biomarkers and validate the model


across diverse datasets and diseases. DATA AVAILABILITY The Automatic Detection Challenge on Age-Related Macular Degeneration (ADAM) dataset is available at


https://ieee-dataport.org/documents/adam-automatic-detection-challenge-age-related-macular-degeneration. REFERENCES * Akkara, J. D. & Kuriakose, A. Role of artificial intelligence and


machine learning in ophthalmology. _Kerala J. Ophthalmol._ 31(2), 150–160. https://doi.org/10.4103/kjo.kjo_54_19 (2019). Article  Google Scholar  * Ting, D. S. W. et al. Deep learning in


ophthalmology: The technical and clinical considerations. _Prog. Retin. Eye Res._ https://doi.org/10.1016/j.preteyeres.2019.04.003 (2019). Article  PubMed  Google Scholar  * Li, T. et al.


Applications of deep learning in fundus images: A review. _Med. Image Anal._ https://doi.org/10.1016/j.media.2021.101971 (2021). Article  PubMed  PubMed Central  Google Scholar  * Pead, E.


et al. Automated detection of age-related macular degeneration in color fundus photography: A systematic review. _Surv. Ophthalmol._ 64(4), 498–511.


https://doi.org/10.1016/j.survophthal.2019.02.003 (2019). Article  PubMed  PubMed Central  Google Scholar  * Thompson, A. C., Jammal, A. A. & Medeiros, F. A. A review of deep learning


for screening, diagnosis, and detection of glaucoma progression. _Transl. Vis. Sci. Technol._ https://doi.org/10.1167/tvst.9.2.42 (2020). Article  PubMed  PubMed Central  Google Scholar  *


Hagiwara, Y. et al. Computer-aided diagnosis of glaucoma using fundus images: A review. _Comput. Method. Progr. Biomed._ 165, 1–12. https://doi.org/10.1016/j.cmpb.2018.07.012 (2018). Article


  Google Scholar  * Wadden, J. J. Defining the undefinable: The black box problem in healthcare artificial intelligence. _J. Med. Ethic._ 48, 764–768.


https://doi.org/10.1136/medethics-2021-107529 (2022). Article  Google Scholar  * van der Velden, B. H. M., Kuijf, H. J., Gilhuijs, K. G. A. & Viergever, M. A. Explainable artificial


intelligence (XAI) in deep learning-based medical image analysis. _Med. Image Anal._ https://doi.org/10.1016/j.media.2022.102470 (2022). Article  PubMed  Google Scholar  * Rajpurkar, P.,


Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. _Nat. Med._ 28, 31–38. https://doi.org/10.1038/s41591-021-01614-0 (2022). Article  CAS  PubMed  Google Scholar  * Singh,


R. P. et al. Current challenges and barriers to real-world artificial intelligence adoption for the healthcare system, provider, and the patient. _Transl. Vis. Sci. Technol._


https://doi.org/10.1167/tvst.9.2.45 (2020). Article  PubMed  PubMed Central  Google Scholar  * Kundu, S. AI in medicine must be explainable. _Nat. Med._ 27, 1328.


https://doi.org/10.1038/s41591-021-01461-z (2021). Article  CAS  PubMed  Google Scholar  * de Vries, B. M. et al. Explainable artificial intelligence (XAI) in radiology and nuclear medicine:


A literature review. _Front. Med._ https://doi.org/10.3389/fmed.2023.1180773 (2023). Article  Google Scholar  * Zhou, B., Khosla, A ., Lapedriza, A ., Oliva, A ., & Torralba, A.


Learning deep features for discriminative localization. In _Proc. of the IEEE conference on computer vision and pattern recognition_. 2921–2929 https://doi.org/10.1109/CVPR.2016.319 (2016).


* Selvaraju, R. R. _et al_. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In _Proc. of the IEEE international conference on computer vision (ICCV)_.


618–626 https://doi.org/10.1109/ICCV.2017.74 (2017). * Wong, W. L. et al. Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: A systematic


review and meta-analysis. _Lancet Glob Health._ https://doi.org/10.1016/S2214-109X(13)70145-1 (2014). Article  PubMed  Google Scholar  * Mitchell, P., Liew, G., Gopinath, B. & Wong, T.


Y. Age-related macular degeneration. _Lancet_ 392(10153), 1147–1159. https://doi.org/10.1016/S0140-6736(18)31550-2 (2018). Article  PubMed  Google Scholar  * Burlina, P. M. et al. Use of


deep learning for detailed severity characterization and estimation of 5-year risk among patients with age-related macular degeneration. _JAMA Ophthalmol._ 136(12), 1359–1366.


https://doi.org/10.1001/jamaophthalmol.2018.4118 (2018). Article  PubMed  PubMed Central  Google Scholar  * Grassmann, F. et al. A deep learning algorithm for prediction of age-related eye


disease study severity scale for age-related macular degeneration from color fundus photography. _Ophthalmol._ 125(9), 1410–1420. https://doi.org/10.1016/j.ophtha.2018.02.037 (2018). Article


  Google Scholar  * Burlina, P. M. et al. Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks. _JAMA Ophthalmol._ 135(11),


1170–1176. https://doi.org/10.1001/jamaophthalmol.2017.3782 (2017). Article  PubMed  PubMed Central  Google Scholar  * González-Gonzalo, C. et al. Evaluation of a deep learning system for


the joint automated detection of diabetic retinopathy and age-related macular degeneration. _Acta Ophthalmol._ 98(4), 368–377. https://doi.org/10.1111/aos.14306 (2020). Article  PubMed 


Google Scholar  * Fang, H. et al. ADAM Challenge: Detecting age-related macular degeneration from fundus images. _IEEE Trans. Med. Imaging_ 41(10), 2828–2847.


https://doi.org/10.1109/TMI.2022.3172773 (2022). Article  PubMed  Google Scholar  * Pascal, L. et al. Multi-task deep learning for glaucoma detection from color fundus images. _Sci. Rep._


https://doi.org/10.1038/s41598-022-16262-8 (2022). Article  PubMed  PubMed Central  Google Scholar  * Ju, L. et al. Synergic adversarial label learning for grading retinal diseases via


knowledge distillation and multi-task learning. _IEEE J. Biomed. Health Inform._ 25(10), 3709–3720. https://doi.org/10.1109/JBHI.2021.3052916 (2021). Article  PubMed  Google Scholar  *


Ronneberger, O., Fischer, P., & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In_ Medical Image Computing and Computer-Assisted Intervention (MICCAI)_.


https://doi.org/10.1007/978-3-319-24574-4_28 (2015). * Tan, M., & Le, Q. V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In _Proc. of the 36th International


Conference on Machine Learning (ICML)_. 6105–6114 http://proceedings.mlr.press/v97/tan19a.html (2019). * He, K., Zhang, X., Ren, S., & Sun, J. Deep Residual Learning for Image


Recognition. In _Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)_. 770–778, https://doi.org/10.1109/CVPR.2016.90 (2016). * Caruana, R. Multitask learning.


_Mach. Learn._ 28, 41–75. https://doi.org/10.1023/A:1007379606734 (1997). Article  Google Scholar  * Zhao, Y., Wang, X., Che, T., Bao, G. & Li, S. Multi-task deep learning for medical


image computing and analysis: A review. _Comput. Biol. Med._ 153, 106496. https://doi.org/10.1016/j.compbiomed.2022.106496 (2023). Article  PubMed  Google Scholar  * Otsu, N. A threshold


selection method from gray-level histograms. _IEEE Trans. Syst. Man Cybern._ 9(1), 62–66. https://doi.org/10.1109/TSMC.1979.4310076 (1979). Article  Google Scholar  Download references


AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * The Harker School, San Jose, CA, 95129, USA Lily Shi Authors * Lily Shi View author publications You can also search for this author inPubMed 


Google Scholar CONTRIBUTIONS The author was responsible for the conceptualization, study design, methodology, presented results, and manuscript preparation. CORRESPONDING AUTHOR


Correspondence to Lily Shi. ETHICS DECLARATIONS COMPETING INTERESTS The author declares no competing interests. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with


regard to jurisdictional claims in published maps and institutional affiliations. RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons


Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give


appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission


under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons


licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by


statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit


http://creativecommons.org/licenses/by-nc-nd/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Shi, L. Enhancing medical explainability in deep learning for age-related


macular degeneration diagnosis. _Sci Rep_ 15, 16975 (2025). https://doi.org/10.1038/s41598-025-01496-z Download citation * Received: 09 January 2025 * Accepted: 05 May 2025 * Published: 15


May 2025 * DOI: https://doi.org/10.1038/s41598-025-01496-z SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a


shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative KEYWORDS * Medical explainability *


Explainable AI * Deep learning * Medical imaging * Lesion segmentation * Age-related macular degeneration