If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Department of Epidemiology, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
Departments of Biomedical Engineering and Physics & Radiology and Nuclear Medicine, University Medical Center Amsterdam, University of Amsterdam, Amsterdam, the NetherlandsQuantitative Healthcare Analysis group, Informatics Institute, Faculty of Science, University of Amsterdam, Amsterdam, the Netherlands
Department of Public Health, Healthcare Innovation & Evaluation and Medical Humanities, Julius Center Research Program Methodology, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
Image Sciences Institute, University Medical Center Utrecht, Utrecht University, Utrecht, the NetherlandsMedical Image Analysis, Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, the Netherlands
Biomarkers for checkpoint inhibitor response prediction in melanoma are lacking.
•
We show that radiomics derived from baseline CT imaging are moderately predictive.
•
However, a radiomics model does not improve over simpler clinical predictors.
•
Combining radiomics with clinical predictors does not yield better predictions.
•
This is most likely due to the overlap in predictive information in both models.
Abstract
Introduction
Predicting checkpoint inhibitors treatment outcomes in melanoma is a relevant task, due to the unpredictable and potentially fatal toxicity and high costs for society. However, accurate biomarkers for treatment outcomes are lacking. Radiomics are a technique to quantitatively capture tumour characteristics on readily available computed tomography (CT) imaging. The purpose of this study was to investigate the added value of radiomics for predicting clinical benefit from checkpoint inhibitors in melanoma in a large, multicenter cohort.
Methods
Patients who received first-line anti-PD1±anti-CTLA4 treatment for advanced cutaneous melanoma were retrospectively identified from nine participating hospitals. For every patient, up to five representative lesions were segmented on baseline CT, and radiomics features were extracted. A machine learning pipeline was trained on the radiomics features to predict clinical benefit, defined as stable disease for more than 6 months or response per RECIST 1.1 criteria. This approach was evaluated using a leave-one-centre-out cross validation and compared to a model based on previously discovered clinical predictors. Lastly, a combination model was built on the radiomics and clinical model.
Results
A total of 620 patients were included, of which 59.2% experienced clinical benefit. The radiomics model achieved an area under the receiver operator characteristic curve (AUROC) of 0.607 [95% CI, 0.562–0.652], lower than that of the clinical model (AUROC=0.646 [95% CI, 0.600–0.692]). The combination model yielded no improvement over the clinical model in terms of discrimination (AUROC=0.636 [95% CI, 0.592–0.680]) or calibration. The output of the radiomics model was significantly correlated with three out of five input variables of the clinical model (p < 0.001).
Discussion
The radiomics model achieved a moderate predictive value of clinical benefit, which was statistically significant. However, a radiomics approach was unable to add value to a simpler clinical model, most likely due to the overlap in predictive information learned by both models. Future research should focus on the application of deep learning, spectral CT-derived radiomics, and a multimodal approach for accurately predicting benefit to checkpoint inhibitor treatment in advanced melanoma.
Survival of patients with advanced melanoma has improved dramatically after the introduction of immunotherapy. The survival of patients with unresectable stage III and stage IV melanoma has historically been very poor with a 1-year overall survival of 25% in phase II trials up to 2007 [
Meta-analysis of phase II cooperative group trials in metastatic stage IV melanoma to determine progression-free and overall survival benchmarks for future phase II trials.
]. In patients treated with anti-PD1 antibodies, real-world 1-year overall survival is now 67%, with 40% of patients achieving remissions of several years [
However, not all patients benefit from checkpoint inhibitors. At 6 months after start of anti-PD1 treatment, 43% of patients experience progression or death. Furthermore, overall survival of patients with progression at 6 months was shown to be only 16% at 30 months. This is in contrast to a 30-month overall survival of 60%, 79%, and 96% for patients with stable disease, partial response, and complete response at 6 months of follow-up, respectively, in real-world data [
Accurate prediction of treatment benefit is an important topic for several reasons. First, treatment with checkpoint inhibitors is associated with severe and potentially fatal or irreversible toxicity. Severe toxicity occurs in 10–15% of patients treated with anti-PD1 monotherapy [
]. Second, checkpoint inhibition therapy is very costly. Depending on country and setting, estimates of additional costs per gained quality-adjusted life year range from 25,000 to 81,000 United States Dollars [
]. Lastly, if patients who will not benefit are identified before start of treatment, alternative or experimental therapies can be started without delay.
Previously identified predictors for treatment outcomes are not yet sufficient to guide clinical decisions. Known clinical predictors of poor outcome include high tumour load, presence of liver metastases and symptomatic brain metastases, increased lactate dehydrogenase (LDH), and worse Eastern Cooperative Oncology Group (ECOG) performance status [
]. In addition, other biomarkers have been explored, such as PD-L1 expression, tumour mutational burden, and histopathology features. Thus far, however, these predictors are not strong enough to predict treatment outcomes with high certainty [
Radiomics are by now an established modality for diagnosis, prognosis, and prediction. Radiomics capture information about shape, intensity, and texture of lesions in imaging and thereby form a reflection of tumour characteristics, such as necrosis or vascularisation. These extracted features can subsequently be correlated to a clinical outcome [
The added value of computed tomography (CT) radiomics for predicting clinical benefit of checkpoint inhibitors in melanoma remains to be determined in large multicenter studies. Three previous smaller studies have investigated radiomics for this purpose, with conflicting findings. The studies by Trebeschi et al. [
Combination of whole-body baseline CT radiomics and clinical parameters to predict response and survival in a stage-IV melanoma cohort undergoing immunotherapy.
] report a significant discriminative value of radiomics for treatment outcomes (area under the receiver operator characteristic curve [AUROC]=0.78 on a dataset of 80 patients, and AUROC=0.64 on a dataset of 262 patients, respectively). In contrast, Brendlin et al. [
A Machine learning model trained on dual-energy CT radiomics significantly improves immunotherapy response prediction for patients with stage IV melanoma.
] reported a non-discriminative performance, despite using a similar methodology (AUROC=0.50 in 140 patients). These differences in results highlight the importance of a large dataset to determine the value of radiomics. Furthermore, only the study by Peisen et al. investigated the added value over a simpler clinical model, with varying results across different outcomes. Lastly, none of the previous studies evaluated their model on data from other centres, although variability in scanner protocol may add significant noise [
]. In this study, we aimed to address these limitations and determine the added value of radiomics for predicting checkpoint inhibitor outcomes in a multicenter study in advanced melanoma.
2. Materials and methods
2.1 Patient selection
Eligible patients were retrospectively identified from high-quality registry data [
] from nine participating centres in The Netherlands (Amphia Ziekenhuis, Isala Zwolle, Leids Universitair Medisch Centrum, Máxima MC, Medisch Spectrum Twente, Radboudumc, UMC Utrecht, Amsterdam UMC, Zuyderland MC). Patients over the age of 18 were included if they received first-line treatment with anti-PD1±anti-CTLA4 checkpoint inhibition for irresectable stage IIIC or stage IV cutaneous melanoma after 01–01–2016. Exclusion criteria were (i) unavailability of baseline contrast-enhanced CT imaging (CE-CT), (ii) lack of eligible target lesions, and (iii) less than 6 months of follow-up. Clinical characteristics were collected for the included patients and compared to those of the excluded patients. CT acquisition characteristics were extracted for included patients.
2.2 Lesion selection and segmentation
For every patient, one to five lesions were selected on baseline CT imaging and segmented. We aimed to make this selection of lesions as informative and representative as possible by using the following protocol: first, the five largest lesions were selected with a maximum of two per organ. If more lesions remained after segmenting a maximum of two per organ, the largest remaining lesions were segmented up to a total of five. Lesion selections were made without knowledge of the outcome. Lesions were excluded if they were not well-demarcated, affected by imaging artifacts or if the maximum diameter was less than 5 mm. Segmentations were performed in 3D Slicer [
] on the series with the lowest slice thickness by authors LSM and IAJD, under supervision of board-certified radiologists with 17 and 18 years of experience (PJ and TL, respectively).
2.3 Feature extraction
Features were extracted from the segmented volumes using PyRadiomics [
]. For every volume, 1874 features were extracted at five different levels of detail, resulting in a total of 9370 features. An overview of the extracted features is given in the Supplementary Methods. Interobserver agreement of segmentations and features was calculated using Dice scores and intraclass correlation coefficient (ICC), respectively, based on 16 scans segmented by both observers (LSM, IAJD).
2.4 Outcome definition
The primary outcome was clinical benefit, defined as the best overall response of partial or complete response, or stable disease for a minimum of 6 months after start of treatment; response was determined by the treating physician in line with RECIST 1.1 criteria [
]. The secondary outcome was objective response, defined as the best overall response of partial or complete response. Clinical benefit was used as the primary outcome, as the intended use of the model was to identify patients who would quickly progress despite treatment and therefore not derive any benefit from treatment. Individual lesion response was assessed using maximum diameter recordings at baseline and at 3, 6 and 9 months, or until treatment was changed. Given the possibility of pseudo-progression, the last available follow-up was used to determine lesion outcomes. If the maximum diameter at the last follow-up was less than 120% of the baseline diameter, the lesion was labelled as ‘does benefit’, and ‘does not benefit’ otherwise. In parallel, the lesion was labelled as ‘responsive’ if the maximum diameter was less than 70% of the baseline diameter at the last follow-up, and ‘non-responsive’ otherwise. These lesion-level cut-offs were chosen in correspondence with the patient-level cut-offs used in RECIST 1.1 [
Three predictive models were compared: a model based on radiomics, a model based on baseline clinical characteristics and an ensemble model that combined the predictions of these models. The radiomics model consisted of a machine learning pipeline that automatically selected optimal components and hyperparameters for feature selection, dimensionality reduction, and classification (Fig. 1). This pipeline was trained to predict outcomes per lesion; these outputs per lesion were then aggregated to a patient-level prediction. The clinical model used the same machine learning pipeline, which was fitted on five clinical variables that were consistently shown to be predictive of checkpoint inhibitor treatment outcomes in previous literature [
]. These predictors were (i) ECOG performances status, (ii) LDH level, presence of (iii) brain and (iv) liver metastases, and (v) number of affected organs. All variables were one-hot encoded; missing values were encoded as a separate label. The ensemble model consisted of a logistic regression fitted on the output of the radiomics and clinical model. All three models were evaluated using a nested cross validation. The inner loop was used for optimal model selection and hyperparameter tuning; the outer loop was used to evaluate predictive performance on unseen data and was conducted in a leave-one-centre-out manner. Further details are supplied in the Supplementary Methods.
The discriminative performance of the models was evaluated using the AUROC and corresponding 95% confidence interval. The cross-validated AUROC and confidence interval were calculated using the cvAUC R package [
]. Methods for comparing cross-validated AUROCs between models are detailed in the Supplementary Methods. Subgroup analyses were conducted for patients treated with anti-PD1 therapy and anti-PD1 plus anti-CTLA4 therapy by evaluating the fitted model only on patients from the respective groups. The output of the radiomics model for predicting clinical benefit was correlated to the input variables of the clinical model to determine if the radiomics model learned features that were already represented in the baseline clinical model.
] was completed and is available in Supplementary Table 1. The study design was reviewed by the Medical Ethics Committee and not considered subject to the Medical Research Involving Human Subjects Act in compliance with Dutch regulations; informed consent was waived.
3. Results
3.1 Patient characteristics
Out of 1191 eligible patients, 620 patients with a total of 2352 lesions were included. A flowchart of the selection process is shown in Fig. 2. The rate of clinical benefit was 59.2% (367 patients); the objective response rate was 51.3% (318 patients); Lesion level outcomes were available for 75.2% of lesions. Lesion-level outcomes could not be recorded for patients from the Radboudumc (327 lesions, 13.9%) due to local regulations. In addition, follow-up imaging was unavailable due to patient death or clinical progression before the first follow-up moment for 185 lesions (7.9%), due to the lesion falling outside the field of view in 27 lesions (1.1%) and due to technical issues in 44 lesions (1.9%). Rate of benefit was 79.4% among lesions with available labels, whereas response rate was 54.8% (Supplementary Table 5). Of all eligible patients, 490 patients were excluded because of the unavailability of contrast-enhanced pretreatment CT. In most of these cases, an 18-fluorodeoxyglucose positron emission tomography (FDG-PET) with low-dose CT was made. Characteristics for the included patients are shown in Table 1 and compared to those of excluded patients in Supplementary Table 2. The subgroups of patient treated with anti-PD1 and combination therapy consisted of 370 and 250 patients, respectively. Supplementary Tables 3 and 4 show patient characteristics per centre, and for the subgroups treated with monotherapy and combination therapy, respectively. CT acquisition characteristics per centre are displayed in Supplementary Table 6.
52 lesions in 16 scans were segmented by two observers. Segmentations corresponded with a median Dice score of 0.88 (IQI 0.82–0.92). For the extracted features, the median ICC was 0.97 (IQI 0.92–0.99).
3.3 Treatment outcome prediction
For predicting clinical benefit, the radiomics model achieved an AUROC of 0.607 [95% CI, 0.562–0.652], the clinical model an AUROC of 0.646 [95% CI, 0.600–0.692], and the ensemble model an AUROC of 0.636 [95% CI, 0.592–0.680]. The difference in AUROC between the ensemble and clinical model was not statistically significant (Supplementary Figure 1). Calibration curves showed adequate calibration of the three models with no evidence of poor fit (Hosmer-Lemeshow p > 0.07). The range of predicted probabilities was comparable between models (IQI 0.56–0.65, 0.53–0.67, and 0.52–0.69 for the radiomics, clinical and ensemble model, respectively). Results were similar for predicting objective response (Supplementary Figure 2–3). Predictive performance for both outcomes was comparable in subgroups of patients treated with monotherapy and combination therapy, with a trend of better discrimination in the subgroup of patients treated with combination therapy (Supplementary Figures 4–7). Details of the selected models and hyperparameters per fold are shown in Supplementary Table 7 (Fig. 3).
Fig. 3Receiver operator characteristic (ROC) curves and calibration curves for predicting clinical benefit. (A–C) ROC curves for predicting durable clinical benefit in patients with melanoma treated with anti-PD1±anti-CTLA4 checkpoint inhibition for the radiomics model (A), clinical model (B) and combination model (C). Grey curves correspond to results per fold; blue curves are the weighted average of the results per fold. The area under the curve (AUC) with corresponding 95% confidence intervals is displayed. (D–F) LOESS fitted calibration curves for predicting durable clinical benefit in the radiomics model (D), clinical model (E), and combination model (F); the shaded area corresponds to±1 standard deviation. Histograms of the predictions for positive (blue) and negative (orange) samples are provided below the curves, the x-axis displays the predicted values for these histograms. P-values of the Hosmer-Lemeshow goodness-of-fit test are shown in the plot titles.
The predicted probability of clinical benefit by the radiomics model was significantly lower in patients in whom liver metastases were absent (Mann-Whitney U, p < 0.001, Fig. 4A), in patients with higher LDH (Kruskal-Wallis p < 0.001, Fig. 4D) and who had more affected organs (Mann-Whitney U p < 0.001, Fig. 4E). The output of radiomics model was not significantly different in patients with and without brain metastases), and for different categories of ECOG performance status (Fig. 4B–C). The output of radiomics and clinical models were significantly and positively correlated (Spearman’s correlation coefficient = 0.369, p < 0.001, Fig. 4F).
Fig. 4Correspondence between the output of the clinical and radiomics models for predicting clinical benefit. Graphical overview of correspondence between the output of the radiomics and clinical models. (A–E) Boxenplots of the output of the radiomics model, compared across different values for clinical predictors. (A) The output of the radiomics model is significantly lower in patients with liver metastases than in patients without (Mann-Whitney U p < 0.001). (B) No statistical difference was found in the output of the radiomics model between patients without or with asymptomatic or symptomatic brain metastases (Kruskal-Wallis p = 0.074) and Eastern Cooperative Oncology Group performance status (Kruskal-Wallis p = 0.201). D) The output of the radiomics model is significantly lower in patients with higher levels of lactate dehydrogenase (Kruskal-Wallis p < 0.001) and with more affected organs (Mann-Whitney U p < 0.001). (F) The outputs of the clinical and radiomics models (predicted probability of response) are positively correlated (Spearman’s rank correlation coefficient = 0.369, p < 0.001).
The present work shows that radiomics are moderately predictive of checkpoint inhibitor treatment outcomes in patients with advanced melanoma. The results were consistent for both clinical benefit and objective response rate, and are most in line with the findings of the earlier study by Peisen et al. A recent work by Dercle et al. allows for comparison to a model that also incorporates radiomics from on-treatment CT scans [
]. This model reached an AUROC of 0.92 for predicting overall survival at 6 months, indicating that on-treatment radiomics are strongly predictive. However, most toxicity occurs in the first 3 months, and long-term outcomes can already be accurately predicted using on-treatment information without the use of radiomics [
]. Predicting response using the 3-month on-treatment scan therefore appears to be of limited clinical relevance.
Addition of radiomics to known clinical predictors, however, did not yield improvement in predictive value. The combination model was not superior to the clinical model in either discrimination or calibration. This lack of improvement can be explained due to an overlap in the information learned by the radiomics model, and the information that is already represented in clinical variables. As demonstrated, the radiomics model indirectly learns to detect the presence of liver metastases and the amount of tumour burden, as reflected in LDH and number of affected organs. The fact that this information is indeed learned as expected is a strong argument for the validity of the present work. Furthermore, this indicates that such overlap is likely to be present in any radiomics model which is investigated for clinical purposes.
Studies on radiomics should assess the added value over simpler predictors. Many smaller exploratory studies have been conducted into the predictive value of radiomics for checkpoint inhibitor outcomes across different malignancies [
]. Their findings are almost exclusively positive, but the added value over clinical predictors was seldomly assessed. The present work demonstrates that clinical predictors can be captured by radiomics, and that the added value of radiomics should therefore always be investigated, even in exploratory studies.
Future works should aim to improve on radiomics through deep learning or spectral CT-derived radiomics. Deep learning has a significant advantage over handcrafted radiomics, as this method is not limited by predefined features in what information can be captured. Instead, a deep learning approach is given the raw data as input and learns informative features on the fly [
]. Furthermore, spectral CT-derived radiomics were shown to be superior over single energy radiomics for predicting response to checkpoint inhibition in patients with melanoma by Brendlin et al. [
A Machine learning model trained on dual-energy CT radiomics significantly improves immunotherapy response prediction for patients with stage IV melanoma.
]. As spectral CT scanners become increasingly available, this approach may be tested more thoroughly in future research.
Lastly, the multimodal approach should be extended with other data sources. Accurately predicting checkpoint inhibitor treatment outcomes in melanoma remains challenging. It is possible that individual biomarkers are insufficient to guide clinical decisions. An approach that combines different data sources may therefore prove to be superior. A possible modalities that may be explored for this purpose is histopathology imaging [
], which will be investigated in this cohort in a future work.
The strengths of this work are the large sample size, the multicenter design and extensive hyperparameter optimisation. This is the largest work published on radiomics for prediction of checkpoint inhibitor treatment outcomes in any malignancy [
]. This large size adds to the weight of the presented conclusion. Furthermore, the dataset in this work includes patients from nine different centres. As stability of radiomics features across scanner types and protocols is far from certain, external validation is essential for determining the practical value of a radiomics approach. Lastly, the proposed pipeline systematically explores design choices, from the extraction of radiomics to the final prediction. This approach should maximise potential performance by avoiding arbitrary and therefore possibly suboptimal design choices.
A potential limitation is the exclusion of a large fraction of patients due to unavailability of CE-CT imaging. Comparison of patient characteristics between the included and excluded groups showed minor differences overall, with a trend towards more progressed disease in the included patients. Our hypothesis for this is that patients with more progressed disease are more likely to directly present to medical oncology, instead of being referred after having undergone imaging by a different specialty where FDG-PET is the preferred modality. Although this selection may theoretically have influenced the presented results, this risk is arguably limited as the characteristics of in- and excluded patients are overall very comparable.
Furthermore, patients with stable disease for a minimum of 6 months were labelled as having clinical benefit. This group could therefore theoretically include patients with indolent tumour progression (less than 120% of original diameters in 6 months), without effect from checkpoint inhibition therapy. However, given the consistent results across outcomes and small proportion of patients for which this may be the case, the impact on eventual results is likely limited.
In conclusion, radiomics are predictive of checkpoint inhibition treatment outcomes in patients with advanced melanoma, but did not improve predictive value over a simpler clinical model. A radiomics model can predict both clinical benefit and response from checkpoint inhibitor therapy with moderate discriminative performance. However, the predictive value of this radiomics model overlaps with that of a clinical model, which is evident from the lack of improvement of a combined model. The added value of a radiomics approach therefore appears to be limited. Future research should focus on related techniques, such as deep learning or radiomics on dual energy CT images. In addition, an approach that combines radiomics and clinical data with other modalities may provide a next step towards accurate prediction of checkpoint inhibitor treatment outcomes in melanoma.
Funding
This research was funded by The Netherlands Organization for Health Research and Development (ZonMW, project number 848101007) and Philips.
Due to confidentiality agreements, clinical and imaging data cannot be made available.
Conflict of interest statement
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: AvdE has advisory relationships with Amgen, Bristol Myers Squibb, Roche, Novartis, MSD, Pierre Fabre, Sanofi, Pfizer, Ipsen, Merck and has received research study grants not related to this paper from Sanofi, Roche, Bristol Myers Squibb, Idera, and TEVA and has received travel expenses from MSD Oncology, Roche, Pfizer, and Sanofi and has received speaker honoraria from BMS and Novartis. JdG has consultancy/advisory relationships with Bristol Myers Squibb, Pierre Fabre, Servier, MSD, Novartis. PJ has a research collaboration with Philips Healthcare and Vifor Pharma. MBS has consultancy/advisory relationships with Pierre Fabre, MSD, and Novartis. EK has consultancy/advisory relationships with Bristol Myers Squibb, Novartis, Merck, Pierre Fabre, Lilly, Bayer, EISAI, and Ipsen, and received research grants not related to this paper from Bristol Myers Squibb and Pierre Fabre. PD has consultancy/advisory relationships with Paige, Pantarei, and Samantree paid to the institution, and research grants from Pfizer, none related to current work, and paid to institute. KS has advisory relationships with Bristol Myers Squibb, Novartis, MSD, Pierre Fabre, AbbVie and received honoraria from Novartis, MSD, and Roche and research funding from Bristol Myers Squibb, TigaTx, and Philips. TL has received research funding from Philips. All remaining authors have declared no conflicts of interest.
Meta-analysis of phase II cooperative group trials in metastatic stage IV melanoma to determine progression-free and overall survival benchmarks for future phase II trials.
Combination of whole-body baseline CT radiomics and clinical parameters to predict response and survival in a stage-IV melanoma cohort undergoing immunotherapy.
A Machine learning model trained on dual-energy CT radiomics significantly improves immunotherapy response prediction for patients with stage IV melanoma.