Advertisement
Original Research| Volume 111, P30-37, April 2019

Download started.

Ok

Comparing artificial intelligence algorithms to 157 German dermatologists: the melanoma classification benchmark

Open AccessPublished:February 22, 2019DOI:https://doi.org/10.1016/j.ejca.2018.12.016

      Highlights

      • This paper provides the first open access melanoma classification benchmark for both non-dermoscopic and dermoscopic images.
      • Algorithms can now be easily compared to the performance of dermatologists in terms of sensitivity, specificity and ROC.
      • The melanoma benchmark allows comparability between algorithms of different publications and provides a new reference standard.

      Abstract

      Background

      Several recent publications have demonstrated the use of convolutional neural networks to classify images of melanoma at par with board-certified dermatologists. However, the non-availability of a public human benchmark restricts the comparability of the performance of these algorithms and thereby the technical progress in this field.

      Methods

      An electronic questionnaire was sent to dermatologists at 12 German university hospitals. Each questionnaire comprised 100 dermoscopic and 100 clinical images (80 nevi images and 20 biopsy-verified melanoma images, each), all open-source. The questionnaire recorded factors such as the years of experience in dermatology, performed skin checks, age, sex and the rank within the university hospital or the status as resident physician. For each image, the dermatologists were asked to provide a management decision (treat/biopsy lesion or reassure the patient). Main outcome measures were sensitivity, specificity and the receiver operating characteristics (ROC).

      Results

      Total 157 dermatologists assessed all 100 dermoscopic images with an overall sensitivity of 74.1%, specificity of 60.0% and an ROC of 0.67 (range = 0.538–0.769); 145 dermatologists assessed all 100 clinical images with an overall sensitivity of 89.4%, specificity of 64.4% and an ROC of 0.769 (range = 0.613–0.9). Results between test-sets were significantly different (P < 0.05) confirming the need for a standardised benchmark.

      Conclusions

      We present the first public melanoma classification benchmark for both non-dermoscopic and dermoscopic images for comparing artificial intelligence algorithms with diagnostic performance of 145 or 157 dermatologists. Melanoma Classification Benchmark should be considered as a reference standard for white-skinned Western populations in the field of binary algorithmic melanoma classification.

      Keywords

      1. Introduction

      Melanoma accounts for the majority of skin cancer–related deaths worldwide [
      • Schadendorf D.
      • van Akkooi A.C.J.
      • Berking C.
      • et al.
      Melanoma.
      ]. Owing to rapid increase in prevalence over recent decades, several institutions have funded programs to improve measures for prevention and early detection/screening [
      • Gordon L.G.
      • Rowell D.
      Health system costs of skin cancer and cost-effectiveness of skin cancer prevention and screening: a systematic review.
      ,
      • Brinker T.J.
      • Klode J.
      • Esser S.
      • Schadendorf D.
      Facial-aging app availability in waiting rooms as a potential opportunity for skin cancer prevention.
      ]. Despite special training and the use of dermoscopes, dermatologists rarely exceed a sensitivity of 80% [
      • Vestergaard M.
      • Macaskill P.
      • Holt P.E.
      • Menzies S.W.
      Dermoscopy compared with naked eye examination for the diagnosis of primary melanoma: a meta-analysis of studies performed in a clinical setting.
      ].
      In 2017, Esteva et al. was the first to report a deep-learning convolutional neural network (CNN) image classifier whose performance in determining the management of malignant lesions based on image analysis was comparable to that of 21 board-certified dermatologists [
      • Esteva A.
      • Kuprel B.
      • Novoa R.A.
      • et al.
      Dermatologist-level classification of skin cancer with deep neural networks.
      ]. The CNN deconstructed digital images of skin lesions and generated its own diagnostic criteria for melanoma detection during training.
      Other subsequent landmark publications have claimed dermatologist-level skin cancer classification via CNNs [
      • Esteva A.
      • Kuprel B.
      • Novoa R.A.
      • et al.
      Dermatologist-level classification of skin cancer with deep neural networks.
      ,
      • Han S.S.
      • Kim M.S.
      • Lim W.
      • Park G.H.
      • Park I.
      • Chang S.E.
      Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm.
      ,
      • Marchetti M.A.
      • Codella N.C.F.
      • Dusza S.W.
      • et al.
      Results of the 2016 international skin imaging collaboration international Symposium on biomedical imaging challenge: comparison of the accuracy of computer algorithms to dermatologists for the diagnosis of melanoma from dermoscopic images.
      ,
      • Haenssle H.
      • Fink C.
      • Schneiderbauer R.
      • et al.
      Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists.
      ]. However, these publications did not reveal the exact procedure or the images used for training. Moreover, the final test images used to measure performance of these algorithms were not made publicly available. Thus, the performance of these algorithms may only be evaluated by using the International Symposium on Biomedical Imaging (ISBI) challenge 2016 test-set as a benchmark, but this benchmark has never been fully compared with the performance of dermatologists for the 379 test images and, thus, provides limited information about the clinical value of an algorithm [
      • Gutman D.
      • Codella N.C.F.
      • Celebi E.
      • et al.
      Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC).
      ]. The status-quo restricts the comparison between algorithms and thereby the technical progress in this field [
      • Brinker T.J.
      • Hekler A.
      • Utikal J.S.
      • et al.
      Skin cancer classification using convolutional neural networks: systematic review.
      ].
      In this work, we created the first publicly available Melanoma Classification Benchmark (MClass) for both dermoscopic and clinical images of melanocytic skin lesions accompanied by an open-source test-set. MClass enables researchers to compare their artificial intelligence algorithms for the classification of melanocytic images with that performed by dermatologists. The algorithm was validated with the help of 302 data sets (data set = responses by one dermatologist to one of the electronic questionnaire with 100 images of skin lesions) created by dermatologists from 12 German university hospitals, and eight data sets created by resident physicians from Germany (157 dermatologists completed the dermoscopic survey and 145 dermatologists completed the clinical survey). In addition, our work provides insights into the diagnostic performance of dermatologists for melanoma by illustrating the impact of major variables of interest (i.e. hierarchical position, residents versus university hospital physicians, sex and age).

      2. Material and methods

      2.1 Recruitment and data collection

      The collaborative MClass benchmark project was introduced at the National German Skin Cancer Conference conducted in September 2018 at Stuttgart, Germany. Twelve leading dermatologists from 12 university hospitals in Germany (Berlin, Bonn, Erlangen, Essen, Hamburg, Heidelberg, Kiel, Magdeburg, Mannheim, Munich, Regensburg and Würzburg) agreed to participate. They encouraged their colleagues via their university email accounts to participate in the anonymous validation of the benchmark and to ‘test their skills’ pertaining to melanoma diagnosis via two online links to two separate questionnaires comprising 100 dermoscopic test images and 100 clinical test images, respectively. The ratio of nevocytic nevi (NZN)/melanoma images in the test-sets was not disclosed. At the end of the survey, participants learned about their diagnostic accuracy. Data were collected between 17th September 2018 and 1st October 2018.

      2.2 Electronic questionnaire

      Prior to data collection, both electronic questionnaires were developed by consensus between the authors. The first part of both questionnaires was identical and recorded age, sex, years of dermatologic practice/experience, estimated number of skin checks performed and position within the medical hierarchy. This was followed by 100 dermoscopic (link 1) or clinical (link 2) images of 80 benign nevi and 20 biopsy-verified melanomas, each. For each image, the participant was asked to make a management decision: (a) biopsy/further treatment or (b) reassure the patient. The same question was asked in the study by Esteva et al. [
      • Esteva A.
      • Kuprel B.
      • Novoa R.A.
      • et al.
      Dermatologist-level classification of skin cancer with deep neural networks.
      ]. A response for all images was mandatory, and participants were not allowed to skip any question. Dermatologists were able to use digital zoom and had to use desktop screens to answer the questionnaires. All originally used image files are available at www.skinclass.de/mclass.

      2.3 Eligibility criteria

      Only physicians with clinical training in dermatology were eligible. Every dermatologist was only allowed to participate once.

      2.4 Used images

      All images used were open-source and anonymous. We programmed a randomiser in Python for random selection of 100 images with an allocation of 80% NZN and 20% melanoma. The 80:20 ratio is based on the ISBI 2016 challenge test and training set hosted by the International Skin Imaging Collaboration (ISIC) [
      • Gutman D.
      • Codella N.C.F.
      • Celebi E.
      • et al.
      Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC).
      ]. In accordance, all dermoscopic images were sourced from the ISIC archive [
      • Gutman D.
      • Codella N.C.F.
      • Celebi E.
      • et al.
      Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC).
      ]. All melanomas were verified by histopathology and the nevi were either biopsy-verified (n = 29) or verified by single image expert consensus (n = 51) (test-set available for downloading [Multimedia Appendix 1]) [
      • Tschandl P.
      • Rosendahl C.
      • Kittler H.
      The HAM10000 dataset: a large collection of multi-source dermatoscopic images of common pigmented skin lesions.
      ]. The clinical images were obtained from the MED-NODE database, and only the melanomas were biopsy-verified; the nevi were declared as benign via expert consensus [
      • Giotis I.
      • Molders N.
      • Land S.
      • Biehl M.
      • Jonkman M.F.
      • Petkov N.
      MED-NODE: a computer-assisted melanoma diagnosis system using non-dermoscopic images.
      ] (test-set available for downloaded here [Multimedia Appendix 2]). All images were publicly available together with an excel sheet enlisting the reader results per dermatologist per image for each dermoscopic and non-dermoscopic images in addition to the information of the underlying ground truth of each image and how it was determined under this link: www.skinclass.de/mclass.

      2.5 Analysis

      2.5.1 Data validation

      Data quality is an important issue when using anonymous questionnaires, especially under conditions of obligatory participation. Careless and meaningless responses have to be identified and removed from the dataset. In this work, we performed a two-step data cleaning process. To prevent bias in the selection of data entries, statistical methods were applied first. In the second validation step, we looked for contradictions in the respondent metadata. For example, no established physician could have zero years of professional experience. As a statistical outlier detection method, we applied the Local Outlier Factor (LOF) method [
      • Breunig M.M.
      • Kriegel H.P.
      • Ng R.T.
      • Sander J.
      LOF: identifying density-based local outliers.
      ]. The space of all possible management decisions consists of 100 dimensions, one for each test image, and each dimension is a binary variable. The LOF algorithm is an unsupervised method that determines the local density deviation of a distinct point with respect to its neighbors. The factor is close to 1.0 if a point is located in a subspace where many other points can be found. In our case, this means that there are very similar answers from dermatologists who differ only slightly from each other. For respondents who show large deviations in their answers compared to other dermatologists, the value is significantly larger, indicating the outliers. In this work, we consider the 30 nearest neighbors of each response, but the detected outliers are not so sensitive to the exact parameter selection. As a result, 18 dermatologists were excluded from the dermoscopic survey group and 17 from the clinical survey group because of the following predefined exclusion criteria: age inconsistent with academic position (N = 5); identical response for all 100 images (N = 2) and double entry (participation by the same physician twice; N = 2).

      2.5.2 Data analysis

      The survey data were extracted in .csv format and imported into a Jupyter Notebook. The programming language Python was used for calculating sensitivity, specificity and receiver operating characteristics (ROC). On sub-group analysis, between-group differences were assessed using two-sided Chi-squared test programmed in the Jupyter Notebook. For dichotomous predictions, ROC is considered equivalent to the average of sensitivity and specificity.

      3. Results

      3.1 Total sample

      Of the 337 dermatologist-created data sets, 35 were excluded during data validation. Thus, 302 data sets (89.6%) comprising 145 clinical and 157 dermoscopic data sets were included; 210 (64%) participants were female, and 118 (36%) were male. Median age was 30–34 years; 60% participants were junior physicians in their dermatologic residency; 320 dermatologists were from the 12 participating university hospitals and eight were resident physicians in private practice who formerly worked at one of these hospitals. Because a single invitation was sent for this survey, at least 157 German dermatologists were involved in creating this benchmark.

      3.1.1 Dermoscopic melanoma classification benchmark

      3.1.1.1 Sample characteristics

      Out of 175 dermatologists, 157 (56 [35.7%] males; 101 [64.3%] females) provided valid answers. Median age range was 30–34 years (Fig. 1; left); 56.1% were junior physicians (dermatologic residency), and 43.9% were board-certified (Fig. 1; right). Total 12 dermatologic university hospital departments in Germany provided 163 (95.9%) of these dermatologists, and seven (4.1%) were dermatologic residents involved with these departments.
      Fig. 1
      Fig. 1Sample characteristics for the dermoscopic data set: age distribution (left); distribution of positions in the medical hierarchy (right).
      The benchmark parameters for dermoscopic melanoma classification benchmark (MClass-D) as per various subgroups are summarised in Table 1. An overview of the results for the dermoscopic test-set is presented in Fig. 2, and Fig. 3 provides an overview of the easiest and hardest to diagnose lesions. None of the differences between the subgroups were statistically significant (P > 0.05). However, there was considerable variability in performance in the samples (mean ROC [range = 0.54–0.77]; best 25% > 0.732; best 50% > 0.709; best 75% > 0.691).
      Table 1Benchmark parameters for MClass-D.
      Subset of dermatologistsSensitivity (%)Specificity (%)ROC area
      All participants (N = 157)74.1160.020.671
      University hospital (N = 151)74.0159.790.669
      Private practice (resident) (N = 6)76.6765.830.713
      Practical experience (pe)
      pe ≤ 2 years (N = 46)75.9856.470.662
      2 years < pe ≤ 4 years (N = 37)73.7859.090.664
      4 years < pe ≤ 12 years (N = 32)73.2862.540.679
      pe > 12 years (N = 42)72.9862.830.679
      Position in university hospital
      Junior physician (N = 88)74.7758.150.665
      Attending (N = 15)72.67600.663
      Senior physician (N = 45)7362.310.677
      Chief physician (N = 3)73.3369.170.713
      Sex
      Female (N = 101)77.3357.340.673
      Male (N = 56)68.364.870.666
      Age
      20–29 (N = 45)74.89570.659
      30–34 (N = 48)74.1760.440.673
      35–44 (N = 44)75.4561.70.686
      >44 (N = 220)69.2562.130.657
      Number of skin screenings (noS) (N = 156, one missing value)73.9460.130.670
      noS ≤ 150 (N = 36)78.4753.130.658
      150 < noS ≤ 500 (N = 40)71.6263.810.677
      500 < noS ≤ 1000 (N = 39)72.6961.990.673
      noS > 1000 (N = 41)73.4160.910.672
      ROC, receiver operating characteristics; MClass-D, dermoscopic melanoma classification benchmark.
      Fig. 2
      Fig. 2Overview of results for the dermoscopic test-set: each dot represents the performance of an individual dermatologist.
      Fig. 3
      Fig. 3Best/worst classification results. Upper row = melanoma: images 1a and 1b were associated with highest sensitivity (all 157 dermatologists opted for biopsy); for image 2, biopsy was recommended by 30 dermatologists (127 dermatologists opted to ‘reassure the patient’). Lower row: benign nevi (biopsy-verified): images 3a (156 opted to “reassure patient”; one dermatologist recommended biopsy) and 3b (157 dermatologists opted to “reassure the patient”) were associated with the highest specificity; for image 4, biopsy was recommended by 156 of the 157 dermatologists.

      3.1.2 Non-dermoscopic melanoma classification benchmark

      3.1.2.1 Sample characteristics

      Out of the 162 dermatologists who participated in clinical image survey, 145 (50 males [34.5%]; 95 females [65.5%]) were included after data validation (89.5%). Median age (30–34 years) and the occupational profile of participants are summarised in Fig. 4. The benchmark parameters for non-dermoscopic melanoma classification benchmark (MClass-ND) as per various subgroups are summarised in Table 2. An overview of the results for the non-dermoscopic test-set is presented in Fig. 5, and Fig. 6 provides an overview of the easiest and hardest to diagnose lesions.
      Fig. 4
      Fig. 4Sample characteristics for the non-dermoscopic data set: age distribution (left); distribution of positions in the medical hierarchy (right).
      Table 2Benchmark parameters for MClass-ND.
      Subset of dermatologistsSensitivity (%)Specificity (%)ROC area
      All participants (N = 145)89.4064.370.769
      University hospital (N = 142)89.4464.180.768
      Private practice (resident) (N = 3)86.6773.330.800
      Practical experience (pe)
      pe ≤ 2 years (N = 42)89.4063.570.765
      2 years < pe ≤ 4 years (N = 36)87.9264.860.764
      4 years < pe ≤ 12 years (N = 31)91.1364.030.776
      pe > 12 years (N = 36)89.3165.10.772
      Position in university hospital
      Junior physician (N = 97)87.6864.450.761
      Attending (N = 16)92.8157.660.752
      Senior physician (N = 39)88.7165.80.773
      Chief physician (N = 3)91.6758.750.752
      Sex
      Female (N = 101)88.7163.440.761
      Male (N = 57)87.6365.680.767
      Age
      20–29 (N = 48)87.564.430.76
      30–34 (N = 50)87.865.4750.766
      35–44 (N = 42)90.1262.830.765
      >44 (N = 18)86.1563.850.75
      Number of skin screenings (noS) (157, 1 missing value)88.3864.20.76
      noS ≤ 150 (N = 35)87.7163.250.755
      150 < noS ≤ 500 (N = 47)85.8667.960.769
      500 < noS ≤ 2000 (N = 45)88.9765.220.771
      noS > 2000 (N = 30)89.6762.830.763
      ROC, receiver operating characteristics; MClass-ND, non-dermoscopic melanoma classification benchmark.
      Fig. 5
      Fig. 5Overview of the results for the non-dermoscopic test-set: each dot represents the performance of an individual dermatologist.
      Fig. 6
      Fig. 6Best/worst classification results. Upper row (melanoma): 1a and 1b were associated with the highest sensitivity (all dermatologists opted for biopsy); for image 2, biopsy was recommended by 45 cases (100 dermatologists opted to ‘reassure the patient’). Lower row (benign nevi): 3a and 3b were associated with the highest specificity (100% opted to ‘reassure the patient’); 4 had the lowest specificity (three dermatologists opted for reassurance of patient and 142 recommended biopsy).
      None of the differences between the subgroups were statistically significant (P > 0.05).
      However, there was substantial variability in the performance (mean ROC [range = 0.615–0.9]; best 25% > 0.766; best 50% > 0.771; best 75% > 0.764).

      3.1.2.2 Comparison of MClass-D and MClass-ND

      MClass-ND based on expert opinion showed significantly better sensitivity and specificity than MClass-D (P < 0.05).

      4. Discussion

      In this work, we present the first public MClass for both dermoscopic (MClass-D) and non-dermoscopic (MClass-ND) images (based on 157 and 145 dermatologists, respectively) for evaluating artificial intelligence algorithms. Our results have high external validity owing to the largest number of dermatologists surveyed till date. Moreover, our results and the test-sets are available in the public domain. Previous landmark publications by Esteva et al., Marchetti et al. and Hänßle et al. involved 21, 8 and 58 dermatologists, respectively, for evaluating their algorithms; moreover, the latter two studies only compared dermoscopic images [
      • Esteva A.
      • Kuprel B.
      • Novoa R.A.
      • et al.
      Dermatologist-level classification of skin cancer with deep neural networks.
      ,
      • Marchetti M.A.
      • Codella N.C.F.
      • Dusza S.W.
      • et al.
      Results of the 2016 international skin imaging collaboration international Symposium on biomedical imaging challenge: comparison of the accuracy of computer algorithms to dermatologists for the diagnosis of melanoma from dermoscopic images.
      ,
      • Haenssle H.
      • Fink C.
      • Schneiderbauer R.
      • et al.
      Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists.
      ]. More importantly, many groups could not compare their algorithms with clinical performance owing to the lack of availability of image sets to measure performance. Our work is of seminal importance as MClass-D and MClass-ND represent an open access standardised clinical benchmark to assess the performance of artificial intelligence (AI) algorithms against that of dermatologists with different sex and age and different levels of training.

      4.1 Interpretation of results

      In clinical practice, dermoscopy improves the sensitivity of naked-eye examination [
      • Vestergaard M.
      • Macaskill P.
      • Holt P.E.
      • Menzies S.W.
      Dermoscopy compared with naked eye examination for the diagnosis of primary melanoma: a meta-analysis of studies performed in a clinical setting.
      ]. However, in our study, dermatologists performed significantly worse for dermoscopic images than for clinical images of different skin lesions (P < 0.05); this indicates that the performance is largely dependent on the images of nevi and melanoma selected for the test-set. Similar effect was observed by Esteva et al.; the ROC for dermoscopic images in their study was worse than that for non-dermoscopic images [
      • Esteva A.
      • Kuprel B.
      • Novoa R.A.
      • et al.
      Dermatologist-level classification of skin cancer with deep neural networks.
      ]. Another similarity with the work of Esteva et al. is the use of mostly biopsy-verified nevi for the dermoscopic set (obtained from the ISIC archive), which are difficult to distinguish from benign lesions (and therefore were sent to biopsy).
      However, the outcomes (both sensitivity and specificity) achieved with both test-sets are comparable to those of previous studies [
      • Carli P.
      • Quercioli E.
      • Sestini S.
      • et al.
      Pattern analysis, not simplified algorithms, is the most reliable method for teaching dermoscopy for melanoma diagnosis to residents in dermatology.
      ,
      • Dolianitis C.
      • Kelly J.
      • Wolfe R.
      • Simpson P.
      Comparative performance of 4 dermoscopic algorithms by nonexperts for the diagnosis of melanocytic lesions.
      ].

      4.2 Generalisability

      The performance of dermatologists may be different in other countries because of different education programs and different habits regarding use of dermoscopy.

      4.2.1 Limitations

      4.2.1.1 Image only as input

      Clinical encounter with the actual patient provides more information than that provided by an image. Hänßle et al. demonstrated that additional clinical information slightly improves the sensitivity (from 86.6% to 88.9%) and specificity (from 71.3% to 75.7%) of dermatologists [
      • Haenssle H.
      • Fink C.
      • Schneiderbauer R.
      • et al.
      Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists.
      ]. However, currently tested algorithms only accept an image input; thus, for a current benchmark, the input data are restricted to an image for direct comparison of an image classification task with dermatologists.

      4.2.1.2 Anonymity

      The anonymity of the electronic questionnaire was mandatory to protect privacy. However, anonymity carries the risk of abuse. By involving physicians exclusively via their institutional email addresses and by predefining data validation strategies, this risk was minimised, and a high successful plausibility rate was achieved (157 of 175 participants for MClass-D and 145 of 162 participants for MClass-ND).

      4.2.1.3 Allocation of images

      The 1:5 ratio (Melanoma/Nevi) per image is equal to one from the ISBI test-set [
      • Gutman D.
      • Codella N.C.F.
      • Celebi E.
      • et al.
      Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC).
      ]. In clinical practice, a 1:50 ratio would be more realistic. However, use of this ratio would have necessitated 10 times more test images per data set to achieve the same number of classified melanomas (total 2000 test images), which would have drastically reduced the number of dermatologists willing to participate.

      4.2.1.4 Generalisability

      MClass may be used as a benchmark for binary decisions on images of melanocytic lesions to distinguish nevi from melanoma trained on and for classification of images from white-skinned Western populations. Age, sun exposure and other factors of the original lesions could not be controlled in our benchmark but might cause slight differences in performance of algorithms. However, most past publications were tested for white-skinned Western populations [
      • Esteva A.
      • Kuprel B.
      • Novoa R.A.
      • et al.
      Dermatologist-level classification of skin cancer with deep neural networks.
      ,
      • Marchetti M.A.
      • Codella N.C.F.
      • Dusza S.W.
      • et al.
      Results of the 2016 international skin imaging collaboration international Symposium on biomedical imaging challenge: comparison of the accuracy of computer algorithms to dermatologists for the diagnosis of melanoma from dermoscopic images.
      ,
      • Haenssle H.
      • Fink C.
      • Schneiderbauer R.
      • et al.
      Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists.
      ]. In addition, other factors such as age and sun exposure were not controlled for in past publications in this field [
      • Esteva A.
      • Kuprel B.
      • Novoa R.A.
      • et al.
      Dermatologist-level classification of skin cancer with deep neural networks.
      ,
      • Han S.S.
      • Kim M.S.
      • Lim W.
      • Park G.H.
      • Park I.
      • Chang S.E.
      Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm.
      ,
      • Marchetti M.A.
      • Codella N.C.F.
      • Dusza S.W.
      • et al.
      Results of the 2016 international skin imaging collaboration international Symposium on biomedical imaging challenge: comparison of the accuracy of computer algorithms to dermatologists for the diagnosis of melanoma from dermoscopic images.
      ,
      • Haenssle H.
      • Fink C.
      • Schneiderbauer R.
      • et al.
      Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists.
      ].

      5. Conclusions

      We present the first public melanoma classification benchmark for both non-dermoscopic (MClass-ND) and dermoscopic (MClass-D) images for comparison of artificial intelligence algorithms with diagnostic performance of 145 and 157 dermatologists, respectively. Future publications should consider MClass as a reference standard for classification of melanocytic images of white-skinned Western populations for binary classification tasks.

      Acknowledgements

      The authors would like to thank and acknowledge the dermatologists who actively and voluntarily spend much time to participate in the reader study; some participants asked to remain anonymous, and they also thank these colleagues for their commitment. Berlin: Wiebke Ludwig-Peitsch; Bonn: Judith Sirokay; Erlangen: Lucie Heinzerling; Essen: Magarete Albrecht, Katharina Baratella, Lena Bischof, Eleftheria Chorti, Anna Dith, Christina Drusio, Nina Giese, Emmanouil Gratsias, Klaus Griewank, Sandra Hallasch, Zdenka Hanhart, Saskia Herz, Katja Hohaus, Philipp Jansen, Finja Jockenhöfer, Theodora Kanaki, Sarah Knispel, Katja Leonhard, Anna Martaki, Liliana Matei, Johanna Matull, Alexandra Olischewski, Maximilian Petri, Jan-Malte Placke, Simon Raub, Katrin Salva, Swantje Schlott, Elsa Sody, Nadine Steingrube, Ingo Stoffels, Selma Ugurel, Anne Zaremba. Hamburg: Christoffer Gebhardt, Nina Booken, Dr. Maria Christolouka; Heidelberg: Kristina Buder-Bakhaya, Therezia Bokor-Billmann, Alexander Enk, Patrick Gholam, Holger Hänßle, Martin Salzmann, Sarah Schäfer, Knut Schäkel, Timo Schank; Kiel: Ann-Sophie Bohne, Sophia Deffaa, Katharina Drerup, Friederike Egberts, Anna-Sophie Erkens, Benjamin Ewald, Sandra Falkvoll, Sascha Gerdes, Viola Harde, Axel Hauschild, Marion Jost, Katja Kosova, Laetitia Messinger, Malte Metzner, Kirsten Morrison, Rogina Motamedi, Anja Pinczker, Anne Rosenthal, Natalie Scheller, Thomas Schwarz, Dora Stölzl, Federieke Thielking, Elena Tomaschewski, Ulrike Wehkamp, Michael Weichenthal, Oliver Wiedow; Magdeburg: Claudia Maria Bär, Sophia Bender-Säbelkampf, Marc Horbrügger, Ante Karoglan, Luise Kraas Mannheim: Jörg Faulhaber, Cyrill Geraud, Ze Guo, Philipp Koch, Miriam Linke, Nolwenn Maurier, Verena Müller, Benjamin Thomas, Jochen Sven Utikal; Munich: Ali Saeed M. Alamri, Andrea Baczako, Carola Berking, Matthias Betke, Carolin Haas, Daniela Hartmann, Markus V. Heppt, Katharina Kilian, Sebastian Krammer, Natalie Lidia Lapczynski, Sebastian Mastnik, Suzan Nasifoglu, Cristel Ruini, Elke Sattler, Max Schlaak, Hans Wolff; Regensburg: Birgit Achatz, Astrid Bergbreiter, Konstantin Drexler, Monika Ettinger, Sebastian Haferkamp, Anna Halupczok, Marie Hegemann, Verena Dinauer, Maria Maagk, Marion Mickler, Biance Philipp, Anna Wilm, Constanze Wittmann; Würzburg: Anja Gesierich, Valerie Glutsch, Katrin Kahlert, Andreas Kerstan, Bastian Schilling and Philipp Schrüfer.
      This research did not receive any specific grant from funding agencies in the public, commercial or not-for-profit sectors.

      Appendix A. Supplementary data

      The following are the Supplementary data to this article:

      Conflict of interest statement

      None declared.

      References

        • Schadendorf D.
        • van Akkooi A.C.J.
        • Berking C.
        • et al.
        Melanoma.
        Lancet. 2018; 392: 971-984
        • Gordon L.G.
        • Rowell D.
        Health system costs of skin cancer and cost-effectiveness of skin cancer prevention and screening: a systematic review.
        Eur J Cancer Prev. 2015; 24: 141-149
        • Brinker T.J.
        • Klode J.
        • Esser S.
        • Schadendorf D.
        Facial-aging app availability in waiting rooms as a potential opportunity for skin cancer prevention.
        JAMA dermatology. 2018; 154: 1085-1086
        • Vestergaard M.
        • Macaskill P.
        • Holt P.E.
        • Menzies S.W.
        Dermoscopy compared with naked eye examination for the diagnosis of primary melanoma: a meta-analysis of studies performed in a clinical setting.
        Br J Dermatol. 2008; 159: 669-676
        • Esteva A.
        • Kuprel B.
        • Novoa R.A.
        • et al.
        Dermatologist-level classification of skin cancer with deep neural networks.
        Nature. 2017; 542: 115
        • Han S.S.
        • Kim M.S.
        • Lim W.
        • Park G.H.
        • Park I.
        • Chang S.E.
        Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm.
        J Invest Dermatol. 2018; 138: 1529-1538
        • Marchetti M.A.
        • Codella N.C.F.
        • Dusza S.W.
        • et al.
        Results of the 2016 international skin imaging collaboration international Symposium on biomedical imaging challenge: comparison of the accuracy of computer algorithms to dermatologists for the diagnosis of melanoma from dermoscopic images.
        J Am Acad Dermatol. 2018; 78: 270-277
        • Haenssle H.
        • Fink C.
        • Schneiderbauer R.
        • et al.
        Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists.
        Ann Oncol. 2018; 29: 1836-1842
        • Gutman D.
        • Codella N.C.F.
        • Celebi E.
        • et al.
        Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC).
        2016 (arXiv preprint arXiv:1605.01397)
        • Brinker T.J.
        • Hekler A.
        • Utikal J.S.
        • et al.
        Skin cancer classification using convolutional neural networks: systematic review.
        J Med Internet Res. 2018; 20
        • Tschandl P.
        • Rosendahl C.
        • Kittler H.
        The HAM10000 dataset: a large collection of multi-source dermatoscopic images of common pigmented skin lesions.
        2018 (arXiv preprint arXiv:1803.10417)
        • Giotis I.
        • Molders N.
        • Land S.
        • Biehl M.
        • Jonkman M.F.
        • Petkov N.
        MED-NODE: a computer-assisted melanoma diagnosis system using non-dermoscopic images.
        Expert Syst Appl. 2015; 42: 6578-6585
        • Carli P.
        • Quercioli E.
        • Sestini S.
        • et al.
        Pattern analysis, not simplified algorithms, is the most reliable method for teaching dermoscopy for melanoma diagnosis to residents in dermatology.
        Br J Dermatol. 2003; 148: 981-984
        • Dolianitis C.
        • Kelly J.
        • Wolfe R.
        • Simpson P.
        Comparative performance of 4 dermoscopic algorithms by nonexperts for the diagnosis of melanocytic lesions.
        Arch Dermatol. 2005; 141: 1008-1014
        • Breunig M.M.
        • Kriegel H.P.
        • Ng R.T.
        • Sander J.
        LOF: identifying density-based local outliers.
        in: ACM sigmod record. 2000, May