OVERVIEW
Nodule detection on lung computed tomography (CT) has long been engineered with the goal of improving radiologists’ sensitivity and specificity in identifying nodules. To achieve this, advanced deep learning techniques were used to train models by maximizing full 3D low-dose computed tomography (LDCT) volumes, pathology-confirmed case results and prior volumes (Ardila et al., 2019). This article will focus on the pre-existing limitations of current approaches, the modelling framework and clinical evaluation of a proposed deep learning model, as well as how this translates to new opportunities in the medical field.
WORKING PRINCIPLE OF CT SCANS
Computed tomography uses ionizing radiation (x-rays) coupled with an electronic detector array to record a pattern of densities and create a cross-section of a tissue. The x-ray beam rotates around the object within the scanner such that multiple x-ray projections pass through the object. An image is generated by stacking the cross-sections to record the internal structure of an object (Karen S. Caldemeyer et al., 1999).
According to the Lung-RADS (a lung imaging reporting and data system to aid with findings in LDCT screening exams for lung cancer) guidelines for LDCT lung cancer screening by the American College of Radiology, evaluation of the scan by radiologists is based primarily on nodule size, density and growth. At screening sites, Lung-RADS and other models are used to determine malignancy risk ratings that drive recommendations for clinical management.
LIMITATIONS OF CURRENT APPROACHES
Improving the sensitivity and specificity of lung cancer screening is key to eradicating high clinical costs of missed or late diagnosis and unnecessary biopsy procedures resulting from false negatives and false positives (Black, W. C. et al., 2014). Sensitivity and specificity are measures of a test’s ability to accurately classify a person of having a disease or not. Sensitivity refers to a test’s ability to designate an individual with the disease as positive. The higher the sensitivity, the fewer negative results there are, and the fewer number of cases missed. Specificity refers to a test’s ability to designate an individual without the disease as negative. A highly specific test implies that there are few false positive results (New York Department of Health, 1999) Here is how sensitivity and specificity are expressed in mathematical formulae:
Sensitivity = a/(a+c) Specificity = d/(b+d)
Yet limitations on lesions localization and malignancy risk evaluation still exist in current approaches. Some examples are included below:
Lung-RADS
Lung-RADS disagreement with respect to CT categorization was observed in a study on observer variability for Lung-RADS categorization lung cancer screening CTs. 6% of observers assigned the wrong Lung-RADS category based on their personal annotations on nodule type, size and growth (van Riel, Sarah J., et al., 2019). This is an example of low sensitivity and specificity in lung cancer screening.
Computer-aided detection (CADe)
In a performance evaluation of the CADe approach, data from CT scans were transferred to a lung analysis software with a prototype CADe algorithm. Images were read by a radiologist and a first-year resident with and without the software at low-dose and full-dose settings. Results show that both radiologists displayed a statistically significant increase in sensitivity with the use of CADe (Das, M., et al., 2008). However, this approach highlights small nodules, leaving malignancy risk evaluation and clinical decision making to the clinician, hence lowering the specificity of lung cancer screening (Ardila et al., 2019).1
Computer-aid diagnosis (CADx)
The CADx approach includes diagnostic support for pre-identified lesions. This approach is primarily aimed at improving specificity. An example of such is a study to evaluate the performance of a CADx system for breast ultrasound to improve the characterization of breast lesions by radiologists. Results show that the use of CADx increased radiologists’ sensitivity scores, with seven additional cancers diagnosed. However, the low specificity of CADx decreased the specificity of radiologists, especially of the more experienced among them (Marie-Laure Chabi et al., 2011) In addition, CADx has gained greater interest in other areas of radiology, but unfortunately not in lung cancer for the time being (Ardila et al., 2019)1.
THE PROPOSED THREE-DIMENSIONAL DEEP LEARNING MODEL (Ardila et al., 2019)1
To move beyond the limitations of prior CADe and CADx approaches, an end-to-end approach performing both localization and lung cancer risk categorization tasks using input CT data alone was proposed. It aimed to replicate a more complete part of a radiologist’s workflow, including full assessment of LDCT volume, focus on regions of concern, comparison to prior imaging when available and calibration against biopsy-confirmed outcomes. An important high-level decision in this approach was to learn features using deep convolutional neural networks (CNN) (a class of neural networks that specialises in processing data that has a grid-like topology, such as an image), rather than using hand-engineered features such as specific Hounsfield unit values (a relative quantitative measurement of radio density used by radiologists in the interpretation of CT images), which has repeatedly been shown inferior to CNN approach in many open computer vision competitions in recent years.
Overall modeling framework
There were three key components in the new approach.
The ‘full-volume model’: It is a 3D CNN model that performs end-to-end analysis of whole-CT volumes using LDCT volumes with pathology-confirmed cancer as training data.
The ‘cancer ROI detection model’: A CNN region-of-interest (ROI) detection model was trained to detect 3D cancer candidate regions in the CT volume
The ’cancer risk prediction model’: It operates on outputs from both the cancer ROI detection model and full-volume model. It then assigns a case-level malignancy score to predict whether the patients received a cancer diagnosis.
Clinical evaluation of the model
A deep learning model for analysis of malignancy risk in lung cancer screening CTs was developed from a National Lung Cancer Screening Trial (NLST) dataset consisting of 42290 CT cases from 14851 patients, 578 of whom developed biopsy-confirmed cancer within the 1-year follow-up period. Patients were randomly assigned into one of three sets: a training set (70%), a tuning set (15%) and a test set (15%).
The model predictions were thresholded at three different cutoffs to produce four different lung malignancy scores (LUMAS) - LUMAS buckets 1/2, 3+, 4A+ and 4B/X.
A two-part retrospective reader study with 6 US board-certified radiologists with an average of 8 years clinical experience was conducted. In the first part, radiologists reviewed a subset of the test dataset consisting of 507 patients (83 cancer-positives). They were given access to patient demographic and clinical history, while the deep learning model did not attain the information. Neither of them have access to previous screening CT volumes from the patient. On the same subset, the model was instructed to review the malignancy risk of lung cancer of each patient. Results show that the model’s area under curve (AUC) was 95.9, as seen in the area bounded by the blue trend line in the graph below in table a and b. At the same time, when we look at data by individual readers on different buckets, as indicated by the red, green and yellow crosses and circles in the specificity and sensitivity graphs respectively, the performance of all six radiologists trended at or below the model’s receiver operating curve. This shows that the model has achieved better sensitivity and specificity than the average radiologist, despite having no information on patient demographic and clinical history.
In the second part, CT volumes from both the current and previous year were available to the model and the radiologists. On this subset, the AUC of the model was 92.6, as indicated again by the blue trend line on the graph below. In comparison to the average reader, its specificity is very similar to an average radiologist, especially for the buckets 3+ and 4A+, the green and yellow crosses. Meanwhile, the performance on sensitivity of the model was not as satisfactory as specificity, as we can see that some readers had a higher accuracy in identifying negative cases. In general, it showed significant improvement for the 4A+ bucket, while others matched the sensitivity and specificity of an average reader.
A localization analysis was also performed to measure how often a correct cancer diagnosis was linked with a correct localization. A bounding box was produced by the model for the top two lesions by malignancy risk. After comparison with the bounding box labelled by two radiologists, results showed that the highest-ranked bounding box overlapped with a malignancy in the scan labelled by radiologists in all but one case. This shows that the model’s ability in correct localization of lesions is somewhat similar to that of an average reader.
To further improve the reliability of the study on the deep learning model’s performance on evaluating malignancy risk and diagnosing lung cancer, a lot more data subsets were taken into account. The overall results show that the model was not inferior in any way compared to the average reader for any subsets.
Limitations of the model
While the study has made the deep learning model as something that is similar, if not superior, to an average radiologist, there are still some limitations to the study.
Firstly, further could be done to improve reliability of the results of the study. Although as mentioned above, multiple datasets were taken into account to evaluate the deep learning model’s performance on lung cancer diagnosis, further could be done. These radiologist-comparison studies were limited to the data from the NLST dataset, hence further study will require testing and tuning against a broader variability of screening data to ensure reliability.
Secondly, the performance of the model was primarily evaluated based on comparison of reader and model performance on diagnoses of positive and negative cases. However, in reality, the selection of models for use in clinical practice remains an ongoing research, and a lot more other parameters, such as costs and outcomes, may trade off sensitivity and specificity. Therefore this study does not reflect all considerations necessary for deciding what models to use in clinical practice.
Conclusion of the study
All in all, advanced deep learning techniques were used to train the proposed model to generate case level malignancy risk prediction as well as localization information for LDCT lung screening volumes. The observed increase in specificity could translate to fewer unnecessary follow-up procedures, while increased sensitivity could translate to fewer missed cancers in clinical practice. Looking ahead, these models could be used to aid clinicians in evaluating lung cancer screening exams.
FUTURE PROSPECTS
Having looked into a study on the performance of a deep learning model on evaluating malignancy risk and diagnosing lung cancer by Diego Ardila’s research team, one might ask - Apart from lung cancer screening, how can deep learning techniques used in this model be applied to other medical fields? Here are some of the many examples:
Neuroimaging and Machine Learning for Dementia Diagnosis
Dementia is the progessitve loss of cognitive functioning which interferes severely with a patient’s daily life and activities, and is becoming more and more prevalent due to the aging population. A major challenge in dementia is achieving accurate and timely diagnosis. In a study which presents a comprehensive survey of automated diagnostic approaches for dementia using medical image analysis and machine learning algorithm similar to the deep learning model for lung cancer screening, it is shown that multimodal imaging analysis deep learning approaches have shown promising results in diagnosis of dementia (Ahmed et al., 2018).
Precision Medicine in Clinical Care
The same deep learning technique is applied in precision medicine in the form of big data analysis, which augments the cognitive decisions of surgeons. For example, instead of assigning a common treatment for a certain disease, oncologists can now analyze a patient’s biopsy tissue for a panel of genetic variants, which enables a more reliable prediction of the patient’s response to a particular treatment (Mirnezami et al., 2012).
Similarly, the deep learning techniques also have considerable relevance to other types of 3D imaging data, such as magnetic resonance imaging (a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body), position emission tomography (an imaging test that helps reveal how your tissues and organs are functioning), or other types of volumetric research.
CONCLUSION
To sum up, we have looked at the working principles of CT scans and their current limitations in computer-aid detection and detection in terms of sensitivity and specificity. To move beyond such limitations, we have dwelled deeper into a specific study on the performance of a deep learning model on evaluating malignancy risk and lung cancer diagnosis. These techniques can not only be taken further to diagnosis for other diseases such as dementia, but can also be used to develop precision medicine in clinical care. As the name suggests, deep learning has the potential to be explored deeper into, and it is only a matter of time before they are widely used in the medicine field.
BIBLIOGRAPHY
Ardila, D., Kiraly, A.P., Bharadwaj, S. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med 25, 954–961 (2019).
Karen S. Caldemeyer, MDa, Kenneth A. Buckwalter, MDb (1999). https://doi.org/10.1016/S0190-9622(99)70015-0
Black, W. C. et al. Cost-effectiveness of CT screening in the National Lung Screening Trial. N. Engl. J. Med. 371, 1793–1802 (2014).
New York Department of Health. Disease Screening - Statistics Teaching Tools. (April 1999) https://www.health.ny.gov/diseases/chronic/discreen.htm
van Riel, Sarah J., et al. "Observer variability for Lung-RADS categorisation of lung cancer screening CTs: impact on patient management." European radiology 29.2 (2019): 924-931. https://link.springer.com/content/pdf/10.1007/s00330-018-5599-4.pdf
Das, M., et al. "Performance evaluation of a computer-aided detection algorithm for solid pulmonary nodules in low-dose and standard-dose MDCT chest examinations and its influence on radiologists." The British journal of radiology 81.971 (2008): 841-847.
Chabi ML, Borget I, Ardiles R, Aboud G, Boussouar S, Vilar V, Dromain C, Balleyguier C. Evaluation of the accuracy of a computer-aided diagnosis (CAD) system in breast ultrasound according to the radiologist's experience. Acad Radiol. 2012 Mar;19(3):311-9. doi: 10.1016/j.acra.2011.10.023. PMID: 22310523.
Ahmed, Md Rishad, et al. "Neuroimaging and machine learning for dementia diagnosis: Recent advancements and future prospects." IEEE reviews in biomedical engineering 12 (2018): 19-33.
Mirnezami, Reza, Jeremy Nicholson, and Ara Darzi. "Preparing for precision medicine." N Engl J Med 366.6 (2012): 489-491.
Comments