Development of a Deep Learning Model for Classification of Hepatic Steatosis from Clinical Standard Ultrasound





ABSTRACT


Objective


Early detection and monitoring of hepatic steatosis can help establish appropriate preventative measures against progression to more advanced disease. We aimed to develop a deep learning (DL) program for classification of hepatic steatosis from standard-of-care grayscale ultrasound (US) images.


Methods


In this single-center retrospective study, we utilized grayscale US images from January 1, 2010, to October 23, 2022, labeled with magnetic resonance imaging (MRI) proton density fat fraction (MRI-PDFF) to develop a DL multi-instance program for differentiating normal (S0) from steatotic liver (S1/2/3) and normal/mild steatosis (S0/1) from moderate/severe steatosis (S2/3). Diagnostic performances were assessed with area under the receiver operating characteristic curves (AUC), sensitivity, specificity and balanced accuracy with 95% confidence interval (CI).


Results


A total of 403 patients with 403 US exams were included: 171 (42%) were normal (S0: MRI-PDFF <5%), 154 (38%) had mild steatosis (S1: MRI-PDFF 5–17.4%), 29 (7%) had moderate steatosis (S2: MRI-PDFF >17.4%–22.1%) and 49 (12%) had severe steatosis (S3: MRI-PDFF >22.1%). The dataset was split to include 322 patients in train/validation and 81 patients in a holdout test set (kept blind). The S0 versus S1/2/3 model achieved 81.3% (95% CI 72.1–90.5) AUC, 81.1% (70.6–91.6) sensitivity, 71.4% (54.7–88.2) specificity and 76.3% (66.4–86.2) balanced accuracy. The S0/1 versus S2/3 model achieved 95.9% (89–100) AUC, 87.5% (71.3–100) sensitivity, 96.9% (92.7–100) specificity and 92.2% (83.8–100) balanced accuracy. A multi-class model achieved a sensitivity of 71.4% (54.7–88.2) for S0, 67.6% (52.5–82.7) for S1 and 87.5% (71.3–100) for S2/3; specificity for the same model was 81.1% (70.6–91.6) for S0, 77.3% (64.9–89.7) for S1 and 96.9% (92.7–100) for S2/3.


Conclusion


Our DL program offered high sensitivity and accuracy in detecting and categorizing hepatic steatosis from standard-of-care ultrasound.




Key results




  • 1.

    Our deep learning model can detect hepatic steatosis from normal with 81.1% sensitivity and 76.3% balanced accuracy.


  • 2.

    The model can detect normal/mild steatosis from moderate/severe steatosis with 92.2% balanced accuracy.


  • 3.

    The multi-class model achieved the highest sensitivity (87.5%) and specificity (96.9%) for detecting moderate/severe steatosis.




Introduction


Hepatic steatosis is a common disease worldwide with various etiologies including alcohol use, metabolic syndrome, viral infection and medication use [ ]. Patients with hepatic steatosis can develop inflammation and fibrosis, which are risk factors for cirrhosis and hepatocellular carcinoma [ ]. Early detection and monitoring of hepatic steatosis could help establish appropriate preventative measures against progression. Magnetic resonance imaging proton density fat fraction (MRI-PDFF) has become the best noninvasive method to quantify fat in the liver [ ]. However, MRI also has disadvantages such as long acquisition time, high cost and limited availability, especially in low-resource settings. Ultrasound (US) can serve as a first-line screening modality for liver disease because it is fast, portable and low-cost.


Currently, the standard practice to evaluate hepatic steatosis on US is a qualitative assessment by radiologists on grayscale images [ ]. This is prone to inter-/intrareader variability [ , ], and calibration and discriminative accuracy can vary greatly [ , ]. Recent emerging quantitative US technologies include hepatorenal index, backscatter coefficient, sound-speed estimation, and attenuation imaging. Some are already available but vendor-specific, limited to specific circumstances, or in need of further external validation, especially in diverse patient populations with different health and demographic/ethnic backgrounds [ , ].


Machine learning-based methods can enable novel approaches for detection, classification and quantification of hepatic steatosis and have been demonstrated on several US-based data types including radiofrequency and standard B-mode images [ ]. While access to RF and raw US data is limited, B-mode grayscale data is readily available on a Pictures Archiving and Communication System (PACS) and is part of the standard routine US exams. Enabling machine learning methods to automatically detect and classify hepatic steatosis during a routine US exam with a model deployed on a PACS could significantly enhance radiologists’ productivity by automatically highlighting cases that require special attention (especially for training radiologists or those do not specialize in reading US), reduce interpretation variability and increase accuracy [ ]. While some modeling attempts to do so have been introduced, the limitations of these models include variable reference standards, training set label accuracy, training set size and difficulty implementing them into clinical practice [ ]. Here, we introduce a deep learning (DL) multi-instance model for automated detection and grading of hepatic steatosis from B-mode US images that is trained on data labeled using MRI-PDFF to standardize measurement methods and to remove human reader variability. This model could be integrated seamlessly into a PACS, providing automated, precise and accurate evaluation of hepatic steatosis.


Materials and methods


Data to train, validate and test our model was obtained through a single-center, retrospective protocol approved by the Institutional Review Board and compliant with the Health Insurance Portability and Accountability Act. Informed consent was waived due to the study’s retrospective nature. This study followed the Standards for Reporting of Diagnostic Accuracy Studies 2015 guidelines [ ].


MRI-PDFF reference standard


MRI-PDFF is routinely acquired on all MRI abdomen and liver protocols. Images were acquired on a 3T magnet from GE (GE HealthCare, Chicago, IL, USA), Phillips (Philips HealthCare, Best, The Netherlands) or Siemens (Siemens Healthineers, Erlangen, Germany). A fat fraction from MRI-PDFF was extracted from MRI reports as the reference standard. For steatosis grades S0–S3 (normal-severe), we used the following MRI-PDFF cut-off values: S0 (normal) fat fraction <5%, S1 (mild) fat fraction ≥5%–17.4%, S2 (moderate) fat fraction >17.4%–22.1% and S3 (severe) fat fraction >22.1% [ , ].


Clinical-standard US dataset


The electronic medical record was searched from January 1, 2010, to January 1, 2022, for all adult patients who underwent abdominal or liver MRI with MRI-PDFF and abdominal US within 12 weeks and with available body mass index (BMI) within 12 weeks of the MRI and US exams. An additional search was performed from January 1, 2022, to October 23, 2022, for patients with the above criteria to include more patients with MRI-PDFF >17.4% (moderate and severe steatosis). Patients were excluded due to significant BMI change between the MRI and US exams (>5% absolute value change) [ , ], insufficient normal liver parenchyma (right lobe especially), status post liver transplant, or pregnancy ( Fig. 1 ). All patients were included only once (the most recent US exam). Due to the low prevalence of moderate and severe steatosis, all patients with moderate and severe steatosis (MRI-PDFF >17.4%) were grouped into a single label. Patients with normal liver and mild steatosis were randomized.




Figure 1


Flowchart to derive training patient cohort.


B-mode grayscale liver images were acquired as a general abdominal or dedicated liver protocol using either GE Logiq E9 or E10 (GE Healthcare, Wausheka, WI, USA) or Canon Aplio i800 (Canon Medical Systems USA, Inc, Tustin, CA, USA). Patient age, sex, BMI, ethnicity, presence of cirrhosis (clinical diagnosis), and risk factors for diffuse liver disease were collected.


Image preparation


Data extracted from PACS included single-image JPGs/TIFF and DICOM (in transverse and longitudinal views), and multi-image cinematic clips in the form of AVIs or DICOMs. Multi-image files were separated into individual frames to facilitate input into developed models. Images were cropped to remove extraneous text, retaining the majority of the liver parenchyma. Images were then resized to 224 pixels by 224 pixels. Figure 2 shows representative images.




Figure 2


Ultrasound images of the right lobe of the liver showing liver without steatosis (a, MRI-PDFF 0.5%), with mild steatosis (b, MRI-PDFF 11%), with moderate steatosis (c, MRI-PDFF 20%) and with severe steatosis (d, MRI-PDFF 31%). K, right kidney; L, liver (right lobe).


Data splits for training, validation and testing


Figure 3 a shows the partitioning and flow of image data. The holdout (i.e., test) set was prepared by stratifying 20% of the total patient dataset (15,874 images, 81 patients) by steatosis category, and consisted of 28 S0 patients, 37 S1 patients and 16 S2–3 patients This ensured that the holdout contained approximately the same steatosis category distribution as the remaining 80% of the data (81,147 images, 322 patients, which were used for training and cross-validation), which consisted of 143 S0 patients, 117 S1 patients and 62 S2/3 patients. Patients in the holdout set (i.e., test set) were not included in training or validation sets; all data was stratified by patient.




Figure 3


(a) Image data partitioning and flow. A 20% holdout dataset was initially extracted from the cleaned patient dataset, which was used to evaluate the final multi-class structure. The remaining 80%, through 4-fold cross-validation, was used to select the best parameters for the binary models that were integrated into the final multiclass structure. (b) The architecture of each binary model used in the multi-class structure. The entire training dataset left over from holdout extraction was used to train a frame-level predictor ( far-left ). Computing various statistics on the patient-grouped predictions of this model yielded a dataset of feature vectors that was used to cross-validate a Histogram-Based Gradient Booster. The best-performing booster was then trained on the entire feature vector dataset to yield patient-level predictions.


Multi-instance binary model architecture


Our model followed a multi-instance-like methodology; such methods are used when labels are only available for a group of “instances” that together form what is known as a “bag” [ , ]. In multi-instance learning, the model aims to classify the bag by considering each instance belonging to that bag [ ]. Here, patients were treated as bags, and individual frames/images were treated as instances within a bag. Two multi-instance binary models were developed, one distinguishing between S0 and S1/S2/S3 patients and another distinguishing between S0/S1 and S2/S3 patients. Both models were combined to yield a final multi-class prediction based on a rule system.


Figure 3 b illustrates how each binary model was composed of a frame-level model yielding predictions on individual images, followed by pooling by patient and feature computation layer. Similar to Mazaki et al. and Huertas-Tato et al., these grouped statistics served as inputs into a Histogram-based Gradient Boosting Classification Tree, yielding a binary patient-level prediction for each group [ , ]. This particular patient-level predictor was selected due to superior cross-validation performance when compared to alternatives such as simple averaging of frame-level probabilities, SVMs, Decision Trees and Random Forests. The frame-level model was based on a deep convolutional neural network (CNN) with weights pretrained on ImageNet [ ]. Various architectures were evaluated in this phase, including ConvNeXt, MobileNetV3, Resnet, VGG and Visual Transformers, with MobileNetV3 being selected due to superior frame-level performance. The model was fitted with balanced class weights. Pooling between frame-level and patient-level was necessary to account for variations in the number of frames (instances) per patient. The patient-level features used as input to the patient-level model were computed from the frame-level output probabilities and comprised of statistical summaries of those probabilities (described below). A 4-fold cross-validation was performed via a grid search to tune the parameters of the patient-level model to maximize the sensitivity at a threshold of 0.5 [19]. Afterward, the model was trained on the entire patient dataset after holdout extraction and evaluated on the holdout dataset.


The features computed for each patient-level grouping included the following 19 statistical summaries obtained from the prediction probabilities (i.e., output of frame-level model) across all frames for a patient: minimum, maximum, range, mean, variance, sample standard deviation, geometric mean, geometric standard deviation, harmonic mean, standard error of the mean, coefficient of variation, median, interquartile range, mode, sample skewness, kurtosis, Shannon entropy, differential entropy and median absolute deviation.


Statistical analysis


For each patient-level binary model, we computed receiver operating characteristic (ROC) curves to assess the discrimination of hepatic steatosis categories. The best model at a threshold of 0.5, was used to compute the binary balanced accuracy, sensitivity (recall) and specificity. Multi-class confusion matrices (depicting accuracy percentages for each stage), balanced accuracy, sensitivity and specificity with 95% confidence interval (CI) were computed by combining the predictions from both binary models. All statistical analysis was done in Python using the scikit-learn module. For all experiments, the optimal operating output threshold was assumed to be 0.5.


Results


Training cohort baseline characteristics


A total of 403 patients with 403 US exams were included in the training set. Demographic and clinical background of patients are listed in Table 1 . A total of 59 (15%) patients were diagnosed with liver cirrhosis. Out of 403 patients, 171 (42%) patients had a normal liver (S0: MRI-PDFF <5%), 154 (38%) patients had mild steatosis (S1: MRI-PDFF 5–17.4%), 29 (7%) patients had moderate steatosis (S2: MRI-PDFF >17.4%–22.1%) and 49 (12%) patients had severe steatosis (S3: MRI-PDFF >22.1%). A total of 8438 static grayscale images and 1067 cinematic clips (each of variable size) were included for training. Examples of different categories of hepatic steatosis are shown in Figure 2 .



Table 1

Baseline Characteristics of 403 US Exams in 403 Patients



















































































































Characteristics Value
Age (y), median (IQR) 53 (40–66)
Sex
Female 252 (63)
Male 151 (37)
BMI (kg/m 2 ), median (IQR) 29 (24.4–33.4)
Range 16.8–51.8
Ethnicity
White 151 (37)
Asian 96 (24)
Black 7 (2)
Hispanic/Latino (white) 16 (4)
Hispanic/Latino (Other) 93 (23)
American Indian or Alaska Native 2 (<1)
Native Hawaiian or Other Pacific Islander 4 (1)
Unknown/Others 34 (8)
Presence of cirrhosis 59 (15)
Risk factor of cirrhosis
Hepatitis B 30 (7)
Hepatitis C 9 (2)
Alcohol 43 (11)
MAFLD 14 (3)
Cardiomyopathy 1 (<1)
Autoimmune 4 (1)
Biliary 5 (1)
Congenital/genetic 4 (1)
None/unknown 293 (73)
US machine
GE 341 (85)
Canon 62 (15)
Days between US and MRI, median (IQR) 4 (1–27)
Range 0–83
MRI-PDFF (%), median (IQR), range 6.4 (2.3–15), 0.5–48
Normal (<5%) 171 (42)
Mild (≥5%–17.4%) 154 (38)
Moderate (>17.4%–22.1%) 29 (7)
Severe (>22.1%) 49 (12)

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

May 10, 2025 | Posted by in ULTRASONOGRAPHY | Comments Off on Development of a Deep Learning Model for Classification of Hepatic Steatosis from Clinical Standard Ultrasound

Full access? Get Clinical Tree

Get Clinical Tree app for offline access