SSAT-Swin: Deep Learning-Based Spinal Ultrasound Feature Segmentation for Scoliosis Using Self-Supervised Swin Transformer





Abstract


Objective


Scoliosis, a 3-D spinal deformity, requires early detection and intervention. Ultrasound curve angle (UCA) measurement using ultrasound images has emerged as a promising diagnostic tool. However, calculating the UCA directly from ultrasound images remains challenging due to low contrast, high noise, and irregular target shapes. Accurate segmentation results are therefore crucial to enhance image clarity and precision prior to UCA calculation.


Methods


We propose the SSAT-Swin model, a transformer-based multi-class segmentation framework designed for ultrasound image analysis in scoliosis diagnosis. The model integrates a boundary-enhancement module in the decoder and a channel attention module in the skip connections. Additionally, self-supervised proxy tasks are used during pre-training on 1,170 images, followed by fine-tuning on 109 image-label pairs.


Results


The SSAT-Swin achieved Dice scores of 85.6% and Jaccard scores of 74.5%, with a 92.8% scoliosis bone feature detection rate, outperforming state-of-the-art models.


Conclusion


Self-supervised learning enhances the model’s ability to capture global context information, making it well-suited for addressing the unique challenges of ultrasound images, ultimately advancing scoliosis assessment through more accurate segmentation.


Introduction


Scoliosis, a deformity of the spine, typically presents as an “S” or “C” shape with a curvature exceeding 10 degrees [ ]. It has become a significant health issue globally [ ], making early detection and intervention critical to mitigating the associated risks [ , ].


Radiographic assessment using Cobb’s angle remains the gold standard for scoliosis diagnosis [ ], supported by research demonstrating high detection accuracy [ , ]. While measurements based on spinous processes and transverse processes show good reliability for estimating curve angles, they tend to underestimate the curve magnitude compared with the radiological Cobb method [ , ]. The center of lamina method offers high reliability, with variations in estimation accuracy across different curve severities [ ]. Beyond curve measurement, ultrasound provides significant advantages in assessing traumatic spinal cord injuries, infant spinal abnormalities, scoliosis treatment, and guiding invasive neuraxial interventions [ ]. Therefore, with most methods reliance on multiple X-ray scans introduces significant radiation exposure, urgently necessitating the development of non-invasive, radiation-free diagnostic alternatives [ ]. In addition, recent research has focused on the ultrasound curve angle (UCA), a non-invasive, radiation-free method for spinal curvature assessment [ ], offering real-time images and ease of use despite challenges of low contrast and irregular target shapes [ , ].


However, ultrasound image segmentation of spinal structures presents significant challenges due to inherent image complexities and ambiguous anatomical boundaries. Convolutional neural networks (CNNs) have demonstrated limitations in accurately detecting thoracic and lumbar bony features (TBFs/LBFs; as shown in Fig. 1 ), with previous studies plagued by high feature missing rates and poor detection accuracy [ ], which negatively impact the accuracy of algorithms like the UCA calculation [ ]. When a spinal feature is missed, the algorithm attempts to compensate by assuming or using previously detected features, which may not be accurate. Therefore, there is a need for a more advanced model that not only enhances segmentation performance but also improves feature detection to minimize such inaccuracies.




Figure 1


Comparison of raw and label images: (1) TBF raw image, (2) corresponding label. (3) Raw image of LBF, (4) corresponding label.


To address the challenges of ultrasound images in scoliosis diagnosis, we propose a self-supervised learning framework for pre-training the SSAT-Swin model. The pre-training phase leverages tailored proxy tasks, including image inpainting, rotation prediction and contrastive learning, to help the model learn robust feature representations by capturing anatomical patterns from different spinal regions. This process, conducted on 1170 ultrasound images, allows the model to handle the complex characteristics of ultrasound data, such as low contrast and high noise. After pre-training, the model is fine-tuned on 109 ultrasound image-label pairs for scoliosis diagnosis.


Our main contributions include:




  • A novel self-supervised learning framework: The SSAT-Swin model is pre-trained on 1170 ultrasound images using tailored proxy tasks to enhance feature extraction, followed by fine-tuning on a dataset of 109 image-label pairs;



  • Introduction of a boundary-enhancement module: This module, integrated into the decoder side of the Swin transformer block and the final projection layer, enhances boundary representation and highlights the region of interest (ROI), improving segmentation performance;



  • Integration of a channel attention module: This module, applied to the skip connections at each layer, ensures that critical information is emphasized and not overlooked, particularly in the lower-level skip connections.



Related works


Scoliosis causes gradual sagging of the spinal cord over time [ ], highlighting the importance of a radiation-free automated assessment system. Achieving this necessitates accurate segmentation results.


Landmark identification for automation


During measurement of the UCA, lateral features of the spine are prioritized over the spinous process as the anatomical reference for angle calculation. Before calculating the UCA, it is essential to first identify the exact locations of TBFs and LBFs, particularly those above and below the T12 level (as Fig. 2 shows), and determine the most tilted TBF and LBF pairs [ ]. Clear identification of TBFs and LBFs plays a critical role in accurate UCA measurement.




Figure 2


Various regions of interest in an ultrasound spinal image, with the (a) original image and (b) its corresponding marked bone features.


Currently, this identification process is manual, heavily reliant on the expertise of doctors and time-consuming. To automate this step, it is necessary to develop an architecture capable of accurately segregating bony features from ultrasound images amidst noise and speckles [ ]. Once TBFs and LBFs are successfully segmented, they can be used to identify the most skewed regions of the spine for UCA measurement. Therefore, segmentation is a crucial step in automating scoliosis analysis using the UCA method.


Current segmentation method


Numerous models have excelled in medical image segmentation, particularly CNN-based architectures like UNet and its variants, which have demonstrated commendable results [ , ]. Specific models, such as Dense-UNet for general segmentation [ ] and H-DenseUNet for liver and tumor segmentation [ ], have shown promising outcomes. Attention mechanisms have further enhanced segmentation, with ASCU-Net employing attention gates for skin lesion detection [ ].


However, despite these advancements, traditional CNNs face challenges in effectively extracting long-range features, which are essential for accurately segmenting complex medical images [ ].


The advent of transformer-based networks has introduced new possibilities for medical image segmentation. Initially developed for natural language processing [ ], transformers have shown significant success in computer vision, particularly with the Vision transformer [ ]. However, challenges remained regarding local information extraction and fixed receptive fields. The Swin transformer [ ] addressed these by implementing a window-based self-attention mechanism that effectively integrated local and global features.


Swin-Unet has emerged as a leading transformer-based model in medical imaging, achieving remarkable results across various competitions and datasets [ ]. Additionally, SwinUNETR [ ] is a 3-D model designed for self-supervised pre-training, achieving state-of-the-art performance on public leaderboards such as Medical Segmentation Decathlon and Beyond the Cranial Vault.


In conclusion, while CNN-based networks have historically excelled in computer vision, particularly in medical image segmentation [ ], the emergence of transformer-based algorithms has opened up new opportunities for advancement in this field.


Methods


Architecture overview


The proposed architecture is illustrated in Figure 3 . The input image is divided into four patches, and attention scores are computed using a window size of 7 within each patch. This process uses window-based multi-head self-attention and shifted window-based self-attention from the Swin transformer [ ] to capture both local and global features. In the encoder, patch merging reduces dimensionality, while patch expanding occurs in the decoder. Each skip connection incorporates a channel attention module [ ] to emphasize important channels. Additionally, a boundary-enhancement module, inspired by the attentive feedback network [ ], is integrated into each decoder layer’s Swin transformer block and the final projection, enhancing edge detection.




Figure 3


The SSAT-Swin model features two key modules: The boundary-enhancement module, which improves boundary detection, and the channel attention module, which emphasizes important channels to ensure crucial information is highlighted during processing.


Boundary-enhancement module


To augment the feature’s integrity, we used a boundary-enhancement module, showcased in Figure 4 , which effectively separates the background and ROI while accentuating the boundary of the bone feature.




Figure 4


The prediction result P ( i ) is subtracted from the max pooling result P( m ), followed by applying a Rectified Linear Unit (ReLU) function to enhance features. Finally, an addition operation integrates the refined features, contributing to the model’s prediction refinement.


While predicting the shape and structure of targeted bone features, the segmentation result is enhanced by the boundary-enhancement component. To achieve this, we obtain the predicted result P( z ), use a maxpool operation to obtain the feature map P( m ) and subtract P( z ) from P( m ) to get the boundary map P( b ) [ ]. In the end, in order to highlight the boundary, we add that with P( z ) to enhance the boundary. The boundary enhancement is used at different levels within the network as well as after the final projection layer.


Channel attention module


This module enhances input feature representation. First, adaptive average pooling reduces the time series dimension, obtaining a global channel description [ ]. This channel attention strengthens key information, improving the model’s understanding of complex tasks.


As shown in Figure 5 , by incorporating the channel attention module in the skip connection, the model optimizes channel-wise feature adjustment, enhancing its focus on key information. This process improves the model’s ability to express input data, particularly in complex tasks, by emphasizing important features and suppressing irrelevant ones.




Figure 5


An adaptive average pooling method is used by the channel attention mechanism to lower the channel dimension of input features and make a global description. Then, a Multi-Layer Perceptron (MLP), which has a sequence of two linear layers with ReLU, reduces attention to learn channel weights. These weights are normalized using a sigmoid function and applied to the input features, adjusting each channel through element-wise multiplication to produce the final output features.


Experiments


Dataset


As Figure 6 shows, there are two important parts for each case:



  • 1.

    The LBF is at the bottom of the photos, as shown in blue;


  • 2.

    The upper part is TBF, as red and green colors.




Figure 6


A pair of data. The image on the left side is the original ultrasound image, while on the right side is the ground truth label overlapping the raw data.


This research utilizes input images acquired from the Scolioscan system, which employs 3-D ultrasound image techniques to generate 3-D volume projection images. To ensure spatial accuracy, a cross-wire calibration procedure was conducted using the non-linear Levenberg-Marquardt algorithm to align 2-D images with positional sensor data. Specifically, ultrasound gel was applied during the scanning process to eliminate gaps between the probe and the skin, ensuring consistent image quality. Technically, a total of 1170 images were pre-trained using self-supervised proxy tasks, followed by fine-tuning with 109 images selected from 109 patients. The patient cohort included 82 females and 27 males, with an average age of 15.6 ± 2.7 y and all presenting varying degrees of spinal deformity. From the 3-D ultrasound voxel data, medical experts selected nine 2-D images per patient, captured at different image depths, and a RankNet was used for selecting the best-quality slice image for each case [ ]. These images, despite variations in size, typically have a resolution of approximately 2600 × 640 pixels [ ].


All experimental procedures involving human participants were approved by the Institutional Review Board, and informed consent was obtained as required. The study strictly adhered to the principles outlined in the Declaration of Helsinki. Furthermore, human subject consent for the project was granted by the Hong Kong Polytechnic University Ethics Committee (approval no. HSEARS20180906005).


Data pre- and post-process


During the pre-processing stage, the grayscale-converted original images and their corresponding label arrays were stored as NumPy arrays in .npz format. The image pixel values ranged from 0 to 255, while the labels were constrained to values of 0, 1 or 2 within a 2-D array. To further reduce noise and enhance image quality, a Gaussian operator was applied as part of the pre-processing workflow [ ]. Each case was then consolidated into a single NumPy array, which was split into training and testing sets at a 0.9:0.1 ratio. The images were subsequently resized to 448 × 448 dimensions to improve model performance.


During post-processing, the image was restored to its original size of 640 × 2600. The results displayed the original class alongside a combination of classes 1 and 2, multiplied by 255 for better visualization.


Implementation


The network was designed in Python 3.7 and Pytorch 1.13.1. Data augmentations such as flips and rotations were used for all the training images. As mentioned above, the input image was 448 × 448 and the patch size was 4, with a window size of 7. We used an NVIDIA Quadro RTX 8000 GPU with a total memory of 32 GB.


Pre-training. We pre-trained the SSAT-Swin model encoder using multiple proxy tasks and adopted a multi-objective loss function for self-supervised representation learning. We utilized three proxy tasks: masked image inpainting, rotation prediction and contrastive learning. During pre-training, three projection heads were attached to the encoder, which were later removed for the downstream segmentation task.


May 10, 2025 | Posted by in ULTRASONOGRAPHY | Comments Off on SSAT-Swin: Deep Learning-Based Spinal Ultrasound Feature Segmentation for Scoliosis Using Self-Supervised Swin Transformer

Full access? Get Clinical Tree

Get Clinical Tree app for offline access