Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video
#vision-language models #ergonomic assessment #manual lifting #musculoskeletal disorders #NIOSH Lifting Equation #computer vision #workplace safety #RGB video analysis
📌 Key Takeaways
- Researchers developed vision-language models to estimate hand distances for ergonomic assessment
- Two VLM-based pipelines were created: detection-only and detection-plus-segmentation
- The segmentation-based approach achieved mean absolute errors of 6-8 cm for horizontal and 5-8 cm for vertical distances
- Pixel-level segmentation reduced estimation errors by 20-40% compared to detection-only method
- This research offers a non-invasive alternative to traditional ergonomic measurement systems
📖 Full Retelling
Researchers Mohammad Sadra Rajabi, Aanuoluwapo Ojelade, Sunwook Kim, and Maury A. Nussbaum from arXiv published a groundbreaking study on February 24, 2026, introducing vision-language models for ergonomic assessment of manual lifting tasks, specifically to estimate horizontal and vertical hand distances from RGB video streams. The research addresses the critical issue of work-related musculoskeletal disorders caused by manual lifting, for which the Revised NIOSH Lifting Equation (RNLE) serves as a widely used ergonomic assessment tool. Traditional methods of obtaining the required hand distance variables typically involve manual measurement or specialized sensing systems that prove difficult to implement in real-world environments. The team's innovative approach leverages advanced computer vision and artificial intelligence to create non-invasive assessment methods that could revolutionize workplace safety monitoring. By developing two multi-stage VLM-based pipelines—a text-guided detection-only approach and a detection-plus-segmentation method—the researchers demonstrated how transformer-based temporal regression can estimate hand distances at the start and end of lifting tasks with remarkable accuracy. Their evaluation across various camera view conditions revealed that the segmentation-based, multi-view pipeline consistently yielded the smallest errors, achieving mean absolute errors of approximately 6-8 cm for horizontal distances and 5-8 cm for vertical distances. Notably, pixel-level segmentation reduced estimation error by 20-30% for horizontal measurements and 35-40% for vertical measurements compared to the detection-only approach, significantly enhancing the feasibility of implementing VLM-based systems for ergonomic risk assessment in diverse workplace settings.
🏷️ Themes
Ergonomics, Computer Vision, Workplace Safety
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
--> Computer Science > Computer Vision and Pattern Recognition arXiv:2602.20658 [Submitted on 24 Feb 2026] Title: Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video Authors: Mohammad Sadra Rajabi , Aanuoluwapo Ojelade , Sunwook Kim , Maury A. Nussbaum View a PDF of the paper titled Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video, by Mohammad Sadra Rajabi and 3 other authors View PDF Abstract: Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal and vertical hand distances; such distances are typically obtained through manual measurement or specialized sensing systems and are difficult to use in real-world environments. We evaluated the feasibility of using innovative vision-language models to non-invasively estimate H and V from RGB video streams. Two multi-stage VLM-based pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. Both pipelines used text-guided localization of task-relevant regions of interest, visual feature extraction from those regions, and transformer-based temporal regression to estimate H and V at the start and end of a lift. For a range of lifting tasks, estimation performance was evaluated using leave-one-subject-out validation across the two pipelines and seven camera view conditions. Results varied significantly across pipelines and camera view conditions, with the segmentation-based, multi-view pipeline consistently yielding the smallest errors, achieving mean absolute errors of approximately 6-8 cm when...
Read full article at source