Improving Multi-Label Business Text Classification with Imbalanced Data: Adjusted BCE Weighting and Threshold Optimization for Rare Labels in BERT Models

Hassani, Sina; Homayouni, Haleh; Bazargan Lari, Kimia; Soleimani, Danial

doi:10.22091/jdaid.2026.15478.1039

Improving Multi-Label Business Text Classification with Imbalanced Data: Adjusted BCE Weighting and Threshold Optimization for Rare Labels in BERT Models

Document Type : Original Article

Authors

¹ Apadana Institute, Shiraz, Iran

² Assistance Prof, Apadana Institute, Shiraz, Iran

10.22091/jdaid.2026.15478.1039

Abstract

Multi-label classification of business texts in the presence of imbalanced label distributions remains a significant challenge in Natural Language Processing. Tail labels, which are associated with very few training samples, typically exhibit weak predictive performance even when advanced transformer-based models such as BERT are employed. This limitation hinders the reliable identification of rare but potentially valuable business opportunities within large-scale textual data. The present study aims to enhance tail-label performance by introducing an adjusted weighting strategy into the Binary Cross-Entropy (BCE) loss function. The proposed approach consists of two main components. First, a label-specific weight is calculated as the ratio of negative to positive samples for each label and then constrained within a predefined range to prevent excessive dominance of either frequent or rare labels. Second, an optimal decision threshold is determined through grid search over the interval [0.1, 0.9], enabling improved balance between precision and recall across labels. Experiments are conducted on an English multi-label dataset containing 1,000 samples and 20 imbalanced labels, with label frequencies varying from 180 to 5 instances. The data are split into 80% training and 20% testing sets. Results show that the weighted BERT model achieves a Hamming accuracy of 0.623, a macro-F1 score of 0.091, and a tail-label F1 score of 0.025. Notably, using only one twenty-eighth of the baseline dataset size, the model retains approximately 70% of baseline accuracy while improving tail-label performance compared to the unweighted setting. The method offers a practical, computationally efficient solution for data-scarce and resource-constrained environments.

Keywords

Main Subjects

Statistics, Machine Learning, Predictive Modeling, Time Series Analysis, Data Mining and Visualization, Decision Analysis and Optimization, Forecasting, Data Envelopment Analysis and Performance Evaluation, Mathematical Modeling and Simulation

References

Arslan, M., & Cruz, C. (2023). Business-text classification with imbalanced data and moderately large label spaces for digital transformation. arXiv preprint arXiv:2306.07046.

Tang, X., Liu, Y., & Zhang, J. (2021). Research on automatic labeling of imbalanced texts using BERT and word2vec. Scientific Reports, 11(1), Article 11855.

Tsai, C.-F., Wu, H.-C., & Hu, Y.-H. (2023). A comparative study on multi-label text classification methods with imbalanced label distribution. Expert Systems with Applications, 215, Article 119387.

Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multi-label classification. Machine Learning, 85(3), 333–359.

Cui, Y., Jia, M., Lin, T.-Y., Song, Y., & Belongie, S. (2019). Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 9268–9277).

Wang, M., Li, Y., & Liu, Z. (2023). An empirical study on active learning for multi-label text classification. In Proceedings of the 3rd Workshop on Insights from Neural Generative Language Models (pp. 102–110).

Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 2980–2988).

Cui, Y., Jia, M., Lin, T.-Y., Song, Y., & Belongie, S. (2019).Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 9268–9277).

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019).
BERT: Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT) (pp. 4171–4186).

Johnson, R., & Lee, T. (2019).Handling rare labels in multi-label text classification.
Journal of Artificial Intelligence Research, 65, 1–24.

Liu, Y., Ott, M., Goyal, N., et al. (2019).RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint, arXiv:1907.11692.

Nam, J., Kim, J., Mencía, E. L., Gurevych, I., & Fürnkranz, J. (2017).Large-scale multi-label text classification—Revisiting neural networks.In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) (pp. 437–452).

Tsai, C.-F., Wu, H.-C., & Hu, Y.-H. (2023).A comparative study on multi-label text classification methods with imbalanced label distribution. Expert Systems with Applications, 215, 119387.

Wang, M., Li, Y., & Liu, Z. (2021). Threshold optimization for imbalanced multi-label text classification. In Proceedings of the Workshop on Insights from Neural Generative Models (pp. 102–110).

Journal of Data Analytics and Intelligent Decision-making

Volume 2, Issue 1 - Serial Number 4
January 2026
Pages 74-82

Article View: 41
PDF Download: 22

Improving Multi-Label Business Text Classification with Imbalanced Data: Adjusted BCE Weighting and Threshold Optimization for Rare Labels in BERT Models

References

Volume 2, Issue 1 - Serial Number 4
January 2026
Pages 74-82

Files

Share

How to cite

Statistics

Improving Multi-Label Business Text Classification with Imbalanced Data: Adjusted BCE Weighting and Threshold Optimization for Rare Labels in BERT Models

References

Volume 2, Issue 1 - Serial Number 4January 2026Pages 74-82

Files

Share

How to cite

Statistics

Volume 2, Issue 1 - Serial Number 4
January 2026
Pages 74-82