Authors
Anja Radomirovi c , University Union , Serbia
Abstract
Mutations in the HBB gene cause severe hemoglobinopathies such as sickle cell disease and beta-thalassemia. Accurate HBB variant classification is crucial for diagnosis but remains challenging. I present a bioinformatics pipeline integrating HGVS parsing, Ensembl annotation, SpliceAI, and BioPython to analyze 1,809 ClinVar variants. Seven models were trained with SMOTE. XGBoost achieved an F1-score of 0.9495 and perfect recall, though ROC-AUC 0.4489 showed discrimination limits. Results highlight ML challenges for single-gene classification and importance of data quality in genomic medicine.
Keywords
HBB gene, variant pathogenicity, machine learning, protein encoding, XGBoost, hemoglobinopathies