A Transformer-Based Nlp Pipeline for Enhanced Extraction of Botanical Information using Camembert on French Literature


Ayoub Nainia1, Regine Vignes-Lebbe1, Eric Chenin2, Maya Sahraoui1, Hajar Mousannif3 and Jihad Zahir2,3, 1Sorbonne Universite, France, 2UMMISCO, France, 3Cadi Ayyad University, Morocco


This research investigates the untapped wealth of centuries-old French botanical literature, particularly focused on floras, which are comprehensive guides detailing plant species in specific regions. Despite their significance, this literature remains largely unexplored in the context of AI integration. Our objective is to bridge this gap by constructing a specialized botanical French dataset sourced from the flora of New Caledonia. We propose a transformer-based Named Entity Recognition pipeline, leveraging distant supervision and CamemBERT, for the automated extraction and structuring of botanical information. The results demonstrate exceptional performance: for species names extraction, the NER model achieves precision (0.94), recall (0.98), and F1-score (0.96), while for fine-grained extraction of botanical morphological terms, the CamemBERT-based NER model attains precision (0.93), recall (0.96), and F1-score (0.94). This work contributes to the exploration of valuable botanical literature by underscoring the capability of AI models to automate information extraction from complex and diverse texts.


Information Extraction, Natural Language Processing, Named Entity Recognition, Biodiversity Literature.

Full Text  Volume 14, Number 6