Authors
Alejandro A lvarez Castro and Joaquın Ordieres-Mere, Universidad Politecnica de Madrid, Madrid
Abstract
Earnings calls represent a uniquely rich and semistructured source of financial communication, blending scripted managerial commentary with unscripted analyst dialogue. Although recent advances in financial sentiment analysis have integrated multimodal signals, such as textual content and vocal tone, most systems rely on flat document-level or sentence-level models, failing to capture the layered discourse structure of these interactions. This paper introduces a novel multimodal framework designed to generate semantically rich and structurally aware embeddings of earnings calls, by encoding them as hierarchical discourse trees. Each node, comprising either a monologue or a question-answer pair, is enriched with emotional signals derived from text, audio, and video, as well as structured metadata including coherence scores, topic labels, and answer coverage assessments. A two-stage transformer architecture is proposed: the first encodes multimodal content and discourse metadata at the node level using contrastive learning, while the second synthesizes a global embedding for the entire conference. Experimental results reveal that the resulting embeddings form stable, semantically meaningful representations that reflect affective tone, structural logic, and thematic alignment. Beyond financial reporting, the proposed system generalizes to other high-stakes unscripted communicative domains such as telemedicine, education, and political discourse, offering a robust and explainable approach to multimodal discourse representation. This approach offers practical utility for downstream tasks such as financial forecasting and discourse evaluation, while also providing a generalizable method applicable to other domains involving high-stakes communication.
Keywords
Multimodal Learning, Neural Machine Translation (NMT), Speech-Text Alignment, Crossmodal Embeddings, Transformer Models, Multilingual Corpora, Representation Learning, Sequence-to- Sequence Models, Self-supervised Learning.