Tailoring Furhat robotic head lip-syncing to Galician language: an adaptation and evaluation study

Text-speech alignment and lip-syncing are crucial for a pleasant interaction with an embodied conversational agent, especially for anthropomorphic social robots like Furhat. Whilepre-trained alignmentmodelsareintegrated into these agents, they may not fully align the target language, as they are primarily trained in English. This paper addresses this limitation for Galician, leveraging a previously developed text-to-speech system, the Furhat robot, and the Montreal Forced Aligner (MFA). We create acoustic models and a pronunciation dictionary for Galician from scratch, which is a key contribution given the lack of resources. We propose an alternative method using MFAto generate accurate phone-level alignments for Galician synthetic speech and evaluate its quality through objective and subjective experiments. In this preliminary study, our trained model’s accuracy in misalignment assessment matches results reported in the literature for other languages despite the limited data availability for Galician. Regarding the subjective evaluation, a perceptual test with native speakers reveals a strong preference (88%) for our lip synchronization over Furhat’s default (2%), highlighting the validity of our method for improving lip-syncing in under-resourced languages.

keywords: lip-syncing, forced alignment, social robot, Furhat, Galician language