Le Petit Prince Corpus

The Le Petit Prince Corpus (LPPC) is a multi-lingual resource for research in (computational) psycho- and neurolinguistics. The primary data consists of two versions of the children’s story The Little Prince, translated into in 26 languages: both original translated text and synthesized speech, obtained using state-of-the-art text-to-speech synthesis (TTS). The LPPC dataset comprises three parts: 1) raw and syntactically annotated text (dependency parses), 2) near-natural-sounding synthetic speech, and 3) electroencephalography (EEG) recordings. The LPPC combines linguistic and EEG data for many languages using fully automatic methods, and thus constitutes a readily extendable resource that supports cross-linguistic and cross-disciplinary research.

The LPPC dataset plays a central role in the Language Cycles project: We will use this corpus for conducting neurolinguistic studies that generalize across a wide range of languages. This allows us to overcome typological constraints to traditional neuroscientific approaches that usually limit the amount of language stimuli.

If you want to find out more about the building process and applied methods, have a look at our recent publication at the LiNCR workshop.