Le Petit Prince Corpus

The Le Petit Prince Corpus (LPPC) is a multi-lingual resource for research in (computational) psycho- and neurolinguistics. The primary data consists of two versions of the children’s story The Little Prince, translated into in 26 languages: both original translated text and synthesized speech, obtained using state-of-the-art text-to-speech synthesis (TTS). The planned release of LPPC dataset comprises three parts: 1) raw and syntactically annotated text (dependency parses), 2) near-natural-sounding synthetic speech, and 3) electroencephalography (EEG) recordings.

The LPPC dataset plays a central role in the Language Cycles project: We will use this corpus for conducting neurolinguistic studies that generalize across a wide range of languages. This allows us to overcome typological constraints to traditional neuroscientific approaches that usually limit the amount of language stimuli. Apart from using this corpus for our own studies, we will make the LPPC available for the scientific community: The planned release of the LPPC combines linguistic and EEG data for many languages using fully automatic methods, and thus constitutes a readily extendable resource that supports cross-linguistic and cross-disciplinary research.

If you want to find out more about the building process and applied methods, have a look at our recent publication at the LiNCR workshop.