Rhythms of Speech and Language showcases the current state of the art of research on rhythm in speech and language. Its ~50 chapters go full circle in covering physiology, cognition, and culture—presenting the accumulated knowledge gained by neuroscience, cognitive science and psychology, as well as phonetics and communication research. The past decades of research have fostered the elegant idea that bodily rhythms (e.g., of breath, jaws, and brains) are key to understanding the production and comprehension of human speech and language—not only in healthy adults, but also in developmental and clinical populations. It is also intuitive that such rhythms have shaped the languages of the world in the course of language evolution. Yet, it is also clear that there is perplexing dimensionality and variability of rhythm within and across the languages of the world, speakers’ registers, and communicative settings. Rhythms of Speech and Language is the scientific basis for reconciling physiological uniformity and cultural variability, allowing for synergistic progress across fields of research.

The book will be published by Cambridge University Press in 2025. Please find the Table of Contents with abstracts and preprints below: 

Section A: The Physiology of Rhythm

In spoken communication, one can observe a near-constant presence of both communicative gestures and non-communicative movements, involving the limbs for actions or locomotion. This suggests that the physical underpinning of spoken communication extends beyond the articulatory system. It may find its roots in breathing, a pivotal element that plays a crucial role in the control and rhythms of both speech and limb movement. This hypothesis has recently garnered attention in interdisciplinary research. Within this framework, this chapter examines evidence of the impact of breathing and limb movements on speech rhythms. First, it highlights breathing as a fundamental rhythm unique to speakers, acting as a conductor for the temporal organization of speech at various linguistic levels. The chapter then further explores the influence of co-speech gestures and non-communicative motions on the temporal organization of speech. The intricate interplay between speech and breathing, as well as speech and motion, conceptualizes breathing as a potential bridge connecting speech and limb motions at different levels. This perspective encourages research to investigate the interaction of speech, breathing, and limb motion in communication. The insights acquired in this context inspire the exploration of new research directions, aiming for a more comprehensive understanding of the physical determinants of spoken communication.

A string of speech is a string of syllables, a series of varying amounts of jaw openings/ closings. Neutralizing the vowel-intrinsic jaw opening indicates a pattern of jaw opening matching the utterance syllable prominence patterns. The hypothesis is that the jaw opening patterns ensue from the metrical hierarchy of the language, such that for languages like English, we see exponentially increasing jaw displacement on the metrically strong syllable within each foot, phrase and utterance; for languages like French, Chinese and Japanese, we see increased jaw displacement at the end of each phrase, with the largest amount of jaw displacement at the end of the utterance. These language-specific jaw displacement patterns tend to be carried over when learning a second language. Also explored in this chapter are segmental articulation interactions with jaw displacement patterns, as well as the relationship between metrically-motivated jaw displacement patterns and listeners’ perceptions of utterance prominence patterns.

Brain rhythms at different timescales are observed ubiquitously across cortex. Despite this ubiquitousness, individual brain areas can be characterized by ‘spectral profiles’, which reflect distinct patterns of endogenous brain rhythms. Crucially, endogenous brain rhythms have often been explicitly or implicitly related to perceptual and cognitive functions. Regarding language, a vast amount of research investigates the role of brain rhythms for speech processing. Particularly, lower-level processes, such as speech segmentation and consecutive syllable encoding and the hemispheric lateralization of such processes, have been related to auditory cortex brain rhythms in the theta and gamma range and explained by neural oscillatory models. Other brain rhythms —particularly delta and beta— have been related to prosodic processing (delta) but also higher-level language processing, including phrasal and sentential processing. Delta and beta brain rhythms have also been related to predictions from the motor cortex, emphasizing the tight link between production and perception. More recently, neural oscillatory models were extended to include different levels of language processing. Attempts to directly relate these brain rhythms observed during task-related processing to endogenous brain rhythms are sparse. In summary, many questions remain: the functional relevance of brain rhythms with respect to speech and language continues to be a subject of heated discussion, and research that systematically links endogenous brain rhythms to specific computations and possible algorithmic implementations is rare.

Click here for the preprint

The temporal signatures that characterize speech—especially its prosodic qualities—are observable in the movements of the hands and bodies of its speakers. A neurobiological account of these prosodic rhythms is thus likely to benefit from insights on the neural coding principles underlying co-speech gestures. Here we consider whether the vestibular system, a sensory system that encodes movements of the body, contributes to prosodic processing. Careful review of the vestibular system’s anatomy and physiology, its role in dynamic attention and active inference, its relevance for the perception and production of rhythmic sound sequences, and its involvement in vocalization all point to a potential role for vestibular codes in the neural tracking of speech. Noting that the kinematics and time course of co-speech movements closely mirror prosodic fluctuations in spoken language, we propose that the vestibular system cooperates with other afferent networks to encode and decode prosodic features in multimodal discourse and possibly in the processing of speech presented unimodally.

The temporal structure of speech provides crucial information to listeners for comprehension: in particular, the slow modulations in the amplitude envelope constitute important landmarks to discretize the continuous signal into linguistic units. Contemporary models of speech perception attribute a major functional role to brain rhythmic activity in this process: by aligning their phase to the quasi-periodic patterns in speech, neural oscillations would facilitate speech decoding. We here review evidence from EEG/MEG studies showing neural theta-range (~4-8 Hz) tracking of syllabic rhythm, with a special interest in speech rate variations. We also discuss to what extent neural oscillatory coupling contributes to, and is in turn modulated by speech intelligibility, namely whether it is only acoustically or also linguistically guided. We finally review some findings showing that in addition to auditory cortex, motor regions play an active role in the oscillatory dynamic underlying speech processing.

Click here for the preprint

Spoken language is a complex signal that evolves over time and conveys rhythm across multiple timescales. Beyond the signal level, there is rhythm in social aspects of speech communication such as joint-attention, gestures, or turn-taking. Neural oscillations have in many cases been shown to directly reflect the rhythmic features of speech. However, the knowledge about origins, specific functions, and potential interactions of different rhythms and their neural signatures is far from complete. An integrative perspective that builds on phylogenetic and ontogenetic developments can provide some of the missing components. Here we propose that speech production and perception engage evolutionary ancient temporal processing mechanisms that guide sensorimotor sequencing and the allocation of cognitive resources in time. Slow-wave (delta-to-theta band) oscillations are the designated common denominator of these mechanisms, which interact in a speech-specific variant of the perception-action-cycle with the goal to achieve optimal temporal coordination and predictive adaptation in speech communication.

Click here for the preprint

A better understanding of where speech and language rhythms come from may not only require its investigation in humans but also understand their roots in the animal kingdom. In this opinion paper, we will summarize what is known about the role of locomotion and respiration as generators of rhythm across species. Furthermore, we will discuss selected prosodic phenomena such as f0 declination over the course of an utterance and final lengthening at the end of an utterance as markers of rhythm. We will summarize the evidence to what extent they may also appear in communicative calls of animals, propose a new research program along those lines, and discuss their relation to language representations

Section B: Acoustic and Sublexical Rhythms

The “speech envelope” is often used as an acoustic proxy for neural rhythm. The problem is its assumption that the unfiltered, broadband signal can satisfactorily model neural modulation in the auditory pathway (and beyond). However, the auditory system does not function as a passive transducer, but rather decomposes and segregates the signal into an array of tonotopically organized frequency channels. This modulation filtering results in a partitioning of slow (3 Hz–20 Hz) neural modulation patterns across the tonotopic axis that bear only a passing resemblance to the broadband speech envelope. Such polychromatic diversity (in frequency, magnitude, and phase) of auditory modulation patterns is critical for decoding the speech signal, as it highlights critical linguistic properties like articulatory-acoustic and prosodic features important for decoding and understanding spoken language. The low-frequency modulation patterns associated with high-frequency (>2 kHz) auditory channels are especially important for prosodic processing and consonant discrimination, both key for speech intelligibility, especially in adverse listening conditions and among the hard of hearing. The speech signal’s resilience and linguistic depth are a consequence of its “polychromatic” and multi-modal qualities including those associated with audio-visual speech features and semantic context
In speech, linguistic information is encoded in hierarchically organized units such as phones, syllables, and words. In auditory neuroscience, it is widely accepted that syllables in connected speech are quasi-rhythmic, and the rhythmicity makes it suitable to be encoded by theta-band neural oscillations. The rhythmicity of phones or words, however, is more controversial. Here, we analyze the statistical regularity in the duration of phones, syllables, and words, based on large corpora in English and Mandarin Chinese. The coefficient of variation (CV) of unit duration is slightly lower for syllables than phones and words, consistent with the idea that syllables are more rhythmic than phones and words but the difference is weak. The mean duration of phones, syllables, and words matches the time scales of alpha-, theta-, and delta-band neural oscillations, respectively.
The analysis of low-frequency amplitude envelopes has become a wide-spread method in the speech sciences, language acquisition, and neurolinguistics. Amplitude envelopes track an utterance’s amplitude distribution and hence the part of the signal that conveys speech rhythm. Given different methodological decisions, studies are sometimes difficult to compare. This chapter contributes summarizes acoustic and statistic procedures used in the field and focuses on which factors influence the amplitude envelopes in which way, comparing data on aspects that relate to speech rhythm (a language’s rhythm class, speech styles, phonemic segment length). It furthermore tests the specificity of amplitude envelopes for tracking speech rhythm by analyzing control data with different pitch accent types (that are not expected to influence rhythm). The comparison of various factor with the same procedures allows to order factors with respect to the magnitude of differences in amplitude modulation spectra and the frequency bands in which differences occur
Much linguistic research into the perception of rhythmic structure in speech has been concerned with temporal domains that may show isochronous or at least somewhat regular timing. Early studies discovered that there is a substantial discrepancy between the physical and the subjectively perceived onsets of speech events such as words or syllables. Sequences of alternating speech units tend to be perceived as irregularly timed if the intervening pause duration is kept constant. This peculiarity of speech perception is commonly referred to as the perceptual center effect (or the P-center). Since its discovery, the effect has been defeating all quantification attempts as the P-center does not seem to consistently coincide with any specific acoustic markers of speech signals, though it is generally agreed that the P-center represents the rhythmic beat in speech. This chapter reviews existing evidence, outlines future directions, and discusses the domain of beat perception in spoken language.

In speech perception, timing and content are interdependent. For example, in distal rate effects, context speech rate determines the number of words, syllables, and phonemes heard in an unchanging target speech segment. Such results confront psycholinguistic theory with the chicken-and-egg problem of concurrently inferring speech timing and content, and the interrelated issues of narrowing the search space of speech interpretations without bias and optimizing the speed/accuracy tradeoff in online processing. We propose listeners address these issues by managing the timing of speech-related computations. Specifically, we claim: (1) listeners model speech timing as part of a speaker model; (2) variable-length sequences of morphosyntactic units are the basic increments of speech inference; and (3) listeners adaptively schedule inferential updates and computationally intensive operations according to (4) fluctuations in uncertainty predicted by the speaker model. We illustrate these claims in a mechanistic model—Vowel-onset Paced Syllable Inference—explaining multiple psycholinguistic results, including distal rate effects.

Click here for the preprint

Speech is a multiplexed signal displaying levels of complexity, organizational principles and perceptual units of analysis at distinct timescales. This critical acoustic signal for human communication is thus characterized at distinct representational and temporal scales, related to distinct linguistic features, from acoustic to supra-lexical. This chapter presents an overview of experimental work devoted to the characterization of the speech signal at different timescales, beyond its acoustic properties. The functional relevance of these different levels of analysis for speech processing is discussed. We advocate that studying speech perception through the prism of multi-time scale representations effectively integrates work from various research areas into a coherent picture and contributes significantly to increase our knowledge on the topic. Finally we discuss how these experimental results fit with neural data and current dynamical models of speech perception.

Many studies in the linguistic literature have tried to explain the rhythmic component of speech by resorting to the notion of isochrony. The problems with such approaches have been demonstrated in various recent works, owing to the fact that natural speech is highly irregular and quasi-periodic at best. Rhythm also plays a role in the link between brain oscillations and linguistic constituents, where entrainment is often assumed to be the underlying mechanism. Here too, the non-isochronous nature of the speech signal led recent works to call for a more nuanced understanding of entrainment in the context of language. We suggest that rhythm is the timescale within which temporal relationships between isolated events are perceived (about 0.5–12 Hz). We claim that while music tends to use this timescale to promote phase-locking to an external clock, language exploits it to achieve an effective distinction between fast and slow rates in prosody.

Click here for the preprint

Section C: Rhythm in Prosody and at the Prosody–Syntax Interface

On phrasal timescales, spontaneous conversational speech is not very rhythmic. Instead, periods of speech activity are intermittent: words tend to come in short bursts and are often interrupted with hesitations. Nonetheless, it has been suggested that there is a production mechanism that generates phrasal rhythmicity in speech. This chapter examines the empirical evidence for such a mechanism and concludes that speakers do not directly control the timing of phrases. Instead, it is argued that temporal patterns associated with phrases are epiphenomena of processes involved in conceptual-syntactic organization. A model is presented in which coherency-monitoring systems govern the initiation and interruption of speech activity. Hesitations arise when conceptual or syntactic systems fail to achieve sufficiently ordered states. The model provides a mechanism to account for intermittency on phrasal timescales.

Click here for the preprint

On phrasal timescales, spontaneous conversational speech is not very rhythmic. Instead, periods of speech activity are intermittent: words tend to come in short bursts and are often interrupted with hesitations. Nonetheless, it has been suggested that there is a production mechanism that generates phrasal rhythmicity in speech. This chapter examines the empirical evidence for such a mechanism and concludes that speakers do not directly control the timing of phrases. Instead, it is argued that temporal patterns associated with phrases are epiphenomena of processes involved in conceptual-syntactic organization. A model is presented in which coherency-monitoring systems govern the initiation and interruption of speech activity. Hesitations arise when conceptual or syntactic systems fail to achieve sufficiently ordered states. The model provides a mechanism to account for intermittency on phrasal timescales.

Click here for the preprint

Recent studies have shown that neural activity tracks the syntactic structure of phrases and sentences in connected speech. This work has sparked intense debate, with some researchers aiming to account for the effect in terms of the overt or imposed prosodic properties of the speech signal. In this chapter, we present four types of arguments against attempts to explain putatively syntactic tracking effects in prosodic terms. The most important limitation of such prosodic accounts is that they are architecturally incomplete, as prosodic information does not arise in speech autonomously. Prosodic and syntactic structure are interrelated, so prosodic cues are informative about the intended syntactic analysis, and syntactic information can be used to aid speech perception. Rather than trying to attribute neural tracking effects exclusively to one linguistic component, we consider it more fruitful to think about ways in which the interaction between the components drives the neural signal. 

Click here for the preprint

In this chapter, we discuss research from behavior, event-related brain potentials, and neural oscillations which suggests that cognitive and neural constraints affect the timing of speech processing and language comprehension. Some of these constraints may even manifest as rhythmic patterns in linguistic behavior. We discuss two types of constraints: First, we review how the unfolding acoustic and abstract context affect the timing of incremental processing on different linguistic levels (e.g., prosody, syntax). Second, we consider context-invariant constraints (e.g., working memory trace decay, period of electrophysiological activity) and how these limit the duration of our processing time windows, thus restricting our segmentation and composition abilities.

Click here for the preprint

Durational information provides a reliable cue to the unfolding syntactic structure of a sentence. At the same time, durational properties of speech are largely dependent on predictability: less predictable elements of an utterance are more carefully articulated, and thus produced more slowly. While these two determinants of duration (structure and predictability) often align, there exists a well-defined exception where the two factors make opposite predictions. We discuss converging evidence for tempo modulation playing a crucial role in the disambiguation of clausal attachment (modifier vs argument), leading to a shorter duration for the less predictable nested structure and a longer duration for the more predictable sisterhood structure. We then present an account of these temporal patterns, based on the interaction of independently motivated prosodic principles.
The term “prosody” encompasses properties of speech that span several timescales and levels of linguistic units, from the intensity and pitch of phonemes and syllables to overall timing and intonation of utterances and conversations. Hierarchical temporal structure was introduced as a measure of clustering in sound energy that quantifies the relationship among timescales of prosody and related aspects of speech and music. The present chapter reviews several studies showing that the degree of hierarchical temporal structure in speech signals, as measured by the rate of increase in clustering with timescale, reflects the degree of prosodic composition. Prosodic composition can serve different purposes in communication, including linguistic emphasis and chunking in infant-directed speech, scaffolding of spoken interactions with children whose speech abilities are relatively less developed, and stricter timing in formal interactions. Prosodic composition as expressed by hierarchical temporal structure may serve as a control parameter in speech production and communication.
Temporal properties, such as duration, rate, and rhythm, are crucial aspects influencing the perception and production of speech. To study how these properties affect speech processing, researchers can create retimed experimental stimuli with varying temporal patterns. However, retiming speech also poses significant challenges, such as preserving naturalness, intelligibility, and prosody. In this chapter, we present three methods of altering the acoustic speech signal to achieve a desired rhythmic structure. Each method differs in how it adjusts the timing of the utterance and its segments. The methods are used to create stimuli with regular isochronous stress. We evaluate the methods in terms of how much they disrupt the speech signal and how effective they are in achieving isochrony. Finally, we demonstrate how retiming can be used to produce stimuli with more naturalistic rhythmic characteristics. We show that retiming can be a powerful tool for exploring perceptual effects of timing in speech.
A phenomenon that received considerable attention is the propensity for an alternating rhythm in speech. However, algorithms for the calculation of linguistic rhythm are sparse and limited to binary alternation and very short and isolated structures. In the context of a production study, I introduce an algorithm for the calculation of rhythmic well-formedness which goes beyond such a binary alternation and works for sequences larger than short phrases. The algorithm is based on the idea that rhythmicity is defined by a balanced distance of similarly prominent syllables. The study shows that the produced sentences as well as the perceived prominence of the German object pronoun ihn, ‚him‘, vary systematically with the predicted degree of rhythmicity. The algorithm can be applied to any linguistic structure, once the accented syllables are identified.

Section D: Diversity of Rhythm from Oral Speech to Music

After a discussion of factors of time, cohesion, style and rhythm formants in the context of speech registers, a brief appraisal of relevant approaches to the rhythms of natural speech is provided and exploratory case studies of oral narrative registers are conducted using a novel speech modulation theoretic framework, Rhythm Formant Theory (RFT), and its associated methodology of Rhythm Formant Analysis (RFA). The versatility of this framework is shown in application to narrations of different types: toddler dialogue at an early stage in first language acquisition, the narrative genre of African village communities, fluency of reading aloud in English as a second language (L2), a comparison between newsreading and poetry reading in English, and a comparison of recitations of different Chinese poetry genres. Unlike earlier phonetic and phonological analyses of the ‘linguistic rhythm’ of words and sentences, the novel analyses deal with natural real-time rhythms in recordings of authentic data which may be several minutes long, using utterance-long spectral analysis time windows. Cluster analysis of properties of spectra and spectrograms which emerge from this analysis shows that speech registers can be phonetically distinguishable in the low frequency infrasound spectral domain, and that the differences are interpretable in linguistic terms.

Click here for the preprint

Almost no seminar, book, and YouTube tutorial on successful public speaking is without the established and traditional “cork exercise”. It is supposed to enhance speakers’ rhythm and intelligibility, for which there is, however, no scientific evidence so far. Our experiment addressed this gap. Twenty speakers performed a presentation-task three times: (1) before a cork-exercise intervention, (2) immediately after it, and (3) some minutes later after having completed a distractor questionnaire. The intervention was a video recorded by a professional media trainer. Results show significant rhythmic (and related melodic and articulatory) differences between presentations (1) and (2), suggesting a positive effect for speakers in (2). However, in presentation (3), all measurements revert to the baseline-presentation (1) level. Thus, the “cork exercise” basically works and yields positive effects; however, they are short-lived. The chapter ends with suggestions for further research and practical ideas for a more sustainable design of the cork exercise.
An increasing number of studies report that different forms of rhythmic stimulation influence linguistic task performance. First, this chapter aims at describing to what extent the construction of a tree-like structure in which lower-level units are combined into higher-level constituents in linguistic syntax and rhythm could be subserved by similar mechanisms. Second, we will review and categorise rhythmic stimulation findings based on the temporal delay between the rhythmic stimulation and linguistic task that it influences, the precise relationship between the rhythmic and linguistic stimuli used, and the nature of the linguistic task. Lastly, this chapter will discuss which categories of rhythmic stimulation effects can be interpreted in a framework based on a shared cognitive system that is responsible for hierarchical structure building

Music, like language, relies on listeners’ ability to extract information as it unfolds in time. One key difference between music and language is the strong rhythmic regularities of music relative to language. Despite a wealth of literature describing the rhythms of song as regular and the rhythms of speech as irregular, the acoustic features and neural processing of rhythmic regularity in song and (lack thereof) in speech are poorly understood. This chapter examines acoustic, behavioural, and neural indices of rhythmic regularity in speech and song. Our goal is to review which features induce rhythmic regularity and examine how regularity impacts attention, memory, and comprehension. This work has the potential to inform a wide range of areas, including clinical interventions for speech and reading, best practices for teaching and learning in the classroom, and how attention is captured in real-world scenes. 

Click here for the preprint

The timing of acoustic events in relation to different levels of structure building is a fundamental task in both language and music. While in music the timing of sounds and their relation to an abstract metrical grid is often used to create aesthetic effects, timing relations in language are commonly grammaticalized for the conventional construction of different levels of meaning, leaving only a narrow margin for rhythmic preferences of other sorts. Our article reviews functions of timing and, specifically, metrical structure, in both music and language, suggests a unified form of representation inspired by Autosegmental-Metrical Phonology and thereby directs the attention to principles of time-related structure building that are relevant for both communicative sound systems.

Click here for the preprint

Music rhythm and speech rhythm share acoustic, temporal and syntactic similarities, and neuroscience research has shown that similar areas and networks in the brain are recruited to process both types of signals. Rhythm is a core predictive element for both music and speech, allowing for facilitated processing of upcoming, predicted elements. The combined study of music and speech rhythm processing can be particularly insightful, considering the stronger regularity and predictability of musical rhythm. Although speech rhythm is less regular, it still contains regularities, notably at syllabic and prosodic levels. In this chapter, we will outline different research lines investigating connections between music and speech rhythm processing, including the recently proposed Processing Rhythm in Speech and Music framework, as well as music rhythm interventions and stimulations that aim to improve speech signal processing both in the short-term and long-term. Implications for developmental language disorders and future research perspectives will be outlined.

Click here for the preprint

One of the riddles of human communication is interlocutors’ ability to adapt to “noisy” inputs. It is argued that it is the interpersonal co-ordination of rhythmic structure underlying this ability, which can be selectively activated. This process is described as a set of mechanisms operating on linguistic and phonetic structures: Interaction Phonology. Interaction Phonology provides the necessary scaffold for enabling an alignment of phonetic-phonological, and potentially also higher-order linguistic representations. This co-ordination process relies on the rhythmic structure of the individual language or register pertaining to the ongoing communication. That way, interlocutors can attend to relevant phonetic detail unveiling higher-order symbolic information, and adapt their own rhythmic pattern to enhance mutual comprehension. The testable predictions of Interaction Phonology are discussed in the light of recent empirical evidence, and the initial version of Interaction Phonology is modified: perception-production coupling is marked as optional, and the automaticity between rhythmic entrainment and higher-order symbolic alignment is questioned.

Click here for the preprint

 

Section E: Rhythm across Languages

Research on speech rhythm over the last decades has led to the widespread application of so-called rhythm metrics in order to empirically quantify variation in timing across languages and dialects. Many of these rhythm metrics are duration-based, such as the standard deviation of vocalic and consonantal interval duration (ΔV and ΔC), respectively, the coefficient of variation of vocalic interval duration (VarcoV) and the Normalized pairwise variability index for vocalic intervals (nPVI-V). While these and other duration-based rhythm metrics have been widely used in research, and also tested for their reliability, there are also a number of lesser-used acoustic rhythm metrics. These indices rely solely on measures of variability in pitch, loudness, or factors, or combine them with measures of duration. This chapter will discuss which rhythm metrics are available and will conclude with practical recommendations for their application (an accompanying Praat script is available here).

Some hierarchical models of speech timing represent prosodic constituents as oscillators that are coupled, thereby influencing each other’s duration. Alternative approaches focus on the systematic distribution of localised speech timing effects, such as phrase-final lengthening and stress-based lengthening. In this review, we explore how oscillator-based speech timing models may be informed by, and possibly reconciled with, approaches that emphasise local timing effects. We consider data from temporally-constrained speech production tasks, such as speech cycling, and explore the nature of the hierarchical coordination of prosodic constituents observed therein. In particular, we examine how variation – between dialects and between languages – in the magnitude of the durational contrast between stressed and unstressed syllables may help to account for observed patterns of temporal coordination. Finally, we explore how speech behaviour in temporally-constrained tasks may be informative about speakers’ coordination of turn-taking in natural dialogues.

Click here for the preprint

We conducted two experiments, testing the iambic-trochaic law (ITL) with speakers of English, Greek, and Korean. They heard sequences of tones varying in duration, intensity, or both; stimuli differed in the magnitude of the acoustic differences between alternating tones and involved both short and long inter-stimulus intervals. While the results were not always compatible with ITL predictions and did not show strong grouping preferences, language-related differences did emerge, with Korean participants showing a preference for trochees, and Greek participants being more sensitive to duration differences than the other two groups. Importantly, grouping preferences showed substantial individual variation, evinced by responses to both test sequences and controls (sequences of identical tones). These findings indicate that results from ITL experiments are influenced by linguistic background but are also difficult to replicate, as individual preferences and specific experimental conditions influence how participants impose rhythm structure to sound sequences. 

Click here for the preprint

A long-standing debate among scholars continues concerning the validity of rhythmic classification of the world’s languages. In order to address the remaining questions, it is key to further explore the speech production by bilinguals with 2L1s and second language speakers. According to the majority of previous studies, results from bilinguals are intermediate between those of the two kinds of monolinguals, and results from second language speakers are influenced by the rhythms of their first languages, which appear to support the rhythmic classification. However, several questions remain. The first is how to classify languages that exhibit characteristics of multiple rhythmic types. The second is that previous studies generally demonstrate that languages are more or less stress-timed, syllable-timed, or mora-timed, rather than strictly belonging to a single rhythm category. The third is that the proposed rhythmic measures are not comprehensive, and new measures are needed to account for the morphological and syntactic components of languages.

Despite their genetic relatedness, Romance languages and dialects exhibit considerable differences in their phonological systems. In rhythm typology, Spanish was long considered a textbook example of the so-called syllable-timing type, while the classifications for French and Portuguese were often disputed. Rhythmic differences were also found between the more accent-based European varieties of Portuguese and the more syllable-based Brazilian dialects. Our contribution will first endeavour to carry out a phonological assessment of the degree of syllable prominence and accent prominence in European French, Spanish and Portuguese, as well as in varieties of Spanish and Portuguese spoken in the Americas. In a second step, we will conduct a phonetic case study using comparable spoken language data of the varieties under investigation.

Click here for the preprint

Section F: Rhythm in Language Acquisition

Data from auditory neuroscience provide a novel ‘oscillatory hierarchy’ perspective on how the brain encodes speech. Temporal Sampling theory, originally proposed to provide a conceptual framework to explain why acoustic rhythmic impairments in children with developmental dyslexia and developmental language disorder lead to phonological and syntactic impairments, can also explain why sensitivity to linguistic rhythm is a key factor in language acquisition. An overview of the theory is provided, and then data from two longitudinal infant projects applying Temporal Sampling theory to language acquisition are discussed. One project followed infants at family risk (or not at risk) for developmental dyslexia from age 5 months, and one followed typically-developing infants from age 2 months. The infant data suggest that neural oscillatory mechanisms, along with acoustic rhythm sensitivity, play key roles in early language acquisition.

The developmental community is beginning to embrace the idea of exaggerated rhythm in infant and child-directed speech providing critical information during early language acquisition. Here, we consider I/CDS as a special case of language, with enhanced multimodal temporal and prosodic cues, attuned to the needs of the listener. The evidence supporting this idea is largely based on language disorders (e.g., dyslexia, DLD), with relatively sparse extant literature on typical language development. However, the field is rapidly growing, with methodological advances in cortical and behavioural rhythmic tracking allowing us to better understand the organising principles of speech and language processing. We address the multiple approaches adopted across research communities, providing a commentary on both the reach and suitability of these methods. From a nascent literature, the chapter aims to paint a coherent picture of the field’s current state, providing recommendations for future research.

Click here for the preprint

Children are active learners: they selectively attend to important information. Rhythmic neural tracking of speech is central to active language learning. This chapter evaluates recent research showing that neural oscillations in the infant brain synchronise with the rhythm of speech, tracking it at different frequencies. This process predicts word segmentation and later language abilities. We argue that rhythmic neural speech tracking reflects infants’ attention to specific parts of the speech signal (e.g., stressed syllables), and simultaneously acts as a core mechanism for maximising temporal attention onto those parts. Rhythmic neural tracking of speech puts a constraint on neural processing, which maximises the uptake of relevant information from the noisy multimodal environment. We hypothesise this to be influenced by neural maturation. We end by evaluating the implications of this proposal for language acquisition research, and discuss how differences in neural maturation relate to variance in language development in autism.

Click here for the preprint

Infant-directed communication has been proposed to facilitate early language development, not only by providing infants with ample native language input but also by tailoring this input to infants’ individual developmental needs. In particular, extensive research has investigated prosodic and phonetic adaptations in caregivers’ infant-directed speech proposed to support early language acquisition, but more recently, research focus has shifted to the rhythmical properties of this register. This chapter will review this recent evidence, and it will argue that rhythmic optimization is not limited to infants’ early speech input. Instead, it is present across the auditory, visual, and tactile domains of caregiver-infant communication. We will argue that infants enjoy access to optimized intersensory rhythmic input, which scaffolds their ability to segment the continuous speech signal into meaningful linguistic units, even when these units occur with weak regularity in naturally produced adult-directed speech.
The prosody of spoken language is characterized by quasi-rhythmic features, which are perceivable by the fetus already from the third trimester of gestation. Recent research studying infant cognition is increasingly focusing on oscillations as a reliable measure of brain responses to quasi-rhythmic auditory stimuli, such as speech at different levels of granularity. There is indeed increasing evidence for a match between the frequency of neural oscillations and the rates of different linguistic units, such as phonemes, syllables and phrases, both in adults and children. Here we review recent advances in how neural activity aligns with language input at different levels of language structure and organization, at different developmental stages in the first year of life. Importantly, we discuss how this neural architecture may support the development of grammar.

Mastering rhythm is essential in learning a second language (L2). This study explores whether shared rhythmic classes in a first language (L1), between English and German as opposed to French, facilitate L2 speech rhythm learning. We analyzed rhythmic patterns in a corpus of accented utterances utilizing a novel rhythm metric based on amplitude envelope modulation frequency. The analysis showed that German-accented English and English-accented German are more likely to be classified as native compared to their French-accented equivalents. Furthermore, German-accented English was classified as English significantly more frequently than German-accented French as French. Importantly, word-based pronunciation proficiency was found to be higher for German and English speakers in their respective L2s, with German speakers exhibiting greater proficiency in English than in French. These findings indicate that shared L1 rhythm significantly aids L2 speech learning and that rhythm planning may be influenced by the words and their segmental compositions.

Click here for the preprint

A considerable amount of the linguistic input that young infants receive consists of multi-word utterances where word boundaries are not marked by pauses. Therefore, a crucial step in language acquisition is to learn to parse the continuous speech stream into possible word candidates. Here we argue that the ability to anticipate how the speech signal will unfold plays an important part in speech segmentation throughout the lifespan, and that spoken language that is rhythmic and temporally predictable will have the biggest effect on the speech segmentation. We introduce spontaneous pupillary synchrony with auditory stimuli as a novel way of investigating speech perception and segmentation as the speech signal unfolds. We discuss two studies with adults and young infants that show that synchronized changes in pupil size can reveal about the perception of temporal and structural rhythmic regularities in spoken language.

Click here for the preprint

Section G: Rhythm in Speech and Language Disabilities

A substantial portion of the global human population live with some level of hearing loss, and the World Health Organisation estimates this disability may affect up to one in four people by 2050. For people who use speech to communicate, hearing impairment can cause serious disruption to daily life, yet we do not fully understand how speech rhythm perception is impacted by hearing loss. Moreover, in the case of people who use cochlear implants to listen, it is unclear how well aspects of speech rhythm are captured by hearing devices. This chapter surveys an interdisciplinary literature, bringing together insights from perception, speech therapy, and hearing and audiological sciences across the lifespan to construct an emergent picture of speech rhythm processing in the context of hearing disability.

Click here for the preprint

Melodic Intonation Therapy (MIT) is a prominent music-based treatment for people with nonfluent aphasia that has numerous potentially active treatment ingredients. These include a simplified, predictable rhythm, slow rate, and unison production of spoken language. Evidence supports the effectiveness of MIT for improving repetition ability but is more modest regarding improvements in functional communication. This chapter reviews MIT’s treatment ingredients, including how they are used and how they are thought to work. With these numerous ingredients, MIT is flexible and can be customized for a particular individual’s needs, but group-level studies using standardized treatment protocols may not allow for this. A treatment taxonomy specifying treatment targets, ingredients, and mechanisms of action is a promising tool to organize the existing evidence, further investigate MIT, and implement it in clinical practice. This approach will allow for a balance between customization and standardization of the treatment protocol.

Click here for the preprint

One of the remarkable characteristics of spoken language is that it is constantly undergoing change. The plasticity of sound patterns, i.e., their susceptibility to short- and long-term changes, is driven by processes of mutual adaptation during conversational interactions and thereby reflects a constant interplay of perceptual and motor processes of spoken language. Existing models of speech motor control largely neglect the environment-driven phonetic plasticity by focusing on single-person accounts of spoken language production. This chapter addresses the roles of cortical and subcortical structures in the accommodation of speakers-listeners in interactive language use. It reviews investigations of the propensity of patients with different neurologic conditions to align with or adapt to others’ speech, with a particular focus on the role of speech rhythm. 

Click here for the preprint

Stuttering and Parkinson’s disease (PD) manifest in altered motor control, apparent in speech and walking. Both disorders display untimely initiation or termination of motor commands. Stuttering symptoms include blockades, sound and syllable repetitions and prolongations that can severely interrupt the rhythmic flow of speech. PD is associated with dysfunctional gait and balance, and freezing episodes, hindering the regular rhythm of walking. These rhythmic alterations span across motor effectors, and extend to rhythm perception. In this chapter we examine the hypothesis that in both populations motor deficits are underpinned by alterations within a general-purpose timing system that sustains rhythmic behavior via temporal predictions. We will focus on similarities between stuttering and PD in terms of impaired rhythm mechanisms and on the associated neuronal circuitries. We will provide new insights into how rhythm in speech relates to non-verbal functions and how this knowledge can inform us about rhythm-based interventions.

What is the relation between rhythm and stuttering in speech production / perception? Stuttering is a neurodevelopmental speech disorder that has an impact on the timing and rhythmic flow of production. It is marked by several repetitions, blocks or lengthening of sounds and syllables, that unsettle the rhythm of speech. There is a lot of behavioral and imaging research on speech disruptions, however the mechanism behind stuttering is still unclear. Speech timing is rhythmically structured. Children who stutter do not easily generate an internal rhythm; they have a worse rhythm discrimination ability than typically developing children. In this chapter we investigate how adults who stutter pace their speech. We illustrate evidence of rhythm perception/production dysfunctions, assessing the hypothesis that neurodevelopmental stuttering is associated with a deficit in temporal processing and rhythmic patterning. Speech rhythm has been quantified using rhythmic measures (especially the Pairwise Variability Index = PVI).

Click here for the preprint

This chapter reviews speech rhythm in the context of prosodic entrainment in speakers with autism, and then presents data on speaking rate entrainment obtained from conversations of children and adolescents with and without autism. The study focuses in particular on speaking rate entrainment at the level of the conversational turn and compares patterns of speaking rate entrainment to patterns in entrainment of fundamental frequency. The relationship between local entrainment at the conversational turn-level is furthermore compared to global conversational entrainment that occurs over the course of the entire conversation. Results show no differences in entrainment in speaking rate at the turn level between speakers with and without autism. Furthermore, speaking rate and fundamental frequency entrainment behavior are correlated at the level of the conversational turn for both groups. Lastly, results suggest that turn-level entrainment is not correlated with global entrainment in fundamental frequency, possibly indicating that local and global entrainment serve different conversational functions.