Affective states can be understood as dynamic interpersonal processes developing over time and space. When we observe emotional interactions performed by other individuals, our visual system anticipates how the action will unfold. Thus, it has been proposed that the process of emotion perception is not only a simulative but also a predictive process—a phenomenon described as interpersonal predictive coding. The present study investigated whether the recognition of emotions from dyadic interactions depends on a fixed spatiotemporal coupling of the agents. We used an emotion recognition task to manipulate the actions of two interacting point-light figures by implementing different temporal offsets that delayed the onset of one of the agent’s actions (+0 ms, +500 ms, +1000 ms or +2000 ms). Participants had to determine both the subjective valence and the emotion category (happiness, anger, sadness, affection) of the interaction. Results showed that temporal decoupling had a critical effect on both emotion recognition and the subjective impression of valence intensity: Both measures decreased with increasing temporal offset. However, these effects were dependent on which emotion was displayed. Whereas affection and anger sequences were impacted by the temporal manipulation, happiness and sadness were not. To further investigate these effects, we conducted exploratory analyses of interpersonal movement parameters. Our findings complement and extend previous evidence by showing that the complex, noncoincidental coordination of actions within dyadic interactions results in a meaningful movement pattern and might serve as a fundamental factor in both detecting and understanding complex actions during human interaction.