UniMelb x Orygen

Voicing Empathy: Modulating AI Speech for Emotionally Sensitive Communication

This research project investigates how modifying specific paralinguistic cues, like intonation and pitch, to AI-generated speech can enhance empathy and trust in therapy. In partnership with Orygen, a youth mental health organisation in Melbourne, it aims to identify key acoustic features that influence emotional tone and address ethical considerations in AI-human interactions.

*(Full report may currently be private, pending approval for public release)

Problem Statement

The current landscape of AI-generated voices highlights a significant gap in their ability to convey the emotional nuance necessary for effective communication in sensitive contexts, such as therapy and counselling. Despite advancements in AI voice synthesis, many systems still struggle to incorporate paralinguistic cues, such as intonation, pitch, and speech rate, which are crucial for expressing empathy and building trust. This shortcoming can hinder user comfort and engagement during emotionally charged interactions.

To address this gap, this research project aims to answer the question: How does integrating paralinguistic cues into the responses of voice-based AI agents affect users' perceptions of empathy and trust?

By exploring this question, the study seeks to identify specific acoustic and prosodic features that can enhance the emotional quality of AI speech, ultimately contributing to the development of more empathetic and effective AI systems in mental health care.

Prototype Design

The prototype was developed to facilitate participant interaction with AI-generated voices. It operated locally to ensure privacy and featured a user-friendly interface for selecting and adjusting voice samples. Participants could manipulate amplitude and intonation through pre-generated options and fine-tune pitch and speech rate using real-time sliders. This setup allowed for a flexible exploration of vocal characteristics, enabling participants to create detailed empathetic vocal profiles tailored to different conversational contexts.

Preview of award received for Musical Plants
Preview of award received for Musical Plants

*(The demo video may currently be set to private.)

Methodology

This research employed a mixed-methods design to explore perceptions of empathetic voices in AI-generated speech. 12 participants were recruited to provide both quantitative and qualitative insights, ensuring a diverse sample of individuals proficient in English. Each participant engaged with a custom-built HTML prototype that allowed them to manipulate various paralinguistic features, including pitch, intonation, speech rate, and amplitude, to create what they believed to be the "most empathetic voice".

Specifically, the study utilised several user research methods including participatory design, validation survey, semi-structured interviews, as well as thematic analysis coupled with affinity diagramming for the data analysis section.

Speaking emoji

Participatory Design

Hand writing emoji

Validation Survey

Person waving emoji

Semi-Structured Interviews

Light bulb emoji

Thematic Analysis

Participant Engagement

The study involved one-on-one sessions lasting approximately 45 to 60 minutes, conducted in a quiet environment. Participants began with a demographic questionnaire, followed by a participatory design activity where they interacted with the prototype. They completed three trials using different texts that represented various emotional contexts: emotional support, neutral explanation, and session closure. This design aimed to assess whether perceptions of empathy varied based on the scenario presented.

Data Collection

Participants rated their voice designs on a 7-point Likert scale to validate their perceptions of empathy. They also ranked the importance of different paralinguistic features in influencing their designs. Following the interactive tasks, semi-structured interviews were conducted to gather in-depth qualitative feedback on participants' decision-making processes and emotional responses.

Data analysis

Interviews were conducted to explore user perceptions of empathetic AI voices, with a mixed-methods approach used to analyse the data. Quantitative responses from Likert scale ratings and feature rankings were summarised using descriptive statistics, offering a broad view of participant preferences. Qualitative data from the interviews were transcribed and analysed thematically to uncover deeper insights.

During the sessions, participant voices were recorded and later used to take detailed notes. This resulted in 241 individual notes, which were then processed using affinity diagramming in Miro. Participants' thoughts were anonymised and, where relevant, quoted directly. Notes were grouped into categories based on common themes, with the number of participants supporting each idea recorded. Categories that shared thematic similarities and were supported by a notable number of participants were then synthesised into higher-level insights. This process led to the development of 8 key insights, representing the most consistent and meaningful patterns across all interviews.

Findings

This project explored how users perceive empathy in AI-generated voices by adjusting vocal features such as intonation, pitch, amplitude and speech rate across a range of scenarios. Rather than relying on fixed emotional cues, participants tailored these features to suit different conversational goals, showing that empathy is shaped by the interaction between voice, listener and context. The findings offer valuable guidance for designing adaptive, emotionally intelligent voice systems that sound more human and feel more authentic.

For a full breakdown of the findings and detailed insights, view the report below.

*(Full report may currently be private, pending approval for public release)