Jan 2022     Issue 18
Research
Saving the Voice of an Oral Cancer Patient with AI-based Speech Technology
Prof. Lee Tan, Department of Electronic Engineering

Artificial intelligence (AI) has become the most influential technology revolutionising modern life in many ways. To date, the power of AI has been demonstrated by notable stories of success, including AlphaGo, DeepFake, GPT-3, and Waymo. Despite this, the real benefits of AI remain relatively inaccessible to the general public, even though they should not be. As researchers continue to work tirelessly on advanced AI systems to address grand challenges, there are many ordinary situations in which AI can be applied to immediately and directly benefit people in need.

Speech and language technologies constitute a key area for AI today. The core technologies, namely automatic speech recognition (ASR), text-to-speech (TTS), and spoken language understanding (SLU), are widely known through their prominent use cases: Apple’s Siri, Microsoft’s Cortana, and Amazon’s Alexa. In staying ahead of the curve, our Digital Signal Processing and Speech Technology Laboratory (DSPLAB) at CUHK Electronic Engineering has been carrying out pioneering research on multilingual speech technology over the past 30 years with particular emphasis on Cantonese. The DSPLAB is currently recognised as the knowledge and resource hub of Cantonese spoken language technology.

One of the research directions being explored at the DSPLAB is expressive TTS. The research itself aims to devise neural models that can automatically synthesise natural speech to carry custom voice characteristics and speaking styles for specific communication situations, e.g., storytelling, counselling, and game playing. In addition to reporting their latest findings and methods internationally via conferences and journals, the DSPLAB team were recently given a meaningful opportunity to apply TTS technology in a way that they had never thought they would.

In June 2021, a DSPLAB PhD student was browsing through LIHKG, a popular online forum in Hong Kong, when a new post caught his eye. The writer of the post was seeking urgent assistance in cloning the voice of an oral-laryngeal cancer patient, Jody, who was set to permanently lose her speech after surgery. In less than 4 hours, the DSPLAB team got in touch with Jody’s family and committed to the mission of “saving Jody’s voice” by building a personalised neural TTS system. Given that there were only 10 days before her operation, the team wasted no time in starting their most pressing task: collecting voice recordings from Jody. Drawing on their research experience as well as the data resources on Cantonese speech accumulated by the DSPLAB in the last 20 years, the team managed to complete the database design and set up the recording system within 24 hours. As a result, they obtained 6 hours of high-quality speech data.

As the contents of the speech data needed to be verified and transcribed into Cantonese phonemic symbols before they could be used to “train” the deep neural network model, the team adopted a robust system architecture by combining the designs of Google’s Tacotron2 and Tencent AI Lab’s DurIAN models. After numerous attempts at finetuning the system and rounds of performance evaluations, they successfully built a personalised TTS system in about 2 months after Jody’s operation – one that can generate fluent Cantonese speech that is nearly indistinguishable from her original voice from arbitrary text content.

In enabling Jody to use her “new voice” with ease for day-to-day communication, the DSPLAB team collaborated with a local technology company to develop a tailor-made mobile app for the iPad or iPhone, featuring a simple and user-friendly graphical interface. The app allows her to either input new text content (in traditional Chinese characters) or select preloaded phrases that are frequently used in her daily life. It can also generate playlists, adjust the speed of her speech, and share text on social media. Today, through the app, Jody can communicate with her family and friends, send voice messages on WhatsApp, make PowerPoint presentations, and read the Bible – all in her own voice. The DSPLAB team have never felt more joy and satisfaction in seeing their research work genuinely making an impact on a patient’s life.


Past Issue      
Contact Us
Subscribe    Email to friend    Unsubscribe
Copyright © 2024.
All Rights Reserved. The Chinese University of Hong Kong.