Can You Tell the Difference Between AI-Generated Audio and Real Speech?

August 2, 2023

Are you able to discern if you’re listening to an AI-generated voice? Even when individuals are aware that they may be listening to AI-generated speech, it remains challenging for both English and Mandarin speakers to consistently detect a deepfake voice. This poses a potential risk to billions of people who speak the world’s most common languages, as they can be exposed to deepfake scams and misinformation.

In a recent study conducted by Kimberly Mai and her colleagues at University College London, over 500 participants were challenged to identify speech deepfakes from various audio clips. These clips involved a female speaker either reading generic sentences in English or Mandarin or were deepfakes created by generative AIs trained on female voices.

Participants were randomly assigned to two experimental setups. In the first setup, one group listened to 20 voice samples in their native language and had to determine whether the clips were genuine or fake. The deepfakes and authentic voices were correctly identified about 70% of the time for both English and Mandarin samples. This suggests that real-life detection of deepfakes would likely be even more challenging because most people would not know in advance if they were listening to AI-generated speech.

In the second setup, another group was given 20 pairs of audio clips, each featuring the same sentence spoken by a human and a deepfake. Participants had to identify which was the fake. This increased the detection accuracy to over 85%, although the researchers acknowledged that this scenario gave listeners an unrealistic advantage.

However, the study did not address whether listeners could determine if the deepfakes sounded like the targeted individuals being impersonated. This aspect is crucial in real-life scenarios, as scammers have cloned the voices of business leaders to deceive employees into making fraudulent transfers, and misinformation campaigns have circulated deepfakes of well-known politicians on social media platforms.

Hany Farid at the University of California, Berkeley, commented that this research helps evaluate how well AI-generated deepfakes are progressing in mimicking the natural sound of human voices, without capturing the subtle speech differences that can feel unsettling to listeners. Farid considers this study as a valuable foundation for developing automated deepfake detection systems.

Attempts to train participants to improve deepfake detection generally proved unsuccessful. This emphasizes the importance of developing AI-powered deepfake detectors. Mai and her colleagues are now exploring the potential of large language models capable of processing speech data to tackle this task.