How to Remove a Device’s Voice from a Mixed Audio Signal

February 21, 2022

Summary

Method to create a data set of mixed voices audio comprising of audio from device’s and user’s voices with a Signal-to-Noise (SNR) levels
Algorithm for separating user’s voice from the mixed voices audio using the known device’s voice
Demonstration of Word Error Rate (WER) of the original user’s voice and the user’s voice recovered by the algorithm
You can find a video explanation of the following concepts at the end of the article

Introduction

This article is a result of a collaboration between Omdena and Consenz for improving their driver assistant device (see project) to increase road safety and reduce vehicle accidents. In this project, one task was to improve the voice communication between the driver and the device. This collaboration developed a computationally efficient algorithm for processing the driver’s and the device’s voices. This algorithm may be applicable to many other assistant devices that receive verbal instructions.

The problem

Many Google Assistant Devices accept verbal commands from users and communicate to the users by speech. A problematic situation occurs with these devices is when the user is speaking to the device and at the same time the device is speaking to the user. In this situation, the device is converting an audio signal mix of the user’s voice and its own voice to text

In order to deal with that scenario, the device can stop speaking and listen to the user or turn off its microphone when speaking. Neither of the above options are satisfactory.

The solution

Subtract the device’s audio signal from the mixed audio signal so that the device can process the user’s instructions.

Steps discussed in the article:

Make preparations for creating an audio database using The Microsoft Scalable Noisy Speech Dataset (MS-SNSD)
Create a database for having a mix a clean speech audio file with a noisy (mixed) speech at desired length and SNR Levels
Use algorithm to separate the unknown input audio from mixed audio signal
Convert Speech into Text using speech recognizer API
Evaluate Algorithm using Word Error Rate (WER) Metric scores

Step 1: Make preparations for creating a database of mixed audio using the MS-SNSD program

The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) is a noisy speech dataset that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio (SNR) levels desired.

The following modifications were made to MS-SNDN:

# Modified File from origial verson of noisyspeech_synthesizer.py 
# Source: https://github.com/microsoft/MS-SNSD/blob/master/noisyspeech_synthesizer.py
# @author: chkarada

import glob
import numpy as np
import soundfile as sf
import os
import argparse
import configparser as CP
from audiolib import audioread, audiowrite, snr_mixer

def main(cfg):

snr_lower = float(cfg["snr_lower"])
snr_upper = float(cfg["snr_upper"])
total_snrlevels = float(cfg["total_snrlevels"])

clean_dir = os.path.join(os.path.dirname(__file__), 'clean_train')
if cfg["speech_dir"]!='None':
clean_dir = cfg["speech_dir"]
if not os.path.exists(clean_dir):
assert False, ("Clean speech data is required")

noise_dir = os.path.join(os.path.dirname(__file__), 'noise_train')
if cfg["noise_dir"]!='None':
noise_dir = cfg["noise_dir"]
if not os.path.exists(noise_dir):
assert False, ("Noise data is required")

fs = float(cfg["sampling_rate"])
fs = 16000 # change
audioformat = cfg["audioformat"]
total_hours = float(cfg["total_hours"])
audio_length = float(cfg["audio_length"])
silence_length = float(cfg["silence_length"])
noisyspeech_dir = os.path.join(os.path.dirname(__file__), 'NoisySpeech_training')
if not os.path.exists(noisyspeech_dir):
os.makedirs(noisyspeech_dir)
clean_proc_dir = os.path.join(os.path.dirname(__file__), 'CleanSpeech_training')
if not os.path.exists(clean_proc_dir):
os.makedirs(clean_proc_dir)
noise_proc_dir = os.path.join(os.path.dirname(__file__), 'Noise_training')
if not os.path.exists(noise_proc_dir):
os.makedirs(noise_proc_dir)

total_secs = total_hours*60*60
total_samples = int(total_secs * fs)
audio_length = int(audio_length*fs)
# Modification
# Original: SNR = np.linspace(snr_lower, snr_upper, total_snrlevels) 
# Change: replace: ‘total_snrlevels’ with ‘int(total_snrlevels)’
# Reason: To avoid an error as np.linspace expects an integer input

SNR = np.linspace(snr_lower, snr_upper, int(total_snrlevels)) #change with int
# Modification
# New Line after line 47: ‘SNR = np.round(SNR, 5)’
# Reason: To set SNR Levels at consistent intervals
SNR = np.round(SNR, 5) # added

cleanfilenames = glob.glob(os.path.join(clean_dir, audioformat))
if cfg["noise_types_excluded"]=='None':
noisefilenames = glob.glob(os.path.join(noise_dir, audioformat))
else:
filestoexclude = cfg["noise_types_excluded"].split(',')
noisefilenames = glob.glob(os.path.join(noise_dir, audioformat))
for i in range(len(filestoexclude)):
noisefilenames = [fn for fn in noisefilenames if not os.path.basename(fn).startswith(filestoexclude[i])]

filecounter = 0
num_samples = 0
# Modification
# Original: while num_samples < total_samples: (line 85)
# Change: 
# ran_num = 0 # add
# while num_samples < total_samples:
# np.random.seed(ran_num) # add
# ran_num = ran_num + 1 # add
# Reason: This program randomly uses the files provided to create a database. Each time the program 
# is run a new set of clean files are created. However, the clean files are used to create a transcript
# to calculate Word Error Rate (wer) scores. By using a random seed, the program can be run multiple times 
# and no changes to the transcript file are needed.

ran_num = 0 # add
while num_samples < total_samples:
np.random.seed(ran_num) # add
ran_num = ran_num + 1 # add
idx_s = np.random.randint(0, np.size(cleanfilenames))
clean, fs = audioread(cleanfilenames[idx_s])

if len(clean)>audio_length:
clean = clean

else:

while len(clean)<=audio_length:
idx_s = idx_s + 1
if idx_s >= np.size(cleanfilenames)-1:
idx_s = np.random.randint(0, np.size(cleanfilenames)) 
newclean, fs = audioread(cleanfilenames[idx_s])
cleanconcat = np.append(clean, np.zeros(int(fs*silence_length)))
clean = np.append(cleanconcat, newclean)

idx_n = np.random.randint(0, np.size(noisefilenames))
noise, fs = audioread(noisefilenames[idx_n])

if len(noise)>=len(clean):
noise = noise[0:len(clean)]

else:

while len(noise)<=len(clean):
idx_n = idx_n + 1
if idx_n >= np.size(noisefilenames)-1:
idx_n = np.random.randint(0, np.size(noisefilenames))
newnoise, fs = audioread(noisefilenames[idx_n])
noiseconcat = np.append(noise, np.zeros(int(fs*silence_length)))
noise = np.append(noiseconcat, newnoise)
noise = noise[0:len(clean)]
filecounter = filecounter + 1

for i in range(np.size(SNR)):
text1 = str(cleanfilenames[idx_s])
clean_snr, noise_snr, noisy_snr = snr_mixer(clean=clean, noise=noise, snr=SNR[i])
noisyfilename = 'noisy'+ str(filecounter)+'_SNRdb_'+str(SNR[i])+'_clnsp'+str(filecounter)+'.wav'
cleanfilename = 'clnsp'+ str(filecounter)+'.wav'
noisefilename = 'noisy'+ str(filecounter)+'_SNRdb_'+str(SNR[i])+'.wav'
noisypath = os.path.join(noisyspeech_dir, noisyfilename)
cleanpath = os.path.join(clean_proc_dir, cleanfilename)
noisepath = os.path.join(noise_proc_dir, noisefilename)
audiowrite(noisy_snr, fs, noisypath, norm=False)
audiowrite(clean_snr, fs, cleanpath, norm=False)
audiowrite(noise_snr, fs, noisepath, norm=False)
num_samples = num_samples + len(noisy_snr)

if __name__=="__main__":
parser = argparse.ArgumentParser()
# Configurations: read noisyspeech_synthesizer.cfg
parser.add_argument("--cfg", default = "noisyspeech_synthesizer.cfg", help = "Read noisyspeech_synthesizer.cfg for all the details")
parser.add_argument("--cfg_str", type=str, default = "noisy_speech" )
args = parser.parse_args()
cfgpath = os.path.join(os.path.dirname(__file__), args.cfg)
assert os.path.exists(cfgpath), f"No configuration file as [{cfgpath}]"
cfg = CP.ConfigParser()
cfg._interpolation = CP.ExtendedInterpolation()
cfg.read(cfgpath)

main(cfg._sections[args.cfg_str])

47. int(total_snrlevels) to avoid error due to the expecting integer.

Newline after 47. SNR = np.round(SNR,5) to set SNR Level’s at consistent intervals

85. ran_num = 0 # add
          while num_samples < total_samples:
             np.random.seed(ran_num) # add
                  ran_num = ran_num + 1 # add

The program above randomly uses the files provided to create a database. Each time the program is run a new set of clean files are created. However, the clean files are used to create a transcript to calculate WER scores. By using a random seed, the program can be run multiple times and no changes to the transcript file are needed.

Changes to the noisyspeech_synthesizer.cfg

# Modified file noisyspeech_synthesizer.cfg
# Source: https://github.com/microsoft/MS-SNSD/blob/master/noisyspeech_synthesizer.cfg
# Configuration for generating Noisy Speech Dataset

# - sampling_rate: Specify the sampling rate. Default is 16 kHz
# - audioformat: default is .wav
# - audio_length: Minimum Length of each audio clip (noisy and clean speech) in seconds that will be generated by augmenting utterances. 
# - silence_length: Duration of silence introduced between clean speech utterances.
# - total_hours: Total number of hours of data required. Units are in hours. 
# - snr_lower: Lower bound for SNR required (default: 0 dB)
# - snr_upper: Upper bound for SNR required (default: 40 dB)
# - total_snrlevels: Number of SNR levels required (default: 5, which means there are 5 levels between snr_lower and snr_upper)
# - noise_dir: Default is None. But specify the noise directory path if noise files are not in the source directory
# - Speech_dir: Default is None. But specify the speech directory path if speech files are not in the source directory
# - noise_types_excluded: Noise files starting with the following tags to be excluded in the noise list. Example: noise_types_excluded: Babble, AirConditioner
# Specify 'None' if no noise files to be excluded.

[noisy_speech]

sampling_rate: 16000
audioformat: *.wav
audio_length: 10
silence_length: 0.2
total_hours: 3.0
snr_lower: 0.1
snr_upper: 10.1
total_snrlevels: 10

# Modification
# Original: 
# noise_dir: None
# Speech_dir: None
# Change:
# noise_dir: your directory for noisy speech wav files (sample rate = 16000 samples/s)
# speech_dir: your directory for clean speech wav files (sample rate = 16000 samples/s)
noise_dir: your directory for noisy speech wav files (sample rate = 16000 samples/s)
speech_dir: your directory for clean speech wav files (sample rate = 16000 samples/s)
noise_types_excluded: None

Original:

noise_dir: None
Speech_dir: None

Change:

noise_dir: your directory for noisy speech wav files (sample rate = 16000 samples/s)

speech_dir: your directory for clean speech wav files (sample rate = 16000 samples/s)

Step 2: Create a database for having a mix of a clean speech audio file with a noisy speech at desired length and SNR levels

Enter parameters for settings. The parameters are defined in the file: noisyspeech_synthesizer.cfg. The parameters used in this example were: sampling_rate: 16000, audioformat: *.wav, audio_length: 10, silence_length: 0.2, total_hours: 3 , snr_lower: 0.1, snr_upper: 10.1, total_snrlevels: 10
File: noisyspeech_synthesizer.py .
Run this file. Using a Google Colab notebook the code was

import os
from pathlib import Path
from google.colab import drive
drive.mount('/content/drive/')
!python/content/drive/MyDrive/YourPath/noisyspeech_synthesizer.py

This should create 3 new directories containing wav files.
The new directories are: Noise_training containing noise speech files, CleanSpeech_training containing clean speech files, and NoisySpeech_training containing mix of noise and clean files combined at the set signal to noise levels

Step 3: Use algorithm to separate the unknown input audio from mixed audio signal

The audio files having the mix of speech are located in the NoisySpeech_training directory; file name example: noisy28_SNRdb_0.1_clnsp28.wav

Recall the noisy speech would be speech from the device which is located in the Noise_training directory; file name example: noisy28_SNRdb_0.1.wav

Note the “28” in the examples can be used as a link to the files.

You can now use this dataset to test your own algorithm for separating known audio from the device eg. noisy28_SNRdb_0.1.wav & the mixed file noisy28_SNRdb_0.1_clnsp28.wav

The pseudo code used for recovering the user’s voice

Algorithm (using the above example files)

# Define Lists
mlt = list of floats from 0.001 to 2.000 
# eg (0.001, 0.002,  …  2.000)
# storing  absolute value correlation values
corr_abs = empty list 
# storing m values, where m is a float
m_val = empty list
# note y1 = Device’s Voice
y1 = normalize(noisy28_SNRdb_0.1.wav)
# note y2 = Mixed Voices    
y2 = normalize(noisy28_SNRdb_0.1_clnsp28.wav)# 
# normalize files between - 1 to 1, ie they both share the same scale
for m in mlt do:
X = normalize(y2 - m * y1) 
y = y1
corr = correlation between X and y
      append absolute value of corr to corr_abs
ind = index of min value in corr_abs
m1 = value of mlt at index ind
user_voice = normalize(y2 - m1 * y1)

Explanation of algorithm:

An analogy

This algorithm is based upon two signals say z1 = sin(x) and z2 = sin(x-ɣ), where ɣ is a phase difference. Suppose we want to find out the values of ɣ where z1 and z2 are not correlated.

In this case, ɣ would be varied and we would calculate the correlation for a range of values of ɣ between 0 and 2π.

When ɣ = 0, π or 2π, the |correlation| between z1 and z2 would be high, since z2 would be a function of sin(x). When ɣ = 2, the |correlation| between z1 and z2 would be low because z2 = sin(x – 2) = cos(x).

This technique can be used in a lab setting to find the phase difference between two sinusoidal voltage signals. Using an oscilloscope, one signal is displayed on the x axis and the second signal is displayed on the y axis. Next the phase of the first signal is adjusted until a straight line appears on the oscilloscope. That is when the two sinusoidal voltage signals are highly correlated.

Figure 1 shows an example with sin(t) and cos(t)

Fig 1 Plot of out of phase waveforms: sin(t) and cos(t)

The correlation between Sin(t) and Cos(t), the functions can be plotted on the x and y axis as shown in Figure 2 below

Fig 2 Plot of Out of Phase Functions sin(t) versus cos(t)

The circle shape of the scatter plot in Figure 2 indicates that the out of functions sin(t) and cos(t) have no correlation. This analogy can be extended to other waveforms.

Explanation

Let M(t) designate the Mixed voice, U(t) designate the User’s voice, and D(t) desonate the Device’s voice.

M(t) can be mathematically modeled as

M(t) = U(t) + c * D(t)

where c is a constant.

For a range of m values between 0 and 2 this algorithm calculates the correlation between M(t) – m * D(t) versus D(t).

At a certain value of m, M(t) = U(t) where m ≈ c. Here, the |correlation| between M(t) and D(t) is at a minimum because the Mixed voice is no longer a function of the Device’s voice. Hence, the correlation is between the User’s voice and the Device’s voice is minimal.

In contrast, the |correlation| is at a peak when m ≈ 0 because in this case the Mixed voice is a function of the device’s voice + the user’s voice. This is shown in Figure 3.

Fig 3 Determination of multiplication factor m_min

Using, M(t) – m_min * D(t) the waveforms of the User’s Voice can be derived from M(t), m_min and D(t)

Fig 4 Determination of User’s Voice

Mixed Voice

User’s Voice

The subtraction of the Device’s Voice from the Mixed Voice can be seen and heard (by the Mixed and User wav files) in Figure 4.

Step 4: Convert Speech into Text using speech recognizer API

The speech to text output increases if the dc component is removed and the waveform’s amplitude is sufficient

Algorithm for pre-processing audio files

Let S(t) be a discrete audio signal of a person speaking

S_new

Save S_new in wav format as S_new.wav

text = Speech Recognition API(S_new.wav)
print(text)

Speech to Text Output

Device Text:

“yes ma’am it sure was I always say you’ve got a wonderful husband Miss Margaret may I get some more cookies Nora well wait until you finish what you’ve got Davey”

Mixed Text:

“in manager to the heroine of iOS Elliott got a wonderful husband Ben some Mysterio”

Hypothesized User Text:

“he was a head shorter than His companion of almost delicate physique since then some mysterious Force has been fighting us at every step I had faith in them there was a change now”

True User Text:

“he was a head shorter than his companion of most delicate physique since them some mysterious Force has been fighting us outside every step I had faith in them there was a change now”

Tip: Since the clean speech wav files are located in the CleanSpeech_training directory, you can use the speech recognizer api to convert those files to True Text and then verify the conversion

Step 5: Evaluate Algorithm using Word Error Rate (WER) Metric scores

WER scores range from 0 to 1 where 0 is the perfect score
The python API Jiwer 2.30 was used to calculate the WER scores
WER Score of User’s Voice and True User Text was 0.0882

Summary

This article has outlined a solution of the situation when a device’s microphone receives audio inputs from its own voice and a user’s voice. An algorithm was provided for resolving the user’s voice from the mixed voices input. This algorithm was demonstrated on an example having a signal to noise ratio of 0.1. In addition, a method of creating a database containing mixed voice of user defined signal to noise ratio, and length was demonstrated. In summary, the algorithm used signal coherence, phase difference and correlation to determine the user’s voice.

Here is a video which covers the key concepts presented in this article.

Sources

MS_SNSD Database

https://github.com/microsoft/MS-SNSD

@article{reddy2019scalable,

title={A Scalable Noisy Speech Dataset and Online Subjective Test Framework},

author={Reddy, Chandan KA and Beyrami, Ebrahim and Pool, Jamie and Cutler, Ross and Srinivasan, Sriram and Gehrke, Johannes},

journal={Proc. Interspeech 2019},

pages={1816–1820},

year={2019}

}

Clean Speech Source (http://festvox.org/cmu_arctic/)

http://festvox.org/index.html and “The documentation, tools and dependent software are all free without restriction (commercial or otherwise).”

The Arctic Database (used for this article) has its acknowledgement found in http://festvox.org/cmu_arctic/cmu_arctic_report.pdf

Source: “CMU ARCTIC databases for speech synthesis” by John Kominek and Alan W Black

Licensing

‘The Arctic databases are distributed as “free software” under the following terms. Carnegie Mellon University Copyright (c) 2003 All Rights Reserved. Permission to use, copy, modify, and license this software and its documentation for any purpose, is hereby granted without fee, subject to the following conditions:

The code must retain the above copyright notice, this list of conditions and the following disclaimer.
Any modifications must be clearly marked as such.
Original authors’ names are not deleted.

This article is written by George Noble.

Ready to test your skills?

If you’re interested in collaborating, apply to join an Omdena project at: https://www.omdena.com/projects