How to Remove a Device’s Voice from a Mixed Audio Signal
February 21, 2022
Summary
- Method to create a data set of mixed voices audio comprising of audio from device’s and user’s voices with a Signal-to-Noise (SNR) levels
- Algorithm for separating user’s voice from the mixed voices audio using the known device’s voice
- Demonstration of Word Error Rate (WER) of the original user’s voice and the user’s voice recovered by the algorithm
- You can find a video explanation of the following concepts at the end of the article
Introduction
This article is a result of a collaboration between Omdena and Consenz for improving their driver assistant device (see project) to increase road safety and reduce vehicle accidents. In this project, one task was to improve the voice communication between the driver and the device. This collaboration developed a computationally efficient algorithm for processing the driver’s and the device’s voices. This algorithm may be applicable to many other assistant devices that receive verbal instructions.
The problem
Many Google Assistant Devices accept verbal commands from users and communicate to the users by speech. A problematic situation occurs with these devices is when the user is speaking to the device and at the same time the device is speaking to the user. In this situation, the device is converting an audio signal mix of the user’s voice and its own voice to text
In order to deal with that scenario, the device can stop speaking and listen to the user or turn off its microphone when speaking. Neither of the above options are satisfactory.
The solution
Subtract the device’s audio signal from the mixed audio signal so that the device can process the user’s instructions.
Steps discussed in the article:
- Make preparations for creating an audio database using The Microsoft Scalable Noisy Speech Dataset (MS-SNSD)
- Create a database for having a mix a clean speech audio file with a noisy (mixed) speech at desired length and SNR Levels
- Use algorithm to separate the unknown input audio from mixed audio signal
- Convert Speech into Text using speech recognizer API
- Evaluate Algorithm using Word Error Rate (WER) Metric scores
Step 1: Make preparations for creating a database of mixed audio using the MS-SNSD program
The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) is a noisy speech dataset that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio (SNR) levels desired.
The following modifications were made to MS-SNDN:
# Modified File from origial verson of noisyspeech_synthesizer.py # Source: https://github.com/microsoft/MS-SNSD/blob/master/noisyspeech_synthesizer.py # @author: chkarada import glob import numpy as np import soundfile as sf import os import argparse import configparser as CP from audiolib import audioread, audiowrite, snr_mixer def main(cfg): snr_lower = float(cfg["snr_lower"]) snr_upper = float(cfg["snr_upper"]) total_snrlevels = float(cfg["total_snrlevels"]) clean_dir = os.path.join(os.path.dirname(__file__), 'clean_train') if cfg["speech_dir"]!='None': clean_dir = cfg["speech_dir"] if not os.path.exists(clean_dir): assert False, ("Clean speech data is required") noise_dir = os.path.join(os.path.dirname(__file__), 'noise_train') if cfg["noise_dir"]!='None': noise_dir = cfg["noise_dir"] if not os.path.exists(noise_dir): assert False, ("Noise data is required") fs = float(cfg["sampling_rate"]) fs = 16000 # change audioformat = cfg["audioformat"] total_hours = float(cfg["total_hours"]) audio_length = float(cfg["audio_length"]) silence_length = float(cfg["silence_length"]) noisyspeech_dir = os.path.join(os.path.dirname(__file__), 'NoisySpeech_training') if not os.path.exists(noisyspeech_dir): os.makedirs(noisyspeech_dir) clean_proc_dir = os.path.join(os.path.dirname(__file__), 'CleanSpeech_training') if not os.path.exists(clean_proc_dir): os.makedirs(clean_proc_dir) noise_proc_dir = os.path.join(os.path.dirname(__file__), 'Noise_training') if not os.path.exists(noise_proc_dir): os.makedirs(noise_proc_dir) total_secs = total_hours*60*60 total_samples = int(total_secs * fs) audio_length = int(audio_length*fs) # Modification # Original: SNR = np.linspace(snr_lower, snr_upper, total_snrlevels) # Change: replace: ‘total_snrlevels’ with ‘int(total_snrlevels)’ # Reason: To avoid an error as np.linspace expects an integer input SNR = np.linspace(snr_lower, snr_upper, int(total_snrlevels)) #change with int # Modification # New Line after line 47: ‘SNR = np.round(SNR, 5)’ # Reason: To set SNR Levels at consistent intervals SNR = np.round(SNR, 5) # added cleanfilenames = glob.glob(os.path.join(clean_dir, audioformat)) if cfg["noise_types_excluded"]=='None': noisefilenames = glob.glob(os.path.join(noise_dir, audioformat)) else: filestoexclude = cfg["noise_types_excluded"].split(',') noisefilenames = glob.glob(os.path.join(noise_dir, audioformat)) for i in range(len(filestoexclude)): noisefilenames = [fn for fn in noisefilenames if not os.path.basename(fn).startswith(filestoexclude[i])] filecounter = 0 num_samples = 0 # Modification # Original: while num_samples < total_samples: (line 85) # Change: # ran_num = 0 # add # while num_samples < total_samples: # np.random.seed(ran_num) # add # ran_num = ran_num + 1 # add # Reason: This program randomly uses the files provided to create a database. Each time the program # is run a new set of clean files are created. However, the clean files are used to create a transcript # to calculate Word Error Rate (wer) scores. By using a random seed, the program can be run multiple times # and no changes to the transcript file are needed. ran_num = 0 # add while num_samples < total_samples: np.random.seed(ran_num) # add ran_num = ran_num + 1 # add idx_s = np.random.randint(0, np.size(cleanfilenames)) clean, fs = audioread(cleanfilenames[idx_s]) if len(clean)>audio_length: clean = clean else: while len(clean)<=audio_length: idx_s = idx_s + 1 if idx_s >= np.size(cleanfilenames)-1: idx_s = np.random.randint(0, np.size(cleanfilenames)) newclean, fs = audioread(cleanfilenames[idx_s]) cleanconcat = np.append(clean, np.zeros(int(fs*silence_length))) clean = np.append(cleanconcat, newclean) idx_n = np.random.randint(0, np.size(noisefilenames)) noise, fs = audioread(noisefilenames[idx_n]) if len(noise)>=len(clean): noise = noise[0:len(clean)] else: while len(noise)<=len(clean): idx_n = idx_n + 1 if idx_n >= np.size(noisefilenames)-1: idx_n = np.random.randint(0, np.size(noisefilenames)) newnoise, fs = audioread(noisefilenames[idx_n]) noiseconcat = np.append(noise, np.zeros(int(fs*silence_length))) noise = np.append(noiseconcat, newnoise) noise = noise[0:len(clean)] filecounter = filecounter + 1 for i in range(np.size(SNR)): text1 = str(cleanfilenames[idx_s]) clean_snr, noise_snr, noisy_snr = snr_mixer(clean=clean, noise=noise, snr=SNR[i]) noisyfilename = 'noisy'+ str(filecounter)+'_SNRdb_'+str(SNR[i])+'_clnsp'+str(filecounter)+'.wav' cleanfilename = 'clnsp'+ str(filecounter)+'.wav' noisefilename = 'noisy'+ str(filecounter)+'_SNRdb_'+str(SNR[i])+'.wav' noisypath = os.path.join(noisyspeech_dir, noisyfilename) cleanpath = os.path.join(clean_proc_dir, cleanfilename) noisepath = os.path.join(noise_proc_dir, noisefilename) audiowrite(noisy_snr, fs, noisypath, norm=False) audiowrite(clean_snr, fs, cleanpath, norm=False) audiowrite(noise_snr, fs, noisepath, norm=False) num_samples = num_samples + len(noisy_snr) if __name__=="__main__": parser = argparse.ArgumentParser() # Configurations: read noisyspeech_synthesizer.cfg parser.add_argument("--cfg", default = "noisyspeech_synthesizer.cfg", help = "Read noisyspeech_synthesizer.cfg for all the details") parser.add_argument("--cfg_str", type=str, default = "noisy_speech" ) args = parser.parse_args() cfgpath = os.path.join(os.path.dirname(__file__), args.cfg) assert os.path.exists(cfgpath), f"No configuration file as [{cfgpath}]" cfg = CP.ConfigParser() cfg._interpolation = CP.ExtendedInterpolation() cfg.read(cfgpath) main(cfg._sections[args.cfg_str])
47. int(total_snrlevels)
to avoid error due to the expecting integer.
Newline after 47. SNR = np.round(SNR,5)
to set SNR Level’s at consistent intervals
85. ran_num = 0 # add while num_samples < total_samples: np.random.seed(ran_num) # add ran_num = ran_num + 1 # add
The program above randomly uses the files provided to create a database. Each time the program is run a new set of clean files are created. However, the clean files are used to create a transcript to calculate WER scores. By using a random seed, the program can be run multiple times and no changes to the transcript file are needed.
Changes to the noisyspeech_synthesizer.cfg
# Modified file noisyspeech_synthesizer.cfg # Source: https://github.com/microsoft/MS-SNSD/blob/master/noisyspeech_synthesizer.cfg # Configuration for generating Noisy Speech Dataset # - sampling_rate: Specify the sampling rate. Default is 16 kHz # - audioformat: default is .wav # - audio_length: Minimum Length of each audio clip (noisy and clean speech) in seconds that will be generated by augmenting utterances. # - silence_length: Duration of silence introduced between clean speech utterances. # - total_hours: Total number of hours of data required. Units are in hours. # - snr_lower: Lower bound for SNR required (default: 0 dB) # - snr_upper: Upper bound for SNR required (default: 40 dB) # - total_snrlevels: Number of SNR levels required (default: 5, which means there are 5 levels between snr_lower and snr_upper) # - noise_dir: Default is None. But specify the noise directory path if noise files are not in the source directory # - Speech_dir: Default is None. But specify the speech directory path if speech files are not in the source directory # - noise_types_excluded: Noise files starting with the following tags to be excluded in the noise list. Example: noise_types_excluded: Babble, AirConditioner # Specify 'None' if no noise files to be excluded. [noisy_speech] sampling_rate: 16000 audioformat: *.wav audio_length: 10 silence_length: 0.2 total_hours: 3.0 snr_lower: 0.1 snr_upper: 10.1 total_snrlevels: 10 # Modification # Original: # noise_dir: None # Speech_dir: None # Change: # noise_dir: your directory for noisy speech wav files (sample rate = 16000 samples/s) # speech_dir: your directory for clean speech wav files (sample rate = 16000 samples/s) noise_dir: your directory for noisy speech wav files (sample rate = 16000 samples/s) speech_dir: your directory for clean speech wav files (sample rate = 16000 samples/s) noise_types_excluded: None
- Original:
noise_dir: None Speech_dir: None
- Change:
noise_dir: your directory for noisy speech wav files (sample rate = 16000 samples/s)
speech_dir: your directory for clean speech wav files (sample rate = 16000 samples/s)
Step 2: Create a database for having a mix of a clean speech audio file with a noisy speech at desired length and SNR levels
- Enter parameters for settings. The parameters are defined in the file: noisyspeech_synthesizer.cfg. The parameters used in this example were: sampling_rate: 16000, audioformat: *.wav, audio_length: 10, silence_length: 0.2, total_hours: 3 , snr_lower: 0.1, snr_upper: 10.1, total_snrlevels: 10
- File: noisyspeech_synthesizer.py .
- Run this file. Using a Google Colab notebook the code was
import os from pathlib import Path from google.colab import drive drive.mount('/content/drive/') !python/content/drive/MyDrive/YourPath/noisyspeech_synthesizer.py
- This should create 3 new directories containing wav files.
- The new directories are: Noise_training containing noise speech files, CleanSpeech_training containing clean speech files, and NoisySpeech_training containing mix of noise and clean files combined at the set signal to noise levels
Step 3: Use algorithm to separate the unknown input audio from mixed audio signal
The audio files having the mix of speech are located in the NoisySpeech_training directory; file name example: noisy28_SNRdb_0.1_clnsp28.wav
Recall the noisy speech would be speech from the device which is located in the Noise_training directory; file name example: noisy28_SNRdb_0.1.wav
Note the “28” in the examples can be used as a link to the files.
You can now use this dataset to test your own algorithm for separating known audio from the device eg. noisy28_SNRdb_0.1.wav & the mixed file noisy28_SNRdb_0.1_clnsp28.wav
The pseudo code used for recovering the user’s voice
Algorithm (using the above example files)
# Define Lists mlt = list of floats from 0.001 to 2.000 # eg (0.001, 0.002, … 2.000) # storing absolute value correlation values corr_abs = empty list # storing m values, where m is a float m_val = empty list # note y1 = Device’s Voice y1 = normalize(noisy28_SNRdb_0.1.wav) # note y2 = Mixed Voices y2 = normalize(noisy28_SNRdb_0.1_clnsp28.wav)# # normalize files between - 1 to 1, ie they both share the same scale for m in mlt do: X = normalize(y2 - m * y1) y = y1 corr = correlation between X and y append absolute value of corr to corr_abs ind = index of min value in corr_abs m1 = value of mlt at index ind user_voice = normalize(y2 - m1 * y1)
Explanation of algorithm:
An analogy
This algorithm is based upon two signals say z1 = sin(x) and z2 = sin(x-ɣ), where ɣ is a phase difference. Suppose we want to find out the values of ɣ where z1 and z2 are not correlated.
In this case, ɣ would be varied and we would calculate the correlation for a range of values of ɣ between 0 and 2π.
When ɣ = 0, π or 2π, the |correlation| between z1 and z2 would be high, since z2 would be a function of sin(x). When ɣ = 2, the |correlation| between z1 and z2 would be low because z2 = sin(x – 2) = cos(x).
This technique can be used in a lab setting to find the phase difference between two sinusoidal voltage signals. Using an oscilloscope, one signal is displayed on the x axis and the second signal is displayed on the y axis. Next the phase of the first signal is adjusted until a straight line appears on the oscilloscope. That is when the two sinusoidal voltage signals are highly correlated.
Figure 1 shows an example with sin(t) and cos(t)
The correlation between Sin(t) and Cos(t), the functions can be plotted on the x and y axis as shown in Figure 2 below
The circle shape of the scatter plot in Figure 2 indicates that the out of functions sin(t) and cos(t) have no correlation. This analogy can be extended to other waveforms.
Explanation
Let M(t) designate the Mixed voice, U(t) designate the User’s voice, and D(t) desonate the Device’s voice.
M(t) can be mathematically modeled as
M(t) = U(t) + c * D(t)
where c is a constant.
For a range of m values between 0 and 2 this algorithm calculates the correlation between M(t) – m * D(t) versus D(t).
At a certain value of m, M(t) = U(t) where m ≈ c. Here, the |correlation| between M(t) and D(t) is at a minimum because the Mixed voice is no longer a function of the Device’s voice. Hence, the correlation is between the User’s voice and the Device’s voice is minimal.
In contrast, the |correlation| is at a peak when m ≈ 0 because in this case the Mixed voice is a function of the device’s voice + the user’s voice. This is shown in Figure 3.
Using, M(t) – m_min * D(t) the waveforms of the User’s Voice can be derived from M(t), m_min and D(t)
Mixed Voice
User’s Voice
The subtraction of the Device’s Voice from the Mixed Voice can be seen and heard (by the Mixed and User wav files) in Figure 4.
Step 4: Convert Speech into Text using speech recognizer API
The speech to text output increases if the dc component is removed and the waveform’s amplitude is sufficient
Algorithm for pre-processing audio files
Let S(t) be a discrete audio signal of a person speaking
Save S_new in wav format as S_new.wav
text = Speech Recognition API(S_new.wav) print(text)
Speech to Text Output
Device Text:
“yes ma’am it sure was I always say you’ve got a wonderful husband Miss Margaret may I get some more cookies Nora well wait until you finish what you’ve got Davey”
Mixed Text:
“in manager to the heroine of iOS Elliott got a wonderful husband Ben some Mysterio”
Hypothesized User Text:
“he was a head shorter than His companion of almost delicate physique since then some mysterious Force has been fighting us at every step I had faith in them there was a change now”
True User Text:
“he was a head shorter than his companion of most delicate physique since them some mysterious Force has been fighting us outside every step I had faith in them there was a change now”
Tip: Since the clean speech wav files are located in the CleanSpeech_training directory, you can use the speech recognizer api to convert those files to True Text and then verify the conversion
Step 5: Evaluate Algorithm using Word Error Rate (WER) Metric scores
- WER scores range from 0 to 1 where 0 is the perfect score
- The python API Jiwer 2.30 was used to calculate the WER scores
- WER Score of User’s Voice and True User Text was 0.0882
Summary
This article has outlined a solution of the situation when a device’s microphone receives audio inputs from its own voice and a user’s voice. An algorithm was provided for resolving the user’s voice from the mixed voices input. This algorithm was demonstrated on an example having a signal to noise ratio of 0.1. In addition, a method of creating a database containing mixed voice of user defined signal to noise ratio, and length was demonstrated. In summary, the algorithm used signal coherence, phase difference and correlation to determine the user’s voice.
Here is a video which covers the key concepts presented in this article.
Sources
- MS_SNSD Database
https://github.com/microsoft/MS-SNSD
@article{reddy2019scalable,
title={A Scalable Noisy Speech Dataset and Online Subjective Test Framework},
author={Reddy, Chandan KA and Beyrami, Ebrahim and Pool, Jamie and Cutler, Ross and Srinivasan, Sriram and Gehrke, Johannes},
journal={Proc. Interspeech 2019},
pages={1816–1820},
year={2019}
}
- Clean Speech Source (http://festvox.org/cmu_arctic/)
http://festvox.org/index.html and “The documentation, tools and dependent software are all free without restriction (commercial or otherwise).”
The Arctic Database (used for this article) has its acknowledgement found in http://festvox.org/cmu_arctic/cmu_arctic_report.pdf
Source: “CMU ARCTIC databases for speech synthesis” by John Kominek and Alan W Black
Licensing
‘The Arctic databases are distributed as “free software” under the following terms. Carnegie Mellon University Copyright (c) 2003 All Rights Reserved. Permission to use, copy, modify, and license this software and its documentation for any purpose, is hereby granted without fee, subject to the following conditions:
- The code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Any modifications must be clearly marked as such.
- Original authors’ names are not deleted.