Choi: I'm Keunwoo Choi from Spotify, I'm working as a research scientist, and the talk title, Deep Learning with Audio Signal, will be a gentle, quick introduction about, what would you do, how would you do when you happen to be in a situation where you suddenly have to work on audio signal, or music maybe, speech, sound, whatever, with the machine learning.
A very quick introduction of myself, I studied some of acoustic, a bit of machine learning and MIR, music information retrieval. Now I'm in a team called MIQ, which is mostly in New York and some in London, and everyone's working on a slightly different topic within an umbrella named MIR, which is basically music and machine learning, mostly music signals and machine learning.
After this talk you will probably have some idea about how you start with your task with the machine learning and audio signal, and maybe after working on and on you will quickly get to know a bit more about the thing that I cover with this talk, so think of it as a good starting point or baseline, I'm assuming you're a beginner on audio, and music, and speech, or sound- and maybe also some of MIR, or deep learning and machine learning.
We'll be covering pretty much every bit of everything very shallowly and quickly. It doesn't include how would you do music recommendation with Spotify, that's a bit different topic from my talk. The content of the talk will be about these four different sections, this chronological order about the whole task. How would you prepare the dataset and how would you deal with the audio signal? How do you process the signals so that it could be better and it could be nicer to the model you use? Then how do you design your network, and then, what can you expect from the task, audio signals, and machine learning?
First, prepare the dataset, know your data, or, maybe you can say something like, how to start another task. Every task these days with this crazy machine-learning approaches, and techniques, and methods, systems, papers, what we want to do is a data-driven approach. We want some datasets, so there's a Y-A-H-O-O [inaudible 00:02:56] there's lots of other dataset. You've got a Google or Yahoo, and whatever, there's so many different dataset, and maybe at the beginning you could be super optimistic about this acquiring or collecting a dataset, because things are offered, things are public, and you can just download everything.
This is just a very different, random arbitrary classes of your situations. If you're really lucky then there could be some precise exactly the same dataset that you need to use. For example, the number of some random audio task, something like dog barking classification. In that case, you would need lots of different wav files that have different dogs, different breeds, different sizes of the different files of dogs barking in different ways with different mood so that everything can be covered in the dataset.
If that's happening, that's really awesome, in many cases, it's not really such a way, but still, there's hope, because this is 2019. There should be so many dog audio files online, so maybe you can think that with a slight modification of the existing dataset you can do the work. The worst case is, if you want to build something like dark-haired Golden Retriever raised in West Coast of America dog classification, in this case, there are not that many data points that you can just directly use. In that case you will be thinking of collecting the dataset by yourself, but before you try to use it, maybe we can think of, what is the nature of the data itself?
What is audio, or even, what is sound? All the work is on cloud, computer, GPU, anyway, that's all digital signals, and that's what we call audio signals here, so basically just audio files, digital data. In reality, in case your work is of sound which exists, which is living in the real world, not cyberspace, in that case, we still want to do some audio tasks. We need to be aware of, what is the procedure of converting the sound into audio?
In this case, what is this, what are the procedures? We can break it down into three, four different entity, and here, the source exists there, a dog barking. In reality there's always noise, if you want to build your dog barking classifier, or dog versus cat, dog, not dog, in that case, there's always some noise, maybe there's some street cat, there could be some windy, dog. Also in reality, when there's some sound source and there's a microphone, there's some interactions between the sound in a space, which is a reverberation, that is when the acoustic kicks in into the procedure, then there's a microphone, which all of these procedures and things, object, or systems will change the sound. They will do something, they modify, they're involved in the modification from the sound, the audio to audio, so you got to beware of, what are they, and how to deal with them.
In reality, you probably will have to do a lot more than just downloading the dataset. Even the very simple task, say dog barking or baby crying, which sounds like a toy task, in reality, if you really want to build a product out of it, it could be quite tricky because of the difference between sound and audio. Here is a summary of what I've just told you already, and that means, if you trained with some very clear signal, dog barking sound, and you put it on your Raspberry Pi, microphone, TensorFlow modeler, everything, when the dog barks way behind your house and then there is a noise, a reverberation, the model might work, but might not as well, so our goal is to make it robust.
By making it robust there could be some way to put the robustness into the model, but obviously, in this data-driven approach, the most straightforward way is to prepare your data, similar in a way that the real-world scenarios is similar to the training scenario for the model. To break down all those guys in a very simplified structure, it would be something like there's a clean signal that only has a dog barking sound and then you want to take some noise into account. There's this dry signal which is an acoustic jargon, but dry signal means not signal like this where every room acoustics is happening and the reverberation is involved. Dry signal is more like a specialty audio-treated room, you say something and there's no reverberation, or almost zero reverberation.
When this original signal goes through the microphone something is happening, which can be just simplified or approximated as a bandpass filter. We got the recorded signal, and all of this, at this moment, broke into three different procedures, they are happening all together, simultaneously. It's a little tricky to present all the details of the acoustics, but there are a few keywords that you want to use when you want to mimic the real world so that you can prepare the right dataset. For noise, it depends on the place, the situation where you want to use your model, for example. Really common case people think of is mostly home or office, in those indoors there are still lots of different noises, there could be kitchen noise, you might be vacuuming, there's outside noise from the street.
For that kind of thing, you can think of something like babble noise, home noise, cafe noise, there are a lot of it, so that means there's a lot of different noises out there. Also, there's lots of dog barking sound, but you want to mix them together so that it sounds like more realistic scenario. The implementation is really simple, although we can classify the type of noise to be additive or convolutive, just adding the original signal with the noise is already good enough to simulate the real world.
As about the second role there's a reverberation, which I would recommend to skip the first time, there are many reasons I don't really recommend to try to solve the reverberation problem. First of all, it's tricky to simulate it because to make good room impulse responses, you need quite good acoustic facilities, and there are not that many open-sourced one out there. In case you want to do it, what you want to look for is a room impulse response, or RIR, and the operation itself is not addition but more like convolution, that's the operation.
In many cases the microphone we are using in our mobile phone, laptop, or the microphone that you want to install in your Raspberry Pi, it's not that good one. That means if the signal is really good, the original signal is really good in your downloaded dataset you want to make it in a way that it sounds not acoustically very great, which means at the same time, acoustically more similar to the audio signal you will acquire with your microphone.
There are lots of things happening in the microphone as well, but the most important thing will be about the bandwidth of the microphone that you are using. In that case, you might want to do some signal processing, which will be simply just the convolution, or even just trimming off some high frequency and low-frequency region from your spectrogram. That was a quick summary of what to do in the first section.
Here's the second section, we assume that we have a good amount of audio dataset, how would you train it so that the model can do the work most effectively and efficiently? This is already answered, what you do is the spectrogram, which we will go in a bit more detail, so what to do after loading the signal into your memory during training and maybe also during test.
Before getting into the spectrogram different parameters and representations, let's get into the digital audio very quickly. We have, say, 1-second duration of the digital audio signal, mono signal, and we assume the sampling rate is 44.1 kilohertz. Those are as a NumPy array, or as a data point, what we see is one single-dimensional array, size of that, and the type of an integer, there's a little difference between the audio signals and image signal image that, I think, many of you are familiar with. If you see those images, or numbers of pixels, or number or data point in each training item in those popular dataset, MNIST, CIFAR, ImageNet, you will realize that actually the number of data points in 1 second of digital audio is way much bigger than most of those datasets.
You will probably have to deal with actually longer than 1 second of signal, so there's a couple of different issues which I don't really cover, but memory. All you need to know there is to understand that all the audio is in such a shape, and the kind of amount of data points that you will get when you're dealing with it.
The image I showed here is the audio waveforms, which is pretty much the original form of the data we used, because it's the deep learning era and everything, seem to be end-to-end, here I'm going to give you a recommendation of what kind of representation you want to use. If you want to go for a waveform, which you see here, out of these three types of representations, waveforms would be the most end-to-end approach type of data. You can think of it as the rawer end into some sort of representation with some compression, and then, even more compression, something called features. There's a waveform, you can compute some spectrograms, which is a general term to describe some sort of two-dimensional representation where one axis is time and the other axis is frequency. Out of spectrogram, there is actually more than one type of spectrogram, like short-time Fourier transform, MelSpectrogram, Constant-Q, you name it, there are many.
Out of those spectrograms mostly, there's even more condensed representation, and a good example or most popular example is MFCC. There are many, really a hell of lots of audio features up there on the top of MFCC. Here you see a number of data points. STFT is a compressed number of data points, but actually it's not really about compression, STFT is more about the different ways of arranging the same data in a different axis or different space, but after that, if you go for MelSpectrogram, 99% of the time you reduce the size of the data point into something smaller, which matters a lot when it comes to training a model.
When you're training a model, you want your data, you want your input data representation to be the most efficient and effective, which means you want to reduce the size as much as possible, unless it doesn't hurt the performance of the model. These are three different categories of representation of audio, the spoiler is that at the end of this section, after 30 minutes I'll just tell you that all of you just start with log or decibel-scaled Melspectrograms, but maybe it's nice to know why.
This is a good example of spectrogram, on the top you see the waveform. This waveform, when it is decomposed with the different center frequency, you see these two dimensional, which is probably a little bit more already intuitive and informative than just waveforms. It shows the pitches and over time, which is already much nicer to understand what's going on than the waveform itself.
When it's easier for us actually nicely, straightforwardly, intuitively, it's better for a machine to understand what's going on there as well for most of the case. The logic behind using it would be making it easier for the model to capture the characteristic, so we do the work, we do the bit of work for the model and then let the model do the rest of the work, and we want to help as much as possible. Here, the golden rule of thumb is to discard all the redundancies, out of these different types of time frequency or two-dimensional spectrogram representations, what we think of is, how can we reduce the size of dataset while keeping all the important information? This was just another example, the cat mew sound versus dog barking sound, they just look so different already there, unlike when we visually inspect the waveforms.
Out of many choices, the choice of the Melspectrogram in decibel scale at least as a starting point because they're smaller, they are perceptual in two different ways. They're perceptual in the way that they reshape the frequency axis into something different, which is a mel frequency, which was originally designed for speech understanding.
It turns out that they work also pretty good for the deep learning model, machine learning models for music, audio in general, some audio signals that people listen to because we only can listen to what we can listen to. We are not bats or dogs, so even if there might be some information out there that is discarded or less weight in Melspectrogram, it's mostly the right choice because what the task we want the model to learn is something about what we are interested in, which means the perceptual needs always help. Compared to Constant-Q transform, which I only merely mentioned, it's faster to compute, in terms of memory computation and training, Melspectrogram is a really good choice.
The question would be, there are many libraries, how to use it? I'm assuming Python environment, there's a librosa/madmom, there are quite a bit different nicely maintained libraries that do everything on CPU, so this is the best if you want to get every dataset ready after the Melspectrogram or the pre-processing stage. Here are some other choices, I think Kapre's the most structured one, which is the name of a Keras audio processing layer, so for when you want to do some deep learning with audio.
If you want to do all the computations of pre-processing of the computing Melspectrogram on-the-fly and also on GPU as well, kapre could be a good choice. It helps a lot, especially because when you're computing Melspectrogram there is a bunch of different hyper parameters that change the shape or resolution of the Melspectrogram. If you want to optimize those kinds of audio related parameters, you can do it automatically, you have to do the research with it, which means you don't want to prepare all the different versions of the dataset in your hard disk and then train one by one. What is much better is to compute them on-the-fly, and on that kapre could be really good.
If you prepare PyTorch, the pytorch-audio, which is on the active development at the moment, disclaimer is that I'm involved in both development of kapre and pytorch-audio, but there are not that many choices, you have to figure it out.
Now we download the dataset, we did some interesting audio treatment and some pre-processing, we have a bunch of files in the folders, Pandas file, everything. What you want to do is train the network, but before training we need to design the model. What would be the shape? What kind of network we should be using? There are many structures out there, and most of them are originally developed for computer vision. A while ago, in the 1990s when the neural network was developed originally, people cared about what's happening within our body and the brain, the nerves, that's why we have this synapse, and neuron motivated thing, like 2D convnet classification
From some point, we started to diverge into something more computational and less about trying to mimic a human body process. People still want to explain something like, "Hey, this 3-by-3 convolutional is working because this is capturing the shape of the thing which is happening in our human vision system." There are also so many counter examples of what’s actually happening is not similar to what's happening within our perception, that is why those structures that were not originally developed for audio are working really nice.
What do we do? Let's say your dog might be barking for an hour or something but still you want to chop the signal and then do some classification, or regression, whatever the task is. That is a strong base line which you might even be able to write some papers. Still, you might need to go simpler because you don't have enough dataset, you don't have that much budget for GPU, then you want some feature structure, and you want to do some transfer learning, and in transfer learning, obviously in computer vision people are using some of the network data pre-trained on ImageNet and then do some sort of different task, similar but different task, relevant.
Based on object detection, people do some structure segmentation of the images. What is interesting in many papers is that saying, "Oops, it's just random initialized network” and you don’t know if it works very well, which is also happening in audio as well, it's totally fine. If you don't really have any choice at all then just pick up some pre-trained network that looks similar, slightly on very high level, similar to your task, it is definitely worth trying that approach.
There's even some paper in the older days using the AlexNet, which is one of the oldest image classifier networks in 2012 or something to music genre classification, and they actually worked, so go and check it out. At the same time, if you want to go a little bit further or a little bit deeper, you can make a better structure, what you have to think of in audio task especially is to take the size of the receptive field into account. If you go a little deep down on trying to understanding what to design and what kind of operation is happening in ConvNet, RNN, then you will face this receptive field. The size of receptive field means, when the model is capturing different patterns, the model can't deal with some pattern that is bigger than its receptive field so you want to optimize the size of that.
In that sense, especially when it comes to audio, unlike image, you also want to think of the sparsity of the data. For example, there's a pretty challenging thing like bird singing detection classification, where they are given for 1, 2 minutes of audio signal. Out of the 1 minute, when you're lucky the birds are singing all the time, but at the other times it's lots of noise, and then it's gone, and it sounds so small. In that case, that is the sparsest problem that I know out of the audio domain.
There could be not as sparse as that, but still, say you want to do something like vocal detection, and in many musics there's a vocal but still it's kind of sparse, like, distortion, guitar detection in Metallica's music is not sparse at all, so it depends on what kind of path do you look for, but it's the sparsity in the data or an input audio is one of the things that you want to take into account.
The worst advice would be to copy what they do even in a different domain, and to proceed a little bit nicer way. People are actively discussing why we should copy something from some vision area, and this is also related to people's pride in the audio research field, you guys need to understand what's going on in your ear and everything.
They are different, and although some people argue, "Come on, they're the same thing in terms of data." I think still the discussion is going on, and of course, it's much better when you have your dedicated thing on everything, but when you don't have a choice, still this might save you.
This section is almost embarrassingly short, expect a result, or another problem, so at the end you have the model, you have the data. You'll be curious about what kind of preference you can expect after training your system. The thing is, overall it could be a problem unlike other popular problems, because there's not that many referenced to copy from a reference, like to expect how it would work or not. It's a very basic data scientist question, when you have a new problem, first thing you have to do is not import keras, but spend at least a minute or two for feasibility check, if it's really something that I can do it or not.
In many cases, it could be a normal problem, there's no blog post, there's no paper you can copy, so that is why this comes into your procedure. The most basic question is, can you do that by yourself? Or at least some professional can do that after training, then maybe model can do it. There is some easy, easy task, like baby crying detection, it's so easy anyone can do it, but when you want to do something more, like baby crying translation, baby cries for so many different reasons like hungry, sleepy, they poop, annoyed, cold, so that would be definitely more difficult than just crying detection, and it’s the same for the dog barking.
In the music field, there is a task called [inaudible 00:31:56] sound detection, or more like [inaudible 00:31:57] sound prediction. When it comes to detection, it's easy because detection sounds like something retrospective, but when it comes to [inaudible 00:32:04] sound prediction, can you actually predict, after listening to some new songs that it's going to make a big hit or not? That doesn't sound easy at least.
Two years ago there was Interspeech, which is a conference, the international conference on speech, a research challenge about medical diagnosis based on the snoring sound. There was some strange symptom that some of the people are struggling, and the task was to do that through the classification of this, patient and non-patient based on the snoring sound.
I almost wanted to take part in that challenge, but after actually downloading the dataset and the [inaudible 00:33:06] preferences, and I just realized, maybe I should ask to my friends who is a doctor, "Hey, do you know this?" He's like, "Oh, you need to use some special device? You can't really do it." He's a doctor and he said, "I don't think I could do it after listening to the snoring sound," and that's when I gave up, because that sounds like that is at least not a very easy task to do. You got to think about the feasibility of the problem, that is my final decision and final words from this section.
Let's recap. Sound is analog audio, digital audio is digital. During the conversion from sound to audio you might want to think about what's happening and how you should mimic your dataset. Then pre-processing, MelSpectrogram, although it is from the speech recognition could be pretty much your end choice and at least a very good starting point. Their perception of the speech is not very different from the perception of other sound. Designing is about copying, and then you can start the modification, you might be the first one who is solving the problem in the world. Finally, expectation, do some good feasibility check because the problem could be very normal.
Moderator: Thank you very much as well from my side of the talk. I really liked it as well, seeing how you actually think about transferring ImageNet ideas, or the transfer learning ideas onto audio data.
Participant 1: Could you talk a bit about the augmentation that you could be using to train audio versus images?
Choi: Yes, that's a good topic. First of all, when you want to do some audio augmentation there's, fortunately, a couple of different tools, and most of them are not similar to the procedure I introduced at the first section, because the idea is the same. They want to boost a number of the data point, so when it comes to, say, noise, people want to do something similar to a secure connection where people control the SNR, or just amount or gain of the noise so that they can be robust to a certain point. Then people do some sort of things like using [inaudible 00:36:00] to stretch the audio within a long time or frequency, like pitch transposing. All those things seem to be working and mostly improving the signals, the performance.
There were a couple of papers that was done for [inaudible 00:36:20] detection. Yes, there are a couple of augmentations. It depends a lot, though, in some cases, changing the signal in frequency and time axis totally makes sense, a dog could bark in a little lower frequency. The bottom line is you got to be thinking of what kind of variability there really will be existing in the real-world situation.
Participant 2: Thank you for the amazing talk. Is the following problem solvable using machine learning on audio, as in, if I want to detect if the voice that I'm hearing is being replayed from a recorded device versus a real human actually speaking it?
Choi: I'm sorry, what is the problem exactly? Participant 2: If I wanted to detect if the voice I'm hearing is being played from a pre-recorded device versus actually live recording.
Choi: I think there's a whole area of that topic, like proofing or something, where people want to tell the difference between recorded Alexa and real Alexa sound. I'm not really worried about what the problem itself entails, but there's a problem, there are many papers because it's a real thing that people need to solve. It's definitely worth checking out the recent progress there.
See more presentations with transcripts