Skip to main content

Command Palette

Search for a command to run...

I Built a Convolutional Neural Network that understands Audio

Updated
6 min read
I Built a Convolutional Neural Network that understands Audio
T

build. ship, publish


TL;DR

I trained a convolutional neural network (CNN) based on a ResNet‑34 style residual architecture to classify audio clips from the ESC‑50 dataset (50 environmental sound classes). I used log–mel spectrograms as input, reached strong accuracy and generalization with residual blocks, and packaged the model with dropout and adaptive average pooling for robustness. This post is both a learning journal and a mini‑tutorial on CNNs for audio.

Check it out tho —> Link


Why I Built This

I’ve always been fascinated by how we can “see sound.” Turning waveforms into spectrograms and then letting a CNN discover structure in those images felt like the perfect way to combine my curiosity about audio with my love for deep learning architectures.

This project wasn’t just about hitting a benchmark – it was about really understanding CNNs and how they map onto audio data. So here’s my honest build log, decisions, failures, and what I’d improve.


Dataset: ESC‑50

I worked with the ESC‑50 dataset:

  • 2000 audio clips (5 seconds each)

  • 50 balanced classes covering animals, natural soundscapes, human sounds, domestic, and urban noise

    esc-50-dataset

Each class has 40 examples, which makes it a small but clean dataset – ideal for a controlled CNN experiment but definitely requiring augmentation.


Turning Audio into Images (Spectrograms)

My preprocessing pipeline:

  • Resampled audio to 44100Hz

  • STFT with n_fft=1024, hop_length=512

  • Converted to mel scale with n_mels=128

  • Took log amplitude (dB)

  • Normalized each spectrogram

This gave me consistent (1, 128, 256) tensors for each clip (1 channel, 128 mel bins, 256 time steps).

Why Convert Audio to Mel Spectrograms?

Waveforms are just raw 1D signals — each point is the sum of all frequencies at that instant. But CNNs thrive on spatial patterns. Enter the mel spectrogram:

  • Transforms 1D audio → 2D image of frequency vs. time

  • Mel scale emphasizes lower frequencies (where human perception is sharper) and groups higher ones

  • Log scaling compresses huge dynamic ranges

Benefits for CNNs:

  • Shifting in time just slides the spectrogram horizontally → CNNs detect the same pattern at different positions

  • Makes sounds more time-invariant and structured for CNNs

In my inference pipeline (main.py), I used:

self.transform = nn.Sequential(
    T.MelSpectrogram(
        sample_rate=44100,
        n_fft=2048,
        hop_length=512,
        n_mels=128,
        f_min=0,
        f_max=22050
    ),
    T.AmplitudeToDB()
)

The CNN Architecture

I didn’t just build any CNN – I implemented a ResNet‑34 inspired model adapted for spectrograms. Here’s the full architecture:

34 layer resnet model

Code (from model.py)

class AudioClassifier(nn.Module):
    def __init__(self, num_classes=50):
        super().__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1,64,7,2,3,bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(3,2,1),
        )
        self.layer2 = nn.ModuleList([ResidualBlock(64,64) for _ in range(3)])
        self.layer3 = nn.ModuleList([ResidualBlock(64 if i==0 else 128,128,stride=2 if i==0 else 1) for i in range(4)])
        self.layer4 = nn.ModuleList([ResidualBlock(128 if i==0 else 256,256,stride=2 if i==0 else 1) for i in range(6)])
        self.layer5 = nn.ModuleList([ResidualBlock(256 if i==0 else 512,512,stride=2 if i==0 else 1) for i in range(3)])
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(512,num_classes)

Visual Structure

  • Initial Conv Layer: 7×7, stride=2, 64 channels, followed by BN, ReLU, MaxPool.

  • Residual Blocks:

    • Layer2: 3× blocks (64→64)

    • Layer3: 4× blocks (64→128)

    • Layer4: 6× blocks (128→256)

    • Layer5: 3× blocks (256→512)

    • Residual connections are the secret sauce.

      • Without shortcuts: Each block must learn the full mapping.

      • With shortcuts: The block only learns the difference (residual). If nothing needs to change, it can output 0 and pass input forward untouched.

      • Backprop advantage: Gradients flow through both the main path and the shortcut, preventing vanishing gradients.

Formally:

  • With shortcut: output = F(x) + x

  • Without shortcut: output = F(x)

This makes deep networks stable and easier to train.

  • Residual connections were critical – they allowed me to go deep without vanishing gradients, making this model much more stable.

  • residual block image

  • Head: AdaptiveAvgPool → Flatten → Dropout(0.5) → Linear(512→50)

Why Residual CNNs for Audio?

  • 7×7 kernel at start → captures broad time–frequency context.

  • Residual blocks with 3×3 convs → capture local harmonics and transient events.

  • Skip connections → let gradients flow through deep stacks.

  • Global Average Pooling → compact representation before classification, fewer params than big dense layers.

  • Dropout 0.5 → essential for such a small dataset.

    layer 1 to 5

    pooling + dropout + linear layers

    Pooling Explained

    Pooling reduces spatial size, keeping salient features:

    • MaxPool: Keeps the strongest activation in a region → captures sharp onsets like clicks or knocks.

    • AveragePool (at the end): Summarizes entire channels, creating compact embeddings.

Pooling is why the receptive field grows and why the model sees larger time spans of sound as we go deeper.


Dropout Explained

Dropout randomly zeroes neurons during training. Why?

  • Forces redundancy → model can’t “cheat” by relying on a single feature.

  • Helps prevent overfitting, especially critical for small datasets like ESC-50.

  • In my model, p=0.5 before the final linear layer was key.


Training Setup

  • Loss: CrossEntropyLoss

  • Optimizer: Adam (lr=0.0005, weight_decay=0.01)

  • Scheduler: StepLR (decay every 10 epochs)

  • Batch size: 32

  • Epochs: 100

  • GPU: Serverless NVIDIA H100


Results

  • Validation Accuracy: ~88%

  • Macro F1: ~0.83

  • Per‑class: strong on distinct sounds (dog, siren), weaker on overlapping (rain vs pouring water).

  • Confusions: Urban sounds (car horn vs train) often overlapped.


CNNs Explained (For Audio Beginners) in a short way

I used this project as an excuse to master CNNs in audio:

  • Stride = downsampling: reduces resolution while increasing abstraction.

  • Receptive field grows with depth, eventually covering long syllables or ambient textures.

  • Residual learning means “learn the difference” instead of the whole mapping → makes deep nets trainable.

  • Waveforms vs Spectrograms: On raw waveforms, shifting time changes the whole input → CNN struggles. On spectrograms, shift = horizontal move → CNN detects the same pattern.

  • Convolutions: Filters slide over time–frequency patches, learning harmonic or transient detectors.

  • Pooling: Adds invariance by focusing on salient energy patterns.

  • Think of it this way: a spectrogram is like sheet music, and CNN filters are the musicians who learn to spot melodies, rhythms, and harmonics no matter where they appear.


Lessons Learned

  1. SpecAugment is non‑negotiable with small audio datasets.

  2. Dropout saved me from massive overfitting.

  3. Residual blocks made training so much smoother than plain CNNs.

  4. Confusions reveal dataset limitations as much as model ones.


Closing

This build was personal: I didn’t just want to “use” CNNs, I wanted to understand them deeply in the context of sound. Now, I can hear a spectrogram and picture the convolutional filters lighting up. If you’re starting your audio ML journey, I hope this post gives you both a blueprint and intuition.


CNN for Audio Recognition: My Journey