CNN for Audio Recognition: My Journey

TL;DR

I trained a convolutional neural network (CNN) based on a ResNet‑34 style residual architecture to classify audio clips from the ESC‑50 dataset (50 environmental sound classes). I used log–mel spectrograms as input, reached strong accuracy and generalization with residual blocks, and packaged the model with dropout and adaptive average pooling for robustness. This post is both a learning journal and a mini‑tutorial on CNNs for audio.

Check it out tho —> Link

Why I Built This

I’ve always been fascinated by how we can “see sound.” Turning waveforms into spectrograms and then letting a CNN discover structure in those images felt like the perfect way to combine my curiosity about audio with my love for deep learning architectures.

This project wasn’t just about hitting a benchmark – it was about really understanding CNNs and how they map onto audio data. So here’s my honest build log, decisions, failures, and what I’d improve.

Dataset: ESC‑50

I worked with the ESC‑50 dataset:

2000 audio clips (5 seconds each)
50 balanced classes covering animals, natural soundscapes, human sounds, domestic, and urban noise

Each class has 40 examples, which makes it a small but clean dataset – ideal for a controlled CNN experiment but definitely requiring augmentation.

Turning Audio into Images (Spectrograms)

My preprocessing pipeline:

Resampled audio to 44100Hz
STFT with n_fft=1024, hop_length=512
Converted to mel scale with n_mels=128
Took log amplitude (dB)
Normalized each spectrogram

This gave me consistent (1, 128, 256) tensors for each clip (1 channel, 128 mel bins, 256 time steps).

Why Convert Audio to Mel Spectrograms?

Waveforms are just raw 1D signals — each point is the sum of all frequencies at that instant. But CNNs thrive on spatial patterns. Enter the mel spectrogram:

Transforms 1D audio → 2D image of frequency vs. time
Mel scale emphasizes lower frequencies (where human perception is sharper) and groups higher ones
Log scaling compresses huge dynamic ranges

Benefits for CNNs:

Shifting in time just slides the spectrogram horizontally → CNNs detect the same pattern at different positions
Makes sounds more time-invariant and structured for CNNs

In my inference pipeline (main.py), I used:

self.transform = nn.Sequential(
    T.MelSpectrogram(
        sample_rate=44100,
        n_fft=2048,
        hop_length=512,
        n_mels=128,
        f_min=0,
        f_max=22050
    ),
    T.AmplitudeToDB()
)

The CNN Architecture

I didn’t just build any CNN – I implemented a ResNet‑34 inspired model adapted for spectrograms. Here’s the full architecture:

34 layer resnet model

Code (from `model.py`)

class AudioClassifier(nn.Module):
    def __init__(self, num_classes=50):
        super().__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1,64,7,2,3,bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(3,2,1),
        )
        self.layer2 = nn.ModuleList([ResidualBlock(64,64) for _ in range(3)])
        self.layer3 = nn.ModuleList([ResidualBlock(64 if i==0 else 128,128,stride=2 if i==0 else 1) for i in range(4)])
        self.layer4 = nn.ModuleList([ResidualBlock(128 if i==0 else 256,256,stride=2 if i==0 else 1) for i in range(6)])
        self.layer5 = nn.ModuleList([ResidualBlock(256 if i==0 else 512,512,stride=2 if i==0 else 1) for i in range(3)])
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(512,num_classes)

Visual Structure

Initial Conv Layer: 7×7, stride=2, 64 channels, followed by BN, ReLU, MaxPool.
Residual Blocks:
- Layer2: 3× blocks (64→64)
- Layer3: 4× blocks (64→128)
- Layer4: 6× blocks (128→256)
- Layer5: 3× blocks (256→512)
- Residual connections are the secret sauce.
  - Without shortcuts: Each block must learn the full mapping.
  - With shortcuts: The block only learns the difference (residual). If nothing needs to change, it can output 0 and pass input forward untouched.
  - Backprop advantage: Gradients flow through both the main path and the shortcut, preventing vanishing gradients.

Formally:

With shortcut: output = F(x) + x
Without shortcut: output = F(x)

This makes deep networks stable and easier to train.

Residual connections were critical – they allowed me to go deep without vanishing gradients, making this model much more stable.

Head: AdaptiveAvgPool → Flatten → Dropout(0.5) → Linear(512→50)

Why Residual CNNs for Audio?

7×7 kernel at start → captures broad time–frequency context.
Residual blocks with 3×3 convs → capture local harmonics and transient events.
Skip connections → let gradients flow through deep stacks.
Global Average Pooling → compact representation before classification, fewer params than big dense layers.
Dropout 0.5 → essential for such a small dataset.

Pooling Explained

Pooling reduces spatial size, keeping salient features:
- MaxPool: Keeps the strongest activation in a region → captures sharp onsets like clicks or knocks.
- AveragePool (at the end): Summarizes entire channels, creating compact embeddings.

Pooling is why the receptive field grows and why the model sees larger time spans of sound as we go deeper.

Dropout Explained

Dropout randomly zeroes neurons during training. Why?

Forces redundancy → model can’t “cheat” by relying on a single feature.
Helps prevent overfitting, especially critical for small datasets like ESC-50.
In my model, p=0.5 before the final linear layer was key.

Training Setup

Loss: CrossEntropyLoss
Optimizer: Adam (lr=0.0005, weight_decay=0.01)
Scheduler: StepLR (decay every 10 epochs)
Batch size: 32
Epochs: 100
GPU: Serverless NVIDIA H100

Results

Validation Accuracy: ~88%
Macro F1: ~0.83
Per‑class: strong on distinct sounds (dog, siren), weaker on overlapping (rain vs pouring water).
Confusions: Urban sounds (car horn vs train) often overlapped.

CNNs Explained (For Audio Beginners) in a short way

I used this project as an excuse to master CNNs in audio:

Stride = downsampling: reduces resolution while increasing abstraction.
Receptive field grows with depth, eventually covering long syllables or ambient textures.
Residual learning means “learn the difference” instead of the whole mapping → makes deep nets trainable.
Waveforms vs Spectrograms: On raw waveforms, shifting time changes the whole input → CNN struggles. On spectrograms, shift = horizontal move → CNN detects the same pattern.
Convolutions: Filters slide over time–frequency patches, learning harmonic or transient detectors.
Pooling: Adds invariance by focusing on salient energy patterns.
Think of it this way: a spectrogram is like sheet music, and CNN filters are the musicians who learn to spot melodies, rhythms, and harmonics no matter where they appear.

Lessons Learned

SpecAugment is non‑negotiable with small audio datasets.
Dropout saved me from massive overfitting.
Residual blocks made training so much smoother than plain CNNs.
Confusions reveal dataset limitations as much as model ones.

Closing

This build was personal: I didn’t just want to “use” CNNs, I wanted to understand them deeply in the context of sound. Now, I can hear a spectrogram and picture the convolutional filters lighting up. If you’re starting your audio ML journey, I hope this post gives you both a blueprint and intuition.

I Built a Convolutional Neural Network that understands Audio

TL;DR

Check it out tho —> Link

Why I Built This

Dataset: ESC‑50

Turning Audio into Images (Spectrograms)

Why Convert Audio to Mel Spectrograms?

The CNN Architecture

Code (from `model.py`)

Visual Structure

Why Residual CNNs for Audio?

Pooling Explained

Dropout Explained

Training Setup

Results

CNNs Explained (For Audio Beginners) in a short way

Lessons Learned

Closing

Comments

More from this blog

Scaling your Apps: The Why, How, and When

Node.js Runtime: How It Works and How It's Different From Bun

Introduction to HTTP: Understanding the Web's Backbone

Command Palette

TL;DR

Check it out tho —> Link

Why I Built This

Dataset: ESC‑50

Turning Audio into Images (Spectrograms)

Why Convert Audio to Mel Spectrograms?

The CNN Architecture

Code (from model.py)

Visual Structure

Why Residual CNNs for Audio?

Pooling Explained

Dropout Explained

Training Setup

Results

CNNs Explained (For Audio Beginners) in a short way

Lessons Learned

Closing

Comments

More from this blog

Code (from `model.py`)