I Built a Convolutional Neural Network that understands Audio

build. ship, publish
TL;DR
I trained a convolutional neural network (CNN) based on a ResNet‑34 style residual architecture to classify audio clips from the ESC‑50 dataset (50 environmental sound classes). I used log–mel spectrograms as input, reached strong accuracy and generalization with residual blocks, and packaged the model with dropout and adaptive average pooling for robustness. This post is both a learning journal and a mini‑tutorial on CNNs for audio.
Check it out tho —> Link
Why I Built This
I’ve always been fascinated by how we can “see sound.” Turning waveforms into spectrograms and then letting a CNN discover structure in those images felt like the perfect way to combine my curiosity about audio with my love for deep learning architectures.
This project wasn’t just about hitting a benchmark – it was about really understanding CNNs and how they map onto audio data. So here’s my honest build log, decisions, failures, and what I’d improve.
Dataset: ESC‑50
I worked with the ESC‑50 dataset:
2000 audio clips (5 seconds each)
50 balanced classes covering animals, natural soundscapes, human sounds, domestic, and urban noise

Each class has 40 examples, which makes it a small but clean dataset – ideal for a controlled CNN experiment but definitely requiring augmentation.
Turning Audio into Images (Spectrograms)
My preprocessing pipeline:
Resampled audio to 44100Hz
STFT with
n_fft=1024,hop_length=512Converted to mel scale with
n_mels=128Took log amplitude (dB)
Normalized each spectrogram
This gave me consistent (1, 128, 256) tensors for each clip (1 channel, 128 mel bins, 256 time steps).
Why Convert Audio to Mel Spectrograms?
Waveforms are just raw 1D signals — each point is the sum of all frequencies at that instant. But CNNs thrive on spatial patterns. Enter the mel spectrogram:
Transforms 1D audio → 2D image of frequency vs. time
Mel scale emphasizes lower frequencies (where human perception is sharper) and groups higher ones
Log scaling compresses huge dynamic ranges
Benefits for CNNs:
Shifting in time just slides the spectrogram horizontally → CNNs detect the same pattern at different positions
Makes sounds more time-invariant and structured for CNNs
In my inference pipeline (main.py), I used:
self.transform = nn.Sequential(
T.MelSpectrogram(
sample_rate=44100,
n_fft=2048,
hop_length=512,
n_mels=128,
f_min=0,
f_max=22050
),
T.AmplitudeToDB()
)
The CNN Architecture
I didn’t just build any CNN – I implemented a ResNet‑34 inspired model adapted for spectrograms. Here’s the full architecture:

Code (from model.py)
class AudioClassifier(nn.Module):
def __init__(self, num_classes=50):
super().__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(1,64,7,2,3,bias=False),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(3,2,1),
)
self.layer2 = nn.ModuleList([ResidualBlock(64,64) for _ in range(3)])
self.layer3 = nn.ModuleList([ResidualBlock(64 if i==0 else 128,128,stride=2 if i==0 else 1) for i in range(4)])
self.layer4 = nn.ModuleList([ResidualBlock(128 if i==0 else 256,256,stride=2 if i==0 else 1) for i in range(6)])
self.layer5 = nn.ModuleList([ResidualBlock(256 if i==0 else 512,512,stride=2 if i==0 else 1) for i in range(3)])
self.avgpool = nn.AdaptiveAvgPool2d((1,1))
self.dropout = nn.Dropout(0.5)
self.fc = nn.Linear(512,num_classes)
Visual Structure
Initial Conv Layer:
7×7, stride=2, 64 channels, followed by BN, ReLU, MaxPool.Residual Blocks:
Layer2: 3× blocks
(64→64)Layer3: 4× blocks
(64→128)Layer4: 6× blocks
(128→256)Layer5: 3× blocks
(256→512)Residual connections are the secret sauce.
Without shortcuts: Each block must learn the full mapping.
With shortcuts: The block only learns the difference (residual). If nothing needs to change, it can output 0 and pass input forward untouched.
Backprop advantage: Gradients flow through both the main path and the shortcut, preventing vanishing gradients.
Formally:
With shortcut:
output = F(x) + xWithout shortcut:
output = F(x)
This makes deep networks stable and easier to train.
Residual connections were critical – they allowed me to go deep without vanishing gradients, making this model much more stable.

- Head: AdaptiveAvgPool → Flatten → Dropout(0.5) → Linear(512→50)
Why Residual CNNs for Audio?
7×7 kernel at start → captures broad time–frequency context.
Residual blocks with 3×3 convs → capture local harmonics and transient events.
Skip connections → let gradients flow through deep stacks.
Global Average Pooling → compact representation before classification, fewer params than big dense layers.
Dropout 0.5 → essential for such a small dataset.


Pooling Explained
Pooling reduces spatial size, keeping salient features:
MaxPool: Keeps the strongest activation in a region → captures sharp onsets like clicks or knocks.
AveragePool (at the end): Summarizes entire channels, creating compact embeddings.
Pooling is why the receptive field grows and why the model sees larger time spans of sound as we go deeper.
Dropout Explained
Dropout randomly zeroes neurons during training. Why?
Forces redundancy → model can’t “cheat” by relying on a single feature.
Helps prevent overfitting, especially critical for small datasets like ESC-50.
In my model,
p=0.5before the final linear layer was key.
Training Setup
Loss: CrossEntropyLoss
Optimizer: Adam (lr=0.0005, weight_decay=0.01)
Scheduler: StepLR (decay every 10 epochs)
Batch size: 32
Epochs: 100
GPU: Serverless NVIDIA H100
Results
Validation Accuracy: ~88%
Macro F1: ~0.83
Per‑class: strong on distinct sounds (dog, siren), weaker on overlapping (rain vs pouring water).
Confusions: Urban sounds (car horn vs train) often overlapped.
CNNs Explained (For Audio Beginners) in a short way
I used this project as an excuse to master CNNs in audio:
Stride = downsampling: reduces resolution while increasing abstraction.
Receptive field grows with depth, eventually covering long syllables or ambient textures.
Residual learning means “learn the difference” instead of the whole mapping → makes deep nets trainable.
Waveforms vs Spectrograms: On raw waveforms, shifting time changes the whole input → CNN struggles. On spectrograms, shift = horizontal move → CNN detects the same pattern.
Convolutions: Filters slide over time–frequency patches, learning harmonic or transient detectors.
Pooling: Adds invariance by focusing on salient energy patterns.
Think of it this way: a spectrogram is like sheet music, and CNN filters are the musicians who learn to spot melodies, rhythms, and harmonics no matter where they appear.
Lessons Learned
SpecAugment is non‑negotiable with small audio datasets.
Dropout saved me from massive overfitting.
Residual blocks made training so much smoother than plain CNNs.
Confusions reveal dataset limitations as much as model ones.
Closing
This build was personal: I didn’t just want to “use” CNNs, I wanted to understand them deeply in the context of sound. Now, I can hear a spectrogram and picture the convolutional filters lighting up. If you’re starting your audio ML journey, I hope this post gives you both a blueprint and intuition.

