Engineering · April 30, 2026 · 8 min read

Three audio tools, four rounds, zero servers

A piano roll → scale detector, a BPM + key analyzer, and a guitar tuner. All built in four build rounds. None of them upload anything; all run entirely inside the browser tab. Here's the algorithm behind each, and why no-server was the cheapest correct choice.

What we shipped

3
tools
4
rounds
0
function slots used
~2k
lines of JS

Each tool fits in one HTML file. No build step. No backend. No WASM. Just the Web Audio API and ~500 lines of JavaScript per tool, including all the DSP.

Why all-browser was the right call

The instinct on these features is to upload files to a server, analyze them with a serious DSP library (librosa, essentia), and return JSON. That's how Tunebat works. It's also how every paid alternative works.

We're on Vercel Hobby with 12 functions and a 4.5MB request body limit. Uploading audio would burn function-execution time on every analysis, hit the body cap on anything over a few minutes long, and force us to think about rate limits and abuse on day one. None of which we want to think about.

The Web Audio API is built into every browser. It exposes decoded float samples and an FFT-capable AnalyserNode. Modern JS engines run hand-rolled DSP at speeds that surprise people who haven't tried recently. So: do the DSP in the browser, no upload, no function slot, audio never leaves the device. That becomes the marketing line too.

The fastest way to handle a request is to never receive it.

Tool 1 — Piano roll → scale detector

The smallest one. No DSP. Just a 2-octave piano keyboard rendered as flex children (white keys) and absolutely-positioned slivers (black keys) on top.

Pressing keys feeds a set of pitch classes (0–11). The matcher tries every (12 roots × 12 scales = 144) combination and asks: what fraction of pressed notes does this scale contain? Perfect fits (100%) flag in green, partials follow. Tie-breaks go to smaller scales — so pentatonic beats major when both contain all the notes.

function matchScales(pcs) {
  for (let root = 0; root < 12; root++) {
    for (const scale of SCALES) {
      const scaleSet = new Set(scale.intervals.map(iv => (iv + root) % 12));
      const matched = pcs.filter(p => scaleSet.has(p)).length;
      const fit = matched / pcs.length;
      if (fit >= 0.5) results.push({ root, scale, fit });
    }
  }
  return results.sort(/* ...fit, then size, then rank */);
}

One side benefit of doing this in JS: each pressed key plays a tone immediately. Web Audio's OscillatorNode with a triangle wave and a quick attack-decay envelope sounds clean enough that nobody complains. ~12 lines of audio code.

Tool 2 — BPM + Key Detector

The biggest one. Two algorithms in one tool.

BPM via energy-envelope autocorrelation

Beats are periodic energy peaks. The trick is finding the period without getting confused by the audio itself. Three steps:

  1. Build an energy envelope. Square the audio samples (turns negative-going air pressure into positive energy), box-filter into 5ms buckets (smooths over individual snare hits), subtract the running mean. You get a 200Hz signal that pulses with the beat.
  2. Autocorrelate at lags corresponding to 60–200 BPM. For each candidate lag, slide the envelope against itself and compute the dot product. The lag that produces the strongest correlation is the period.
  3. Pick the peak, convert to BPM. BPM = 60 / (lag / envSr).

Half/double-time ambiguity is the classic failure mode — the autocorrelation peak at 75 BPM also has a strong peak at 150 BPM. We display both alternates in the UI and let the producer pick.

Key via chromagram + Krumhansl-Schmuckler

Three steps again:

  1. FFT the audio. Slide a 4096-sample window with 50% overlap across the first 30 seconds. Apply a Hann taper to each window before transforming.
  2. Map FFT bins to pitch classes. Each FFT bin has a frequency. Convert to MIDI via 12·log₂(f/440)+69, then mod 12 to get the pitch class (C=0, C#=1, …). Sum power per pitch class across all windows. L1-normalize. You get a 12-element chroma vector.
  3. Match against Krumhansl-Schmuckler key profiles. Each of 24 keys (12 major + 12 minor) has a characteristic 12-element profile derived from cognitive-psych studies. Compute Pearson correlation between chroma (rotated to each candidate tonic) and each profile. The max wins.

The classic Krumhansl-Schmuckler weakness is relative major/minor ambiguity — C major and A minor share all the same notes, so chroma alone can't always tell them apart. Modal tracks (A dorian, D mixolydian) confuse it too. The right improvement here is HMM-smoothed temporal models, but for browser-side pop/hip-hop the simple version is right ~80% of the time, which is fine.

The hand-rolled Cooley-Tukey FFT runs in maybe 3ms for a 4096-point transform. The whole 30-second analysis completes in well under a second on any modern device. There is no reason to send this audio to a server.

Tool 3 — Guitar tuner

Real-time pitch detection from the mic. The classic algorithm here is YIN (de Cheveigné & Kawahara, 2002). It's a 3-step improvement on autocorrelation:

  1. Difference function instead of autocorrelation. d(τ) = Σ (x[i] - x[i+τ])². Where autocorrelation looks for similarity, the difference function looks for zero (the period of the wave). Cleaner peaks.
  2. Cumulative mean normalization. d'(τ) = d(τ) / ((1/τ) · Σⱼ d(j)). Suppresses false positives from the trivial τ=0 minimum.
  3. Absolute threshold + parabolic interpolation. Find the first τ where d'(τ) drops below 0.10. Fit a parabola through the surrounding three points to get sub-bin frequency resolution.
function yinPitch(buf, sampleRate, threshold = 0.10) {
  // ...difference function d(τ)
  // ...cumulative mean normalized d'(τ)
  let tau = -1;
  for (let t = 2; t < halfN; t++) {
    if (dPrime[t] < threshold) {
      while (t + 1 < halfN && dPrime[t+1] < dPrime[t]) t++;
      tau = t; break;
    }
  }
  if (tau === -1) return -1;
  // ...parabolic interpolation around the minimum
  return sampleRate / refinedTau;
}

The freq → note conversion is one line: midi = 12·log₂(f/440)+69. Cents off perfect is the fractional part times 100. Standard guitar tuning (EADGBE) is six target frequencies; we pick the nearest in cents space (not Hz space, otherwise E4 would false-match E2).

What I'd change in v2

Three things:

  1. Move the heavy DSP to a Web Worker. Right now the BPM/key analysis runs on the main thread. On long files (5+ minutes) it can lock up the UI for ~500ms. A worker would keep the page interactive and is a 30-line change.
  2. Use AudioWorkletNode for the tuner. AnalyserNode + ScriptProcessorNode is the legacy path; AudioWorklet runs in a dedicated audio thread with tighter latency guarantees. The tuner already feels real-time, but AudioWorklet would buy us margin on lower-end devices.
  3. Persist results in URL params. If you drop a file and the page detects 140 BPM in C minor, the URL could become /tools/bpm-detector?bpm=140&key=Cm for sharing the result with bandmates. Doesn't require re-uploading.

The general lesson

Every "send the file to a server" feature deserves the question: what would it take to do this in the browser? The answer is usually "more than you'd think and less than you assume." Web Audio decodes anything, FFT and autocorrelation are 30-line algorithms, mic input is one getUserMedia call. The privacy story ("audio never leaves your device") is a feature in itself.

Cost: zero function slots, zero monthly egress. Three tools shipped on a plan that wouldn't have allowed any of them as server-side features.

Try them.

All three tools live, free, no signup.

Open the tools page →
© 2026 StudioMode · Home · Blog