2026-04-08 · perf · 6 min
Edge LLM at 14 ko: what fits, what doesn't
We benchmarked a small ONNX + WebGPU model on a 2019 ThinkPad. Verdict: yes, it's doable. No, it's not magic.
Experiment exp_001 targets a precise case: running a text classifier under 50 million parameters fully in the browser, no server call. INT8-quantised ONNX runtime, WebGPU for compute, and a loader that fetches weights once per session.
On a ThinkPad T490 (Intel i5-8265U, integrated UHD 620), we hit the target — 740 ms median latency over 1000 inferences. On a MacBook Air M2 it's 90 ms. The 8x factor is entirely attributable to the GPU.
What doesn't fit: anything requiring long autoregressive generation. Past 200 tokens, quadratic attention blows up memory. So yes to classification, tagging, short completion. No to open-ended chat.