FoxStudio
WorksLabStudioTeamJournalContact
FR·EN·IT

FoxStudio

Subsidiary of FoxCase.

Navigate

  • Works
  • Lab
  • Studio
  • Team
  • Journal

Contact

  • hello@foxstudio.fr
  • Cannes, FR
  • Ombrys

Legal

  • Mentions
  • Privacy
  • Footprint

Cannes, FR · GMT+1MEASURING…v0.1.0
◂ Back to journal

2026-04-08 · perf · 6 min

Edge LLM at 14 ko: what fits, what doesn't

We benchmarked a small ONNX + WebGPU model on a 2019 ThinkPad. Verdict: yes, it's doable. No, it's not magic.

Experiment exp_001 targets a precise case: running a text classifier under 50 million parameters fully in the browser, no server call. INT8-quantised ONNX runtime, WebGPU for compute, and a loader that fetches weights once per session.

On a ThinkPad T490 (Intel i5-8265U, integrated UHD 620), we hit the target — 740 ms median latency over 1000 inferences. On a MacBook Air M2 it's 90 ms. The 8x factor is entirely attributable to the GPU.

What doesn't fit: anything requiring long autoregressive generation. Past 200 tokens, quadratic attention blows up memory. So yes to classification, tagging, short completion. No to open-ended chat.