For years, the default answer to “where should we run our AI model?” was a GPU cluster in us-east-1. That answer is rapidly becoming wrong. On-device AI inference on mobile phones has crossed a critical threshold: sub-20ms latency for production computer vision models. This changes the calculus for every product team shipping AI features.
Let’s break down why this matters, what the framework landscape looks like in 2026, and how to decide when on-device is the right call for your product.
Why On-Device AI Matters Now
Four forces are converging to make on-device inference the default for a growing class of AI workloads:
Privacy by architecture. When data never leaves the device, you don’t need to write a privacy policy for it. No server logs, no data residency questions, no breach surface. For applications processing faces, medical images, or financial documents, this isn’t a nice-to-have---it’s a regulatory requirement in an increasing number of jurisdictions.
Latency that unlocks new UX. There’s a qualitative difference between 200ms cloud round-trip and 15ms on-device inference. The former feels like a response. The latter feels like perception. Real-time camera overlays, instant document scanning, live gesture recognition---these experiences require latency budgets that no cloud endpoint can meet.
Offline-first reliability. Edge AI works on airplanes, in basements, in rural areas, and in every other place where connectivity is unreliable. For field service apps, point-of-care diagnostics, or industrial inspection tools, offline capability isn’t optional.
Cost at scale. Cloud inference pricing is per-request. On-device inference has near-zero marginal cost after the initial model download. For high-frequency inference workloads---think frame-by-frame video analysis---the cost difference is orders of magnitude.
Real-World Benchmark: 20ms Computer Vision on Mobile
The industry has achieved a milestone worth highlighting. Production-grade object detection models (YOLOv8-equivalent architectures) are now running at sub-20ms inference times on mid-range mobile devices (2024-2025 chipsets with standard NPU/GPU acceleration). This isn’t a cherry-picked benchmark on flagship hardware---it’s reproducible across the Android and iOS ecosystem on devices that cost $400.
Here’s what that looks like in practice (measured on 2024-2025 flagship mobile SoCs with NPU acceleration):
- Image classification (MobileNetV3): 4-8ms on-device
- Object detection (optimized YOLO variants): 12-20ms on-device
- Pose estimation (MoveNet Thunder): 15-25ms on-device
- Text recognition (CRNN-based OCR): 10-18ms on-device
These numbers are achieved with INT8 quantization and framework-level optimizations, running on the device’s neural processing unit where available, falling back to GPU, and using CPU as a last resort. The key insight: quantization-aware training has matured to the point where INT8 models show negligible accuracy loss compared to FP32 for most production use cases.
Framework Landscape: Where to Start
Three frameworks dominate the on-device AI space in 2026. Each occupies a distinct niche.
ExecuTorch
Meta’s ExecuTorch has emerged as the leading framework for deploying PyTorch models to mobile and edge devices. It’s designed from the ground up for on-device execution, with a small runtime footprint and first-class support for hardware acceleration via delegates (CoreML on iOS, XNNPACK for CPU, Qualcomm QNN for Snapdragon NPUs).
Getting started with ExecuTorch is straightforward if you’re already in the PyTorch ecosystem. The workflow is: train in PyTorch, export with torch.export, lower to ExecuTorch, optimize with quantization, and deploy. The toolchain handles operator fusion, memory planning, and delegate dispatch automatically.
ExecuTorch’s strength is its tight integration with the PyTorch training ecosystem, meaning you can take a model from research to on-device deployment without switching frameworks.
React Native ExecuTorch
For teams building cross-platform mobile apps, React Native ExecuTorch bridges the gap between the JavaScript application layer and native on-device inference. It provides React hooks and components that wrap ExecuTorch’s native runtime, letting you run models with a few lines of TypeScript.
A React Native ExecuTorch integration typically looks like this:
import { useModel } from 'react-native-executorch';
function MyComponent() {
const model = useModel(require('./model.pte'));
const result = await model.forward(inputTensor);
// result available in ~15ms, no network call
}
This is significant because it brings on-device AI to the enormous React Native ecosystem without requiring teams to write native modules or manage platform-specific inference code. The abstraction handles model loading, memory management, and hardware delegate selection across both iOS and Android.
If you’re evaluating on-device AI for a React Native application, our on-device AI implementation services cover the full pipeline from model selection through production deployment.
ONNX Runtime Mobile
Microsoft’s ONNX Runtime remains the most framework-agnostic option. If your models are trained in TensorFlow, JAX, or scikit-learn, ONNX provides a universal intermediate representation that runs on mobile with competitive performance. Its execution provider architecture supports CoreML, NNAPI, and DirectML, making it a solid choice for teams with heterogeneous model origins.
The trade-off: ONNX Runtime’s generality means it can lag behind ExecuTorch on PyTorch-specific optimizations, and the runtime footprint is larger.
What’s Coming: The Next Wave
Several trends will push on-device AI further in the next 12-18 months:
On-device text-to-speech. Lightweight TTS models (sub-100MB) are approaching real-time factor on mobile NPUs. This enables fully offline voice assistants, accessibility features, and conversational UI without cloud dependency. Expect production-ready on-device TTS to become table stakes by late 2026.
Real-time vision-camera pipelines. Frameworks are adding first-class support for camera frame pipelines---zero-copy buffer sharing between the camera HAL and the inference engine. This eliminates the frame copy overhead that currently adds 3-5ms to camera-based inference, pushing total pipeline latency below 10ms.
GPU and NPU acceleration maturity. Qualcomm’s QNN SDK, Apple’s CoreML, and MediaTek’s NeuroPilot are converging on a common capability set. The fragmentation that plagued on-device AI in 2023-2024 is resolving. Hardware delegates are becoming reliable enough to target by default rather than as an optimization step.
On-device fine-tuning. Early experiments with on-device LoRA adaptation are showing promise. The vision: models that personalize to user behavior without sending data to the cloud. This is still experimental, but the memory and compute requirements are approaching feasibility on flagship devices.
The Edge vs Cloud Decision Matrix
Choosing between on-device and cloud AI isn’t binary---it’s a spectrum. We’ve developed a framework we call The Edge vs Cloud Decision Matrix to help product teams make this decision systematically.
The matrix evaluates five dimensions, each scored from 1 (strongly favors cloud) to 5 (strongly favors edge):
| Dimension | Cloud-Leaning (1-2) | Edge-Leaning (4-5) |
|---|---|---|
| Latency Sensitivity | Tolerates 200ms+ round-trip; async workflows | Requires <50ms; real-time camera/sensor processing |
| Privacy Constraints | Non-sensitive data; acceptable ToS coverage | Biometric, medical, financial data; regulatory exposure |
| Connectivity Profile | Always-online; reliable low-latency network | Intermittent, offline-first, or high-latency environments |
| Model Complexity | LLMs, large diffusion models, >1B parameters | Classification, detection, segmentation, <100M parameters |
| Inference Volume | Low-frequency (<100 calls/user/day) | High-frequency (per-frame, continuous, >1000 calls/user/day) |
How to use it: Score each dimension for your use case. Sum the scores.
- 5-12: Cloud-first. Your workload benefits from cloud scale. Use on-device only for latency-critical preprocessing.
- 13-18: Hybrid. Split the pipeline. Run lightweight preprocessing and filtering on-device, send only relevant data to cloud models.
- 19-25: Edge-first. On-device inference is your primary architecture. Use cloud only for model updates and aggregated analytics.
The critical insight behind this matrix is that model complexity is the gating dimension. If your use case requires a model with more than ~500M parameters, on-device is not yet viable on mobile hardware regardless of how the other dimensions score. Conversely, if your model fits within the on-device compute envelope, the other four dimensions almost always tip the scales toward edge deployment.
Applying the Matrix: Two Examples
Example 1: Real-time quality inspection in manufacturing. Latency: 5, Privacy: 3, Connectivity: 5, Model Complexity: 4, Volume: 5. Total: 22. This is a clear edge-first workload. The factory floor has unreliable connectivity, inspection happens at line speed, and the model (defect detection) is well within on-device capability.
Example 2: Customer support chatbot. Latency: 2, Privacy: 2, Connectivity: 1, Model Complexity: 1, Volume: 2. Total: 8. Cloud-first. The model is large (LLM), the user is online, and response time tolerance is high.
Getting Started
If your Edge vs Cloud Decision Matrix score points toward on-device, here’s the practical path:
- Start with a proven architecture. Don’t train from scratch. Use MobileNetV3, EfficientNet-Lite, or YOLOv8-nano as your starting point and fine-tune on your domain data.
- Quantize early. Integrate quantization-aware training from the start, not as a post-training optimization. INT8 quantization typically yields 2-4x speedup with <1% accuracy loss.
- Profile on real devices. Emulator performance is meaningless for on-device AI. Test on the lowest-tier device in your support matrix.
- Choose your framework based on your stack. PyTorch models go to ExecuTorch. React Native apps use React Native ExecuTorch. Multi-framework teams consider ONNX Runtime.
- Plan for model updates. On-device models need an update mechanism. Design your OTA model delivery pipeline before you ship.
On-device AI in 2026 is no longer a research curiosity or a niche optimization. It’s a production-ready architecture that delivers better privacy, lower latency, and lower cost for a significant class of AI workloads. The frameworks are mature, the hardware is capable, and the tooling is accessible.
The question is no longer “can we run this on-device?” It’s “why aren’t we?”
April 2026 Update
Since this post was published, several developments are worth tracking:
ExecuTorch v0.8 shipped with expanded computer vision task coverage (instance segmentation, OCR, vision-language models) and improved QNN delegate stability on Snapdragon 8 Elite. The React Native ExecuTorch wrapper now exposes the full CV stack through task-specific hooks — see the Computer Vision in React Native guide for a complete walkthrough.
Benchmark update: On-device inference numbers have held steady relative to the March benchmarks. The Snapdragon 8 Elite continues to be the performance leader on Android via the QNN NPU delegate, with INT8 object detection reliably hitting the 12-15ms range. MediaTek Dimensity 9400 devices are closing the gap.
LLM on-device progress: Small language models (Llama 3.2 1B, Qwen2.5 0.5B) are now running at interactive speeds on flagship Android devices. For a practical guide to shipping these in a React Native app, see on-device LLMs for Android with React Native ExecuTorch.
Further reading:
- Best On-Device LLMs for Android in 2026: React Native ExecuTorch Guide — model benchmarks, download size considerations, and production code examples
- Computer Vision in React Native with ExecuTorch v0.8 — classification, detection, segmentation, and OCR on-device
Need help implementing on-device AI in your product? Our team specializes in on-device AI deployment across ExecuTorch and React Native, from model optimization through production rollout.