Privacy regulations are tightening. Users are increasingly wary of cloud-processed data. And mobile chipsets in 2026 — Apple A18, Qualcomm Snapdragon 8 Elite, MediaTek Dimensity 9400 — are closing the gap with server-grade inference hardware faster than anyone predicted.
The result: running a capable language model entirely on a user’s phone is no longer an engineering curiosity. It’s a production architecture choice. For React Native developers, React Native ExecuTorch — Software Mansion’s wrapper around Meta’s ExecuTorch runtime — is the most idiomatic path to get there.
That said, this is a pre-1.0 library. Model format compatibility can break between ExecuTorch versions, Android performance varies wildly across chipsets, and the 400MB–1.5GB model downloads will get your app uninstalled if you mishandle them. What follows is a practical guide to the models worth considering today.
Why ExecuTorch, and Why Now
ExecuTorch is Meta’s production on-device inference framework. It powers on-device AI features in WhatsApp and Instagram at scale. It’s PyTorch-native — no ONNX conversion, no format gymnastics — and ships first-class support for the hardware acceleration backends that matter on mobile:
| Backend | Platform | Notes |
|---|---|---|
| XNNPACK | iOS + Android | Default CPU path, ARM NEON optimized, works everywhere |
| CoreML | iOS only | Apple Neural Engine + GPU — headline performance on iPhone |
| Vulkan | Android | GPU compute, broad device support |
| QNN | Android (Snapdragon) | Qualcomm NPU/DSP — best Android performance on Snapdragon 8 Elite |
React Native ExecuTorch wraps this runtime with an idiomatic TypeScript API: hooks for text generation, image classification, object detection, segmentation, and speech-to-text. Models are distributed as pre-quantized .pte files hosted on Hugging Face, so you’re not compiling anything yourself.
The Models
The ranking below is based on download volume from the Software Mansion Hugging Face org — the most direct signal of what the community is actually deploying. All five models have first-party .pte files maintained by Software Mansion.
1. Llama 3.2 — The Reference Model (17.5k downloads)
Available sizes: 1B and 3B
Meta’s Llama 3.2 remains the most downloaded model in the ecosystem, and for good reason: it was designed from the ground up for edge deployment, not compressed down from something larger. The 1B and 3B variants cover different tradeoffs worth understanding separately.
Llama 3.2 1B is the right choice when response latency is visible to the user. At roughly 60–80 tokens/second on iPhone 15 Pro with CoreML (community-reported estimates — your numbers will vary by device and quantization), short responses feel nearly instant. On a Snapdragon 8 Elite it’s closer to 20–30 tok/s via XNNPACK; on mid-range Android, expect 10–15. The capability ceiling is real — don’t use it for anything requiring multi-step reasoning.
import { useLLM } from 'react-native-executorch';
function TextAssistant() {
const llm = useLLM({
modelSource: 'https://huggingface.co/software-mansion/react-native-executorch-llama-3.2',
});
const generate = async (prompt: string) => {
await llm.generate(prompt, {
onToken: (token) => setOutput(prev => prev + token),
});
};
return (
<TextInput
onSubmitEditing={(e) => generate(e.nativeEvent.text)}
placeholder="Ask anything..."
/>
);
}
Llama 3.2 3B is meaningfully better at instruction following and longer-form generation — worth the tradeoff for flagship-device apps. The .pte file runs around 1.3GB and needs roughly 2.5GB of RAM at runtime. On 6GB Android devices, that RAM pressure is real: the OS will terminate background apps, and if the system is under load, your model load can fail mid-session. Build a fallback for this.
Context window: 128K tokens on both sizes — longer than you’ll realistically use on mobile, but useful for document-heavy features.
2. Qwen 3 — The New Challenger (14.9k downloads)
Available sizes: 0.6B, 1.7B, 4B
Qwen 3 from Alibaba is the fastest-growing model in the Software Mansion ecosystem and the most significant update to this space in 2026. The key differentiator is its hybrid thinking mode: the model can switch between fast, direct responses and a slower chain-of-thought reasoning path depending on the task.
For mobile use cases, this matters more than it sounds. A single model can handle quick autocomplete at the 0.6B size, then step up to deliberate reasoning for a complex query — without you shipping multiple models. The 1.7B variant is the practical sweet spot for most apps.
The 4B is at the edge of what’s viable for mainstream devices. It’s impressive, but the download size (~2GB+ quantized) and RAM requirements (~3.5GB runtime) mean you’re targeting flagship-only and should be explicit about that in your UX.
Qwen 3’s multilingual quality is also strong — 100+ languages, with particularly good performance in Chinese, Japanese, Korean, and Arabic. If Llama 3.2 is the English-first default, Qwen 3 is the model to reach for when your user base isn’t primarily English-speaking.
3. Hammer 2.1 — Efficient Instruction Following (2.45k downloads)
Available sizes: 0.5B, 1.5B, 3B
Hammer 2.1 is built on the Qwen 2.5 architecture and fine-tuned for instruction following and chat. It’s the third most downloaded LLM in the Software Mansion library, which is notable for a model that gets less coverage than Llama or Qwen.
The sizing matches the Qwen 2.5 family (0.5B, 1.5B, 3B), and the instruction-following tuning makes it particularly well-suited for structured tasks: JSON output, form assistants, classification pipelines. If Qwen 3 is the general-purpose upgrade, Hammer 2.1 occupies a narrower niche — but it does that niche well.
Worth evaluating as an alternative to Qwen 2.5 if your use case is heavy on structured output and you want a smaller, tighter model than the full Qwen 3 family.
4. SmolLM 2 — When the Device is the Constraint (1.97k downloads)
Available sizes: 135M, 360M, 1.7B
HuggingFace’s SmolLM 2 is the outlier in this list. The 135M and 360M variants are genuinely useful sizes that no other model in the ecosystem covers — small enough to run on mid-range and older devices that would struggle with a 1B+ model.
The tradeoff is honest: SmolLM 2 is measurably behind the Qwen 3 and Llama 3.2 families on most benchmarks. It’s an efficiency-first model, and that’s the only reason to reach for it. If your app has an always-on AI feature that runs in the background, or you’re targeting a broad device range including budget Android hardware, SmolLM 2 at 360M gets you something useful where other models won’t even load.
The 1.7B variant competes directly with Llama 3.2 1B on size but generally falls behind on quality. In most cases, Llama 3.2 1B is the better choice at that parameter count.
5. Qwen 2.5 — The Known Quantity (1.51k downloads)
Available sizes: 0.5B, 1.5B, 3B
Qwen 2.5 is the predecessor to Qwen 3. Now that Qwen 3 is available in the Software Mansion library, Qwen 2.5’s primary use case is projects that are already integrated and don’t need an upgrade, or niche scenarios where Qwen 3’s hybrid thinking mode adds unwanted latency overhead.
The 1.5B variant has particularly strong multilingual quality for its size — 29+ languages — and has been widely deployed in production apps targeting Southeast Asian and Middle Eastern markets. If you’re locked into ExecuTorch v0.6.0 and something in the Qwen 3 integration is giving you trouble, Qwen 2.5 is the stable fallback.
Choosing the Right Model
| Use Case | Recommended Model | Why |
|---|---|---|
| Autocomplete, quick responses | Llama 3.2 1B | Fastest community-reported inference speed |
| General in-app assistant | Llama 3.2 3B or Qwen 3 1.7B | Quality/size balance for flagship devices |
| Hybrid reasoning + fast responses | Qwen 3 1.7B | Thinking mode adapts to task complexity |
| Structured output, JSON schemas | Hammer 2.1 1.5B | Instruction-following tuned |
| Multilingual (non-English) | Qwen 3 1.7B | 100+ languages, strong non-Latin quality |
| Budget/mid-range devices | SmolLM 2 360M | Runs where 1B+ models don’t |
| Battery/efficiency first | SmolLM 2 1.7B | Designed for edge efficiency |
The device targeting rule: Targeting all iOS and Android devices means capping at 1.5B parameters and designing for graceful degradation when inference fails. Targeting flagship-only (iPhone 14+, Snapdragon 8 Gen 2+) opens up the 3B–4B range. Above that, you’re in beta-tester territory.
Beyond Text: The Full ExecuTorch Task Suite
React Native ExecuTorch covers more than language models. The same library exposes:
Vision: useImageClassification, useObjectDetection, useSegmentation, plus multimodal with LLaVA 1.5 and Florence-2.
Audio: useSpeechToText — Whisper for ASR. Combining it with an LLM gives you a fully on-device voice assistant:
import { useLLM, useSpeechToText } from 'react-native-executorch';
// Voice-to-text → LLM pipeline, entirely on device
function VoiceAssistant() {
const stt = useSpeechToText({
modelSource: 'https://huggingface.co/software-mansion/react-native-executorch-whisper-tiny.en',
});
const llm = useLLM({
modelSource: 'https://huggingface.co/software-mansion/react-native-executorch-llama-3.2',
});
const handleVoiceInput = async (audioBuffer: Float32Array) => {
const transcript = await stt.transcribe(audioBuffer);
await llm.generate(transcript, {
onToken: (token) => setResponse(prev => prev + token),
});
};
// No network call. No server. No data leaving the device.
}
Embeddings: useEmbeddings — BERT-class models for semantic search and similarity.
The all-MiniLM-L6-v2 embedding model has 891k downloads in the Software Mansion library — significantly more than any LLM. Semantic search and retrieval are apparently more popular production use cases than text generation, which matches what we see in client work.
Getting Started
npx expo install react-native-executorch
For bare React Native, the native setup guide requires adding the ExecuTorch framework to iOS and Android builds. The Expo plugin handles this automatically.
Three things that will bite you in production if you skip them:
-
Download the model on Wi-Fi, not on first launch. Gate the download behind an explicit user action, show progress, and cache aggressively. A 700MB download on cellular is how you get a 1-star review.
-
Handle model load failure explicitly. On constrained devices,
modelLoadErroris not an edge case — it’s expected. Fall back to a cloud API silently, or degrade the feature gracefully. Don’t surface a raw error to the user. -
Profile on real devices. CoreML and XNNPACK performance on Simulator is meaningless. Use a physical iPhone and at least two Android devices at different tiers. The mid-range Android experience is often where integration falls apart.
On-device LLMs in React Native are now a legitimate architecture choice, not an experiment — but the current capability envelope is real. The models that work well are in the 1B–3B range, for English-primary tasks, on recent hardware. If your use case fits that box, the tooling is ready. If it doesn’t, the honest answer is that a cloud API is still the right call.
Further reading: For a broader look at on-device inference beyond LLMs — including computer vision models, latency benchmarks, and an edge-vs-cloud decision framework — see On-Device AI on Android in 2026: Sub-20ms Inference Without Cloud Latency.
We work on on-device AI integration for React Native apps — from model selection and quantization through production rollout. Get in touch if you’re evaluating this architecture.