Running a language model entirely on an Android device in 2026 is a production-ready architecture choice — not a research experiment. The chipsets are capable, the frameworks have stabilised, and the privacy and latency advantages are real. What’s less clear is which native integration path to take.
If you’re building with React Native, React Native ExecuTorch gives you a TypeScript API over ExecuTorch with minimal native code. But if you’re building a native Android app in Kotlin or Java, you have three serious options: ExecuTorch’s Java/JNI bindings, llama.cpp compiled for Android via the NDK, and Alibaba’s MNN framework. Each occupies a distinct niche.
This post covers the trade-offs, Kotlin integration patterns, and how to pick the right framework for your use case.
Why Native Android (Not React Native) Matters
React Native ExecuTorch is the fastest path to shipping on-device LLM features in a cross-platform app. But native Kotlin/Java is the right choice when:
- Your app is already native. Adding a React Native bridge solely for on-device inference adds significant complexity and binary size.
- You need fine-grained threading control. Native Android lets you pin inference to specific cores, control thread affinity, and manage the NPU/GPU delegate lifecycle directly.
- You’re integrating with Android system APIs. AccessibilityService, InputMethod, and CameraX pipelines are far simpler to build in native Kotlin than across a bridge.
- Binary size is a constraint. The React Native runtime itself adds overhead that native code avoids.
The Three Frameworks
ExecuTorch (Java/JNI Bindings)
Meta’s ExecuTorch is the most architecturally coherent on-device inference framework for PyTorch models. It powers on-device AI in WhatsApp and Instagram at scale. The Android integration uses a JNI layer that exposes a Module class for Java/Kotlin code.
Strengths:
- First-class QNN delegate for Snapdragon NPU (best Android inference performance)
- Tight PyTorch integration — no model format conversion needed
- Actively maintained by Meta with production deployments as proof
Weaknesses:
- Pre-1.0 API surface; breaking changes between minor versions
- Java bindings are less mature than the C++ API — some advanced delegate options require dropping to JNI directly
.pteformat is ExecuTorch-specific; models require an export step from PyTorch
Typical Kotlin integration:
import org.pytorch.executorch.Module
import org.pytorch.executorch.EValue
import org.pytorch.executorch.Tensor
class LLMInference(modelPath: String) {
private val module = Module.load(modelPath)
fun generate(tokens: LongArray, maxNewTokens: Int = 128): List<Long> {
val inputTensor = Tensor.fromBlob(tokens, longArrayOf(1, tokens.size.toLong()))
val outputs = mutableListOf<Long>()
repeat(maxNewTokens) {
val output = module.forward(EValue.from(inputTensor))
val logits = output[0].toTensor()
val nextToken = logits.dataAsLongArray().last()
outputs.add(nextToken)
if (nextToken == EOS_TOKEN_ID) return outputs
}
return outputs
}
companion object {
const val EOS_TOKEN_ID = 128001L
}
}
Best for: Teams already in the PyTorch ecosystem, Snapdragon-heavy device targets, apps where you control model training and can use torch.export.
llama.cpp (Android NDK)
Georgi Gerganov’s llama.cpp has a well-established Android build path via the NDK. It uses GGUF model format, which is the community standard for quantised LLM weights — virtually every open model has GGUF variants available on Hugging Face. The Android integration wraps the C++ library with a JNI layer exposed as android/llama.cpp in the project’s examples/ directory.
Strengths:
- GGUF ecosystem is massive — any model on Hugging Face is immediately usable
- The most battle-tested quantisation (Q4_K_M, Q5_K_M, Q8_0) with well-understood accuracy/speed tradeoffs
- Vulkan GPU backend works across almost all Android devices without vendor-specific delegate risk
- Active community; prompt template support for every major model family built in
Weaknesses:
- No NPU acceleration path — you’re limited to CPU (NEON) and GPU (Vulkan)
- JNI layer in
examples/androidis a reference implementation, not a polished library - Chat templating and context management are your responsibility
Typical JNI/Kotlin integration:
class LlamaCpp {
init { System.loadLibrary("llama") }
external fun loadModel(path: String, nCtx: Int = 2048, nThreads: Int = 4): Long
external fun completionInit(ctx: Long, prompt: String, maxNewTokens: Int): Int
external fun completionLoop(ctx: Long): String?
external fun freeContext(ctx: Long)
}
// Usage
val llama = LlamaCpp()
val ctx = llama.loadModel(modelFile.absolutePath, nThreads = 4)
llama.completionInit(ctx, "<|user|>\n$userMessage\n<|assistant|>\n", 256)
val sb = StringBuilder()
while (true) {
val token = llama.completionLoop(ctx) ?: break
sb.append(token)
emit(token) // streaming via Kotlin Flow
}
llama.freeContext(ctx)
Best for: Teams who need maximum model flexibility (any GGUF model, any model family), apps where Vulkan GPU support matters more than NPU speed, offline-first apps where connectivity for model updates is limited.
MNN (Alibaba’s Mobile Neural Network)
MNN is Alibaba’s inference framework, production-tested across Taobao, Tmall, and Alipay. It has broad model format support (ONNX, TensorFlow, PyTorch via conversion), Android NPU support via the NNAPI delegate, and a Java API that feels more like a standard Android library than a JNI wrapper.
Strengths:
- Polished Java API with Android lifecycle integration
- NNAPI delegate provides NPU access on devices where QNN (Snapdragon-only) isn’t available — broader device support
- Converts models from multiple frameworks, not just PyTorch
- Good documentation compared to llama.cpp’s reference integration
Weaknesses:
- Smaller community and ecosystem than ExecuTorch or llama.cpp for LLMs specifically
- MNN’s LLM support (
MNN-LLM) is newer than its CV support and still maturing - Less transparent about which model architectures are supported and at what quantisation levels
Typical Java integration:
import com.alibaba.mnn.llm.LLM
import com.alibaba.mnn.llm.LLMConfig
class MNNInference(modelDir: String) {
private val llm = LLM.create(modelDir)
init {
llm.load()
}
fun chat(prompt: String, onToken: (String) -> Unit) {
llm.response(prompt) { token ->
onToken(token)
}
}
fun reset() = llm.reset()
}
Best for: Teams with TensorFlow or ONNX model origins, apps that need broad Android device coverage beyond Snapdragon, organisations with existing Alibaba cloud infrastructure.
Benchmark Comparison (Snapdragon 8 Elite, INT4 Quantisation)
The numbers below are representative ranges for Llama 3.2 1B on a Snapdragon 8 Elite device (2024-2025) at INT4/Q4 quantisation:
| Framework | Backend | Tokens/sec | First Token (ms) | Binary Size |
|---|---|---|---|---|
| ExecuTorch | QNN (NPU) | 18–25 | 400–600 | ~8MB runtime |
| ExecuTorch | XNNPACK (CPU) | 6–10 | 200–350 | ~8MB runtime |
| llama.cpp | Vulkan (GPU) | 8–14 | 300–500 | ~4MB runtime |
| llama.cpp | CPU (NEON) | 4–8 | 150–300 | ~4MB runtime |
| MNN-LLM | NNAPI (NPU) | 10–18 | 350–550 | ~6MB runtime |
| MNN-LLM | CPU | 5–9 | 200–400 | ~6MB runtime |
Benchmarks are approximate and vary significantly by device, model size, quantisation level, context length, and OS version. Always benchmark on your target device distribution.
The headline number: ExecuTorch with the QNN delegate on Snapdragon is the fastest path to NPU inference. But llama.cpp’s Vulkan GPU backend is competitive on GPU-only devices and far more portable.
Model Size and Download UX
One factor that often gets underweighted: on-device LLMs require large model downloads. A Q4_K_M Llama 3.2 1B is ~700MB. For Qwen2.5 0.5B, roughly 400MB.
Strategies that matter for production UX:
Progressive download. Download the model in the background after first launch, not as a blocking install step. Use Android’s WorkManager for reliable background downloading.
Delta updates. If you ship model updates, consider whether you can update quantisation adapters or fine-tuning layers rather than the full base model.
User expectation setting. Show model size and download progress explicitly. Users who understand what’s happening tolerate large downloads; users who don’t will uninstall.
Resumable downloads. Use HttpURLConnection with Range header support or the DownloadManager system service to survive interrupted connections.
Choosing Your Framework
If you already know your model’s origin framework and device targets, the decision tree is short:
PyTorch model + Snapdragon-heavy device mix?
→ ExecuTorch (QNN delegate, best NPU performance)
Any GGUF model + broadest device support?
→ llama.cpp (Vulkan GPU, community ecosystem)
TensorFlow/ONNX model origin + non-Snapdragon priority?
→ MNN (NNAPI delegate, polished Java API)
The React Native path is separate: if you’re in a React Native codebase, React Native ExecuTorch wraps ExecuTorch with a TypeScript API and handles the native plumbing for you.
For broader context on why on-device inference has become viable at production scale, see On-Device AI Inference in 2026: Sub-20ms on Android, Real Benchmarks, and When to Go Edge.
We help teams implement native on-device AI across ExecuTorch, llama.cpp, and custom pipelines — from model selection through production deployment. Get in touch to discuss your project.