On-Device LLMs on Android in 2026: ExecuTorch, llama.cpp, and MNN Compared

Running a language model entirely on an Android device in 2026 is a production-ready architecture choice — not a research experiment. The chipsets are capable, the frameworks have stabilised, and the privacy and latency advantages are real. What’s less clear is which native integration path to take.

If you’re building with React Native, React Native ExecuTorch gives you a TypeScript API over ExecuTorch with minimal native code. But if you’re building a native Android app in Kotlin or Java, you have three serious options: ExecuTorch’s Java/JNI bindings, llama.cpp compiled for Android via the NDK, and Alibaba’s MNN framework. Each occupies a distinct niche.

This post covers the trade-offs, Kotlin integration patterns, and how to pick the right framework for your use case.

Why Native Android (Not React Native) Matters

React Native ExecuTorch is the fastest path to shipping on-device LLM features in a cross-platform app. But native Kotlin/Java is the right choice when:

Your app is already native. Adding a React Native bridge solely for on-device inference adds significant complexity and binary size.
You need fine-grained threading control. Native Android lets you pin inference to specific cores, control thread affinity, and manage the NPU/GPU delegate lifecycle directly.
You’re integrating with Android system APIs. AccessibilityService, InputMethod, and CameraX pipelines are far simpler to build in native Kotlin than across a bridge.
Binary size is a constraint. The React Native runtime itself adds overhead that native code avoids.

The Three Frameworks

ExecuTorch (Java/JNI Bindings)

Meta’s ExecuTorch is the most architecturally coherent on-device inference framework for PyTorch models. It powers on-device AI in WhatsApp and Instagram at scale. The Android integration uses a JNI layer that exposes a Module class for Java/Kotlin code.

Strengths:

First-class QNN delegate for Snapdragon NPU (best Android inference performance)
Tight PyTorch integration — no model format conversion needed
Actively maintained by Meta with production deployments as proof

Weaknesses:

Pre-1.0 API surface; breaking changes between minor versions
Java bindings are less mature than the C++ API — some advanced delegate options require dropping to JNI directly
.pte format is ExecuTorch-specific; models require an export step from PyTorch

Typical Kotlin integration:

import org.pytorch.executorch.Module
import org.pytorch.executorch.EValue
import org.pytorch.executorch.Tensor

class LLMInference(modelPath: String) {
    private val module = Module.load(modelPath)

    fun generate(tokens: LongArray, maxNewTokens: Int = 128): List<Long> {
        val inputTensor = Tensor.fromBlob(tokens, longArrayOf(1, tokens.size.toLong()))
        val outputs = mutableListOf<Long>()

        repeat(maxNewTokens) {
            val output = module.forward(EValue.from(inputTensor))
            val logits = output[0].toTensor()
            val nextToken = logits.dataAsLongArray().last()
            outputs.add(nextToken)
            if (nextToken == EOS_TOKEN_ID) return outputs
        }
        return outputs
    }

    companion object {
        const val EOS_TOKEN_ID = 128001L
    }
}

Best for: Teams already in the PyTorch ecosystem, Snapdragon-heavy device targets, apps where you control model training and can use torch.export.

llama.cpp (Android NDK)

Georgi Gerganov’s llama.cpp has a well-established Android build path via the NDK. It uses GGUF model format, which is the community standard for quantised LLM weights — virtually every open model has GGUF variants available on Hugging Face. The Android integration wraps the C++ library with a JNI layer exposed as android/llama.cpp in the project’s examples/ directory.

Strengths:

GGUF ecosystem is massive — any model on Hugging Face is immediately usable
The most battle-tested quantisation (Q4_K_M, Q5_K_M, Q8_0) with well-understood accuracy/speed tradeoffs
Vulkan GPU backend works across almost all Android devices without vendor-specific delegate risk
Active community; prompt template support for every major model family built in

Weaknesses:

No NPU acceleration path — you’re limited to CPU (NEON) and GPU (Vulkan)
JNI layer in examples/android is a reference implementation, not a polished library
Chat templating and context management are your responsibility

Typical JNI/Kotlin integration:

class LlamaCpp {
    init { System.loadLibrary("llama") }

    external fun loadModel(path: String, nCtx: Int = 2048, nThreads: Int = 4): Long
    external fun completionInit(ctx: Long, prompt: String, maxNewTokens: Int): Int
    external fun completionLoop(ctx: Long): String?
    external fun freeContext(ctx: Long)
}

// Usage
val llama = LlamaCpp()
val ctx = llama.loadModel(modelFile.absolutePath, nThreads = 4)
llama.completionInit(ctx, "<|user|>\n$userMessage\n<|assistant|>\n", 256)
val sb = StringBuilder()
while (true) {
    val token = llama.completionLoop(ctx) ?: break
    sb.append(token)
    emit(token) // streaming via Kotlin Flow
}
llama.freeContext(ctx)

Best for: Teams who need maximum model flexibility (any GGUF model, any model family), apps where Vulkan GPU support matters more than NPU speed, offline-first apps where connectivity for model updates is limited.

MNN (Alibaba’s Mobile Neural Network)

MNN is Alibaba’s inference framework, production-tested across Taobao, Tmall, and Alipay. It has broad model format support (ONNX, TensorFlow, PyTorch via conversion), Android NPU support via the NNAPI delegate, and a Java API that feels more like a standard Android library than a JNI wrapper.

Strengths:

Polished Java API with Android lifecycle integration
NNAPI delegate provides NPU access on devices where QNN (Snapdragon-only) isn’t available — broader device support
Converts models from multiple frameworks, not just PyTorch
Good documentation compared to llama.cpp’s reference integration

Weaknesses:

Smaller community and ecosystem than ExecuTorch or llama.cpp for LLMs specifically
MNN’s LLM support (MNN-LLM) is newer than its CV support and still maturing
Less transparent about which model architectures are supported and at what quantisation levels

Typical Java integration:

import com.alibaba.mnn.llm.LLM
import com.alibaba.mnn.llm.LLMConfig

class MNNInference(modelDir: String) {
    private val llm = LLM.create(modelDir)

    init {
        llm.load()
    }

    fun chat(prompt: String, onToken: (String) -> Unit) {
        llm.response(prompt) { token ->
            onToken(token)
        }
    }

    fun reset() = llm.reset()
}

Best for: Teams with TensorFlow or ONNX model origins, apps that need broad Android device coverage beyond Snapdragon, organisations with existing Alibaba cloud infrastructure.

Benchmark Comparison (Snapdragon 8 Elite, INT4 Quantisation)

The numbers below are representative ranges for Llama 3.2 1B on a Snapdragon 8 Elite device (2024-2025) at INT4/Q4 quantisation:

Framework	Backend	Tokens/sec	First Token (ms)	Binary Size
ExecuTorch	QNN (NPU)	18–25	400–600	~8MB runtime
ExecuTorch	XNNPACK (CPU)	6–10	200–350	~8MB runtime
llama.cpp	Vulkan (GPU)	8–14	300–500	~4MB runtime
llama.cpp	CPU (NEON)	4–8	150–300	~4MB runtime
MNN-LLM	NNAPI (NPU)	10–18	350–550	~6MB runtime
MNN-LLM	CPU	5–9	200–400	~6MB runtime

Benchmarks are approximate and vary significantly by device, model size, quantisation level, context length, and OS version. Always benchmark on your target device distribution.

The headline number: ExecuTorch with the QNN delegate on Snapdragon is the fastest path to NPU inference. But llama.cpp’s Vulkan GPU backend is competitive on GPU-only devices and far more portable.

Model Size and Download UX

One factor that often gets underweighted: on-device LLMs require large model downloads. A Q4_K_M Llama 3.2 1B is ~700MB. For Qwen2.5 0.5B, roughly 400MB.

Strategies that matter for production UX:

Progressive download. Download the model in the background after first launch, not as a blocking install step. Use Android’s WorkManager for reliable background downloading.

Delta updates. If you ship model updates, consider whether you can update quantisation adapters or fine-tuning layers rather than the full base model.

User expectation setting. Show model size and download progress explicitly. Users who understand what’s happening tolerate large downloads; users who don’t will uninstall.

Resumable downloads. Use HttpURLConnection with Range header support or the DownloadManager system service to survive interrupted connections.

Choosing Your Framework

If you already know your model’s origin framework and device targets, the decision tree is short:

PyTorch model + Snapdragon-heavy device mix?
  → ExecuTorch (QNN delegate, best NPU performance)

Any GGUF model + broadest device support?
  → llama.cpp (Vulkan GPU, community ecosystem)

TensorFlow/ONNX model origin + non-Snapdragon priority?
  → MNN (NNAPI delegate, polished Java API)

The React Native path is separate: if you’re in a React Native codebase, React Native ExecuTorch wraps ExecuTorch with a TypeScript API and handles the native plumbing for you.

For broader context on why on-device inference has become viable at production scale, see On-Device AI Inference in 2026: Sub-20ms on Android, Real Benchmarks, and When to Go Edge.

We help teams implement native on-device AI across ExecuTorch, llama.cpp, and custom pipelines — from model selection through production deployment. Get in touch to discuss your project.

Why Native Android (Not React Native) Matters

The Three Frameworks

ExecuTorch (Java/JNI Bindings)

llama.cpp (Android NDK)

MNN (Alibaba’s Mobile Neural Network)

Benchmark Comparison (Snapdragon 8 Elite, INT4 Quantisation)

Model Size and Download UX

Choosing Your Framework

Want to discuss this further?