Computer Vision in React Native: A Practical Guide with ExecuTorch v0.8.0

Computer vision used to mean a server, a GPU, and a round-trip to the cloud every time a user pointed their camera at something. That default is no longer defensible for most mobile use cases. React Native ExecuTorch v0.8.0 ships a complete computer vision stack — classification, object detection, semantic segmentation, instance segmentation, OCR, style transfer, and vision-language models — all running on-device, all wrapped in idiomatic TypeScript hooks.

This is a practical guide to what’s available, what it’s actually useful for, and how to use it.

The Computer Vision Stack

React Native ExecuTorch exposes computer vision through task-specific hooks, each backed by pre-quantized .pte model files distributed by Software Mansion on Hugging Face. You don’t compile models. You don’t touch native code. You import a hook, pass a model constant, and call forward().

Here’s the full CV task surface as of v0.8.0:

Hook	Task	Typical Use Case
`useClassification`	What is in this image?	Content tagging, product recognition
`useObjectDetection`	Where are objects, and what are they?	Safety monitoring, inventory scanning
`useSemanticSegmentation`	Which category does each pixel belong to?	Background removal, scene understanding
`useInstanceSegmentation`	Which pixels belong to each individual object?	AR overlays, precise per-object masking
`useStyleTransfer`	Apply artistic style to an image	Creative apps, photo editing
`useOCR`	Extract text from images	Document scanning, receipt parsing

Beyond these task-specific hooks, useLLM now supports multimodal input in v0.8.0 — passing an image alongside a prompt to the LFM2.5-VL-1.6B vision-language model for free-form visual question answering.

Image Classification: The Fastest Path to a Working App

useClassification is the right hook when your question is categorical: is this food or not food, which product SKU is this, is this a manufacturing defect.

import {
  useClassification,
  EFFICIENTNET_V2_S_QUANTIZED,
} from 'react-native-executorch';

function ProductScanner() {
  const model = useClassification({ model: EFFICIENTNET_V2_S_QUANTIZED });

  const classify = async (imageUri: string) => {
    // Returns Record<string, number> — label to confidence score
    const scores = await model.forward(imageUri);
    const top = Object.entries(scores)
      .sort(([, a], [, b]) => b - a)
      .slice(0, 5);
    return top; // [['tabby cat', 0.92], ['tiger cat', 0.05], ...]
  };

  return (
    <Camera onCapture={(uri) => classify(uri).then(setResults)} />
  );
}

The default model is EfficientNet V2-S, available in both full-precision (EFFICIENTNET_V2_S) and quantized (EFFICIENTNET_V2_S_QUANTIZED) variants. Both classify across ImageNet-1000 categories. The quantized variant is meaningfully smaller and faster on device — prefer it unless you’re benchmarking accuracy differences on your specific task.

The confidence score matters in production. Don’t surface a classification result to the user if the top score is below ~0.6. A low-confidence result means the model is uncertain, and showing it damages trust more than showing nothing.

Object Detection: RF-DETR and YOLO, On Your Phone

useObjectDetection runs object detection and returns an array of Detection objects with bounding boxes, class labels, and confidence scores. v0.8.0 ships two detection architectures:

RF-DETR (RF_DETR_NANO) — transformer-based detection, strong on cluttered scenes
YOLO26 (YOLO26N through YOLO26X) — convolutional, optimized for speed; five size variants let you tune the accuracy/latency tradeoff
SSDLite MobileNet (SSDLITE_320_MOBILENET_V3_LARGE) — the lightest option, useful when memory is constrained

import {
  useObjectDetection,
  RF_DETR_NANO,
  YOLO26N,
  Detection,
} from 'react-native-executorch';

function InventoryScanner({ onItemsDetected }: Props) {
  const model = useObjectDetection({ model: RF_DETR_NANO });

  const processFrame = async (frameUri: string) => {
    const detections: Detection[] = await model.forward(frameUri);
    onItemsDetected(detections.filter(d => d.score > 0.7));
  };

  return <CameraView onFrame={processFrame} />;
}

A note on bounding box coordinate systems: the bbox values in Detection objects are normalized (0.0–1.0 relative to image dimensions). Multiply by your camera preview dimensions before rendering overlays. This is the most common source of misaligned detection boxes in first-time integrations.

If you need both detection and segmentation together, useInstanceSegmentation accepts the same YOLO26 and RF-DETR models — see the next section.

Semantic Segmentation: Pixel-Level Scene Understanding

useSemanticSegmentation assigns every pixel in an image to a semantic category — person, car, sky, background, and so on. The return value is { ARGMAX: Int32Array }, where each element maps to a class index.

import {
  useSemanticSegmentation,
  DEEPLAB_V3_MOBILENET_V3_LARGE_QUANTIZED,
  SELFIE_SEGMENTATION,
} from 'react-native-executorch';

function PortraitMode() {
  const model = useSemanticSegmentation({
    model: DEEPLAB_V3_MOBILENET_V3_LARGE_QUANTIZED,
  });

  const blurBackground = async (imageUri: string) => {
    const { ARGMAX } = await model.forward(imageUri, [], false);
    // ARGMAX is Int32Array — each value is a class index
    // Class 15 = person in the PASCAL VOC palette
    const personMask = ARGMAX.map(cls => cls === 15 ? 255 : 0);
    return applyMaskedBlur(imageUri, personMask);
  };
}

v0.8.0 ships several segmentation model options:

Constant	Architecture	Notes
`DEEPLAB_V3_MOBILENET_V3_LARGE_QUANTIZED`	DeepLab V3 + MobileNet	Balanced speed and accuracy
`DEEPLAB_V3_RESNET50_QUANTIZED`	DeepLab V3 + ResNet-50	Higher accuracy, more memory
`DEEPLAB_V3_RESNET101_QUANTIZED`	DeepLab V3 + ResNet-101	Highest accuracy in this family
`LRASPP_MOBILENET_V3_LARGE_QUANTIZED`	LRASPP + MobileNet	Fastest semantic segmentation
`FCN_RESNET50_QUANTIZED`	FCN + ResNet-50	Classic fully convolutional
`SELFIE_SEGMENTATION`	Selfie segmentation	Person/background only, fast

For portrait or people-focused features, SELFIE_SEGMENTATION is usually the right default — it’s specialized and fast. For scene-level understanding (outdoor scenes, mixed objects), DeepLab V3 is more capable.

The main practical constraint is memory: running segmentation on full-resolution photos from a 48MP camera sensor will OOM on most devices. Downscale to 512×512 or 640×640 before inference, then upscale the mask back to full resolution for rendering.

Instance Segmentation: New in v0.8.0

useInstanceSegmentation is the headline addition to the CV stack in v0.8.0. Where semantic segmentation labels every pixel with a category, instance segmentation produces a per-pixel mask for each individual detected object. Two dogs in a frame get two separate masks, not one “dog” region.

import {
  useInstanceSegmentation,
  YOLO26N_SEG,
  RF_DETR_NANO_SEG,
} from 'react-native-executorch';

function ARMeasurement() {
  const model = useInstanceSegmentation({ model: RF_DETR_NANO_SEG });

  const maskObjects = async (imageUri: string) => {
    const instances = await model.forward(imageUri, {
      confidenceThreshold: 0.5,
      iouThreshold: 0.55,
      maxInstances: 20,
    });
    // instances: per-object masks at original image resolution
    return instances;
  };
}

Available models: YOLO26N_SEG, YOLO26S_SEG, YOLO26M_SEG, YOLO26L_SEG, YOLO26X_SEG, and RF_DETR_NANO_SEG. The YOLO26 SEG variants offer the same speed/accuracy tradeoff spectrum as the detection variants. RF-DETR tends to handle overlapping or cluttered instances better.

Use instance segmentation when you need to act on individual objects — measuring, selecting, or manipulating them independently. Use semantic segmentation when you only care about the category of each pixel.

Style Transfer

useStyleTransfer applies a fixed artistic style to a photo. v0.8.0 ships four quantized styles:

import {
  useStyleTransfer,
  STYLE_TRANSFER_CANDY_QUANTIZED,
  STYLE_TRANSFER_MOSAIC_QUANTIZED,
  STYLE_TRANSFER_RAIN_PRINCESS_QUANTIZED,
  STYLE_TRANSFER_UDNIE_QUANTIZED,
} from 'react-native-executorch';

function PhotoEditor() {
  const model = useStyleTransfer({ model: STYLE_TRANSFER_CANDY_QUANTIZED });

  const applyStyle = async (imageUri: string): Promise<string> => {
    // Returns a URI string of the stylized image
    const styledUri = await model.forward(imageUri, 'url');
    return styledUri;
  };
}

Style transfer models are computationally heavier than classification or detection. They’re appropriate for single-photo processing (tap a button, wait a moment), not real-time camera feeds.

OCR: Text Extraction in Seven Languages

useOCR extracts text from images with bounding box locations and confidence scores. v0.8.0 includes seven language models:

import {
  useOCR,
  OCR_ENGLISH,
  OCR_GERMAN,
  OCR_FRENCH,
  OCR_SPANISH,
  OCR_ITALIAN,
  OCR_JAPANESE,
  OCR_KOREAN,
} from 'react-native-executorch';

function DocumentScanner() {
  const model = useOCR({ model: OCR_ENGLISH });

  const extractText = async (imageUri: string) => {
    const results = await model.forward(imageUri);
    return results;
  };
}

This is a purpose-built OCR hook, not a vision-language model. For structured documents (receipts, labels, forms) where language is known in advance, useOCR will be faster and more reliable than a general VLM. For open-ended visual questions that happen to involve text, the LFM2.5-VL section below applies.

Real-Time Processing: VisionCamera v5 Integration

A major capability addition in v0.8.0 is seamless integration with VisionCamera v5 via the runOnFrame worklet pattern. All CV hooks now expose their inference function as a worklet, so you can process camera frames with near-native performance:

import { useObjectDetection, YOLO26N } from 'react-native-executorch';
import { Camera, useFrameProcessor } from 'react-native-vision-camera';
import { useFrameOutput } from 'react-native-worklets';

function LiveDetector() {
  const model = useObjectDetection({ model: YOLO26N });
  const [detections, setDetections] = useState([]);

  const frameProcessor = useFrameProcessor((frame) => {
    'worklet';
    const result = model.runOnFrame(frame);
    // result is processed on the JS thread when ready
  }, [model]);

  return (
    <Camera
      device={device}
      isActive={true}
      frameProcessor={frameProcessor}
    />
  );
}

The worklet integration avoids the JS bridge overhead on each frame, which makes real-time detection and segmentation practical at 15–30 FPS depending on model and device class. This pattern works with useObjectDetection, useClassification, useSemanticSegmentation, useInstanceSegmentation, and useOCR.

Visual Question Answering with LFM2.5-VL

For use cases where users need to ask arbitrary questions about images — “what’s the expiry date on this product?”, “describe what’s in this photo”, “is this label in English?” — useLLM now supports multimodal input in v0.8.0.

The model is LFM2.5-VL-1.6B (Liquid Foundation Models vision-language, 1.6 billion parameters), distributed by Software Mansion at software-mansion/react-native-executorch-lfm2.5-VL-1.6B.

import { useLLM, LFM2_VL_1_6B_QUANTIZED } from 'react-native-executorch';

function VisualAssistant() {
  const vlm = useLLM({
    modelSource: LFM2_VL_1_6B_QUANTIZED,
  });

  const askAboutImage = async (imageUri: string, question: string) => {
    let answer = '';
    await vlm.generate(question, {
      imageUri,
      onToken: (token) => {
        answer += token;
        setResponse(answer); // stream tokens to UI as they arrive
      },
    });
    return answer;
  };
}

LFM2.5-VL is a 1.6B parameter model, which is meaningfully larger than the single-task CV hooks but small enough to run on current flagship hardware. Expect first-token latency of several hundred milliseconds to low single seconds depending on device; subsequent tokens stream as they generate.

This is not a real-time camera hook. The right usage pattern is request-response: the user captures or selects an image, asks a question, and waits for a response. For real-time camera workloads, the task-specific hooks are still the right tool.

A Decision Guide: Which CV Approach for Which Use Case

If you need to…	Use
Classify images into fixed categories	`useClassification` + `EFFICIENTNET_V2_S_QUANTIZED`
Detect and locate objects in real time	`useObjectDetection` + `YOLO26N` (or `RF_DETR_NANO` for accuracy)
Segment scene regions by category	`useSemanticSegmentation` + `DEEPLAB_V3_MOBILENET_V3_LARGE_QUANTIZED`
Mask individual object instances	`useInstanceSegmentation` + `YOLO26N_SEG` or `RF_DETR_NANO_SEG`
Isolate people from backgrounds	`useSemanticSegmentation` + `SELFIE_SEGMENTATION`
Apply artistic style filters	`useStyleTransfer`
Extract text from documents or signs	`useOCR` + the appropriate language constant
Answer arbitrary questions about images	`useLLM` + `LFM2_VL_1_6B_QUANTIZED`
Process camera frames in real time	Any hook’s `runOnFrame` via VisionCamera v5

The golden rule for model selection: use the smallest model that meets your accuracy requirements. Every additional megabyte in your model is download friction and memory pressure. Start with the smallest variant (YOLO26N, DeepLab MobileNet, EfficientNet quantized), profile on real devices, and only move up if the default falls short on your specific task.

Getting Started

npx expo install react-native-executorch

For bare React Native, the native setup adds the ExecuTorch XCFramework to iOS and the required AAR to Android. The Expo plugin handles both automatically.

Four things to get right before you ship:

1. Load models before they’re needed. Call the hook and trigger model loading during app initialization or at the start of a relevant screen — not when the user first taps the camera shutter. The downloadProgress property on every hook lets you show a progress indicator during the first-run download.

2. Handle error states explicitly. On constrained devices — particularly mid-range Android — the ExecuTorch runtime may fail to allocate memory for a model. This is not a crash; it’s a handled error state. Fall back to a cloud API, disable the feature, or show a clear message. Never surface the raw error object.

3. Downscale inputs for segmentation models. Semantic and instance segmentation have quadratic memory scaling with input resolution. Run at 512×512 or 640×640, not at camera native resolution. Upscale the mask back to full resolution after inference — the perceptual quality loss is negligible.

4. Gate large downloads behind Wi-Fi. The LFM2.5-VL model is several hundred megabytes. Style transfer and DeepLab variants are also non-trivial. Gate initial model downloads behind a Wi-Fi check and show clear download progress. Users will forgive a first-time setup step; they won’t forgive a surprise cellular data charge.

Computer vision in React Native has crossed the threshold from impressive demo to production architecture. The combination of task-specific hooks for classification, detection, and segmentation — plus the new instance segmentation support and VisionCamera v5 integration in v0.8.0 — removes most of the friction that historically made on-device CV a specialist undertaking. Classification and detection are a few lines of TypeScript. For open-ended visual queries, LFM2.5-VL is running on-device at 1.6B parameters.

The remaining constraint is the wide variance in Android hardware — a $200 MediaTek device is meaningfully slower than a Snapdragon 8 Elite. Design your model selection and fallback paths with this in mind.

Further reading: For a broader look at on-device inference performance — including latency benchmarks across model types, hardware delegate selection, and an edge-vs-cloud decision framework — see On-Device AI on Android in 2026: Sub-20ms Inference Without Cloud Latency.

We build production on-device AI features for React Native teams — from model selection and quantization through rollout and monitoring. Get in touch if you’re evaluating computer vision for your app.