YOLO Stereo Depth on Roboflow

Read metric depth per detection, on edge hardware, today

YOLO-StereoDepth is a roadmap item for September 2026. Its headline capability, real-time detection plus metric depth from cameras, is something you can deploy now with a stereo depth camera and Roboflow.

Use a stereo depth camera

Cameras like the Luxonis OAK-D, Stereolabs ZED, and Intel RealSense compute metric depth onboard and output an aligned depth frame alongside the RGB image. No calibration workaround, and no separate depth model competing for compute.

Detect on the RGB stream

Run a detector on the RGB stream with Inference on your edge device. We recommend RF-DETR, trained on your own classes, and Roboflow supports a range of edge hardware for deployment.

Attach metric depth per detection

For each detection, sample the camera's aligned depth frame at the bounding box center, or the median over the box for robustness, to get a real-world distance per object. The camera supplies the depth; the detector supplies the what.

Build logic and act

In Workflows, attach depth to every detection and trigger actions: a stop signal when an obstacle is inside stopping distance, a grasp target for an arm, or a dimension estimate for a passing package.

Try RF-DETR live in the model playground Open in new tab

Try it now for free Talk to a Vision AI engineer

Monocular vs. stereo depth: when to use which

Monocular (one camera)

A single RGB camera reads relative depth, which pixels are closer and which are farther, and a calibration step converts it to real units. Best for retrofits on installed cameras, proximity ranking, and monitoring, with no new hardware. This runs today with YOLO-Depth's available alternative, Depth Anything 3.

Stereo (two cameras)

A calibrated pair computes absolute, metric depth from binocular disparity (1.42 meters, not closer than the shelf behind it). Best for robotics, grasping, dimensioning, and anything that acts on real units. A camera-native alternative to lidar at a fraction of the cost.

	Monocular (one camera)	Stereo (two cameras)
Depth type	Relative by default; metric requires calibration	Metric out of the box, from the known baseline
Hardware	Any existing RGB camera, no new capex	Stereo camera or calibrated pair, typically a few hundred dollars per unit
Accuracy	Consistent ordering of near and far; absolute error grows without calibration	Strong at short-to-mid range; error grows with distance as disparity shrinks
Best for	Retrofits on installed cameras, proximity ranking, monitoring and alerts	Robotics, grasping, dimensioning, anything that acts on real units

The short version: if the cameras are already on the wall, monocular depth gets you there without new hardware. If a machine has to move, grasp, or measure based on the number, stereo earns its hardware cost. Many operations run both, monocular on the installed fleet and stereo on the robots.

Built for robots that act on real units

Metric depth, commodity hardware, and a commercial-safe stack you can deploy today.

Metric depth, not relative

Stereo produces absolute distances from a known baseline, so a robot knows it has 2.8 meters to stop, in meters. That is the difference that lets a machine move, grasp, and measure on the number rather than ordering near from far.

A camera-native alternative to lidar

A calibrated pair of commodity cameras costs a fraction of a lidar unit and captures color and texture lidar cannot. For cost-sensitive robots working at room-to-warehouse distances, stereo closes the gap between a webcam and a lidar.

Deploy today on edge hardware

Pair a Luxonis OAK-D, Stereolabs ZED, or Intel RealSense with RF-DETR and Roboflow Inference to get metric depth per detection now, on hardware you can buy today, instead of waiting for the September 2026 release.

Commercial-safe licensing

RF-DETR, the recommended detector, ships under the permissive Apache 2.0 license. YOLO-StereoDepth licensing is unannounced and previous YOLO releases shipped under AGPL-3.0. Since robotics deployments are almost always commercial, build on a license you can trust.

Vision AI is already running on robots in production

Half the Fortune 100 build computer vision with Roboflow, with detection deployed on AMRs, robot arms, and automation at human scale.

Edge-native

deploy on Luxonis OAK-D, Stereolabs ZED, Intel RealSense, and more

1M+

engineers and 16,000+ organizations building on the platform

55B+

model inferences run in production across critical industries

Trusted by teams at BNSF, Rivian, GE Vernova, Cummins, USG, Pella, and Peer Robotics.

Frequently asked questions

What is YOLO-StereoDepth?

YOLO-StereoDepth is an announced stereo depth estimation model in the YOLO family, part of the YOLO27 generation and planned for September 2026. It computes metric depth from two cameras using binocular disparity, the same principle as human vision: two cameras a known distance apart capture the same scene, and depth is computed from the difference between the views. Because the baseline is known, stereo depth produces absolute, metric distances (1.42 meters, not closer than the shelf behind it), which is why it is positioned as a camera-native alternative to lidar for robotics.

Stereo vs. monocular depth: when should I use which?

If the cameras are already on the wall and you need to know what is closer to what, monocular depth (like Depth Anything 3) gets you there without new hardware, giving relative depth that requires calibration for real units. If a machine has to move, grasp, or measure based on the number, stereo earns its hardware cost, producing metric distances out of the box from a calibrated pair that typically costs a few hundred dollars per unit. Monocular is best for retrofits on installed cameras, proximity ranking, and monitoring; stereo is best for robotics, grasping, dimensioning, and anything that acts on real units. Many operations run both.

How do I read metric depth per detection on edge hardware today?

Use a stereo depth camera such as the Luxonis OAK-D, Stereolabs ZED, or Intel RealSense, which compute metric depth onboard and output an aligned depth frame alongside the RGB image. Run a detector on the RGB stream with Roboflow Inference on your edge device (RF-DETR is recommended), then for each detection sample the aligned depth frame at the bounding box center, or the median over the box, to get metric distance per object. Build the logic in Roboflow Workflows to attach depth to every detection and trigger actions like a stop signal, a grasp target, or a dimension estimate.

Is the licensing safe for commercial robotics?

RF-DETR, the recommended detector, is released under the Apache 2.0 license, free to use commercially with no copyleft obligations. YOLO-StereoDepth licensing has not been announced, and previous similar YOLO releases shipped under AGPL-3.0, which requires open-sourcing derivative works unless you buy a commercial license. Robotics deployments are almost always commercial, so this is worth confirming before you build on it.

YOLO Stereo Depth: Metric Distance from Two Cameras