You can use ML Kit to detect and track objects in successive video frames.
When you pass an image to ML Kit, it detects up to five objects in the image along with the position of each object in the image. When detecting objects in video streams, each object has a unique ID that you can use to track the object from frame to frame. You can also optionally enable coarse object classification, which labels objects with broad category descriptions.
Try it out
- Play around with the sample app to see an example usage of this API.
- See the Material Design showcase app for an end-to-end implementation of this API.
Before you begin
- Include the following ML Kit pods in your Podfile:
pod 'GoogleMLKit/ObjectDetection', '7.0.0'
- After you install or update your project's Pods, open your Xcode project using its
.xcworkspace
. ML Kit is supported in Xcode version 12.4 or greater.
1. Configure the object detector
To detect and track objects, first create an instance of
ObjectDetector
and optionally specify any detector settings you want to
change from the default.
Configure the object detector for your use case with an
ObjectDetectorOptions
object. You can change the following settings:Object Detector Settings Detection mode .stream
(default) |.singleImage
In stream mode (default), the object detector runs with very low latency, but might produce incomplete results (such as unspecified bounding boxes or categories) on the first few invocations of the detector. Also, in stream mode, the detector assigns tracking IDs to objects, which you can use to track objects across frames. Use this mode when you want to track objects, or when low latency is important, such as when processing video streams in real time.
In single image mode, the object detector returns the result after the object's bounding box is determined. If you also enable classification it returns the result after the bounding box and category label are both available. As a consequence, detection latency is potentially higher. Also, in single image mode, tracking IDs are not assigned. Use this mode if latency isn't critical and you don't want to deal with partial results.
Detect and track multiple objects false
(default) |true
Whether to detect and track up to five objects or only the most prominent object (default).
Classify objects false
(default) |true
Whether or not to classify detected objects into coarse categories. When enabled, the object detector classifies objects into the following categories: fashion goods, food, home goods, places, and plants.
The object detection and tracking API is optimized for these two core use cases:
- Live detection and tracking of the most prominent object in the camera viewfinder.
- The detection of multiple objects in a static image.
To configure the API for these use cases:
Swift
// Live detection and tracking let options = ObjectDetectorOptions() options.shouldEnableClassification = true // Multiple object detection in static images let options = ObjectDetectorOptions() options.detectorMode = .singleImage options.shouldEnableMultipleObjects = true options.shouldEnableClassification = true
Objective-C
// Live detection and tracking MLKObjectDetectorOptions *options = [[MLKObjectDetectorOptions alloc] init]; options.shouldEnableClassification = YES; // Multiple object detection in static images MLKObjectDetectorOptions *options = [[MLKOptions alloc] init]; options.detectorMode = MLKObjectDetectorModeSingleImage; options.shouldEnableMultipleObjects = YES; options.shouldEnableClassification = YES;
- Get an instance of
ObjectDetector
:
Swift
let objectDetector = ObjectDetector.objectDetector() // Or, to change the default settings: let objectDetector = ObjectDetector.objectDetector(options: options)
Objective-C
MLKObjectDetector *objectDetector = [MLKObjectDetector objectDetector]; // Or, to change the default settings: MLKObjectDetector *objectDetector = [MLKObjectDetector objectDetectorWithOptions:options];
2. Prepare the input image
To detect and track objects, do the following for each image or frame of video.
If you enabled stream mode, you must create VisionImage
objects from
CMSampleBuffer
s.
Create a VisionImage
object using a UIImage
or a
CMSampleBuffer
.
If you use a UIImage
, follow these steps:
- Create a
VisionImage
object with theUIImage
. Make sure to specify the correct.orientation
.Swift
let image = VisionImage(image: UIImage) visionImage.orientation = image.imageOrientation
Objective-C
MLKVisionImage *visionImage = [[MLKVisionImage alloc] initWithImage:image]; visionImage.orientation = image.imageOrientation;
If you use a
CMSampleBuffer
, follow these steps:-
Specify the orientation of the image data contained in the
CMSampleBuffer
.To get the image orientation:
Swift
func imageOrientation( deviceOrientation: UIDeviceOrientation, cameraPosition: AVCaptureDevice.Position ) -> UIImage.Orientation { switch deviceOrientation { case .portrait: return cameraPosition == .front ? .leftMirrored : .right case .landscapeLeft: return cameraPosition == .front ? .downMirrored : .up case .portraitUpsideDown: return cameraPosition == .front ? .rightMirrored : .left case .landscapeRight: return cameraPosition == .front ? .upMirrored : .down case .faceDown, .faceUp, .unknown: return .up } }
Objective-C
- (UIImageOrientation) imageOrientationFromDeviceOrientation:(UIDeviceOrientation)deviceOrientation cameraPosition:(AVCaptureDevicePosition)cameraPosition { switch (deviceOrientation) { case UIDeviceOrientationPortrait: return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationLeftMirrored : UIImageOrientationRight; case UIDeviceOrientationLandscapeLeft: return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationDownMirrored : UIImageOrientationUp; case UIDeviceOrientationPortraitUpsideDown: return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationRightMirrored : UIImageOrientationLeft; case UIDeviceOrientationLandscapeRight: return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationUpMirrored : UIImageOrientationDown; case UIDeviceOrientationUnknown: case UIDeviceOrientationFaceUp: case UIDeviceOrientationFaceDown: return UIImageOrientationUp; } }
- Create a
VisionImage
object using theCMSampleBuffer
object and orientation:Swift
let image = VisionImage(buffer: sampleBuffer) image.orientation = imageOrientation( deviceOrientation: UIDevice.current.orientation, cameraPosition: cameraPosition)
Objective-C
MLKVisionImage *image = [[MLKVisionImage alloc] initWithBuffer:sampleBuffer]; image.orientation = [self imageOrientationFromDeviceOrientation:UIDevice.currentDevice.orientation cameraPosition:cameraPosition];
3. Process the image
Pass theVisionImage
to one of the object detector's image processing methods. You can either use the asynchronousprocess(image:)
method or the synchronousresults()
method.To detect objects asynchronously:
Swift
objectDetector.process(image) { objects, error in guard error == nil else { // Error. return } guard !objects.isEmpty else { // No objects detected. return } // Success. Get object info here. // ... }
Objective-C
[objectDetector processImage:image completion:^(NSArray
* _Nullable objects, NSError * _Nullable error) { if (error == nil) { return; } if (objects.count == 0) { // No objects detected. return; } // Success. Get object info here. }]; To detect objects synchronously:
Swift
var objects: [Object] do { objects = try objectDetector.results(in: image) } catch let error { print("Failed to detect object with error: \(error.localizedDescription).") return } guard !objects.isEmpty else { print("Object detector returned no results.") return } // Success. Get object info here.
Objective-C
NSError *error; NSArray
*objects = [objectDetector resultsInImage:image error:&error]; if (error == nil) { return; } if (objects.count == 0) { // No objects detected. return; } // Success. Get object info here. 4. Get information about detected objects
If the call to the image processor succeeds, it either passes a list ofObject
s to the completion handler or returns the list, depending on whether you called the asynchronous or synchronous method.Each
Object
contains the following properties:frame
A CGRect
indicating the position of the object in the image.trackingID
An integer that identifies the object across images, or `nil` in single image mode. labels
An array of labels describing the object returned by the detector. The property is empty if the detector option shouldEnableClassification
is set tofalse
.Swift
// objects contains one item if multiple object detection wasn't enabled. for object in objects { let frame = object.frame let trackingID = object.trackingID // If classification was enabled: let description = object.labels.enumerated().map { (index, label) in "Label \(index): \(label.text), \(label.confidence)" }.joined(separator:"\n") }
Objective-C
// The list of detected objects contains one item if multiple // object detection wasn't enabled. for (MLKObject *object in objects) { CGRect frame = object.frame; NSNumber *trackingID = object.trackingID; for (MLKObjectLabel *label in object.labels) { NSString *labelString = [NSString stringWithFormat: @"%@, %f, %lu", label.text, label.confidence, (unsigned long)label.index]; ... } }
Improving usability and performance
For the best user experience, follow these guidelines in your app:
- Successful object detection depends on the object's visual complexity. In order to be detected, objects with a small number of visual features might need to take up a larger part of the image. You should provide users with guidance on capturing input that works well with the kind of objects you want to detect.
- When you use classification, if you want to detect objects that don't fall cleanly into the supported categories, implement special handling for unknown objects.
Also, check out the Material Design Patterns for machine learning-powered features collection.
When you use streaming mode in a real-time application, follow these guidelines to achieve the best framerates:
- Don't use multiple object detection in streaming mode, as most devices won't be able to produce adequate framerates.
- Disable classification if you don't need it.
- For processing video frames, use the
results(in:)
synchronous API of the detector. Call this method from theAVCaptureVideoDataOutputSampleBufferDelegate
'scaptureOutput(_, didOutput:from:)
function to synchronously get results from the given video frame. KeepAVCaptureVideoDataOutput
'salwaysDiscardsLateVideoFrames
astrue
to throttle calls to the detector. If a new video frame becomes available while the detector is running, it will be dropped. - If you use the output of the detector to overlay graphics on the input image, first get the result from ML Kit, then render the image and overlay in a single step. By doing so, you render to the display surface only once for each processed input frame. See the updatePreviewOverlayViewWithLastFrame in the ML Kit quickstart sample for an example.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-01-09 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-01-09 UTC."],[[["ML Kit can detect and track up to five objects in images, assigning unique IDs for tracking in video streams."],["You can enable coarse classification to label objects with broad categories like fashion, food, or plants."],["The API is optimized for live tracking of one prominent object and detecting multiple objects in still images."],["Integrate the `GoogleMLKit/ObjectDetection` pod and configure options for your desired detection mode and features."],["Process images asynchronously or synchronously, then access detected object information like frame, tracking ID, and labels."]]],[]]
-