As camera resolutions continue to improve the feasibility of capturing a full scene of a classroom, lecture, presentation hall, or the such and autonomously focusing attention on the presenter becomes more practical. Generally, a camera operator pans and zooms in on the presenter as they make their way around the stage to draw the audience attention to the intended target. Professionally filmed videos draw the audiences attention to the speaker and their production quality contributes to a more informative presentation.
Wide-angle, static camera positions are an alternative for capturing presentations but generally fail to draw the audience attention to the speaker. With robust object-detection, the position of the presenter can be automated and thru the use of auto-cropping the presenter can offer a budget-friendly alternative to more professional video production facilities.
YOLO (You Only Look Once) takes a different approach from classic computer vision by utilizing a classifier as a detector. Authored by Joseph Redmon at the University of Washington, YOLO sub-samples and image into regions, assumes each region has an object and executes a classifier on each region, then merges the classifier groups into a list of final objects.
Below is a proof-of-concept utilizing YOLO in an auto-cropping manner. The wide-angle source video is used as input, object-detection is focused on the front of the room, detects the presenter and auto-cropped around the presenter. Once the presenter location is available, we use a variety of means to 'pan the camera', the first by snapping to the presenter location, the second by smoothing the camera motion by incorporating a 2-dimensional shaper, the third using the shaper but only moving the camera when the presenter nears the edges of the current crop window.
Each mechanism is a rough implementation, focused on rapid proof-of-concept rather than optimal results, but you get the idea.
The source video was found on here; Minnebar7