This post is part of a series where I explain how to build an augmented reality Sudoku solver. All of the image processing and computer vision is done from scratch. See the first post for a complete list of topics.
Capturing video on Linux is done using the Video for Linux (V4L2) API. It’s actually a really nice API and very well documented. I encourage you to check out the documentation if you need to do something I don’t cover here.
I’m going to go over how to find all of the connected cameras, query their supported formats, the simplest method to get a video frame, and convert it to an RGB image.
List Connected Cameras
Cameras on Linux show up as files in the /dev directory. They are always listed in the format video{number} where {number} can be any number from 0 to 255. Finding all of the connected cameras is as simple as searching the /dev directory for filenames that start with video.
Cameras capture video frames in different sizes, pixel formats, and rates. You can query a camera’s supported formats to find the best fit for your use or just let the driver pick a sane default.
There’s certainly an advantage to picking the formats yourself. For example, since cameras rarely provide an RGB image, if a camera uses a format that you already know how to convert to RGB, you can save time and use that instead. You can also choose to lower the frame rate for better quality at a cost of more motion blur or raise the frame rate for a more noisy image and less motion blur.
The first thing you need to do to query a camera is to open the camera and get a file descriptor. This file descriptor is then used to refer to that specific camera.
When you’re done using the camera, don’t forget to close the file descriptor.
All camera querying and configuration is done using the ioctl function. It’s used by passing an open file descriptor for the device, a constant that indicates the operation to be performed, and a pointer to some type depending on the indicated operation1.
Three different versions of ioctl are used to query the formats supported by a camera. They give the supported pixel formats, sizes (width and height), and frame intervals2. Querying is done in this same order. That is, each pixel format has its own supported frame sizes which then further restricts the supported frame intervals.
The pixel format is queried by calling ioctl with VIDIOC_ENUM_FMT as the second argument and a pointer to a v4l2_fmtdesc struct as the third argument. The v4l2_fmtdesc struct has an index field that indicates which pixel format is populated into the struct by the ioctl call. The index is then incremented and ioctl is called again until it returns an error code indicating that there are no more supported formats.
The list of possible pixel formats are found in the documentation.
Getting a list of supported sizes is a little trickier. You start by calling ioctl with VIDIOC_ENUM_FRAMESIZES and a pointer to a v4l2_frmsizeenum struct. But after the first call, you need to check if frame sizes are discrete or given as a minimum and maximum that can be adjusted in some step size.
Finally, the list of supported frame intervals is very similar to getting frame sizes. Call ioctl with VIDIOC_ENUM_FRAMEINTERVALS and a pointer to a v4l2_frmivalenum struct. The frame intervals can be given in discrete chunks, incremental steps, or any value within a specified range.
Selecting a Format
Setting the camera to use a particular format and size is done by filling out a v4l2_format struct and passing it to a ioctl call with VIDIOC_S_FMT. If you don’t specify a supported format and size, the driver will select a different one and modify the v4l2_format struct to indicate what’s being used instead.
After you set the format and size, you can change the frame interval as well. If you change this first, it might be reset to a default value. You can also reset it to default by setting the numerator and denominator to zero.
Capturing a Video Frame
There are multiple ways to capture a frame of video using V4L2. The simplest method is to just read from the file descriptor like it’s a file. Every sizeimage (returned by the ioctl(fd,VIDIOC_S_FMT,...); call) bytes read represents a single frame. Remember that the pixel format is probably not RGB so it usually needs to be converted before it can be used.
Converting a Video Frame To RGB
Since the pixel format is not RGB and we need RGB, it obviously needs to be converted. The process varies by format, so I’m going to only go over one format here: YUYV 4:2:2.
The YUYV 4:2:2 format is a Y’CbCr interleaved format where every pixel has its own luminance (Y’) channel and every two pixels share the Cb and Cr channels. Its layout is a repeating pattern that looks like this: `[Y’_0, C_b, Y’_1, C_r, Y’_0, C_b, Y’_1, C_r,…]`. Where `Y’_0` is luminance for the first pixel, `C_b` is the Cb channel, `Y’_1` is lumiance for the second pixel, and `C_r` is the Cr channel.
The actual Y’CbCr to RGB conversion can be done by following the JPEG standard which is good enough because we are assuming all cameras are uncalibrated.
Since this format specifies two Y’ for every pair of Cb and Cr, it’s easiest to do the actual conversion to two pixels at a time.
Now the image is ready to be displayed or processed further. Working with a single camera on a single platform isn’t too much work. Expanding to support more, however, can be very time consuming. I highly recommend using a library that handles most of this for you. Next time, I’m going to ignore my own advice and show how to capture video frames from a camera on Windows.
The function definition is actually int ioctl(int fd, unsigned long int request, ...);. This indicates that the function takes two or more arguments but we only need the format that takes exactly three arguments. ↩
Frame interval is the time between successive frames. Frames per second is equal to 1 / frame interval. ↩