Face recognition with dlib
Dlib offers a high-quality face recognition algorithm based on deep learning.
Dlib implements a face recognition algorithm that offers state-of-the-art
accuracy. More specifically, the model has an accuracy of 99.38% on the labeled
faces in the wild database.
The implementation of this algorithm is based on the ResNet-34 network
proposed in the paper Deep Residual Learning for Image Recognition (2016),
which was trained using three million faces. The created model (21.4 MB) can
be downloaded from https://github.com/davisking/dlib-models/blob/master/dlib_face_rec
This network is trained in a way that generates a 128-dimensional
(128D) descriptor, used to quantify the face. The training step is performed using
triplets. A single triplet training example is composed of three images. Two of
them correspond to the same person. The network generates the 128D descriptor
for each of the images, slightly modifying the neural network weights in order to
make the two vectors that correspond to the same person closer and the feature
vector from the other person further away. The triplet loss function formalizes
this and tries to push the 128D descriptor of two images of the same person
closer together, while pulling the 128D descriptor of two images of different
people further apart.
This process is repeated millions of times for millions of images of thousands of
different people and finally, it is able to generate a 128D descriptor for each
person. So, the final 128D descriptor is good encoding for the following reasons:
The generated 128D descriptors of two images of the same person are quite
similar to each other.
The generated 128D descriptors of two images of different people are very
Therefore, making use of the dlib functionality, we can use a pre-trained model
to map a face into a 128D descriptor. Afterward, we can use these feature
vectors to perform face recognition.
The encode_face_dlib.py script shows how to calculate the 128D descriptor, used to
quantify the face. The process is quite simple, as shown in the following code:
# Load image:
image = cv2.imread("jared_1.jpg")
# Convert image from BGR (OpenCV format) to RGB (dlib format):
rgb = image[:, :, ::-1]
# Calculate the encodings for every face of the image:
encodings = face_encodings(rgb)
# Show the first encoding:
As you can guess, the face_encodings() function returns the 128D descriptor for
each face in the image:
pose_predictor_5_point = dlib.shape_predictor("shape_predictor_5_face_landmarks.dat")
face_encoder = dlib.face_recognition_model_v1("dlib_face_recognition_resnet_model_v1.dat")
detector = dlib.get_frontal_face_detector()
def face_encodings(face_image, number_of_times_to_upsample=1, num_jitters=1):
"""Returns the 128D descriptor for each face in the image"""
# Detect faces:
face_locations = detector(face_image, number_of_times_to_upsample)
# Detected landmarks:
raw_landmarks = [pose_predictor_5_point(face_image, face_location) for face_location
# Calculate the face encoding for every detected face using the detected landmarks for each one:
return [np.array(face_encoder.compute_face_descriptor(face_image, raw_landmark_set,
raw_landmark_set in raw_landmarks]
As you can see, the key point is to calculate the face encoding for every detected
face using the detected landmarks for each one, calling dlib
the face_encoder.compute_face_descriptor() function.
The num_jitters parameter sets the number of times each face will be randomly
jittered, and the average 128D descriptor calculated each time will be returned.
In this case, the output (encoding 128D descriptor) is as follows:
[-0.08550473 0.14213498 0.01144615 -0.05947386 -0.05831585 0.01127038 -0.05497809 -0.03466939 0.14322688 -0.1001832 0.17384697 0.02444006 -0.25994921 0.13708787 -0.08945534 0.11796272 -0.25426617 -0.0829383 -0.05489913 -0.10409787 0.07074109 0.05810066 -0.03349853 0.07649824 -0.07817822 -0.29932317 -0.15986916 -0.087205 0.10356752 -0.12659372 0.01795856 -0.01736169 -0.17094864 -0.01318233 -0.00201829 0.0104903 -0.02453734 -0.11754096 0.2014133 0.12671679 -0.0271306 -0.02350519 0.08327188 0.36815098 0.12599576 0.04692561 0.03585262 -0.03999642 0.23675609 -0.28394884 0.11896492 0.11870296 0.20243752 0.2106981 0.03092775 -0.14315812 0.07708532 0.16536239 -0.19648902 0.22793224 0.06825032 -0.00117573 0.00304667 -0.01902146 0.2539638 0.09768397 -0.13558105 -0.15079053 0.11357955 -0.14893037 -0.09028706 0.03625216 -0.13004847 -0.16567475 -0.21958281 0.08687183 0.35941613 0.16637127 -0.08334676 0.02806632 -0.09188357 -0.10760318 0.02889947 0.08376379 -0.11524356 -0.00998984 -0.05582509 0.09372396 0.30287758 -0.01063644 -0.07903813 0.30418509 -0.01998731 0.0752025 -0.00424637 0.07463965 -0.12972119 -0.04034984 -0.08435905 -0.01642537 0.00847361 -0.09549874 -0.07568903 0.06476583 -0.19202243 0.16904426 -0.01247451 0.03941975 -0.01960869 0.02145611 -0.25607404 -0.03039071 0.20248309 -0.25835767 0.21397503 0.19302645 0.07284702 0.07879912 0.06171442 0.02366752 0.06781606 -0.06446165 -0.14713687 -0.0714087 0.11978403 -0.01525984 -0.04687868 0.00167655]
Once the faces are encoded, the next step is to perform the recognition.
The recognition can be easily computed using some kind of distance metrics
computed using the 128D descriptors. Indeed, if two face descriptor vectors have
a Euclidean distance between them that is less than 0.6, they can be considered to
belong to the same person. Otherwise, they are from different people.
The Euclidean distance can be calculated using numpy.linalg.norm().
In the compare_faces_dlib.py script, we compare four images against another image.
To compare the faces, we have coded two functions: compare_faces() and
compare_faces_ordered(). The compare_faces() function returns the distance when
comparing a list of face encodings against a candidate to check:
def compare_faces(face_encodings, encoding_to_check):
"""Returns the distances when comparing a list of face encodings against a candidate to check"""
return list(np.linalg.norm(face_encodings - encoding_to_check, axis=1))
The compare_faces_ordered() function returns the ordered distances and the
corresponding names when comparing a list of face encodings against a
candidate to check:
def compare_faces_ordered(face_encodings, face_names, encoding_to_check):
"""Returns the ordered distances and names when comparing a list of face encodings against a candidate to check"""
distances = list(np.linalg.norm(face_encodings - encoding_to_check, axis=1))
return zip(*sorted(zip(distances, face_names)))
Therefore, the first step in comparing four images against another image is to
load all of them and convert to RGB (dlib format):
# Load images:
known_image_1 = cv2.imread("jared_1.jpg")
known_image_2 = cv2.imread("jared_2.jpg")
known_image_3 = cv2.imread("jared_3.jpg")
known_image_4 = cv2.imread("obama.jpg")
unknown_image = cv2.imread("jared_4.jpg")
# Convert image from BGR (OpenCV format) to RGB (dlib format):
known_image_1 = known_image_1[:, :, ::-1]
known_image_2 = known_image_2[:, :, ::-1]
known_image_3 = known_image_3[:, :, ::-1]
known_image_4 = known_image_4[:, :, ::-1]
unknown_image = unknown_image[:, :, ::-1]
# Crate names for each loaded image:
names = ["jared_1.jpg", "jared_2.jpg", "jared_3.jpg", "obama.jpg"]
The next step is to compute the encodings for each image:
# Create the encodings:
known_image_1_encoding = face_encodings(known_image_1)
known_image_2_encoding = face_encodings(known_image_2)
known_image_3_encoding = face_encodings(known_image_3)
known_image_4_encoding = face_encodings(known_image_4)
known_encodings = [known_image_1_encoding, known_image_2_encoding, known_image_3_encoding
unknown_encoding = face_encodings(unknown_image)
And finally, you can compare the faces using the previous functions. For
example, let's make use of the compare_faces_ordered() function:
computed_distances_ordered, ordered_names = compare_faces_ordered(known_encodings, names
Doing so will give us the following:
(0.3913191431497527, 0.39983264838593896, 0.4104153683230741, 0.9053700273411349)
('jared_3.jpg', 'jared_1.jpg', 'jared_2.jpg', 'obama.jpg')
The first three values (0.3913191431497527, 0.39983264838593896, 0.4104153683230741) are
less than 0.6. This means that the first three images ('jared_3.jpg', 'jared_1.jpg',
'jared_2.jpg') can be considered from the same person as the image to check
('jared_4.jpg'). The fourth value obtained (0.9053700273411349) means that the fourth
image ('obama.jpg') is not the same person as the image to check.
This can be seen in the next screenshot:
In the previous screenshot, you can see that the first three images can be
considered from the same person (the obtained values are less than 0.6), while
the fourth image can be considered from another person (the obtained value is
greater than 0.6).
Th next example gives you an introduction into how to handle mouse events with OpenCV. The cv2.setMouseCallback() function performs this functionality. The signature for this method is as follows: cv2.setMouseCallback(windowName, onMouse, param=None) This function establishes the mouse handler for the window named windowName. The onMouse function is the callback function, which is called when a mouse event is performed (for example, double-click, left-button down, left-button up, among others). The optional param parameter is used to pass additional information to the callback function. So the first step is to create the callback function: # This is the mouse callback function: def draw_circle(event, x, y, flags, param): if event == cv2.EVENT_LBUTTONDBLCLK: print("event: EVENT_LBUTTONDBLCLK") cv2.circle(image, (x, y), 10, colors['magenta'], -1) if event == cv2.EVENT_MOUSEMOVE: print("event: EVENT_MOUSEMOVE") if event == cv2.EVENT_LBUTTONUP: print("event: EVENT_LBUTTONUP") if event == cv2.EVENT_LBUTTONDOWN: print("event: EVENT_LBUTTONDOWN") The draw_circle() function receives the specific event and the coordinates (x, y) for every mouse event. In this case, when a left double-click (cv2.EVENT_LBUTTONDBLCLK) is performed, we draw a circle in the corresponding (x, y) coordinates on the event. Additionally, we have also printed some messages in order to see other produced events, but we do not use them to perform any additional actions. The next step is to create a named window. In this case, we named it Image mouse. This named window is where the mouse callback function will be associated with: And finally, we set (or activate) the mouse callback function to the function we created before: # We set the mouse callback function to 'draw_circle' cv2.setMouseCallback('Image mouse', draw_circle) In summary, when a left double-click is performed, a filled magenta circle is drawn centered at the (x, y) position of the performed double-click. The full code for this example can be seen in the mouse_drawing.py script.
based tracker The face_tracking_correlation_filters.py script can be modified to track an arbitrary object. In this case, we will use the mouse to select the object to track. If we press 1, the algorithm will start tracking the object inside the pre-defined bounding box. Additionally, if we press 2, the pre-defined bounding box will be emptied and the tracking algorithm will be stopped, allowing the user to select another bounding box. To clarify how the face_tracking_correlation_filters.py script works, we have included the next two screenshots. In the first one, we can see that we need to select a bounding box to start the tracking: In the second one, we can see the output of an arbitrary frame when the algorithm is tracking the object: As you can see in the previous screenshot, the algorithm is tracking the object inside the bounding box.
Marker-based augmented reality In this section, we are going to see how marker-based augmented reality works. There are many libraries, algorithms, or packages that you can use to both generate and detect markers. In this sense, one that provides state-of-the-art performance in detecting markers is ArUco. ArUco automatically detects the markers and corrects possible errors. Additionally, ArUco proposes a solution to the occlusion problem by combining multiple markers with an occlusion mask, which is calculated by color segmentation. As previously commented, pose estimation is a key process in augmented reality applications. Pose estimation can be performed based on markers. The main benefit of using markers is that they can be both efficiently and robustly detected in the image where the four corners of the marker can be accurately derived. Finally, the camera pose can be obtained from the previously calculated four corners of the marker. Therefore, in next subsections, we will see how to create marker-based augmented reality applications, starting from creating both markers and dictionaries. Creating markers and dictionaries The first step when using ArUco is the creation of markers and dictionaries. Firstly, an ArUco marker is a square marker composed of external and internal cells (also called bits). The external cells are set to black, creating an external border that can be fast and robustly detected. The remaining cells (the internal cells) are used for coding the marker. ArUco markers can also be created with different sizes. The size of the marker indicates the number of internal cells related to the internal matrix. For example, a marker size of 5 x 5 (n=5) is composed of 25 internal cells. Additionally, you can also set the number of bits in the marker border. Secondly, a dictionary of markers is the set of markers considered to be used in a specific application. While previous libraries considered only fixed dictionaries, ArUco proposes an automatic method for generating the markers with the desired number of them and with the desired number of bits. In this sense, ArUco includes some predefined dictionaries covering many configurations in connection with the number of markers and the marker sizes. The first step to consider when creating your marker-based augmented reality application is to print the markers to use. In the aruco_create_markers.py script, we are creating some markers ready to print. The first step is to create the dictionary object. ArUco has some predefined dictionaries: DICT_4X4_50 = 0, DICT_4X4_100 = 1, DICT_4X4_250 = 2, DICT_4X4_1000 = 3, DICT_5X5_50 = 4, DICT_5X5_100 = 5, DICT_5X5_250 = 6, DICT_5X5_1000 = 7, DICT_6X6_50 = 8, DICT_6X6_100 = 9, DICT_6X6_250 = 10, DICT_6X6_1000 = 11, DICT_7X7_50 = 12, DICT_7X7_100 = 13, DICT_7X7_250 = 14, and DICT_7X7_1000 = 15. In this case, we will create a dictionary using the cv2.aruco.Dictionary_get() function composed of 250 markers. Each marker will have a size of 7 x 7 (n=7): aruco_dictionary = cv2.aruco.Dictionary_get(cv2.aruco.DICT_7X7_250) At this point, a marker can be drawn using the cv2.aruco.drawMarker() function, which returns the marker ready to be printed. The first parameter of cv2.aruco.drawMarker() is the dictionary object. The second parameter is the marker id, which ranges between 0 and 249, because our dictionary has 250 markers. The third parameter, sidePixels, is the size (in pixels) of the created marker image. The fourth (optional, by default 1) parameter is borderBits, which sets the number of bits in the marker borders. So, in this example, we are going to create three markers varying the number of bits in the marker borders: aruco_marker_1 = cv2.aruco.drawMarker(dictionary=aruco_dictionary, id=2, sidePixels=600, borderBits=1) […]
Image blending Image blending is also image addition, but different weights are given to the images, giving an impression of transparency. In order to do this, the cv2.addWeighted() function will be used. This function is commonly used to get the output from the Sobel operator. The Sobel operator is used for edge detection, where it creates an image emphasizing edges. The Sobel operator uses two 3 × 3 kernels, which are convolved with the original image in order to calculate approximations of the derivatives, capturing both horizontal and vertical changes, as shown in the following code: # Gradient x is calculated: # the depth of the output is set to CV_16S to avoid overflow # CV_16S = one channel of 2-byte signed integers (16-bit signed integers) gradient_x = cv2.Sobel(gray_image, cv2.CV_16S, 1, 0, 3) gradient_y = cv2.Sobel(gray_image, cv2.CV_16S, 0, 1, 3) Therefore, after the horizontal and vertical changes have been calculated, they can be blended into an image by using the aforementioned function, as follows: # Conversion to an unsigned 8-bit type: abs_gradient_x = cv2.convertScaleAbs(gradient_x) abs_gradient_y = cv2.convertScaleAbs(gradient_y) # Combine the two images using the same weight: sobel_image = cv2.addWeighted(abs_gradient_x, 0.5, abs_gradient_y, 0.5, 0) This can be seen in the arithmetic_sobel.py script. The output of this script can be seen in the following screenshot: In the preceding screenshot, the output of the Sobel operator is shown, including both the horizontal and vertical changes.
The coordinate system in OpenCV To show you the coordinate system in OpenCV and how to access individual pixels, we are going to show you a low-resolution image of the OpenCV logo: This logo has a dimension of 20 × 18 pixels, that is, this image has 360 pixels. So, we can add the pixel count in every axis, as shown in the following image: Now, we are going to look at the indexing of the pixels in the form (x,y). Notice that pixels are zero-indexed, meaning that the upper left corner is at (0, 0), not (1, 1). Take a look at the following image, which indexes three individual pixels. origin. Moreover, y coordinates get larger as they go down: The information for an individual pixel can be extracted from an image in the same way as an individual element of an array is referenced in Python. In the next section, we are going to see how we can do this.
k-means clustering OpenCV provides the cv2.kmeans() function, which implements a k-means clustering algorithm, which finds centers of clusters and groups input samples around the clusters. The objective of the k-means clustering algorithm is to partition (or cluster) n samples into K clusters where each sample will belong to the cluster with the nearest mean. The signature of the cv2.kmeans() function is as follows: retval, bestLabels, centers=cv.kmeans(data, K, bestLabels, criteria, attempts, flags[, centers]) data represents the input data for clustering. It should be of np.float32 data type, and each feature should be placed in a single column. K specifies the number of clusters required at the end. The algorithm-termination criteria are specified with the criteria parameter, which sets the maximum number of iterations and/or the desired accuracy. When these criteria are satisfied, the algorithm terminates. criteria is a tuple of three parameters, type, max_iterm, and epsilon: type: This is the type of termination criteria. It has three flags: cv2.TERM_CRITERIA_EPS: The algorithm stops when the specified accuracy, epsilon, is reached. cv2.TERM_CRITERIA_MAX_ITER: The algorithm stops when the specified number of iterations, max_iterm, is reached. cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER: The algorithm stops when any of the two conditions is reached. max_iterm: This is the maximum number of iterations. epsilon: This is the required accuracy. An example of criteria can be the following: criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 20, 1.0) In this case, the maximum number of iterations is set to 20 (max_iterm = 20) and the desired accuracy is 1.0 (epsilon = 1.0). The attempts parameter specifies the number of times the algorithm is executed using different initial labelings. The algorithm returns the labels that yield the best compactness. The flags parameter specifies how initial centers are taken. The cv2.KMEANS_RANDOM_CENTERS flag selects random initial centers in each attempt. The cv2.KMEANS_PP_CENTERS flag uses the k-means++ center initialization proposed by Arthur and Vassilvitskii (see k-means++: The Advantages of Careful Seeding (2007)). cv2.kmeans() returns the following: bestLabels: An integer array that stores the cluster indices for each sample centers: An array that contains the center for each cluster compactness: The sum of the squared distance from each point to their corresponding centers In this section, we will see two examples of how to use the k-means clustering algorithm in OpenCV. In the first example, an intuitive understanding of k-means clustering is expected to be achieved while, in the second example, k-means clustering will be applied to the problem of color quantization. Understanding k-means clustering In this example, we are going to cluster a set of 2D points using the k-means clustering algorithm. This set of 2D points can be seen as a collection of objects, which has been described using two features. This set of 2D points can be created and visualized with the k_means_clustering_data_visualization.py script. The output of this script can be seen in the next screenshot: This set of 2D points consists of 150 points, created in this way: data = np.float32(np.vstack( (np.random.randint(0, 40, (50, 2)), np.random.randint(30, 70, (50, 2)), np.random.randint( This will represent the data for clustering. As previously mentioned, it should be of np.float32 type and each feature should be placed in a single column. […]