Facial Landmark Detection at L’Oréal
Facial landmark detection (FLD) is the field of computer vision used to identify key points or landmarks on a human face. The rich geometric information provided by landmarks with distinct semantic significance, such as eye corner, nose tip, or jawline, can be helpful in various tasks. These include face reconstruction, face identification, emotion recognition, face morphing and many others. FLD is a challenging task due the high variability in facial poses, lighting and expressions, as well as other possible source of biases.
A crucial component of many of L’Oréal’s services, including Skin Screen, Modiface, and Hapta, is Facial Landmark Detection. This technology is essential as it allows to perform subsequent tasks effectively based on these landmarks. Selecting the most accurate landmark estimation model is paramount for L’Oréal, ensuring reliable predictions for each and every one, across all ages, genders, ultimately providing a superior and dependable customer experience. Hence, a partnership with Giskard was established to create an adapted test suite for comparing these models across various use cases.
The giskard-vision
library allows to perform this granular comparison across different open-source models and different criteria, such as:
- Performance on partial facial regions
- Performance on face images with different head poses
- Robustness against image perturbations like blurring, resizing, recoloring
The aim of the subsequent sections is to demonstrate an example of using giskard-vision
. Our dataset will comprise the first 100 images featured in this article are sourced from from the 300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge.
Evaluation Metrics with Giskard
Today’s Facial landmark detection AI models benchmarks rely on highly aggregated metrics, the most important one being the Normalized Mean Error (NME):
where pli
and gli
are the 2D coordinates of the predicted and ground-truth landmarks; NI and NL are the number of images in the dataset and the number of landmarks, and Di is the euclidean distance between the outer eye corners. Below is an example of the predicted (in red) and ground-truth (in green) landmarks.
This chart from PapersWithCode reveal that ever since 2019, FLD models have plateaued from a NME standpoint.
It can be argued that the aggregated NME (over all images) does not provide a complete picture, highlighting the need for more specific testing.
Our analysis was performed with 3 different metrics:
- the NME calculated based on various data slices/transformations. For example, one could measure how the model performance vary on slightly blurry images, or images with reduced sizes.
- Fail rate measures the ratio of images for which the model failed to provide a prediction (mostly due to the failure of the face detection step).
- Prediction time is the time it took for a model to run its predictions.
The giskard-vision
library also gives access to many other metrics that were not used for this example, which can all be found here.
Evaluation Criteria with Giskard
For L’Oréal’s case, FLD models need to perform well on imperfect images taken “in the wild”, with high variability in facial poses, lighting and expressions. As such, some landmarks could be partially or completely occulted. Therefore, there is a need for testing on specific data slices.
The 3 open-source model that were evaluated are the open-source FaceAlignment, Mediapipe and OpenCV:
- OpenCV (Open Source Computer Vision Library) is a broad library of computer vision and machine learning algorithm that includes functionalities for facial landmark detection, among many other things. OpenCV's FLD is typically based on machine learning models like Haar cascades or dlib's shape predictor. It's widely used due to its speed, versatility, and open-source nature.
- FaceAlignment is a library developed specifically for FLD. It's known for its high precision and robustness. The model can handle face alignment tasks in both 2D and 3D coordinates. It's often used in applications that require precise facial feature extraction, such as face recognition, facial expression analysis, and face editing.
- Google's MediaPipe is a cross-platform framework for building multimodal (video, audio, and sensor) applied machine learning pipelines. With it, you can build applications that process perceptual data in real time, such as video and audio, to detect and recognize objects, people, and actions. This includes a quick and reliable FLD model.
The aim will be to write a quantitative summary report that compares these models, based on the different criteria that we will discuss below.
Performance on partial faces
First, let’s see how the models perform on faces that are only partially visible, by cropping the images. As L’Oréal’s models’ faces are often not fully visible on images, detecting the landmarks can be very challenging.
The way we simulate partial visible faces is by cropping part of the faces from the original 300W dataset. Knowing that a colored image is a matrix of pixels of size (height x width x channels), we can crop by applying masked arrays on the first two dimensions.
In our case, we use the facial landmarks themselves in order to crop specific facial parts. For example, cropping based on FacialParts.NOSE
is equivalent to drawing a bounding box around the landmarks 27 to 36 as shown below. Once the box is defined, we pad all around it with black pixels so that the final picture has the exact same dimensions as the original one. Finally, we can pass this to a model to get predictions (in green).
Considering the whole 300W dataset (dl_ref
in the following), we can define two cropped versions using the CroppedDataLoader
wrapper, one on the left half and one on the upper half of the face as follows:
Here are the results of the metrics on the entire cropped dataset:
When cropping on left half, OpenCV goes twice as fast FaceAlignment but performs twice as worse. However, on upper half, Mediapipe and OpenCV fail to deliver predictions around 95% of the time. As such, the trade-off between speed and robustness that FaceAlignment offers is a good choice for cropped faces. The giskard-vision
library offers many pre-set options for facial parts to crop, such as the face contour, the mouth or the eyes. You can thus easily check which model is the best for which part!
Performance on face images with different head poses
One cannot expect all faces to be perfectly aligned with the camera. Therefore, it is important to see if the models perform well when parts of a face are occulted. We define a head pose using 3 values to represent it in space: Yaw, Pitch and Roll.
We first need to estimate the head pose on a picture: we leverage the 6DRepNet library which is the official implementation of the “6D Rotation representation for unconstrained head pose estimation” paper.
For this, we will wrap our 300W dataset (dl_ref
) with the HeadPoseDataLoader
wrapper, which will calculate the head pose for every image using 6DRepNet. We then define two slices, one that correspond to images with faces that has a positive roll, and one for the negative one as follows:
Let’s check some head pose prediction examples:
Here’re the results we got:
While Mediapipe is 10 to 15 times faster than the other two models, it has a very high fail rate on positive roll and performs 30 times as worse on negative roll. OpenCV performs as well as FaceAlignment while being faster, it also has a much higher fail rate, making it risky to use on positive roll for example.
OpenCV on negative roll and FaceAlignment on positive roll seems to be the best method.
Robustness against image perturbations like blurring, resizing, recoloring
Since images in the wild won’t always be taken in perfect conditions, it is also very important to make sure that the models are robust against typical perturbations. With the help of the OpenCV library, we can blur, resize and alter the color of images.
To perform these transformations, we will use the BlurredDataLoader
, ResizedDataLoader
and ColoredDataLoader
as follows:
Which will result in the following:
Let’s run the predictions. The green marks represent the ground truth while the red and magenta ones are respectively the original prediction and the transformed prediction.
The results on the full dataset are as follows:
Again, Mediapipe is not recommended as its NME is 10 times higher than the other models.
Then, OpenCV has fairly high fail rates on each criteria, hence we recommend using FaceAlignment for its robustness.
Automated report generation
In order to produce a full summary of the results displayed above, the Report
class can be used as follows:
Here is the final comparison for each criteria, metric and model:
Overall, what model to use depends on the end goal. If high performances do not matter so much, then Mediapipe and its speed might be the best choice. However, while OpenCV seems to be nice tradeoff between speed and efficiency, it is not as robust as FaceAlignment, which might often be preferred, especially if working on a “faces in the wild” kind of dataset.
Continuous Integration
Integrating an automated evaluation report such as the above into a continuous integration (CI) pipeline enhances the monitoring of facial landmark detection models. This setup allows for real-time performance tracking across various dataset slices, ensuring any issues are quickly identified and addressed.
Using this approach, different versions of the models, whether trained with different parameters or on varying datasets, can be systematically compared. This comparative analysis helps in pinpointing the most effective configurations and understanding the impact of specific changes. The detailed reports generated by the evaluation tool offer valuable insights, guiding further optimization and refinement efforts.
This integration not only ensures that the models remain robust and accurate over time but also supports a more efficient development process. By automatically generating and reviewing performance reports, the development team can make informed decisions quickly, leading to continuous improvement and innovation in facial landmark detection technology.
Checkout this full Github repository demo that is built around the example above and embedded in a simple CI pipeline that is configured with a GCP runner.
Conclusion
Facial landmark detection is pivotal for enhancing L’Oréal’s services, which rely on accurate landmark estimation to deliver a superior customer experience. The collaboration with Giskard has been instrumental in developing a robust test suite that allows for comprehensive comparison of various open-source FLD models using the giskard-vision library.
Our detailed analysis evaluated three open-source models—FaceAlignment, Mediapipe, and OpenCV—across several criteria including performance on partial faces, different head poses, and robustness against image perturbations. The results highlight that while Mediapipe offers the fastest prediction times, it suffers from high failure rates and lower accuracy. OpenCV strikes a balance between speed and performance but is less robust compared to FaceAlignment, which consistently delivers reliable predictions, especially in challenging conditions like cropped or blurred images.
Integrating the automated evaluation reports into a continuous integration (CI) pipeline ensures real-time monitoring and performance tracking of FLD models. This systematic approach facilitates quick identification and resolution of issues, supports efficient model comparison, and aids in optimizing and refining FLD models. By continuously generating detailed performance reports, the development team can make informed decisions, leading to ongoing improvements and innovations in facial landmark detection technology.
To learn more about giskard-vision, visit our quick-start guide and our GitHub repo.