G
Tutorials
July 31, 2024
8 min read

L'Oréal leverages Giskard for advanced Facial Landmark Detection

L'Oréal has partnered with Giskard to enhance its AI models for Facial Landmark Detection. The collaboration focuses on evaluating and comparing various AI models using metrics such as Normalized Mean Error, prediction time, and robustness against image perturbations. It aims to improve the accuracy and reliability of L'Oréal's online services, ensuring superior performance across diverse facial regions and head poses. Co-authors: Alexandre Bouchez (L'Oréal), and Mathieu Martial (Giskard).

Facial Landmark Detection for L'Oréal
Rabah Abdul Khalek
Facial Landmark Detection for L'Oréal
Facial Landmark Detection for L'Oréal

Facial Landmark Detection at L’Oréal

Facial landmark detection (FLD) is the field of computer vision used to identify key points or landmarks on a human face. The rich geometric information provided by landmarks with distinct semantic significance, such as eye corner, nose tip, or jawline, can be helpful in various tasks. These include face reconstruction, face identification, emotion recognition, face morphing and many others. FLD is a challenging task due the high variability in facial poses, lighting and expressions, as well as other possible source of biases.

A crucial component of many of L’Oréal’s services, including Skin Screen, Modiface, and Hapta, is Facial Landmark Detection. This technology is essential as it allows to perform subsequent tasks effectively based on these landmarks. Selecting the most accurate landmark estimation model is paramount for L’Oréal, ensuring reliable predictions for each and every one, across all ages, genders, ultimately providing a superior and dependable customer experience. Hence, a partnership with Giskard was established to create an adapted test suite for comparing these models across various use cases.

The giskard-vision library allows to perform this granular comparison across different open-source models and different criteria, such as:

  • Performance on partial facial regions
  • Performance on face images with different head poses
  • Robustness against image perturbations like blurring, resizing, recoloring

The aim of the subsequent sections is to demonstrate an example of using giskard-vision. Our dataset will comprise the first 100 images featured in this article are sourced from from the 300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge.

Evaluation Metrics with Giskard

Today’s Facial landmark detection AI models benchmarks rely on highly aggregated metrics, the most important one being the Normalized Mean Error (NME):

Normalized Mean Error (NME)

where pli and gli are the 2D coordinates of the predicted and ground-truth landmarks; NI and NL are the number of images in the dataset and the number of landmarks, and Di is the euclidean distance between the outer eye corners. Below is an example of the predicted (in red) and ground-truth (in green) landmarks.

Example of difference between predictions (red) and ground truth (green)

This chart from PapersWithCode reveal that ever since 2019, FLD models have plateaued from a NME standpoint.

Different FLD models performances on the 300W dataset. Source: https://paperswithcode.com/sota/facial-landmark-detection-on-300w

It can be argued that the aggregated NME (over all images) does not provide a complete picture, highlighting the need for more specific testing.

Our analysis was performed with 3 different metrics:

  1. the NME calculated based on various data slices/transformations. For example, one could measure how the model performance vary on slightly blurry images, or images with reduced sizes.
  • Fail rate measures the ratio of images for which the model failed to provide a prediction (mostly due to the failure of the face detection step).
  • Prediction time is the time it took for a model to run its predictions.

The giskard-vision library also gives access to many other metrics that were not used for this example, which can all be found here.

Evaluation Criteria with Giskard

For L’Oréal’s case, FLD models need to perform well on imperfect images taken “in the wild”, with high variability in facial poses, lighting and expressions. As such, some landmarks could be partially or completely occulted. Therefore, there is a need for testing on specific data slices.

The 3 open-source model that were evaluated are the open-source FaceAlignment, Mediapipe and OpenCV:

  • OpenCV (Open Source Computer Vision Library) is a broad library of computer vision and machine learning algorithm that includes functionalities for facial landmark detection, among many other things. OpenCV's FLD is typically based on machine learning models like Haar cascades or dlib's shape predictor. It's widely used due to its speed, versatility, and open-source nature.
  • FaceAlignment is a library developed specifically for FLD. It's known for its high precision and robustness. The model can handle face alignment tasks in both 2D and 3D coordinates. It's often used in applications that require precise facial feature extraction, such as face recognition, facial expression analysis, and face editing.
  • Google's MediaPipe is a cross-platform framework for building multimodal (video, audio, and sensor) applied machine learning pipelines. With it, you can build applications that process perceptual data in real time, such as video and audio, to detect and recognize objects, people, and actions. This includes a quick and reliable FLD model.

The aim will be to write a quantitative summary report that compares these models, based on the different criteria that we will discuss below.

Performance on partial faces

First, let’s see how the models perform on faces that are only partially visible, by cropping the images. As L’Oréal’s models’ faces are often not fully visible on images, detecting the landmarks can be very challenging.

The way we simulate partial visible faces is by cropping part of the faces from the original 300W dataset. Knowing that a colored image is a matrix of pixels of size (height x width x channels), we can crop by applying masked arrays on the first two dimensions.

In our case, we use the facial landmarks themselves in order to crop specific facial parts. For example, cropping based on FacialParts.NOSE is equivalent to drawing a bounding box around the landmarks 27 to 36 as shown below. Once the box is defined, we pad all around it with black pixels so that the final picture has the exact same dimensions as the original one. Finally, we can pass this to a model to get predictions (in green).

Original image
Landmarks predictions based on the nose only

Considering the whole 300W dataset (dl_ref in the following), we can define two cropped versions using the CroppedDataLoader wrapper, one on the left half and one on the upper half of the face as follows:

Here are the results of the metrics on the entire cropped dataset:

Criteria Model NME prediction_time prediction_fail_rate
cropped on left half FaceAlignment 0.0950334 22.3192 0.820441
cropped on left half Mediapipe 2.39931 3.95156 0.951029
cropped on left half OpenCV 0.17521 11.1745 0.825882
cropped on upper half FaceAlignment 0.0948426 26.6648 0.782941
cropped on upper half Mediapipe 2.19568 4.02357 0.941765
cropped on upper half OpenCV 0.0519043 10.7939 0.978824

When cropping on left half, OpenCV goes twice as fast FaceAlignment but performs twice as worse. However, on upper half, Mediapipe and OpenCV fail to deliver predictions around 95% of the time. As such, the trade-off between speed and robustness that FaceAlignment offers is a good choice for cropped faces. The giskard-vision library offers many pre-set options for facial parts to crop, such as the face contour, the mouth or the eyes. You can thus easily check which model is the best for which part!

Performance on face images with different head poses

One cannot expect all faces to be perfectly aligned with the camera. Therefore, it is important to see if the models perform well when parts of a face are occulted. We define a head pose using 3 values to represent it in space: Yaw, Pitch and Roll.

Source: https://www.researchgate.net/figure/The-yaw-pitch-and-roll-angles-in-the-human-head-motion-11_fig1_340166096

We first need to estimate the head pose on a picture: we leverage the 6DRepNet library which is the official implementation of the “6D Rotation representation for unconstrained head pose estimation” paper.

For this, we will wrap our 300W dataset (dl_ref) with the HeadPoseDataLoader wrapper, which will calculate the head pose for every image using 6DRepNet. We then define two slices, one that correspond to images with faces that has a positive roll, and one for the negative one as follows:

Let’s check some head pose prediction examples:

Predictions : {'pitch': -2.5054178, 'yaw': -61.150265, 'roll': -3.664601}
Predictions : {'pitch': -7.8795295, 'yaw': 10.167775, 'roll': -10.162264}

Here’re the results we got:

Criteria Model NME prediction_time prediction_fail_rate
negative_roll FaceAlignment 0.0909015 25.7093 0.0416667
negative_roll Mediapipe 3.04958 2.03313 0.0833333
negative_roll OpenCV 0.0968505 10.9992 0.125
positive_roll FaceAlignment 0.330439 30.441 0.0576923
positive_roll Mediapipe 3.13338 2.88912 0.288462
positive_roll OpenCV 0.411305 17.3002 0.192308

While Mediapipe is 10 to 15 times faster than the other two models, it has a very high fail rate on positive roll and performs 30 times as worse on negative roll. OpenCV performs as well as FaceAlignment while being faster, it also has a much higher fail rate, making it risky to use on positive roll for example.

OpenCV on negative roll and FaceAlignment on positive roll seems to be the best method.

Robustness against image perturbations like blurring, resizing, recoloring

Since images in the wild won’t always be taken in perfect conditions, it is also very important to make sure that the models are robust against typical perturbations. With the help of the OpenCV library, we can blur, resize and alter the color of images.

To perform these transformations, we will use the BlurredDataLoader, ResizedDataLoader and ColoredDataLoader as follows:

Which will result in the following:

Blurred version
Recolored version
Resized version

Let’s run the predictions. The green marks represent the ground truth while the red and magenta ones are respectively the original prediction and the transformed prediction.

Landmarks on blurry face
Landmarks on black and white face
Landmarks on resized face

The results on the full dataset are as follows:

Criteria Model NME prediction_time prediction_fail_rate
altered color FaceAlignment 0.174517 58.1809 0.02
altered color Mediapipe 2.5315 3.96894 0.8
altered color OpenCV 0.243567 28.037 0.14
blurred FaceAlignment 0.204779 63.3339 0.04
blurred Mediapipe 3.26273 5.59357 0.09
blurred OpenCV 0.331916 24.6266 0.12
resized with ratios: 0.5 FaceAlignment 0.218424 59.4045 0.04
resized with ratios: 0.5 Mediapipe 3.19861 5.10067 0.12
resized with ratios: 0.5 OpenCV 0.252987 10.9042 0.18

Again, Mediapipe is not recommended as its NME is 10 times higher than the other models.

Then, OpenCV has fairly high fail rates on each criteria, hence we recommend using FaceAlignment for its robustness.

Automated report generation

In order to produce a full summary of the results displayed above, the Report class can be used as follows:

Here is the final comparison for each criteria, metric and model:

Overall, what model to use depends on the end goal. If high performances do not matter so much, then Mediapipe and its speed might be the best choice. However, while OpenCV seems to be nice tradeoff between speed and efficiency, it is not as robust as FaceAlignment, which might often be preferred, especially if working on a “faces in the wild” kind of dataset.

Continuous Integration

Integrating an automated evaluation report such as the above into a continuous integration (CI) pipeline enhances the monitoring of facial landmark detection models. This setup allows for real-time performance tracking across various dataset slices, ensuring any issues are quickly identified and addressed.

Using this approach, different versions of the models, whether trained with different parameters or on varying datasets, can be systematically compared. This comparative analysis helps in pinpointing the most effective configurations and understanding the impact of specific changes. The detailed reports generated by the evaluation tool offer valuable insights, guiding further optimization and refinement efforts.

This integration not only ensures that the models remain robust and accurate over time but also supports a more efficient development process. By automatically generating and reviewing performance reports, the development team can make informed decisions quickly, leading to continuous improvement and innovation in facial landmark detection technology.

Checkout this full Github repository demo that is built around the example above and embedded in a simple CI pipeline that is configured with a GCP runner.

Conclusion

Facial landmark detection is pivotal for enhancing L’Oréal’s services, which rely on accurate landmark estimation to deliver a superior customer experience. The collaboration with Giskard has been instrumental in developing a robust test suite that allows for comprehensive comparison of various open-source FLD models using the giskard-vision library.

Our detailed analysis evaluated three open-source models—FaceAlignment, Mediapipe, and OpenCV—across several criteria including performance on partial faces, different head poses, and robustness against image perturbations. The results highlight that while Mediapipe offers the fastest prediction times, it suffers from high failure rates and lower accuracy. OpenCV strikes a balance between speed and performance but is less robust compared to FaceAlignment, which consistently delivers reliable predictions, especially in challenging conditions like cropped or blurred images.

Integrating the automated evaluation reports into a continuous integration (CI) pipeline ensures real-time monitoring and performance tracking of FLD models. This systematic approach facilitates quick identification and resolution of issues, supports efficient model comparison, and aids in optimizing and refining FLD models. By continuously generating detailed performance reports, the development team can make informed decisions, leading to ongoing improvements and innovations in facial landmark detection technology.

To learn more about giskard-vision, visit our quick-start guide and our GitHub repo.

Integrate | Scan | Test | Automate

Giskard: Testing & evaluation framework for LLMs and AI models

Automatic LLM testing
Protect agaisnt AI risks
Evaluate RAG applications
Ensure compliance

L'Oréal leverages Giskard for advanced Facial Landmark Detection

L'Oréal has partnered with Giskard to enhance its AI models for Facial Landmark Detection. The collaboration focuses on evaluating and comparing various AI models using metrics such as Normalized Mean Error, prediction time, and robustness against image perturbations. It aims to improve the accuracy and reliability of L'Oréal's online services, ensuring superior performance across diverse facial regions and head poses. Co-authors: Alexandre Bouchez (L'Oréal), and Mathieu Martial (Giskard).

Facial Landmark Detection at L’Oréal

Facial landmark detection (FLD) is the field of computer vision used to identify key points or landmarks on a human face. The rich geometric information provided by landmarks with distinct semantic significance, such as eye corner, nose tip, or jawline, can be helpful in various tasks. These include face reconstruction, face identification, emotion recognition, face morphing and many others. FLD is a challenging task due the high variability in facial poses, lighting and expressions, as well as other possible source of biases.

A crucial component of many of L’Oréal’s services, including Skin Screen, Modiface, and Hapta, is Facial Landmark Detection. This technology is essential as it allows to perform subsequent tasks effectively based on these landmarks. Selecting the most accurate landmark estimation model is paramount for L’Oréal, ensuring reliable predictions for each and every one, across all ages, genders, ultimately providing a superior and dependable customer experience. Hence, a partnership with Giskard was established to create an adapted test suite for comparing these models across various use cases.

The giskard-vision library allows to perform this granular comparison across different open-source models and different criteria, such as:

  • Performance on partial facial regions
  • Performance on face images with different head poses
  • Robustness against image perturbations like blurring, resizing, recoloring

The aim of the subsequent sections is to demonstrate an example of using giskard-vision. Our dataset will comprise the first 100 images featured in this article are sourced from from the 300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge.

Evaluation Metrics with Giskard

Today’s Facial landmark detection AI models benchmarks rely on highly aggregated metrics, the most important one being the Normalized Mean Error (NME):

Normalized Mean Error (NME)

where pli and gli are the 2D coordinates of the predicted and ground-truth landmarks; NI and NL are the number of images in the dataset and the number of landmarks, and Di is the euclidean distance between the outer eye corners. Below is an example of the predicted (in red) and ground-truth (in green) landmarks.

Example of difference between predictions (red) and ground truth (green)

This chart from PapersWithCode reveal that ever since 2019, FLD models have plateaued from a NME standpoint.

Different FLD models performances on the 300W dataset. Source: https://paperswithcode.com/sota/facial-landmark-detection-on-300w

It can be argued that the aggregated NME (over all images) does not provide a complete picture, highlighting the need for more specific testing.

Our analysis was performed with 3 different metrics:

  1. the NME calculated based on various data slices/transformations. For example, one could measure how the model performance vary on slightly blurry images, or images with reduced sizes.
  • Fail rate measures the ratio of images for which the model failed to provide a prediction (mostly due to the failure of the face detection step).
  • Prediction time is the time it took for a model to run its predictions.

The giskard-vision library also gives access to many other metrics that were not used for this example, which can all be found here.

Evaluation Criteria with Giskard

For L’Oréal’s case, FLD models need to perform well on imperfect images taken “in the wild”, with high variability in facial poses, lighting and expressions. As such, some landmarks could be partially or completely occulted. Therefore, there is a need for testing on specific data slices.

The 3 open-source model that were evaluated are the open-source FaceAlignment, Mediapipe and OpenCV:

  • OpenCV (Open Source Computer Vision Library) is a broad library of computer vision and machine learning algorithm that includes functionalities for facial landmark detection, among many other things. OpenCV's FLD is typically based on machine learning models like Haar cascades or dlib's shape predictor. It's widely used due to its speed, versatility, and open-source nature.
  • FaceAlignment is a library developed specifically for FLD. It's known for its high precision and robustness. The model can handle face alignment tasks in both 2D and 3D coordinates. It's often used in applications that require precise facial feature extraction, such as face recognition, facial expression analysis, and face editing.
  • Google's MediaPipe is a cross-platform framework for building multimodal (video, audio, and sensor) applied machine learning pipelines. With it, you can build applications that process perceptual data in real time, such as video and audio, to detect and recognize objects, people, and actions. This includes a quick and reliable FLD model.

The aim will be to write a quantitative summary report that compares these models, based on the different criteria that we will discuss below.

Performance on partial faces

First, let’s see how the models perform on faces that are only partially visible, by cropping the images. As L’Oréal’s models’ faces are often not fully visible on images, detecting the landmarks can be very challenging.

The way we simulate partial visible faces is by cropping part of the faces from the original 300W dataset. Knowing that a colored image is a matrix of pixels of size (height x width x channels), we can crop by applying masked arrays on the first two dimensions.

In our case, we use the facial landmarks themselves in order to crop specific facial parts. For example, cropping based on FacialParts.NOSE is equivalent to drawing a bounding box around the landmarks 27 to 36 as shown below. Once the box is defined, we pad all around it with black pixels so that the final picture has the exact same dimensions as the original one. Finally, we can pass this to a model to get predictions (in green).

Original image
Landmarks predictions based on the nose only

Considering the whole 300W dataset (dl_ref in the following), we can define two cropped versions using the CroppedDataLoader wrapper, one on the left half and one on the upper half of the face as follows:

Here are the results of the metrics on the entire cropped dataset:

Criteria Model NME prediction_time prediction_fail_rate
cropped on left half FaceAlignment 0.0950334 22.3192 0.820441
cropped on left half Mediapipe 2.39931 3.95156 0.951029
cropped on left half OpenCV 0.17521 11.1745 0.825882
cropped on upper half FaceAlignment 0.0948426 26.6648 0.782941
cropped on upper half Mediapipe 2.19568 4.02357 0.941765
cropped on upper half OpenCV 0.0519043 10.7939 0.978824

When cropping on left half, OpenCV goes twice as fast FaceAlignment but performs twice as worse. However, on upper half, Mediapipe and OpenCV fail to deliver predictions around 95% of the time. As such, the trade-off between speed and robustness that FaceAlignment offers is a good choice for cropped faces. The giskard-vision library offers many pre-set options for facial parts to crop, such as the face contour, the mouth or the eyes. You can thus easily check which model is the best for which part!

Performance on face images with different head poses

One cannot expect all faces to be perfectly aligned with the camera. Therefore, it is important to see if the models perform well when parts of a face are occulted. We define a head pose using 3 values to represent it in space: Yaw, Pitch and Roll.

Source: https://www.researchgate.net/figure/The-yaw-pitch-and-roll-angles-in-the-human-head-motion-11_fig1_340166096

We first need to estimate the head pose on a picture: we leverage the 6DRepNet library which is the official implementation of the “6D Rotation representation for unconstrained head pose estimation” paper.

For this, we will wrap our 300W dataset (dl_ref) with the HeadPoseDataLoader wrapper, which will calculate the head pose for every image using 6DRepNet. We then define two slices, one that correspond to images with faces that has a positive roll, and one for the negative one as follows:

Let’s check some head pose prediction examples:

Predictions : {'pitch': -2.5054178, 'yaw': -61.150265, 'roll': -3.664601}
Predictions : {'pitch': -7.8795295, 'yaw': 10.167775, 'roll': -10.162264}

Here’re the results we got:

Criteria Model NME prediction_time prediction_fail_rate
negative_roll FaceAlignment 0.0909015 25.7093 0.0416667
negative_roll Mediapipe 3.04958 2.03313 0.0833333
negative_roll OpenCV 0.0968505 10.9992 0.125
positive_roll FaceAlignment 0.330439 30.441 0.0576923
positive_roll Mediapipe 3.13338 2.88912 0.288462
positive_roll OpenCV 0.411305 17.3002 0.192308

While Mediapipe is 10 to 15 times faster than the other two models, it has a very high fail rate on positive roll and performs 30 times as worse on negative roll. OpenCV performs as well as FaceAlignment while being faster, it also has a much higher fail rate, making it risky to use on positive roll for example.

OpenCV on negative roll and FaceAlignment on positive roll seems to be the best method.

Robustness against image perturbations like blurring, resizing, recoloring

Since images in the wild won’t always be taken in perfect conditions, it is also very important to make sure that the models are robust against typical perturbations. With the help of the OpenCV library, we can blur, resize and alter the color of images.

To perform these transformations, we will use the BlurredDataLoader, ResizedDataLoader and ColoredDataLoader as follows:

Which will result in the following:

Blurred version
Recolored version
Resized version

Let’s run the predictions. The green marks represent the ground truth while the red and magenta ones are respectively the original prediction and the transformed prediction.

Landmarks on blurry face
Landmarks on black and white face
Landmarks on resized face

The results on the full dataset are as follows:

Criteria Model NME prediction_time prediction_fail_rate
altered color FaceAlignment 0.174517 58.1809 0.02
altered color Mediapipe 2.5315 3.96894 0.8
altered color OpenCV 0.243567 28.037 0.14
blurred FaceAlignment 0.204779 63.3339 0.04
blurred Mediapipe 3.26273 5.59357 0.09
blurred OpenCV 0.331916 24.6266 0.12
resized with ratios: 0.5 FaceAlignment 0.218424 59.4045 0.04
resized with ratios: 0.5 Mediapipe 3.19861 5.10067 0.12
resized with ratios: 0.5 OpenCV 0.252987 10.9042 0.18

Again, Mediapipe is not recommended as its NME is 10 times higher than the other models.

Then, OpenCV has fairly high fail rates on each criteria, hence we recommend using FaceAlignment for its robustness.

Automated report generation

In order to produce a full summary of the results displayed above, the Report class can be used as follows:

Here is the final comparison for each criteria, metric and model:

Overall, what model to use depends on the end goal. If high performances do not matter so much, then Mediapipe and its speed might be the best choice. However, while OpenCV seems to be nice tradeoff between speed and efficiency, it is not as robust as FaceAlignment, which might often be preferred, especially if working on a “faces in the wild” kind of dataset.

Continuous Integration

Integrating an automated evaluation report such as the above into a continuous integration (CI) pipeline enhances the monitoring of facial landmark detection models. This setup allows for real-time performance tracking across various dataset slices, ensuring any issues are quickly identified and addressed.

Using this approach, different versions of the models, whether trained with different parameters or on varying datasets, can be systematically compared. This comparative analysis helps in pinpointing the most effective configurations and understanding the impact of specific changes. The detailed reports generated by the evaluation tool offer valuable insights, guiding further optimization and refinement efforts.

This integration not only ensures that the models remain robust and accurate over time but also supports a more efficient development process. By automatically generating and reviewing performance reports, the development team can make informed decisions quickly, leading to continuous improvement and innovation in facial landmark detection technology.

Checkout this full Github repository demo that is built around the example above and embedded in a simple CI pipeline that is configured with a GCP runner.

Conclusion

Facial landmark detection is pivotal for enhancing L’Oréal’s services, which rely on accurate landmark estimation to deliver a superior customer experience. The collaboration with Giskard has been instrumental in developing a robust test suite that allows for comprehensive comparison of various open-source FLD models using the giskard-vision library.

Our detailed analysis evaluated three open-source models—FaceAlignment, Mediapipe, and OpenCV—across several criteria including performance on partial faces, different head poses, and robustness against image perturbations. The results highlight that while Mediapipe offers the fastest prediction times, it suffers from high failure rates and lower accuracy. OpenCV strikes a balance between speed and performance but is less robust compared to FaceAlignment, which consistently delivers reliable predictions, especially in challenging conditions like cropped or blurred images.

Integrating the automated evaluation reports into a continuous integration (CI) pipeline ensures real-time monitoring and performance tracking of FLD models. This systematic approach facilitates quick identification and resolution of issues, supports efficient model comparison, and aids in optimizing and refining FLD models. By continuously generating detailed performance reports, the development team can make informed decisions, leading to ongoing improvements and innovations in facial landmark detection technology.

To learn more about giskard-vision, visit our quick-start guide and our GitHub repo.

Get Free Content

Download our guide and learn What the EU AI Act means for Generative AI Systems Providers.