Model structure
The network consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training.
Namely, the authors strive for an embedding , from an image into a feature space , such that:
- squared distance between all faces, independent of imaging conditions, of the same identity is small.
- the squared distance between a pair of face images from different identities is large.
Triplet Loss
The embedding is represented by . It embeds an image x into a d-dimensional Euclidean space. Additionally, we constrain this embedding to live on the d-dimensional hypersphere, i.e. .
Here the objective is that we want to make sure that an image (anchor) of a specific person is closer to all other images (positive) of the same person than it is to any image (negative) of any other person.
- for
- is a margin that is enforced between positive and negative pairs.
- is the set of all possible triplets in the training set and has cardinality
- The the objective is to minimize the Loss:
Triplet Selection
Given , we want to select an (hard positive) such that and similarly (hard negative) such that .
Two obvious choices:
- Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.
- Generate triplets online. This can be done by select- ing the hard positive/negative exemplars from within a mini-batch.
Online Triplets Generation
- To have a meaningful representation of the anchor- positive distances, it needs to be ensured that a minimal number of exemplars of any one identity is present in each minibatch.
- around 40 faces are selected per identity per minibatch.
- randomly sampled negative faces are added to each mini-batch.
- Instead of picking the hardest positive, we use all anchor-positive pairs in a mini-batch while still selecting the hard negatives. For all anchor-positive method was more stable and converged slightly faster at the beginning of training.
Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. ). In order to mitigate this, it helps to select such that
- these negative exemplars semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchor-positive distance.
- Those negatives lie inside the margin
In most experiments the author used a batch size of around 1,800 exemplars.
Deep Convolutional Networks
- Use rectified linear units as the non-linear activation function
- Now use Inception+ResNet
Model Evaluation
The author evaluate our method on the face verification task. I.e. given a pair of two face images a squared L2 distance threshold is used to determine the classification of same and different. All faces pairs of the same identity are denoted with , whereas all pairs of different identities are denoted with
the set of all true accepts as:
, with .
These are the face pairs that were correctly classified as same at threshold .
the set of all pairs that was incorrectly classified as same(false accept) is:
, with .
The validation rate and the false accept rate for a given face distance are then defined as:
,