Magics Of Siamese Neural Networks

15 min readMay 28, 2021

I am writing this blog so that this blog would be greatly useful for all the AI enthusiasts . I made it more informative and concise so the most concepts will be cleared and reader will get in depth information about the Siamese Neural Networks. I kept the language simple and easy explanations.Be ready to deep dive into the magics of Siamese Neural Networks. Happy reading !

Table of Contents:-

1. Motivation

2.Introduction

3.Standard Classification vs. One Shot Classification:-

4.problems with Convolutional Neural Network

5.Architectures

6. Siamese Neural network

7.My Work On Siamese Neural network

8. Conclusion

Motivation :-

The main motivation for writing this article is to publish my work and study about the Siamese Networks .I aim here ,Anyone who wants to work on SNN will get great amount of information . I will also write about my own work so that if anyone wants to continue the work this article would be useful.Also , if anyone wants to know the theory and working of SNN or want to build a particular application it would be helpful.

In this article , I have explained how we can solve computer vision problems or we can also say Convolutional Neural Network problems using Siamese Neural Network. How Siamese neural network are superior than Conolutional neural network in some particular cases and the situations . I am going to discuss these the problems with Convolutional Neural Network and their solutions with Siamese Neural Network.

Deep Convolutional Neural Networks have become the state of the art methods for image classification tasks. However, one of the biggest limitations is they require a lots of labelled data. In many applications, collecting this much data is sometimes not feasible. One Shot Learning aims to solve this problem and One Shot Learning is implemented using Siamese neural network.

Introduction :-

In the modern Deep learning era, Neural networks are almost good at every task, but these neural networks rely on more data to perform well. But, for certain problems like face recognition and signature verification, we can’t always rely on getting more data, to solve this kind of tasks we have a new type of neural network architecture called Siamese Networks.

Historically, deep learning algorithms fail to work well if we have only one training example. This is because, in many computer vision problem like face recognition, fraud detection (i.e. signature forgery), where we have only one or a few template images in the training set, the candidate image to be verified may have different backgrounds, facial expressions, lighting conditions, etc.

SNN uses only a few numbers of images to get better predictions. The ability to learn from very little data made Siamese networks more popular in recent years.

One-shot learning is a classification or object categorization task in which one few example is used to classify many new examples.

Standard Classification vs. One Shot Classification:-

In case of standard classification, the input image is fed into a series of layers, and finally at the output we generate a probability distribution over all the classes (typically using a Softmax). For example, if we are trying to classify an image as cat or dog or horse or elephant, then for every input image, we generate 4 probabilities, indicating the probability of the image belonging to each of the 4 classes. Two important points must be noticed here. First, during the training process, we require a large number of images for each of the class (cats, dogs, horses and elephants). Second, if the network is trained only on the above 4 classes of images, then we cannot expect to test it on any other class, example “zebra”. If we want our model to classify the images of zebra as well, then we need to first get a lot of zebra images and then we must re-train the model again. There are applications wherein we neither have enough data for each class and the total number classes is huge as well as dynamically changing. Thus, the cost of data collection and periodical re-training is too high.

On the other hand, in a one shot classification, we require only one training example for each class. Yes you got that right, just one. Hence the name One Shot.

Major problems with Convolutional Neural Network :-

Let’s proceed by showcasing the problems of using a naive convolution network by considering a real-world example.

Assume that we want to build face recognition system for a small organization with only 10 employees (small numbers keep things simple). Using a traditional classification approach, we might come up with a system that looks as below:

Problems :-

a) To train such a system, we first require a lot of different images of each of the 10 persons in the organization which might not be feasible. (Imagine if you are doing this for an organization with thousands of employees).

b) What if a new person joins or leaves the organization? You need to take the pain of collecting data again and re-train the entire model again. This is practically not possible specially for large organizations where recruitment and attrition is happening almost every week.

Solution :-

Instead of directly classifying an input(test) image to one of the 10 people in the organization, this network instead takes an extra reference image of the person as input and will produce a similarity score denoting the chances that the two input images belong to the same person. Typically the similarity score is squished between 0 and 1 using a sigmoid function; wherein 0 denotes no similarity and 1 denotes full similarity. Any number between 0 and 1 is interpreted accordingly.

Notice that this network is not learning to classify an image directly to any of the output classes. Rather, it is learning a similarity function, which takes two images as input and expresses how similar they are.

a) In a short while we will see that to train this network, you do not require too many instances of a class and only few are enough to build a good model.

b) But the biggest advantage is that , let’s say in case of face recognition, we have a new employee who has joined the organization. Now in order for the network to detect his face, we only require a single image of his face which will be stored in the database. Using this as the reference image, the network will calculate the similarity for any new instance presented to it. Thus we say that network predicts the score in one shot.

Architectures :-

1. Convolutional Neural Network :-

In case of a CNN model, you have a series of convolutional and pooling layers followed by some dense layers and an output layer probably with a softmax function. The convolutional layers here are responsible for feature extraction from the image, whereas the softmax layer is responsible for providing a range of probability for every class. We then decide the class of the image with the neuron that has the highest probability value.

Siamese Neural Network :-

1.Siamese network takes two different inputs passed through two similar subnetworks with the same architecture, parameters, and weights.

2.The two subnetworks are a mirror image of each other, just like the Siamese twins. Hence, any change to any subnetworks architecture, parameter, or weights is also applied to the other subnetwork.

3.The two subnetwork outputs an encoding to calculate the difference between the two inputs.

4.The Siamese network’s objective is to classify if the two inputs are the same or different using the Similarity score. The Similarity score can be calculated using Binary cross-entropy, Contrastive function, or Triplet loss, which are techniques for the general distance metric learning approach.

5.Siamese network is a one-shot classifier that uses discriminative features to generalize the unfamiliar categories from an unknown distribution.

Working of Siamese neural network :-

Training the Siamese Neural Network

Load the dataset containing the different classes
Create positive and negative data pairs in case of contrastive loss. Positive data pair is when both the inputs are the same, and a negative pair is when the two inputs are dissimilar.
Create a triplet of images(anchor image,positive image , negative image)in case of triplet loss.
Build the Convolutional neural network, which outputs the feature encoding using a fully connected layer. This is the sister CNN’s through which we will pass the two inputs. The sister CNN’s should have the same architecture, hyperparameters, and weights.
Build the differencing layer to calculate the Euclidian distance between the two sister CNN networks encoding output.
The final layer is a fully-connected layer with a single node using the sigmoid activation function to output the Similarity score.
Compile the model loss functions . some of them we will discuss below.

Testing the Siamese Neural Network

Send two inputs to the trained model to output the Similarity score.
As the last layer uses the sigmoid activation function, it outputs a value in the range 0 to 1. A Similarity score close to 1 implies that the two inputs are similar. A Similarity score close to 0 implies that the two inputs are dissimilar. A good rule of thumb is to use a similarity cutoff threshold value of 0.5.

Contrastive loss function :-

Loss functions are necessary to adjust the weights of this network during training

The Contrastive loss function differentiates between similar and dissimilar images by contrasting the two inputs. It helps when you don’t know all of your classes at training time and have limited training data. It creates a data encoding that can be used when we have new classes in the future.

Contrastive loss requires a pair of positive and negative training data. The positive pair contains an anchor sample and a positive sample, and a negative pair contains an anchor sample and a negative sample.

The Contrastive loss function’s objective is to have a smaller distance for positive pairs and a greater distance for negative pairs.

In the equation above, Y is 0 when the inputs are from the same class; else, Y’s value is 1.

m is the margin that defines the radius to indicate that dissimilar pairs beyond this margin will not contribute to the loss and is always greater than 0.

Dw is the Euclidean distance between the outputs of the sister siamese networks.

While minimizing the loss function , distance between positive pairs encodings is minimized and distance between the negative pairs encodings is maximized .margin is used so that if negative pairs are already further apart from each other then loss will be zero and model will not waste time to make them further apart.

Triplet loss :-

In Triplet loss, we use triplet of data instead of pairs. The triplet is formed by an anchor, a positive sample, and a negative sample .

In Triplet loss, the distance between the anchor and positive sample encoding is minimized, and the distance between the anchor and negative sample encodings is maximized.triplet loss function tries to maximize the distance between anchor image and negative image while minimizing the distance between anchor image and positive image thereby learning to differentiate similar images to non similar ones.

We minimize this loss, which pushes d(a,p) to 0 and d(a,n) to be greater than d(a,p)+margin.While minimizing the triplet loss we don’t push the d(a,p) to zero and d(a,n) to infinity . It just wants that the d(a,p) is greater than the d(a,n). so we have the term margin which signifies that after a particular difference between d(a,p) and d(a,n) there is no need to make the anchor image and negative image apart from each other as model will not learn form it. Also ,if d(a,n) is greater than d(a,p) then loss will be zero and in this case , model will not learn . So we need to choose the triplets carefully. In the next section , I have discussed how to select the triplets to train the model efficiently.

Ideal ways to choose training points for this model :-

A practical model needs to have the right mix of both random samples as well as hard samples.

Random Sampling — choose input samples randomly.
Hard Sampling — choose samples where the anchor and negative images look close enough but have slight dissimilarity.

Based on the definition of the loss, there are three categories of triplets:

easy triplets: triplets which have a loss of 0, because d(a,p)+margin<d(a,n)
hard triplets: triplets where the negative is closer to the anchor than the positive, i.e. d(a,n)<d(a,p)
semi-hard triplets: triplets where the negative is not closer to the anchor than the positive, but which still have positive loss: d(a,p)<d(a,n)<d(a,p)+margin

Triplets that are easy, result in a small loss, and are not effective at updating the model.if triplets are hard then loss will be more and model will learn faster and more efficient feature in this case.

Each of these definitions depend on where the negative is, relatively to the anchor and positive. We can therefore extend these three categories to the negatives: hard negatives, semi-hard negatives or easy negatives.

Triplet Mining for training :-

A model can be trained on triplets by using either offline or online triplet mining.

Offline Triplet Mining: In this approach, we first generate the triplets manually and then fit the data to the network.

Online Triplet Mining: In this approach, we feed a batch of training data, generate triplets using all examples in the batch and calculate the loss on it. This approach allows us to randomize the triplets and increase the chance to find triplets with high losses — this will help train the model faster. For batch size of N, we can generate at most N ³ triplets.

N-way one-shot Learning :-

Note that, for every pair of input images, our model generates a similarity score between 0 and 1. But just looking at the score its difficult to ascertain whether the model is really able to recognize similar characters and distinguish dissimilar ones.

A nice way to judge the model is N-way one shot learning. Don’t worry, it’s much easier than what it sounds to be.

An example of 4-way one shot learning:

We create a dataset of 4 pairs of images as follows:

Basically the same character is compared to 4 different characters out of which only one of them matches the original character. Let’s say by doing the above 4 comparisons we get 4 similarity scores S1, S2, S3 and S4 as shown. Now if the model is trained properly, we expect that S1 is the maximum of all the 4 similarity scores because the first pair of images is the only one where we have two same characters.

Thus if S1 happens to be the maximum score, we treat this as a correct prediction otherwise we consider this as an incorrect prediction. Repeating this procedure ‘k’ times, we can calculate the percentage of correct predictions as follows:

percent_correct = (100 * n_correct) / k

where k => total no. of trials and n_correct => no. of correct predictions out of k trials.

Similarly a 9-way one shot learning will look as follows:

A 16-way one shot leaning will be as shown below:

Some Applications of Siamese Neural network :-

Signature Verification
Facial Recognition
Compare Fingerprinting
Evaluate disease severity based on clinical grading
Text similarity for a job profile to resume matching
Text similarity for pairing similar questions

Features of Siamese Neural network :-

Siamese network is a one-shot classification model and can perform prediction with just a single training example.
More robust to class imbalance as it requires very little information. It can be used on a dataset where very few examples exist for some classes.
The one-shot learning feature of the Siamese network does not rely upon domain-specific knowledge but exploits deep learning techniques.
Learning from Semantic Similarity: Siamese focuses on learning embeddings (in the deeper layer) that place the same classes/concepts close together. Hence, can learn semantic similarity.

Downsides of the Siamese Neural network:-

Needs more training time than normal networks: Since Siamese Networks involves quadratic pairs to learn from (to see all information available) it is slower than normal classification type of learning(pointwise learning)
Doesn’t output probabilities: Since training involves pairwise learning, it won’t output the probabilities of the prediction, but the distance from each class

My Work On Siamese Neural network :-

I will list all the theories , articles and research papers i have studied at the bottom of the article. After getting the decent understanding of Siamese Neural network i found that they are generally used to solve the problems of deep learning where we have less amount of data. since , deep learning models and standard neural networks requires huge amount of data for training and to predict the correct results. where as ,Siamese Neural network requires very few training data for training and after the training it can be used for multiple domain applications since Siamese Neural network learns the generic features or we can say discriminative features from the input data.
for example , if we have trained the model for the animal recognition then we can use the same model for another application like birds recognition .It can be applied even when you don’t know all of your training time classes and have limited training data.

First I have implemented 4 Siamese Neural network models on 4 different datasets. The motivation behind this is to check the accuracy and efficiency of Siamese Neural network over standard convolutional neural networks.
I will attach all my test projects below. For every model i got very good accuracy even when it can not be even possible to make model using CNN .So, I have decided to implement the Siamese Neural network on computer vision problems where we can not use standard CNN . Siamese Neural network is the modified architecture of CNN which we have discussed above in details . The 4 implementations are :-
1.Siamese neural network is implemented and tested using the omniglot dataset :-https://github.com/maheshchat/SiameseNN_implentation_Omnilgot

2..Siamese neural network is implemented and tested using the MNIST_Fashion datset. It different implementations of SNN using 3 different loss functions .(cross entropy , online triplet loss , offline triplet loss)
The aim of the project is to check the accuracy and efficiency of different loss functions that can be used with Siamese neural network.
Accruacy achieved using cross entropy is 72 percent .
Accuracy achieved using Offline triplet loss is 50 percent.
Accuracy achieved using online triplet loss is 82 percent.
https://github.com/mahesh-chat/SNN_Implementation_MNIST2

3.Siamese neural network is implemented and tested using the MNIST dataset :-https://github.com/maheshchat/Siamese_neural_network_Implementation_MNIST

4.Siamese neural network is implemented and tested using the Fruits dataset:-https://github.com/maheshchat/Siamese_nueral_network_implentation_FruitsDataset

After implementing on the popular data sets i have decided to solve Computer vision problems . The list of problems are :-

1. Face detection using Siamese neural network :-

source code :- https://github.com/mahesh-chat/Face_Reacognition_SNN
article :-https://mahesh-chat.medium.com/face-recognition-system-using-siamese-neural-network-130a50aedccf
I have discussed everything about the project in the article . please , check it out for more details.

2. Signature Verification using Siamese neural network :-

1. source code :-https://github.com/mahesh-chat/Signature_Verification_SNN
2. article :-https://medium.com/@mahesh-chat/signature-verification-system-using-siamese-neural-network-e0b3d0200596

3. Plant disease detection using Siamese neural network :-

source code :- https://github.com/mahesh-chat/Plant_Disease_Detection

The aim of the project is to detect the plant disease using machines .there are so many plant diseases so there will be many classes for corrosponding to
each disease so, Siamese neural network fits for this application.

Conclusion :-

The Siamese neural network is not only used for the computer vision problems but it can be used in many deep leaning and other application .Siamese neural network accepts a variety of inputs, which are not just images but numerical data, sequential data such as sentences or time signals.Examples are , It can be used to solve NLP problems like text similarity , text matching .It can be used to solve object detection problems and many many problems from another domain.I would like to work on them in future . I am mentioning some of them below :-

Recruitment tool by reducing the semantic distance between similar resumes and job descriptions and separate non-matching ones from each other.
Similarity between questions on sites like quora and stack over flow so the users will get most accurate results.
I would also like to work in medical field and provide services as huge number of diseases are there and its important to make a general system to detect many diseases with a single test. It would save precious time for the treatment.

I hope u would like the article .Let me know if you are facing any issues to understand . Also, please let me know if it was helpful for you on your projects or if I should change the way of explaining things.Thank you !