Vanishing Gradient Problem!

3 min readOct 15, 2022

What? Why? And how to fix it?

In simple words, the Vanishing Gradient means a Gradient or derivative value which has almost vanished or which is negligible. A vanishing Gradient is a problem that occurs during backpropagation in the neural networks when we are trying to train our model. If this problem occurs in the model then after a point there will not be any training possible in the model. A vanishing gradient occurs in complex neural networks where the number of hidden layers is more or when we are using sigmoid or tanh activation functions to train our model.

What is the Vanishing Gradient?

A vanishing gradient occurs during backpropagation where we move back in every layer and change the value of weights to attain a better result.

Once we have the output from our last Node we check the loss using the (predicted value)— (actual value). Now in order to achieve a better result, we move backward to change the value of weights.

Weight(new)= Weight(old)- learning rate(derivative of weight).

Why Vanishing Gradient Occurs?

Consider the above image where the value of B(y^) depends on the weights 01 and 02. In the case of backpropagation, we are moving from B to 01 so that we can change the value of 01 and 02. But in the case of the vanishing gradient problem, the derivative of 01 becomes very less like 0.001.

In order to attain the weight, 01(new) = 01(old)- learning rate* derivative.Assuming the learning rate to be 0.1 we have

01(new) = 01(old)- (0.1*0.001)

01(new)=01(old)-0.0001.

Here we can see any value when subtracted by 0.0001 will bring negligible change to the value, hence there won't be any change in the value of 01(new) and hence the model will not be able to train itself further.

How to identify a Vanishing Gradient?

No change in loss function: After multiple epochs, if the value of the loss function is not changing then it shows the model is encountering a vanishing gradient.
Weight graph: Plot a graph of the weight having epoch and value as the x,y. If the graph is consistent then it means the model has encountered a vanishing gradient.

How to handle Vanishing Gradient?

This problem occurs when the model is very complex. By decreasing the complexity of the model or bringing down the number of hidden layers we can decrease the probability of Vanishing Gradient.
Changing the activation function to Relu: Sigmoid and Tanh are considered the main reason for the Vanishing Gradient. Shifting to Relu or leaky Relu can bring down the chances of the problem.
Proper Weight Initialization: There are certain weight initialization techniques like Glorat and Xavier which can help in proper weight initialization which can further bring down the chances of the Vanishing gradient.

The article gives a deep understanding of the Vanishing Gradient. I hope you find this article useful.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Vanishing Gradient

Written by Aditya Kumar

63 Followers

23 Following

Data Scientist with 6 years of experience. To find out more connect with me on https://www.linkedin.com/in/adityakumar529/

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

Understanding Deep Learning Optimizers: Momentum, AdaGrad, RMSProp & Adam

TDS Archive

Vyacheslav Efimov

Understanding Deep Learning Optimizers: Momentum, AdaGrad, RMSProp & Adam

Gain intuition behind acceleration training techniques in neural networks

Dec 30, 2023

Mastering LLama — Understanding Residual Connection

Hugman Sangkeun Jung

Mastering LLama — Understanding Residual Connection

Understanding Residual Connections: The Key to Training Deep Neural Networks

Nov 16, 2024

Cristian Leo

The Math Behind Transformers

Deep Dive into the Transformer Architecture, the key element of LLMs. Let’s explore its math, and build it from scratch in Python.

Jul 25, 2024

SILU and GELU activation function in tra

Abhishek Jain

SILU and GELU activation function in tra

The SILU activation function, also known as the Sigmoid Linear Unit or Swish, is a smooth, non-monotonic function introduced by Google…

Feb 4

Self-Attention and Transformer Network Architecture

LM Po

Self-Attention and Transformer Network Architecture

The introduction of Transformer models in 2017 marked a significant turning point in the fields of Natural Language Processing (NLP) and…

Oct 17, 2024

Biased-Algorithms

Amit Yadav

The Role of Dropout in Neural Networks

Are You Feeling Overwhelmed Learning Data Science?

Oct 15, 2024

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech

Vanishing Gradient Problem!

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Aditya Kumar

No responses yet

More from Aditya Kumar

KNN Algorithm: What?When?Why?How?

KNN: K Nearest Neighbour is one of the fundamental algorithms to start Machine Learning. Machine Learning models use a set of input values…

Deep Learning to detect cracks in walls.

Classification is one of the common uses of Deep learning, where it's used to identify one among the two. In this article, we will try to…

Perceptron in Deep learning

An introduction to Perceptron

Forward Propagation

Forward propagation in Multi-Layered Perceptron