Neural Networks and Transformers: The Foundation of Modern AI

Consider the process of teaching a young child to identify animals in a picture book. You show them images of cats, dogs, birds, and so on, pointing out features like "cats have whiskers" or "birds have wings." Each time you show them a picture, they make a guess, and you correct them if they're wrong, praising them when they're right. Over time, they improve at recognizing animals independently. A neural network operates similarly, but instead of animals, it learns to recognize patterns in data (e.g., text, images, sound). Think of it as a digital brain with multiple layers. The first layer might learn to recognize simple patterns, like lines or colors in an image. The next layer combines these patterns to recognize shapes, like ears or tails. The third layer might recognize more complex patterns and as you dive deeper, the network learns more complex patterns, like entire faces or bodies. When you train a neural network, you're showing it many examples (the picture book), telling it what it's seeing (this is a cat, this is a dog), and allowing it to make guesses. If it guesses incorrectly, the network adjusts its inner workings slightly to improve its guess next time. This process repeats many times, with the network continually improving until it can identify what it sees in a reliable manner, sometimes even better than us. AI researchers have been working on neural networks for decades, and they have made significant progress in the field. This progress is the result of trying different approaches, testing, iterating, and learning from failures. Some neural network architectures have been more successful than others, and one of the most successful architectures is called Transformers. Researchers created and used other architectures before Transformers, like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), but Transformers have become the go-to architecture for many AI applications today. Let's explore these architectures in more detail.

Convolutional Neural Networks

To understand the concept, imagine you have an intelligent robot that excels at recognizing objects in images, such as identifying where your cat is hiding in a photo. This robot uses something known as a Convolutional Neural Network (CNN). It's named so because it uses a mathematical operation called convolution to analyze images. CNNs function similarly to artificial eyes, scanning every little detail of the image, analyzing individual pixels at a moment, and then putting all those pixels together to understand the image as a whole. Instead of trying to see the whole image of the cat at once, the CNN robot starts by looking for simple patterns, like the stripes on your cat's fur and the distinctive color of its eyes. After this, it moves to the next step that involves combining all the patterns it has identified into a more complex understanding of the image. If connecting those patterns together results in a picture of a cat, the robot will be able to identify your pet. This process of breaking down an image into smaller parts makes CNNs a powerful tool for recognizing details and patterns in images.

However, even intelligent robots encounter problems. One issue with CNNs is that while they're excellent at examining images, they sometimes become confused when things need to be understood in a specific sequence. Reading a comic book, for instance, is challenging for a CNN because it doesn't easily grasp the sequence of the story. This is why researchers have developed other neural network architectures that are better at understanding sequences, and this is where Recurrent Neural Networks (RNNs) come into play.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are like having a friend who adores stories and remembers every detail from the beginning to the end. We all have that one friend with this extraordinary ability! This friend, symbolizing the Recurrent Neural Network, is exceptional at following sequences, like the order of events in a story, the steps in a recipe, or the order of steps in a dance. Unlike the CNN robot that examines each part of an image without considering what comes before or after, the RNN pays attention to the entire story, keeping track of what occurred earlier to understand what's happening now and predict what will happen next. If an RNN is given a text, it will look at one word at a time, first, second, third, and so on. As it reads a new word, it tries to remember important details from what it has seen before and combines what it remembers with the new word to make sense of the sequence so far. Based on its understanding of the sequence, the RNN can predict what the next word might be.

However, RNNs also face challenges. Imagine if you were trying to remember a story, but it was extremely long. By the time you reached the end, you might forget details about how the story started. In real life, you may have experienced this when reading long books with complex storylines. The more you read on, the harder it gets to remember all the details from the beginning. RNNs have this problem too; they may struggle because the memory of the beginning fades as the sequence progresses. We call this the "vanishing gradient problem."

Long Short-Term Memory Networks

To help our story-loving friend remember longer tales, scientists developed a special kind of RNN called Long Short-Term Memory (LSTM) Networks. LSTMs are like giving our friend a magical notebook that helps them remember important details from the story, regardless of its length. This notebook has special rules about what to jot down, what to erase, and what to keep an eye on. These special rules help the RNN remember both the start and the end of the story and make sense of everything in between. In practical terms, a Long Short-Term Memory network has a memory cell that helps it store and retrieve information when the sequence becomes long.

RNNs have solved the problem of understanding long sequences and remembering details, but they are not perfect. One of their main limitations is still the vanishing gradient problem when the gradients become too small and the network stops learning.

If we dive deeper and understand the technical details, we should go step by step for a better understanding:

When training a neural network, we use a process called backpropagation.
The goal of backpropagation is to adjust the weights of the network.
Consider weights as parameters that determine how much influence a neuron has on another one.
Backpropagation works by calculating gradients.
Gradients are signals that tell us how to change the weights to reduce errors in the network's predictions.
When these gradients are calculated, they are propagated back through the network, from the output layer back to the input layers. The goal of adjusting the weights is to minimize the error between the predicted output and the actual output.
In RNNs and LSTMs, this process consists of multiplying the gradients through many layers.
When the gradient is smaller than 1, it can vanish as it is multiplied through many layers (a multiplication by a number smaller than 1 makes the number smaller and smaller).

This is the vanishing gradient problem, and it's an issue with RNNs. LSTM networks have a solution to this problem, but when the neural network is very deep (has too many layers), the gradients can also vanish. This is where Gated Recurrent Units (GRUs) come into play.

Gated Recurrent Units (GRUs)

Gated Recurrent Units (GRUs) are a type of RNN designed to address the vanishing gradient problem through a special design. A GRU is designed to have two gates:

The update gate: This gate decides how much of the new information to keep and pass along to the next layer.

Generative AI For The Rest Of US

Your Future, Decoded

Enroll now to unlock all content and receive all future updates for free.

Unlock now ~~$20.99~~$15.74

Hurry! This limited time offer ends in:

To redeem this offer, copy the coupon code below and apply it at checkout:

Learn More

Previous Next