A Recurrent Neural Network (RNN) is a type of neural network designed specifically to handle sequential data, such as text, speech, or time-series data. Unlike traditional feed-forward networks, RNNs have a ‘memory’ in the form of a hidden state that is passed from one step (or element in the sequence) to the next. This allows the network to maintain information about past inputs when processing the current input, making them well-suited for tasks where context and order are critical.
Context: Relation to LLMs and Search
RNNs were the foundational architecture for early Large Language Models (LLMs) and natural language processing (NLP), establishing key concepts that paved the way for the more advanced Transformer Architecture.
- Historical Significance: RNNs and their variants, such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), were the state-of-the-art models for machine translation, speech recognition, and basic Text Generation before the introduction of the Transformer. They demonstrated the ability to model the sequential nature of human language.
- The Contextual Hidden State: The hidden state in an RNN allowed it to encode the Semantics and context of the words that came before. This was a primitive form of the Contextual Embedding that modern LLMs now master.
- Limits in Generative Engine Optimization (GEO): RNNs suffered from two critical problems that limited their scalability in modern GEO systems:
- Vanishing Gradient: They struggled to maintain context over very long sequences, a problem known as the vanishing gradient problem, which hindered Training of deep networks.
- Sequential Processing: They could only process text one word at a time, making them slow and impossible to parallelize on modern hardware, a necessity for training billion-parameter LLMs. The Transformer’s Self-Attention Mechanism solved both of these issues.
The Mechanics: Recurrence and State
The core idea of the RNN is that for an input sequence $x_1, x_2, \ldots, x_t$:
- Current Input ($x_t$): The word or token at the current time step.
- Previous Hidden State ($h_{t-1}$): The output of the network from the previous time step, which acts as the ‘memory’ of the sequence so far.
- New Hidden State ($h_t$): A new hidden state is calculated by combining $x_t$ and $h_{t-1}$ (typically through a non-linear Activation Function like ReLU or Tanh).
- Output ($y_t$): The output for the current time step is calculated from the new hidden state $h_t$.
$$h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$
$$y_t = W_{hy} h_t + b_y$$
Gated RNNs (LSTM and GRU)
LSTMs and GRUs were developed to specifically combat the vanishing gradient problem. They introduced internal gates (like ‘input,’ ‘forget,’ and ‘output’ gates) that control the flow of information into and out of the hidden state, allowing them to selectively remember or forget past information over long distances, which improved their ability to capture long-range dependencies in language.
Related Terms
- Transformer Architecture: The successor to RNNs that uses Self-Attention to eliminate recurrence.
- Sequence-to-Sequence (Seq2Seq): The framework often implemented using two stacked RNNs (Encoder-Decoder) for tasks like machine translation.
- Backpropagation Through Time (BPTT): The specialized training algorithm required for RNNs.