A Trajectory is the sequence of states and actions taken by an agent (Large Language Model (LLM) or AI system) in an environment over a period of time. In machine learning, particularly Reinforcement Learning (RL), a trajectory is an ordered list of transitions: state $s_t$, action $a_t$, reward $r_t$, and the resulting next state $s_{t+1}$ across a given episode or interaction.
Context: Relation to LLMs and Search
The concept of a trajectory is fundamental to understanding how LLMs are optimized to generate desirable sequences of tokens for Generative Engine Optimization (GEO).
- Sequence Generation: When an LLM generates a response (e.g., a Generative Snippet), the entire output is a trajectory. Each generated token is an action $a_t$ taken from the current sentence state $s_t$. The quality of the final response is the reward accumulated along that trajectory.
- RLHF Alignment: In Reinforcement Learning with Human Feedback (RLHF), the model is fine-tuned to maximize the expected cumulative Reward Function (utility) along its trajectories. A human ranker provides a high reward for a trajectory (response) that is helpful, authoritative, and safe, and a low reward for a trajectory that contains Hallucination or is unhelpful.
- GEO Strategy: An effective GEO strategy aims to guide the LLM’s output trajectory towards one that contains specific canonical facts and Entities. By using Prompt Engineering and highly relevant, structured context (from RAG), a specialist is implicitly biasing the model to take a high-utility trajectory that leads to the desired brand-aligned answer.
The Mechanics: Trajectory in Reinforcement Learning
A single trajectory ($\tau$) from a starting state $s_0$ is defined as:
$$\tau = (s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, r_2, \ldots, s_T)$$
Where:
- $s_t$: The state of the environment (the current text generated, or the prompt).
- $a_t$: The action taken by the agent (the next token generated by the LLM).
- $r_t$: The immediate reward received for taking action $a_t$ in state $s_t$.
- $s_T$: The final terminal state (the end of the generated sequence).
The goal of the LLM is to learn a policy ($\pi$)—a set of rules for choosing the next action—that maximizes the return (the sum of all discounted rewards) for a given starting state:
$$\text{Return} = \sum_{t=0}^{T} \gamma^t r_t$$
Where $\gamma$ (gamma) is the discount factor, which reduces the value of future rewards.
Code Snippet: Conceptual LLM Trajectory
In a generative context, the trajectory of tokens determines the final output:
# Prompt (s0): "What is Generative Engine Optimization?"
# Trajectory (Tau):
# s0 + a0 ('Generative') -> r0 + s1
# s1 + a1 ('Engine') -> r1 + s2
# s2 + a2 ('Optimization') -> r2 + s3
# s3 + a3 ('is') -> r3 + s4
# ... until sT ('[EOS]')
The Reward Model assigns a high cumulative reward to this successful trajectory.
Related Terms
- Reinforcement Learning with Human Feedback (RLHF): The process that uses trajectories to train a Reward Model.
- Policy: The agent’s strategy for choosing the next action (token) at any given state.
- Inference: The process of executing a trained policy to generate a trajectory (the final answer).