Trajectory

A Trajectory is the sequence of states and actions taken by an agent (Large Language Model (LLM) or AI system) in an environment over a period of time. In machine learning, particularly Reinforcement Learning (RL), a trajectory is an ordered list of transitions: state $s_t$, action $a_t$, reward $r_t$, and the resulting next state $s_{t+1}$ across a given episode or interaction.

Context: Relation to LLMs and Search

The concept of a trajectory is fundamental to understanding how LLMs are optimized to generate desirable sequences of tokens for Generative Engine Optimization (GEO).

Sequence Generation: When an LLM generates a response (e.g., a Generative Snippet), the entire output is a trajectory. Each generated token is an action $a_t$ taken from the current sentence state $s_t$. The quality of the final response is the reward accumulated along that trajectory.
RLHF Alignment: In Reinforcement Learning with Human Feedback (RLHF), the model is fine-tuned to maximize the expected cumulative Reward Function (utility) along its trajectories. A human ranker provides a high reward for a trajectory (response) that is helpful, authoritative, and safe, and a low reward for a trajectory that contains Hallucination or is unhelpful.
GEO Strategy: An effective GEO strategy aims to guide the LLM’s output trajectory towards one that contains specific canonical facts and Entities. By using Prompt Engineering and highly relevant, structured context (from RAG), a specialist is implicitly biasing the model to take a high-utility trajectory that leads to the desired brand-aligned answer.

The Mechanics: Trajectory in Reinforcement Learning

A single trajectory ($\tau$) from a starting state $s_0$ is defined as:

$$\tau = (s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, r_2, \ldots, s_T)$$

Where:

$s_t$: The state of the environment (the current text generated, or the prompt).
$a_t$: The action taken by the agent (the next token generated by the LLM).
$r_t$: The immediate reward received for taking action $a_t$ in state $s_t$.
$s_T$: The final terminal state (the end of the generated sequence).

The goal of the LLM is to learn a policy ($\pi$)—a set of rules for choosing the next action—that maximizes the return (the sum of all discounted rewards) for a given starting state:

$$\text{Return} = \sum_{t=0}^{T} \gamma^t r_t$$

Where $\gamma$ (gamma) is the discount factor, which reduces the value of future rewards.

Code Snippet: Conceptual LLM Trajectory

In a generative context, the trajectory of tokens determines the final output:

# Prompt (s0): "What is Generative Engine Optimization?"
# Trajectory (Tau):
# s0 + a0 ('Generative') -> r0 + s1
# s1 + a1 ('Engine') -> r1 + s2
# s2 + a2 ('Optimization') -> r2 + s3
# s3 + a3 ('is') -> r3 + s4
# ... until sT ('[EOS]')

The Reward Model assigns a high cumulative reward to this successful trajectory.

Related Terms

Reinforcement Learning with Human Feedback (RLHF): The process that uses trajectories to train a Reward Model.
Policy: The agent’s strategy for choosing the next action (token) at any given state.
Inference: The process of executing a trained policy to generate a trajectory (the final answer).

Appear More in
AI Engines

Dominate results in ChatGPT, Gemini & Claude. Contact us today.

This will take you to WhatsApp