Supervised Prompt Optimization

Dec 10, 2025

5 mins

The hand-crafted loop everyone knows

Most teams rely on the same hand-crafted loop for prompt engineering. An engineer writes a prompt, tests it on a few examples, notices where it fails, then rewrites parts of the prompt; either manually or with help from another model. This often leads to a new set of failures: sometimes the updated prompt overfits to the new examples and breaks on the old ones, and the cycle repeats. Other times the edits simply don’t help.

Because this process depends so much on intuition and “feel,” the final prompt ends up being more of an art piece than a systematically optimized artifact. And when a new, cheaper model arrives, the cycle usually starts over because the same prompt rarely transfers cleanly across models.

This task is quite time consuming and unforgiving. What if we can do better and formalize this approach? The basis is quite clear i.e. identify failure modes -> update the prompt and so on.

This blog is inspired from the GEPA paper and discusses our learnings and approach to implementing GEPA.

The optimization loop

Fig 1: Optimization Loop

The loop starts with a seed prompt. We evaluate this prompt on a validation split and place it in the candidate pool. Each iteration then selects one prompt from this pool using any sampling strategy of our choice and runs it on a minibatch from the training set. This produces two things: the model’s outputs and the judge’s feedback and/or scores.

If the judge signals that the prompt has room for improvement, a refinement step runs. The system reflects on the minibatch traces and proposes a new candidate prompt. This child prompt is first tested on the same minibatch. If its aggregated score beats the parent, we run a full validation-set evaluation and then add the new prompt to the candidate pool. If not, the proposal is discarded.

This cycle continues until the compute budget is exhausted. Over time, the pool evolves toward prompts that perform better under supervised, example-level feedback.

Fig 2: Candidate Selection

GEPA uses a Pareto-based strategy for selecting which candidate prompts should continue evolving. The idea is best understood through the figure above. Each prompt (P0–P4) has scores across several tasks (T1–T3). Higher scores mean better performance. Below are the steps in this selection strategy;

Identify the best prompt for each task

For every task, we look for the prompt that achieves the highest score.

T1 → P1, P4
T2 → P2
T3 → P4

This gives us a set of task-wise winners.

Remove strictly dominated prompts

A prompt is strictly dominated if there exists another prompt that performs at least as well on every task and strictly better on at least one.

In the example:

P1 performs best on T1, but P4 matches P1 on T1 and outperforms it on the other tasks.
Therefore, P1 is strictly dominated by P4 and is removed.

The surviving prompts are those that are not dominated by any other prompt. These are the candidates worth exploring further because each of them represents a different tradeoff across tasks.

This strategy tends to pick “specialist” prompts i.e. candidates that are uniquely strong on at least one task. The hope is that combining their strengths across iterations leads to better overall prompts.

However, the example also shows a limitation:

P4 performs consistently well across all tasks, yet it is not chosen by Step 1 because it is not the top performer on T1 or T2. It is a “generalist” prompt with solid, balanced performance, but the selection method may ignore it.

This illustrates that Pareto-based selection is only one way to rank candidates. It is useful, but not always ideal depending on the optimization goals.

Implementation & Results

Let’s take a look at how we optimized the prompt for creating plans which a retrieval agent would execute. Below is an example data point in our dataset:

{
        "query": "Based on existing customers and their AI initiatives, find other potential customers which have similar AI intiatives and operate in a similar domain.",
        "bot_email": "xyz@wisdomlabs.ai",
        "current_user_email": "xyz@wisdomlabs.ai",
        "gold_plan": """
        - Search internal documents to find existing customers.
        - Search internal documents to find AI initiatives of existing customers found from the previous step.
        - Search public web for companies with similar AI initiatives in the same domains as existing customers to identify potential prospects.
"""
	"reasoning": """
        The user query requires identifying potential customers by analogy: starting from known customers, understanding their AI initiatives, and then looking outward for companies with similar initiatives in the same domains. This naturally decomposes into three sequential steps:

        - Find existing customers — The foundation is knowing which organizations are already customers. Without this, later steps can’t be scoped.
        - Find their AI initiatives — For each customer, we need to extract the specific AI projects or strategies they are pursuing. This gives us the attributes we’ll use for comparison.
        - Search externally for similar companies — With initiatives and domains in hand, we can search for other companies that share both characteristics, identifying them as potential prospects

        No assumptions on authorship, time range and sources are made since they cannot be inferred from the query.
"""
}

For each user query, we have a gold plan and the reasoning behind it. We found that annotating the human reasoning behind each plan generates high quality feedback from the LLM judge. The goal from each data point is to convey how we want the plans to look and what’s the logic behind it. Ideally, the logic from each data point should get encoded in the final prompt.

For this task we used 75 labelled samples. The input consists of user query, date, employee details of the person asking the question, employee details of the person to which the question is asked and enabled data sources. To keep things simple, we didn’t include other information like memory instructions and custom instructions.

The output for this task is just a list of steps in the plan. For evaluation, we used LLM as a judge, which gave natural language feedback. For scoring the generated plans, we initially used a 0/1 scoring, where 1 means the generated plan totally aligns with the gold plan. But this scoring was too strict and the optimization process was never able to evolve.

In order to make the scoring a bit relaxed we decided to use LLM to give a score between 0-5 and defined what each score meant. Here is what an example feedback looks like in our case:

{
   "query": "Please summarize the key achievements I directly completed or was actively involved in between August 29 and September 5. Structure the output in a concise, executive-ready format",
   "generated_plan": """
1. Search the email system (gmail) for sent/received messages by XYZ between August 29 and September 5 mentioning completed projects, milestones, or delivery of work.
2. Review attachments sent or received by XYZ during this period for evidence of deliverables, presentations, or finalized documents.
...
5. Search calendar events and meeting summaries for this week, focusing on those led or presented by XYZ.
6. Summarize all identified achievements, ensuring that only those where XYZ was directly responsible or actively involved are included, and structure the output in a concise, executive-ready format.
""",
   "feedback": "The plan is overly granular and tool-specific, enumerating many data sources (Gmail, Slack, Drive, Calendar, Jira, GitHub) and even a synthesis step, which violates the guidance to avoid inferring data sources and to keep the plan as a high-level retrieval plan. It also includes a final summarize step, which belongs to response generation, not retrieval. The human annotations emphasize decomposing by authorship vs participation within the specified date range, using internal search with author/participant filters. The generated plan misses this core decomposition, omits explicit author/participant filtering, and risks exceeding the time budget.
",
   "score": 2
}

In the above example, the generated plan might be right and still work, but it’s not what we want. It’s too granular and iterates over each data source and also has synthesis steps. For such queries, we don’t want to infer any sources and the plan should be to just look at authored docs by the person and the docs in which XYZ participated. It is up to the agent whether each step needs more search calls or not. From this feedback, the hope is that the improved prompt will focus on how to decompose tasks, how to not split the steps for each data source and not infer any data source when not trivially inferrable, not to include any synthesis steps.

For the optimization loop, we used a minibatch size of 10, and generated 5 candidates per proposal (to minimize variance). We used GPT-5 with ‘medium’ reasoning for candidate proposals. This configuration gave us the best results and the plan generated by the final prompt scored an average of 4.1 (on a scale of 1-5) after 15 iterations. We have attached the starting and optimized prompt in the appendix.

Key Learnings

Iterate on the error - candidate prompt loop: The whole system depends on a single step: look at the errors from the current prompt, then propose an improved prompt. If this step fails, the entire optimization breaks.
To keep it healthy:

iterate on the proposal prompt itself until it consistently produces sensible edits
verify that judge feedback is specific and actionable
inspect generated candidate prompts to make sure they stay within your intended design space

High quality feedback matters: For open-ended tasks (e.g., plan generation), all learning comes from the judge’s natural-language feedback. Simply comparing an output to ground truth is not enough; the model needs reasoning. Annotating examples with brief rationales produced much better and more constructive feedback in practice, which in turn led to more meaningful prompt improvements.

Using a stronger model for candidate proposals: In our case, we use gpt-4.1 for creating plans. Initially in the optimization loop, we used gpt-4.1 for proposing new candidates; thinking that models from the same family might help. But, when switching to gpt-5 for candidate proposals, we observed better prompts.

Generating multiple candidate prompts helped: Generating several candidate prompts per step reduced variance and increased the chance of discovering a genuinely better prompt.

Dataset must be diverse and trustworthy: If the optimization loop is meant to learn from data, the training set must reflect the variety of cases seen in production. Also inspect samples individually. A single poor data point won’t derail the loop, but it can introduce misleading feedback or distort the candidate proposals.

Appendix

Seed Prompt:

Optimized Prompt:

# Planner Instructions

Your job is to produce a concise, high‑level retrieval plan that guides an enterprise agent using only these tools:
- search_documents (enterprise/internal sources the user can access)
- web_search (public web)

Do not write the final answer or draft content. Produce only the plan.

## Output format
Return 1–3 numbered steps (4 max when clearly needed). For each step, specify:
- Tool: search_documents or web_search
- Objective: what we aim to find
- Query/Keywords: main terms and obvious synonyms/entities (use exact titles in quotes when applicable)
- Filters: use only these when applicable
  - author: <email/name>
  - participant: <email/name>
  - time_window: explicit (e.g., "past week") or inferred (see rules)
  - entity/topic: e.g., company/product names
  - event_date: e.g., "today" for meetings with only time given
  - recency_boost: true when recent info is likely most relevant

## General planning rules
- Default to internal search (search_documents). Use web_search only when the user explicitly asks for external/industry/public context or when internal sources clearly won’t cover the need (e.g., general technology concepts, competitor/industry info). When ambiguous, plan two steps: internal first, then a web_search fallback.
- Keep plans minimal. Prefer one well‑parameterized internal search over many, unless distinct information types, explicit subtopics, different authorship roles, or distinct channel categories justify decomposition.
- Each step must be a search step using a tool. Do not include analyze/summarize/draft steps; the executor will synthesize.
- Avoid arbitrary assumptions about authorship, time ranges, or sources. Add author/participant filters only when the query implies them (e.g., "my", a named person) or the intent requires role separation.
- Scope discipline: Align objectives tightly to the user’s ask. Do not broaden into tangential goals. If the user lists channels/categories, restrict to them.
- Prefer precision over breadth; assume the executor can iterate if initial results are sparse.
- Targeted artifact retrieval (exact doc/sheet/title): use the exact title in keywords; avoid unnecessary filters (especially time_window) unless the question is about recency or change history.

## Author/Participant mapping
- Pronouns:
  - "I / me / my" → the asking employee (self).
  - "you" → the addressee when the query is directed to a colleague’s twin; otherwise self.
- Common intents:
  - "What am I/you working on?", "How is your work on <X> going?" → author = the person whose work is being asked about. Do not add time_window unless words like "recently/lately" are used.
  - "What messages do I need to respond to?" (inbox triage) → participant = self.
  - "Go through my <channel>

Try Digital Twin

Enterprise Ready

Built for scale

Faster Decisions

Context when you need it

Try Digital Twin

Enterprise Ready

Built for scale

Faster Decisions

Context when you need it

Try Digital Twin

Enterprise Ready

Built for scale

Faster Decisions

Context when you need it