# A Practical Guide to Evaluating Generative AI Applications

## Video

<https://youtu.be/qPHsWTZP58U>

Watch the [full video](https://youtu.be/qPHsWTZP58U)

------------------------------------------------------------------------

## Annotated Presentation

Below is an annotated version of the presentation, with timestamped
links to the relevant parts of the video for each slide.

Here is the annotated presentation for Rajiv Shah’s workshop on “Hill
Climbing: Best Practices for Evaluating LLMs.”

### 1. Title Slide

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_1.png"
alt="Slide 1" />
<figcaption aria-hidden="true">Slide 1</figcaption>
</figure>

([Timestamp: 00:00](https://youtu.be/qPHsWTZP58U&t=0s))

This slide introduces the workshop titled **“Hill Climbing: Best
Practices for Evaluating LLMs,”** presented by Rajiv Shah, PhD, at the
Open Data Science Conference (ODSC). The presentation focuses on the
technical nuances of Generative AI and how to build effective evaluation
workflows.

Rajiv sets the stage by outlining his three main goals for the session:
understanding the technical differences in GenAI evaluation, learning a
basic introductory workflow for building evaluation datasets, and
inspiring practitioners to start “learning by doing” rather than just
reading papers.

The concept of “Hill Climbing” refers to the iterative process of
improving LLM applications—starting with a baseline and continuously
optimizing performance through rigorous testing and error analysis.

### 2. Evaluating for Gen AI Resources

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_2.png"
alt="Slide 2" />
<figcaption aria-hidden="true">Slide 2</figcaption>
</figure>

([Timestamp: 00:06](https://youtu.be/qPHsWTZP58U&t=6s))

This slide provides a QR code and a GitHub URL, directing the audience
to the code and resources associated with the talk. It emphasizes that
the workshop is practical, with code examples available for attendees to
replicate the evaluation techniques discussed.

Rajiv encourages the audience to access these resources to follow along
with the technical implementations of the concepts, such as building LLM
judges and creating unit tests, which will be covered later in the
presentation.

### 3. Customer Support Use Case

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_3.png"
alt="Slide 3" />
<figcaption aria-hidden="true">Slide 3</figcaption>
</figure>

([Timestamp: 00:48](https://youtu.be/qPHsWTZP58U&t=48s))

To motivate the need for evaluation, the presentation introduces a
common real-world use case: **Customer Support**. Generative AI is
frequently deployed to help agents compose emails or chat responses
based on user inquiries.

This scenario serves as the baseline example throughout the talk. It
represents a high-volume task where automation is desirable, but
accuracy and tone are critical for maintaining customer satisfaction and
brand reputation.

### 4. Vibe Coding

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_4.png"
alt="Slide 4" />
<figcaption aria-hidden="true">Slide 4</figcaption>
</figure>

([Timestamp: 00:59](https://youtu.be/qPHsWTZP58U&t=59s))

This slide introduces the concept of **“Vibe Coding”**—the initial phase
where developers grab a simple prompt, feed it to a model, and get a
result that feels right. It highlights the misconception that GenAI is
easy because it works “out of the box” for simple demos.

Rajiv notes that while “vibe coding” might work for a quick demo app, it
is insufficient for production systems. Relying on a “vibe” that the
model is working prevents teams from catching subtle failures that occur
at scale.

### 5. Good Response: Delayed Order

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_5.png"
alt="Slide 5" />
<figcaption aria-hidden="true">Slide 5</figcaption>
</figure>

([Timestamp: 01:10](https://youtu.be/qPHsWTZP58U&t=70s))

Here, we see a successful output generated by the LLM. The customer
inquired about a delayed order, and the AI generated a polite, relevant
response acknowledging the delay and apologizing.

This example reinforces the “Vibe Coding” trap: because the model often
produces high-quality, human-sounding text like this, developers can be
lulled into a false sense of security regarding the system’s
reliability.

### 6. Good Response: Damaged Product

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_6.png"
alt="Slide 6" />
<figcaption aria-hidden="true">Slide 6</figcaption>
</figure>

([Timestamp: 01:12](https://youtu.be/qPHsWTZP58U&t=72s))

This slide provides another example of a “good” response. The AI
correctly identifies that the customer received a damaged product and
initiates a replacement protocol.

These positive examples establish a baseline of expected behavior. The
challenge in evaluation is not just confirming that the model *can*
work, but ensuring it works consistently across all edge cases.

### 7. Bad Response: Irrelevance

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_7.png"
alt="Slide 7" />
<figcaption aria-hidden="true">Slide 7</figcaption>
</figure>

([Timestamp: 01:26](https://youtu.be/qPHsWTZP58U&t=86s))

The presentation shifts to failure modes. In this example, the user asks
about an **“Order Delay,”** but the AI responds with information about a
**“New Product Launch.”**

This illustrates a complete context mismatch. The model failed to attend
to the user’s intent, generating a coherent but completely irrelevant
response. This type of failure frustrates users and degrades trust in
the automated system.

### 8. Bad Response: Hallucination

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_8.png"
alt="Slide 8" />
<figcaption aria-hidden="true">Slide 8</figcaption>
</figure>

([Timestamp: 01:36](https://youtu.be/qPHsWTZP58U&t=96s))

This slide shows a more dangerous failure: **Hallucination**. The AI
apologizes for a defective “espresso machine,” but as the speaker notes,
“We don’t actually sell espresso machines.”

This highlights the risk of the model fabricating facts to be helpful.
Such errors can lead to logistical nightmares, such as customers
expecting replacements for products that do not exist or that the
company never sold.

### 9. Risks of LLM Mistakes

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_9.png"
alt="Slide 9" />
<figcaption aria-hidden="true">Slide 9</figcaption>
</figure>

([Timestamp: 01:51](https://youtu.be/qPHsWTZP58U&t=111s))

Rajiv categorizes the risks associated with LLM failures into three
buckets: **Reputational, Legal, and Financial**. He cites the example of
**Cursor**, an IDE company, where a support bot hallucinated a policy
restricting users to one device, causing customers to cancel
subscriptions.

The slide emphasizes that courts may view AI agents as employees; if a
bot makes a promise (like a refund or policy change), the company might
be legally bound to honor it. This escalates evaluation from a technical
nice-to-have to a business necessity.

### 10. The Despair of Gen AI

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_10.png"
alt="Slide 10" />
<figcaption aria-hidden="true">Slide 10</figcaption>
</figure>

([Timestamp: 02:38](https://youtu.be/qPHsWTZP58U&t=158s))

This visual represents the frustration developers feel when moving from
a successful demo to a failing production system. The “despair” comes
from the realization that the stochastic nature of LLMs makes them
difficult to control.

It serves as an emotional anchor for the audience, acknowledging that
while GenAI is exciting, the unpredictability of its failures causes
significant stress for engineering teams responsible for deployment.

### 11. High Failure Rates

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_11.png"
alt="Slide 11" />
<figcaption aria-hidden="true">Slide 11</figcaption>
</figure>

([Timestamp: 02:48](https://youtu.be/qPHsWTZP58U&t=168s))

The slide cites an MIT report stating that **“95% of GenAI pilots are
failing.”** While Rajiv notes this number might be overstated, it
reflects a trend where executives are demanding ROI and seeing
lackluster results.

This shift in 2025 means that evaluation is no longer just for
debugging; it is required to prove business value and justify the high
costs of running Generative AI infrastructure.

### 12. Evaluation Improves Applications

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_12.png"
alt="Slide 12" />
<figcaption aria-hidden="true">Slide 12</figcaption>
</figure>

([Timestamp: 03:14](https://youtu.be/qPHsWTZP58U&t=194s))

This slide asserts the core thesis: **Evaluation helps you build better
GenAI applications.** It references a previous viral video by the
speaker on the same topic, positioning this talk as an updated,
condensed version with fresh content.

Rajiv explains that you cannot improve what you cannot measure. Without
a robust evaluation framework, developers are essentially guessing
whether changes to prompts or models are actually improving performance.

### 13. Why Evaluation is Necessary

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_13.png"
alt="Slide 13" />
<figcaption aria-hidden="true">Slide 13</figcaption>
</figure>

([Timestamp: 03:40](https://youtu.be/qPHsWTZP58U&t=220s))

This concentric diagram illustrates the stakeholders involved in
evaluation. It starts with **“Things Go Wrong”** (technical reality),
moves to **“Buy-in”** (convincing managers/teams), and ends with
**“Regulators”** (external compliance).

Evaluation serves multiple audiences: it helps the developer debug, it
provides the metrics needed to convince management that the app is
production-ready, and it creates the audit trails required by
third-party auditors or regulators.

### 14. Evaluation Dimensions

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_14.png"
alt="Slide 14" />
<figcaption aria-hidden="true">Slide 14</figcaption>
</figure>

([Timestamp: 04:18](https://youtu.be/qPHsWTZP58U&t=258s))

Evaluation must cover three dimensions: **Technical** (F1 scores,
accuracy), **Business** (ROI, value generated), and **Operational**
(Total Cost of Ownership, latency).

Rajiv highlights that data scientists often focus solely on the
technical, but ignoring operational costs (like the expense of hosting
GPUs vs. using APIs) can kill a project. A comprehensive evaluation
strategy considers the cost-to-quality ratio.

### 15. Public Benchmarks

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_15.png"
alt="Slide 15" />
<figcaption aria-hidden="true">Slide 15</figcaption>
</figure>

([Timestamp: 05:06](https://youtu.be/qPHsWTZP58U&t=306s))

The slide discusses **Public Benchmarks** (like MMLU, GSM8K). While
useful for a general idea of a model’s capabilities (e.g., “Is Llama 3
better than Llama 2?”), they are insufficient for specific applications.

Rajiv warns against using these benchmarks to determine if a model fits
*your* specific use case. Companies promote these numbers for marketing,
but they rarely reflect performance on proprietary business data.

### 16. Custom Benchmarks

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_16.png"
alt="Slide 16" />
<figcaption aria-hidden="true">Slide 16</figcaption>
</figure>

([Timestamp: 05:22](https://youtu.be/qPHsWTZP58U&t=322s))

The solution to the limitations of public benchmarks is **Custom
Benchmarks**. This slide defines a benchmark as a combination of a
**Task**, a **Dataset**, and an **Evaluation Metric**.

This is a critical definition for the workshop. To “tame” GenAI, you
must build a dataset that reflects your specific customer queries and
define success metrics that matter to your business logic, rather than
relying on generic academic tests.

### 17. Taming Gen AI

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_17.png"
alt="Slide 17" />
<figcaption aria-hidden="true">Slide 17</figcaption>
</figure>

([Timestamp: 05:28](https://youtu.be/qPHsWTZP58U&t=328s))

This title slide signals a transition into the technical “how-to”
section of the talk. “Taming” implies that the default state of GenAI is
wild and unpredictable.

The goal of the following sections is to bring structure and control to
this chaos through rigorous engineering practices and evaluation
workflows.

### 18. Workshop Roadmap

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_18.png"
alt="Slide 18" />
<figcaption aria-hidden="true">Slide 18</figcaption>
</figure>

([Timestamp: 05:31](https://youtu.be/qPHsWTZP58U&t=331s))

The roadmap outlines the four main sections of the talk: 1. **Basics of
Gen AI:** Understanding variability and technical nuances. 2.
**Evaluation Workflow:** Building the dataset and running the first
tests. 3. **More Complexity:** Adding unit tests and conducting error
analysis. 4. **Agents:** Evaluating complex, multi-step workflows.

### 19. Variability in Responses

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_19.png"
alt="Slide 19" />
<figcaption aria-hidden="true">Slide 19</figcaption>
</figure>

([Timestamp: 06:00](https://youtu.be/qPHsWTZP58U&t=360s))

This slide visually demonstrates the **Non-Determinism** of LLMs. It
shows two responses to the same prompt generated just minutes apart.
While substantively similar, the wording and structure differ slightly.

This variability makes exact string matching (a common software testing
technique) impossible for LLMs. It necessitates semantic evaluation
techniques, which complicates the testing pipeline.

### 20. Input-Model-Output Diagram

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_20.png"
alt="Slide 20" />
<figcaption aria-hidden="true">Slide 20</figcaption>
</figure>

([Timestamp: 06:24](https://youtu.be/qPHsWTZP58U&t=384s))

A simple diagram illustrates the flow: **Prompt -\> Model -\> Output**.
Rajiv uses this to structure the analysis of where variability comes
from.

He explains that “chaos” can enter the system at any of these three
stages: the input (prompt sensitivity), the model (inference
non-determinism), or the output (formatting and evaluation).

### 21. Inconsistent Benchmark Scores

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_21.png"
alt="Slide 21" />
<figcaption aria-hidden="true">Slide 21</figcaption>
</figure>

([Timestamp: 06:44](https://youtu.be/qPHsWTZP58U&t=404s))

The slide presents a discrepancy between benchmark scores tweeted by
Hugging Face and those in the official Llama paper. Both used the same
dataset (MMLU), but reported different accuracy numbers.

This introduces the problem of **Evaluation Harness Sensitivity**. Even
with standard benchmarks, *how* you ask the model to take the test
changes the score, proving that evaluation is fragile and
implementation-dependent.

### 22. MMLU Overview

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_22.png"
alt="Slide 22" />
<figcaption aria-hidden="true">Slide 22</figcaption>
</figure>

([Timestamp: 07:25](https://youtu.be/qPHsWTZP58U&t=445s))

**MMLU (Massive Multitask Language Understanding)** is explained here.
It is a multiple-choice test covering 57 tasks across STEM, the
humanities, and more.

It is currently the standard for measuring general “intelligence” in
models. However, because it is a multiple-choice format, it is
susceptible to prompt formatting nuances, as the next slides
demonstrate.

### 23. Prompt Sensitivity

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_23.png"
alt="Slide 23" />
<figcaption aria-hidden="true">Slide 23</figcaption>
</figure>

([Timestamp: 07:44](https://youtu.be/qPHsWTZP58U&t=464s))

This slide reveals *why* the scores in Slide 21 differed. The three
evaluation harnesses used slightly different prompt structures (e.g.,
using the word “Question” vs. just listing the text).

These minor changes resulted in significant accuracy shifts. This proves
that LLMs are highly sensitive to syntax, meaning a “better” model might
just be one that was prompted more effectively for the test, not one
that is actually smarter.

### 24. Formatting Changes

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_24.png"
alt="Slide 24" />
<figcaption aria-hidden="true">Slide 24</figcaption>
</figure>

([Timestamp: 08:22](https://youtu.be/qPHsWTZP58U&t=502s))

Expanding on sensitivity, this slide references Anthropic’s research
showing that changing answer choices from `(A)` to `[A]` or `(1)`
affects the output.

This level of fragility is a key takeaway: seemingly cosmetic changes in
how inputs are formatted can alter the model’s reasoning capabilities or
its ability to output the correct token.

### 25. GPT-4o Performance Drop

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_25.png"
alt="Slide 25" />
<figcaption aria-hidden="true">Slide 25</figcaption>
</figure>

([Timestamp: 08:38](https://youtu.be/qPHsWTZP58U&t=518s))

A bar chart demonstrates that this issue persists even in
state-of-the-art models like **GPT-4o**. Subtle changes in wording can
lead to a 5-10% drop in performance.

This counters the assumption that newer, larger models have “solved”
prompt sensitivity. It remains a persistent variable that evaluators
must control for.

### 26. Tone Sensitivity

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_26.png"
alt="Slide 26" />
<figcaption aria-hidden="true">Slide 26</figcaption>
</figure>

([Timestamp: 08:46](https://youtu.be/qPHsWTZP58U&t=526s))

This slide shows that the **tone** of a prompt (e.g., being polite
vs. direct) affects accuracy. Rajiv jokes, “I guess this is why mom
always said to be polite.”

The graph indicates that prompt engineering strategies, like adding
emotional weight or politeness, can statistically alter model
performance, adding another layer of complexity to evaluation.

### 27. Persistent Sensitivity

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_27.png"
alt="Slide 27" />
<figcaption aria-hidden="true">Slide 27</figcaption>
</figure>

([Timestamp: 09:00](https://youtu.be/qPHsWTZP58U&t=540s))

The slide reiterates that despite years of progress, models are still
sensitive to specific phrases. It shows a “Prompt Engineering” guide
suggesting specific words to use.

The takeaway is that developers cannot treat the prompt as a static
instruction; it is a hyperparameter that requires optimization and
constant testing.

### 28. Falcon LLM Bias

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_28.png"
alt="Slide 28" />
<figcaption aria-hidden="true">Slide 28</figcaption>
</figure>

([Timestamp: 09:18](https://youtu.be/qPHsWTZP58U&t=558s))

This slide introduces a case study with the **Falcon LLM**. A user tweet
shows the model recommending **Abu Dhabi** as a technological city with
glowing sentiment, which raised suspicions about bias given the model’s
origin in the Middle East.

This serves as a detective story: users wondered if the model weights
were altered or if specific training data was injected to force this
positive association.

### 29. Potential Cover-up?

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_29.png"
alt="Slide 29" />
<figcaption aria-hidden="true">Slide 29</figcaption>
</figure>

([Timestamp: 09:50](https://youtu.be/qPHsWTZP58U&t=590s))

Another tweet speculates if the model is “covering up human rights
abuses” because it provides different answers for Abu Dhabi compared to
other cities.

This highlights how model behavior can be misinterpreted as malicious
bias or censorship, when the root cause might be something much simpler
in the input stack.

### 30. Inspecting the System Prompt

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_30.png"
alt="Slide 30" />
<figcaption aria-hidden="true">Slide 30</figcaption>
</figure>

([Timestamp: 10:00](https://youtu.be/qPHsWTZP58U&t=600s))

The reveal: The bias wasn’t in the weights, but in the **System
Prompt**. The slide suggests looking at the hidden instructions given to
the model.

In Falcon’s case, the system prompt explicitly told the model, “You are
a model built in Abu Dhabi.” This context influenced its generation
probabilities, causing it to favor Abu Dhabi in its responses.

### 31. Claude System Prompt

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_31.png"
alt="Slide 31" />
<figcaption aria-hidden="true">Slide 31</figcaption>
</figure>

([Timestamp: 10:33](https://youtu.be/qPHsWTZP58U&t=633s))

Rajiv points out that most developers never read the system prompts of
the models they use. He highlights the **Claude System Prompt**, which
is 1700 words long and takes nearly 10 minutes to read.

These extensive instructions define the model’s personality and safety
guardrails. Ignoring them means you don’t fully understand the inputs
driving your application’s behavior.

### 32. Complexity of a Single Response

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_32.png"
alt="Slide 32" />
<figcaption aria-hidden="true">Slide 32</figcaption>
</figure>

([Timestamp: 11:00](https://youtu.be/qPHsWTZP58U&t=660s))

The diagram is updated to show that a “single response” is actually the
result of complex interactions: **Tokenization -\> Prompt Styles -\>
Prompt Engineering -\> System Prompt**.

This visual summarizes the “Input” section of the talk, reinforcing that
before the model even processes data, multiple layers of text
transformation occur that can alter the result.

### 33. Inter-text Similarity

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_33.png"
alt="Slide 33" />
<figcaption aria-hidden="true">Slide 33</figcaption>
</figure>

([Timestamp: 11:15](https://youtu.be/qPHsWTZP58U&t=675s))

This heatmap compares **Inter-text similarity** between models. It
highlights Llama 70B and Llama 8B. Even though they are from the same
family and likely trained on similar data, they are not identical.

This means you cannot swap a smaller model for a larger one (or vice
versa) and expect the exact same behavior. Any model change requires a
full re-evaluation.

### 34. Sycophantic Models

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_34.png"
alt="Slide 34" />
<figcaption aria-hidden="true">Slide 34</figcaption>
</figure>

([Timestamp: 12:16](https://youtu.be/qPHsWTZP58U&t=736s))

The slide discusses **Sycophancy**—the tendency of models to agree with
the user even when the user is wrong. It mentions how early versions of
GPT-4 were sometimes “overly nice.”

This behavior is a specific type of model bias that evaluators must
watch for. If a user asks a leading question containing false premises,
a sycophantic model might validate the falsehood rather than correct it.

### 35. Model Drift

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_35.png"
alt="Slide 35" />
<figcaption aria-hidden="true">Slide 35</figcaption>
</figure>

([Timestamp: 12:37](https://youtu.be/qPHsWTZP58U&t=757s))

**“Model Drift”** refers to the phenomenon where commercial APIs (like
OpenAI or Anthropic) change their model behavior over time without
warning.

Because developers do not control the weights of API-based models, the
“ground underneath them” can shift. A prompt that worked yesterday might
fail today because the provider updated the backend or the inference
infrastructure.

### 36. Degraded Responses Timeline

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_36.png"
alt="Slide 36" />
<figcaption aria-hidden="true">Slide 36</figcaption>
</figure>

([Timestamp: 12:55](https://youtu.be/qPHsWTZP58U&t=775s))

This slide shows a timeline of **Degraded Responses** from an Anthropic
incident. Technical issues like context window routing errors led to
corrupted outputs for a period of days.

This illustrates that drift isn’t always about model updates; it can be
infrastructure failures. Continuous monitoring is required to detect
when an external dependency degrades your application’s performance.

### 37. Hyperparameters

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_37.png"
alt="Slide 37" />
<figcaption aria-hidden="true">Slide 37</figcaption>
</figure>

([Timestamp: 13:33](https://youtu.be/qPHsWTZP58U&t=813s))

The slide lists **Hyperparameters** like Temperature, Top-P, and Max
Length. Rajiv explains that users can control these “knobs” to influence
creativity versus determinism.

Setting temperature to 0 makes the model less random, but as the next
slides show, it does not guarantee perfect determinism due to hardware
nuances.

### 38. Non-Deterministic Inference

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_38.png"
alt="Slide 38" />
<figcaption aria-hidden="true">Slide 38</figcaption>
</figure>

([Timestamp: 14:03](https://youtu.be/qPHsWTZP58U&t=843s))

This slide tackles **Non-Deterministic Inference**. Unlike traditional
ML models (e.g., XGBoost) where a fixed seed guarantees identical
output, LLMs on GPUs often produce different results for identical
inputs.

Causes include floating-point accumulation errors and the behavior of
Mixture of Experts (MoE) models where different batches might activate
different experts.

### 39. Addressing Non-Determinism

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_39.png"
alt="Slide 39" />
<figcaption aria-hidden="true">Slide 39</figcaption>
</figure>

([Timestamp: 15:11](https://youtu.be/qPHsWTZP58U&t=911s))

Rajiv references recent work by **Thinking Machines** and updates to
**vLLM** that attempt to solve the non-determinism problem through
correct batching.

While solutions are emerging, the takeaway is that most current setups
are non-deterministic by default. Evaluators must design their tests to
tolerate this variance rather than expecting bit-wise reproducibility.

### 40. Updated Model Diagram

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_40.png"
alt="Slide 40" />
<figcaption aria-hidden="true">Slide 40</figcaption>
</figure>

([Timestamp: 15:43](https://youtu.be/qPHsWTZP58U&t=943s))

The diagram expands again. The “Model” box now includes **Model
Selection, Hyperparameters, Non-deterministic Inference, and Forced
Updates**.

This visual summarizes the “Model” section, showing that the “black box”
is actually a dynamic system with internal variables
(weights/architecture) and external variables (infrastructure/updates)
that all add noise to the output.

### 41. Output Format Issues

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_41.png"
alt="Slide 41" />
<figcaption aria-hidden="true">Slide 41</figcaption>
</figure>

([Timestamp: 16:01](https://youtu.be/qPHsWTZP58U&t=961s))

Moving to the “Output” stage, this slide uses MMLU again to show how
**Output Formatting** affects evaluation. How do you ask the model to
answer a multiple-choice question?

Do you ask it to output just the letter “A”? Or the full text? Or the
probability of the token “A”? Different evaluation harnesses use
different methods, leading to the score discrepancies seen earlier.

### 42. Evaluation Harness Variations

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_42.png"
alt="Slide 42" />
<figcaption aria-hidden="true">Slide 42</figcaption>
</figure>

([Timestamp: 16:35](https://youtu.be/qPHsWTZP58U&t=995s))

This table details the specific differences in implementation between
harnesses (e.g., original MMLU vs. HELM vs. EleutherAI).

It reinforces that there is no standard “ruler” for measuring LLMs. The
tool you use to measure the model introduces its own bias and variance
into the final score.

### 43. Score Comparison Table

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_43.png"
alt="Slide 43" />
<figcaption aria-hidden="true">Slide 43</figcaption>
</figure>

([Timestamp: 16:56](https://youtu.be/qPHsWTZP58U&t=1016s))

A spreadsheet shows the same models scoring differently across different
evaluation implementations. The variance is not trivial; it can be large
enough to change the ranking of which model is “best.”

This data drives home the point: You must control your own evaluation
pipeline. Relying on reported numbers is risky because you don’t know
the implementation details behind them.

### 44. Sentiment Analysis Variance

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_44.png"
alt="Slide 44" />
<figcaption aria-hidden="true">Slide 44</figcaption>
</figure>

([Timestamp: 17:09](https://youtu.be/qPHsWTZP58U&t=1029s))

This slide shows varying **Sentiment Analysis** outputs. Different
models (or the same model with different prompts) might classify a
review as “Positive” while another says “Neutral.”

This introduces the concept that even “simple” classification tasks in
GenAI are subject to interpretation and variance, unlike traditional
classifiers that have a fixed decision boundary.

### 45. Tool Use Variance

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_45.png"
alt="Slide 45" />
<figcaption aria-hidden="true">Slide 45</figcaption>
</figure>

([Timestamp: 17:23](https://youtu.be/qPHsWTZP58U&t=1043s))

Radar charts illustrate variance in **Tool Use**. Models might be good
at using an “Email” tool but fail at “Calendar” or “Terminal” tools.

Furthermore, models exhibit non-determinism in *decision
making*—sometimes they choose to use a tool, and sometimes they try to
answer from memory. This adds a layer of logic errors on top of text
generation errors.

### 46. Summary: Why Responses Differ

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_46.png"
alt="Slide 46" />
<figcaption aria-hidden="true">Slide 46</figcaption>
</figure>

([Timestamp: 17:49](https://youtu.be/qPHsWTZP58U&t=1069s))

This comprehensive slide aggregates all the factors discussed:
**Inputs** (prompts, system prompts), **Model** (drift, hyperparams),
**Outputs** (formatting), and **Infrastructure**.

It serves as a checklist for the audience. If your application is
behaving inconsistently, investigate these specific layers of the stack
to find the source of the noise.

### 47. Chaos is Okay

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_47.png"
alt="Slide 47" />
<figcaption aria-hidden="true">Slide 47</figcaption>
</figure>

([Timestamp: 18:17](https://youtu.be/qPHsWTZP58U&t=1097s))

Rajiv reassures the audience that **“Chaos is Okay.”** The slide
presents a chart of evaluation methods ranging from flexible/expensive
(human eval) to rigid/cheap (code assertions).

The message is that while the technology is chaotic, there is a spectrum
of tools available to manage it. We don’t need to solve every source of
variance; we just need a robust process to measure it.

### 48. From Chaos to Control

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_48.png"
alt="Slide 48" />
<figcaption aria-hidden="true">Slide 48</figcaption>
</figure>

([Timestamp: 18:27](https://youtu.be/qPHsWTZP58U&t=1107s))

This transition slide marks the beginning of the **Evaluation Workflow**
section. The presentation shifts from describing the problem to
prescribing the solution.

The goal here is to move from “Vibe Coding” to a structured engineering
discipline where changes are measured against a stable baseline.

### 49. Build the Evaluation Dataset

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_49.png"
alt="Slide 49" />
<figcaption aria-hidden="true">Slide 49</figcaption>
</figure>

([Timestamp: 18:37](https://youtu.be/qPHsWTZP58U&t=1117s))

The first step in the workflow is to **Build the Evaluation Dataset**.
The slide lists examples of prompts for tasks like summarization,
extraction, and translation.

Rajiv emphasizes that this dataset should reflect *your* actual use
case. It is the foundation of the “Custom Benchmark” concept introduced
earlier.

### 50. Get Labeled Outputs (Gold)

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_50.png"
alt="Slide 50" />
<figcaption aria-hidden="true">Slide 50</figcaption>
</figure>

([Timestamp: 18:46](https://youtu.be/qPHsWTZP58U&t=1126s))

Step two is to get **Labeled Outputs**, also known as **Gold Outputs**,
Reference, or Ground Truth. The slide adds a column showing the ideal
answer for each prompt.

This is the standard against which the model will be judged. While
obtaining these labels can be expensive (requiring human effort), they
are essential for calculating accuracy.

### 51. Compare to Model Output

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_51.png"
alt="Slide 51" />
<figcaption aria-hidden="true">Slide 51</figcaption>
</figure>

([Timestamp: 19:00](https://youtu.be/qPHsWTZP58U&t=1140s))

Step three is to generate responses from your system and place them
alongside the Gold Outputs. The slide adds a **“Model Output”** column.

This visual comparison allows developers (and automated judges) to see
the delta between what was expected and what was produced.

### 52. Measure Equivalence

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_52.png"
alt="Slide 52" />
<figcaption aria-hidden="true">Slide 52</figcaption>
</figure>

([Timestamp: 19:10](https://youtu.be/qPHsWTZP58U&t=1150s))

Step four is to **Measure Equivalence**. Since LLMs rarely produce exact
string matches, we use an **LLM Judge** (another model) to determine if
the Model Output means the same thing as the Gold Output.

The slide shows a prompt for the judge: “Are these two responses
semantically equivalent?” This converts a fuzzy text comparison problem
into a binary (Pass/Fail) metric.

### 53. Optimize Using Equivalence

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_53.png"
alt="Slide 53" />
<figcaption aria-hidden="true">Slide 53</figcaption>
</figure>

([Timestamp: 19:57](https://youtu.be/qPHsWTZP58U&t=1197s))

Once you have an equivalence metric, you can **Optimize**. The slide
shows Config A vs. Config B. By changing prompts or models, you can
track if your “Equivalence Score” goes up or down.

This treats GenAI engineering like traditional hyperparameter tuning.
The goal is to maximize the equivalence score on your custom dataset.

### 54. Why Global Metrics Aren’t Enough

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_54.png"
alt="Slide 54" />
<figcaption aria-hidden="true">Slide 54</figcaption>
</figure>

([Timestamp: 20:28](https://youtu.be/qPHsWTZP58U&t=1228s))

The slide discusses the limitations of the “Equivalence” approach. While
good for a general sense of quality, **Global Metrics** miss nuances.

Sometimes it’s hard to get a Gold Answer for open-ended creative tasks.
Furthermore, a simple “Pass/Fail” doesn’t tell you *why* the model
failed (e.g., was it tone, length, or factuality?).

### 55. From Global to Targeted Evaluation

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_55.png"
alt="Slide 55" />
<figcaption aria-hidden="true">Slide 55</figcaption>
</figure>

([Timestamp: 20:55](https://youtu.be/qPHsWTZP58U&t=1255s))

This slide argues for **Targeted Evaluation**. To maximize performance,
you need to dig deeper into the data and identify specific error modes.

This transitions the talk from “Basic Workflow” to “Advanced Testing,”
where we break down “Quality” into specific, testable components like
tone, length, and safety.

### 56. Building Tests

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_56.png"
alt="Slide 56" />
<figcaption aria-hidden="true">Slide 56</figcaption>
</figure>

([Timestamp: 21:14](https://youtu.be/qPHsWTZP58U&t=1274s))

The section title **“Building Tests”** appears. This is where the
presentation moves into the “Unit Testing” philosophy for GenAI.

Just as software engineering relies on unit tests to verify specific
functions, GenAI engineering should use targeted tests to verify
specific attributes of the generated text.

### 57. Good vs. Bad Examples

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_57.png"
alt="Slide 57" />
<figcaption aria-hidden="true">Slide 57</figcaption>
</figure>

([Timestamp: 21:20](https://youtu.be/qPHsWTZP58U&t=1280s))

The slide displays a **Good Example** and a **Bad Example** of a
response. The bad example is visibly shorter and less polite.

Rajiv asks the audience to identify *why* it is bad. This exercise is
crucial: you cannot build a test until you can articulate exactly what
makes a response a failure.

### 58. Develop an Evaluation Mindset

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_58.png"
alt="Slide 58" />
<figcaption aria-hidden="true">Slide 58</figcaption>
</figure>

([Timestamp: 21:46](https://youtu.be/qPHsWTZP58U&t=1306s))

To define “Bad,” developers need an **Evaluation Mindset**. This
involves observing real-world user interactions and problems.

Data scientists often want to stay in their “chair” and optimize
algorithms, but Rajiv argues that effective evaluation requires
understanding the user’s pain points.

### 59. Collaborate with Experts

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_59.png"
alt="Slide 59" />
<figcaption aria-hidden="true">Slide 59</figcaption>
</figure>

([Timestamp: 21:58](https://youtu.be/qPHsWTZP58U&t=1318s))

The slide stresses **Collaboration**. You must talk to domain experts
(e.g., the customer support team) to define what a “good” answer looks
like.

Naive bootstrapping—pretending to be a user—is a good start, but
long-term success requires input from the people who actually know the
business domain.

### 60. Identify and Categorize Failures

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_60.png"
alt="Slide 60" />
<figcaption aria-hidden="true">Slide 60</figcaption>
</figure>

([Timestamp: 22:52](https://youtu.be/qPHsWTZP58U&t=1372s))

Once you understand the domain, you can **Categorize Failure Types**.
The slide shows a chart grouping errors into categories like “Harmful
Content,” “Bias,” or “Incorrect Info.”

This clustering allows you to see patterns. Instead of just knowing “the
model failed 20% of the time,” you know “the model has a specific
problem with tone.”

### 61. Define What Good Looks Like

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_61.png"
alt="Slide 61" />
<figcaption aria-hidden="true">Slide 61</figcaption>
</figure>

([Timestamp: 23:11](https://youtu.be/qPHsWTZP58U&t=1391s))

Using the categorization, you can explicitly **Define What Good Looks
Like**. The slide contrasts the good/bad examples again, but now with
labels: “Too short,” “Lacks professional tone.”

This transforms a subjective feeling (“this response sucks”) into
objective criteria (“response must be \>50 words and use polite
honorifics”).

### 62. Document Every Issue

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_62.png"
alt="Slide 62" />
<figcaption aria-hidden="true">Slide 62</figcaption>
</figure>

([Timestamp: 23:32](https://youtu.be/qPHsWTZP58U&t=1412s))

The slide shows a spreadsheet where humans evaluate responses and
**Document Every Issue**. Columns track specific attributes like “Is it
helpful?” or “Is the tone right?”

This manual annotation is the training data for your automated tests.
You need humans to establish the ground truth before you can automate
the checking.

### 63. Evaluation Tooling

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_63.png"
alt="Slide 63" />
<figcaption aria-hidden="true">Slide 63</figcaption>
</figure>

([Timestamp: 23:53](https://youtu.be/qPHsWTZP58U&t=1433s))

Rajiv mentions that **Tooling Can Help**. The slide shows a custom chat
viewer designed to make human review easier.

However, he warns against getting sidetracked by building fancy tools.
Simple spreadsheets often suffice for the early stages. The goal is the
data, not the interface.

### 64. Test 1: Length Check

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_64.png"
alt="Slide 64" />
<figcaption aria-hidden="true">Slide 64</figcaption>
</figure>

([Timestamp: 24:05](https://youtu.be/qPHsWTZP58U&t=1445s))

Now we build the automated tests. **Test 1 is a Length Check**. The
slide shows Python code asserting that the word count is between 8 and
200.

This is a **deterministic test**. You don’t need an LLM to count words.
Rajiv encourages using simple Python assertions wherever possible
because they are fast, cheap, and reliable.

### 65. Test 2: Tone and Style

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_65.png"
alt="Slide 65" />
<figcaption aria-hidden="true">Slide 65</figcaption>
</figure>

([Timestamp: 24:22](https://youtu.be/qPHsWTZP58U&t=1462s))

**Test 2 checks Tone and Style**. Since “tone” is subjective, we use an
**LLM Judge** (OpenAI model) to classify the response.

The prompt asks the judge to identify the style. This allows us to
automate the “vibe check” that humans were previously doing manually.

### 66. Adding Metrics to Documentation

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_66.png"
alt="Slide 66" />
<figcaption aria-hidden="true">Slide 66</figcaption>
</figure>

([Timestamp: 24:41](https://youtu.be/qPHsWTZP58U&t=1481s))

The spreadsheet is updated with new columns: `Length_OK` and `Tone_OK`.
These are the results of the automated tests.

Now, for every row in the dataset, we have granular pass/fail metrics.
This helps pinpoint exactly *why* a specific response failed, rather
than just a generic failure.

### 67. Check Judges Against Humans

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_67.png"
alt="Slide 67" />
<figcaption aria-hidden="true">Slide 67</figcaption>
</figure>

([Timestamp: 25:12](https://youtu.be/qPHsWTZP58U&t=1512s))

A critical step: **Check LLM Judges Against Humans**. You must verify
that your automated “Tone Judge” agrees with your human experts.

If the human says the tone is rude, but the LLM Judge says it’s polite,
your metric is useless. You must iterate on the judge’s prompt until
alignment is high.

### 68. Self-Evaluation Bias

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_68.png"
alt="Slide 68" />
<figcaption aria-hidden="true">Slide 68</figcaption>
</figure>

([Timestamp: 26:06](https://youtu.be/qPHsWTZP58U&t=1566s))

The slide illustrates **Self-Evaluation Bias**. LLMs tend to rate their
own outputs higher than outputs from other models. GPT-4 prefers GPT-4
text.

To mitigate this, Rajiv suggests mixing models—use Claude to judge
GPT-4, or Gemini to judge Claude. This helps ensure a more neutral
evaluation.

### 69. Alignment Checks

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_69.png"
alt="Slide 69" />
<figcaption aria-hidden="true">Slide 69</figcaption>
</figure>

([Timestamp: 26:46](https://youtu.be/qPHsWTZP58U&t=1606s))

This slide reinforces the need for **Continuous Alignment**. Just
because your judge aligned with humans last month doesn’t mean it still
does (due to model drift).

Human spot-checks should be a permanent part of the pipeline to ensure
the automated judges haven’t drifted.

### 70. Biases in LLM Judges

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_70.png"
alt="Slide 70" />
<figcaption aria-hidden="true">Slide 70</figcaption>
</figure>

([Timestamp: 27:02](https://youtu.be/qPHsWTZP58U&t=1622s))

The slide lists known **Biases in LLM Judges**, such as **Position
Bias** (favoring the first answer presented) or **Verbosity Bias**
(favoring longer answers).

Evaluators must be aware of these. For example, you should shuffle the
order of answers when asking a judge to compare two options to cancel
out position bias.

### 71. Best Practices for LLM Judges

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_71.png"
alt="Slide 71" />
<figcaption aria-hidden="true">Slide 71</figcaption>
</figure>

([Timestamp: 27:11](https://youtu.be/qPHsWTZP58U&t=1631s))

A summary of **Best Practices**: Calibrate with human data, use
ensembles (multiple judges), avoid asking for “relevance” (too vague),
and use discrete rating scales (1-5) rather than continuous numbers.

These tips help stabilize the inherently noisy process of using AI to
evaluate AI.

### 72. Error Analysis Chart

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_72.png"
alt="Slide 72" />
<figcaption aria-hidden="true">Slide 72</figcaption>
</figure>

([Timestamp: 27:46](https://youtu.be/qPHsWTZP58U&t=1666s))

With tests in place, we move to **Error Analysis**. The bar chart shows
the number of failed cases categorized by error type (Length, Tone,
Professional, Context).

This visualization tells you where to focus your efforts. If “Tone” is
the biggest bar, you work on the system prompt’s tone instructions. If
“Context” is the issue, you might need better Retrieval Augmented
Generation (RAG).

### 73. Comparing Prompts

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_73.png"
alt="Slide 73" />
<figcaption aria-hidden="true">Slide 73</figcaption>
</figure>

([Timestamp: 27:58](https://youtu.be/qPHsWTZP58U&t=1678s))

The chart can compare **Prompt A vs. Prompt B**. This allows for A/B
testing of prompt engineering strategies.

You can see if a new prompt improves “Tone” but accidentally degrades
“Context.” This tradeoff analysis is impossible with a single global
score.

### 74. Explanations Guide Improvement

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_74.png"
alt="Slide 74" />
<figcaption aria-hidden="true">Slide 74</figcaption>
</figure>

([Timestamp: 28:14](https://youtu.be/qPHsWTZP58U&t=1694s))

Rajiv suggests asking the LLM Judge for **Explanations**. Don’t just ask
for a score; ask for “one sentence explaining why.”

These explanations act as metadata that helps developers understand the
judge’s reasoning, making it easier to debug discrepancies between human
and AI judgments.

### 75. Limits to Explanations

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_75.png"
alt="Slide 75" />
<figcaption aria-hidden="true">Slide 75</figcaption>
</figure>

([Timestamp: 28:35](https://youtu.be/qPHsWTZP58U&t=1715s))

A warning: **Explanations are not causal**. When an LLM explains why it
did something, it is generating a plausible justification, not a trace
of its actual neural activations.

Treat explanations as a heuristic or a helpful hint, not as absolute
truth about the model’s internal state.

### 76. The Evaluation Flywheel

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_76.png"
alt="Slide 76" />
<figcaption aria-hidden="true">Slide 76</figcaption>
</figure>

([Timestamp: 28:46](https://youtu.be/qPHsWTZP58U&t=1726s))

The **Evaluation Flywheel** describes the iterative cycle: Build Eval
-\> Analyze -\> Improve -\> Repeat.

This concept, credited to Hamill, emphasizes that evaluation is not a
one-time event but a continuous loop that spins faster as you gather
more data and build better tests.

### 77. Financial Analyst Agent Example

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_77.png"
alt="Slide 77" />
<figcaption aria-hidden="true">Slide 77</figcaption>
</figure>

([Timestamp: 29:20](https://youtu.be/qPHsWTZP58U&t=1760s))

To demonstrate advanced unit testing, Rajiv introduces a **Financial
Analyst Agent**. The goal is to assess the specific “style” of a
financial report.

This is a complex domain where “good” is highly specific (regulated,
precise, risk-aware), making it a perfect candidate for granular unit
tests.

### 78. Use a Global Test?

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_78.png"
alt="Slide 78" />
<figcaption aria-hidden="true">Slide 78</figcaption>
</figure>

([Timestamp: 29:43](https://youtu.be/qPHsWTZP58U&t=1783s))

You *could* use a **Global Test**: “Was this explained as a financial
analyst would?”

While simple, this test is opaque. If it fails, you don’t know if it was
because of compliance issues, lack of clarity, or poor formatting.

### 79. Global vs. Unit Tests

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_79.png"
alt="Slide 79" />
<figcaption aria-hidden="true">Slide 79</figcaption>
</figure>

([Timestamp: 29:54](https://youtu.be/qPHsWTZP58U&t=1794s))

The slide contrasts the Global approach with **Unit Tests**. Instead of
one question, we ask six: Context, Clarity, Precision, Compliance,
Actionability, and Risks.

This breakdown allows for targeted debugging. You might find the model
is great at “Clarity” but terrible at “Compliance.”

### 80. Scoring Radar Chart

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_80.png"
alt="Slide 80" />
<figcaption aria-hidden="true">Slide 80</figcaption>
</figure>

([Timestamp: 30:16](https://youtu.be/qPHsWTZP58U&t=1816s))

A **Radar Chart** visualizes the unit test scores. This allows for a
quick visual assessment of the model’s profile.

It facilitates comparison: you can overlay the profiles of two different
models to see which one has the better balance of attributes for your
specific needs.

### 81. Analyzing Failures with Clusters

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_81.png"
alt="Slide 81" />
<figcaption aria-hidden="true">Slide 81</figcaption>
</figure>

([Timestamp: 30:37](https://youtu.be/qPHsWTZP58U&t=1837s))

With enough unit test data, you can use **Clustering (e.g., K-Means)**
to group failures. The slide shows clusters like “Synthesis,” “Context,”
and “Hallucination.”

This moves error analysis from reading individual logs to analyzing
aggregate trends, helping you prioritize which class of errors to fix
first.

### 82. Designing Good Unit Tests

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_82.png"
alt="Slide 82" />
<figcaption aria-hidden="true">Slide 82</figcaption>
</figure>

([Timestamp: 30:52](https://youtu.be/qPHsWTZP58U&t=1852s))

Advice on **Designing Unit Tests**: Keep them focused (one concept per
test), use unambiguous language, and use small rating ranges.

Good unit tests are the building blocks of a reliable evaluation
pipeline. If the tests themselves are noisy or vague, the entire system
collapses.

### 83. Examples of Unit Tests

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_83.png"
alt="Slide 83" />
<figcaption aria-hidden="true">Slide 83</figcaption>
</figure>

([Timestamp: 30:55](https://youtu.be/qPHsWTZP58U&t=1855s))

The slide lists specific examples of tests for **Legal** (Compliance,
Terminology), **Retrieval** (Relevance, Completeness), and
**Bias/Fairness**.

This serves as a menu of options for the audience, showing that unit
tests can cover almost any dimension of quality required by the
business.

### 84. Evaluating New Prompts

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_84.png"
alt="Slide 84" />
<figcaption aria-hidden="true">Slide 84</figcaption>
</figure>

([Timestamp: 30:58](https://youtu.be/qPHsWTZP58U&t=1858s))

A bar chart shows how unit tests are used to **Evaluate New Prompts**.
By running the full suite of unit tests on a new prompt, you get a
“scorecard” of its performance.

This data-driven approach removes the guesswork from prompt engineering.

### 85. Tools - No Silver Bullet

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_85.png"
alt="Slide 85" />
<figcaption aria-hidden="true">Slide 85</figcaption>
</figure>

([Timestamp: 31:02](https://youtu.be/qPHsWTZP58U&t=1862s))

Rajiv reminds the audience that **Tools are No Silver Bullet**. You must
master the basics (datasets, metrics) first.

He advises logging traces and experiments and practicing **Dataset
Versioning**. Tools facilitate these practices, but they cannot replace
the fundamental engineering discipline.

### 86. Forest and Trees

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_86.png"
alt="Slide 86" />
<figcaption aria-hidden="true">Slide 86</figcaption>
</figure>

([Timestamp: 31:04](https://youtu.be/qPHsWTZP58U&t=1864s))

An analogy helps structure the analysis: **Forest (Global/Integration)**
vs. **Trees (Test Case/Unit Tests)**.

You need to look at both. The forest tells you the overall health of the
app, while the trees tell you specifically what needs pruning or fixing.

### 87. Change One Thing at a Time

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_87.png"
alt="Slide 87" />
<figcaption aria-hidden="true">Slide 87</figcaption>
</figure>

([Timestamp: 31:17](https://youtu.be/qPHsWTZP58U&t=1877s))

A crucial scientific principle: **Change One Thing at a Time**. With so
many knobs (prompt, temp, model, RAG settings), changing multiple
variables simultaneously makes it impossible to know what caused the
improvement (or regression).

Isolate your variables to conduct valid experiments.

### 88. Error Analysis Tips

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_88.png"
alt="Slide 88" />
<figcaption aria-hidden="true">Slide 88</figcaption>
</figure>

([Timestamp: 31:32](https://youtu.be/qPHsWTZP58U&t=1892s))

A summary of **Error Analysis Tips**: Use ablation studies (removing
parts to see impact), categorize failures, save interesting examples,
and leverage logs/traces.

These are the daily habits of successful GenAI engineers.

### 89. The Evaluation Story

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_89.png"
alt="Slide 89" />
<figcaption aria-hidden="true">Slide 89</figcaption>
</figure>

([Timestamp: 32:08](https://youtu.be/qPHsWTZP58U&t=1928s))

The slide shows the “Story We Tell”—a linear graph of improvement over
time. This is the idealized version of progress often presented in case
studies.

It suggests a smooth journey from “Out of the box” to “Specialized” to
“User Feedback.”

### 90. The Reality of Progress

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_90.png"
alt="Slide 90" />
<figcaption aria-hidden="true">Slide 90</figcaption>
</figure>

([Timestamp: 32:24](https://youtu.be/qPHsWTZP58U&t=1944s))

**The Reality** is a messy, non-linear graph. You take two steps
forward, one step back. Sometimes an “improvement” breaks the model.

Rajiv encourages resilience. Experienced practitioners know that this
messy graph is normal and that sticking to the process eventually yields
results.

### 91. Continual Process

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_91.png"
alt="Slide 91" />
<figcaption aria-hidden="true">Slide 91</figcaption>
</figure>

([Timestamp: 33:01](https://youtu.be/qPHsWTZP58U&t=1981s))

**Evaluation is a Continual Process**. It involves Problem ID, Data
Collection, Optimization, User Acceptance Testing (UAT), and Updates.

Crucially, **UAT** is your holdout set. Since you don’t have a
traditional test set in GenAI, your real users act as the final
validation layer.

### 92. Eating the Elephant

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_92.png"
alt="Slide 92" />
<figcaption aria-hidden="true">Slide 92</figcaption>
</figure>

([Timestamp: 34:03](https://youtu.be/qPHsWTZP58U&t=2043s))

The metaphor **“How do you eat an elephant?”** addresses the
overwhelming nature of building a comprehensive evaluation suite.

The answer, of course, is “one bite at a time.” You don’t need 100 tests
on day one.

### 93. Adding Tests Over Time

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_93.png"
alt="Slide 93" />
<figcaption aria-hidden="true">Slide 93</figcaption>
</figure>

([Timestamp: 34:10](https://youtu.be/qPHsWTZP58U&t=2050s))

The slide visualizes the “elephant” being broken down into bites. You
start with a few critical tests. As the app matures and you discover new
failure modes, you add more tests.

Six months in, you might have 100 tests, but you built them
incrementally. This makes the task manageable.

### 94. Doing Evaluation the Right Way

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_94.png"
alt="Slide 94" />
<figcaption aria-hidden="true">Slide 94</figcaption>
</figure>

([Timestamp: 34:39](https://youtu.be/qPHsWTZP58U&t=2079s))

A summary slide listing best practices: **Annotated Examples**,
**Systematic Documentation**, **Continuous Error Analysis**,
**Collaboration**, and awareness of **Generalization**.

This concludes the core methodology section of the talk.

### 95. Agentic Use Cases

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_95.png"
alt="Slide 95" />
<figcaption aria-hidden="true">Slide 95</figcaption>
</figure>

([Timestamp: 34:50](https://youtu.be/qPHsWTZP58U&t=2090s))

The final section covers **Agentic Use Cases**, symbolized by a dragon.
Agents add a layer of complexity because the model is now making
decisions (routing, tool use) rather than just generating text.

This “agency” makes the system harder to track and evaluate.

### 96. Crossing the River

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_96.png"
alt="Slide 96" />
<figcaption aria-hidden="true">Slide 96</figcaption>
</figure>

([Timestamp: 35:06](https://youtu.be/qPHsWTZP58U&t=2106s))

A conceptual slide asking, **“How should it cross the river?”** (Fly,
Swim, Bridge?). This represents the decision-making step in an agent.

Evaluating an agent requires evaluating *how* it made the decision (the
router) separately from *how well* it executed the action.

### 97. Chat-to-Purchase Router

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_97.png"
alt="Slide 97" />
<figcaption aria-hidden="true">Slide 97</figcaption>
</figure>

([Timestamp: 35:22](https://youtu.be/qPHsWTZP58U&t=2122s))

A complex flowchart shows a **Chat-to-Purchase Router**. The agent must
decide if the user wants to search for a product, get support, or track
a package.

Rajiv suggests breaking this down: evaluate the **Router** component
first (did it pick the right path?), then evaluate the specific workflow
(did it track the package correctly?).

### 98. Text to SQL Agent

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_98.png"
alt="Slide 98" />
<figcaption aria-hidden="true">Slide 98</figcaption>
</figure>

([Timestamp: 36:17](https://youtu.be/qPHsWTZP58U&t=2177s))

Another example: **Text to SQL Agent**. This workflow involves
classification, feature extraction, and SQL generation.

You can isolate the “Classification” step (is this a valid SQL
question?) and build a test just for that, before testing the actual SQL
generation.

### 99. Evaluating Office-Style Agents

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_99.png"
alt="Slide 99" />
<figcaption aria-hidden="true">Slide 99</figcaption>
</figure>

([Timestamp: 36:46](https://youtu.be/qPHsWTZP58U&t=2206s))

The slide discusses **OdysseyBench**, a benchmark for office tasks. It
highlights failure modes like “Failed to create folder” or “Failed to
use tool.”

Evaluating agents involves checking if they successfully manipulated the
environment (files, APIs), which is a functional test rather than a text
similarity test.

### 100. Error Analysis for Agents

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_100.png"
alt="Slide 100" />
<figcaption aria-hidden="true">Slide 100</figcaption>
</figure>

([Timestamp: 37:00](https://youtu.be/qPHsWTZP58U&t=2220s))

**Error Analysis for Agentic Workflows** requires assessing the overall
performance, the routing decisions, and the individual steps.

It is the same “action error analysis” process but applied recursively
to every node in the agent’s decision tree.

### 101. Evaluating Workflow vs. Response

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_101.png"
alt="Slide 101" />
<figcaption aria-hidden="true">Slide 101</figcaption>
</figure>

([Timestamp: 37:19](https://youtu.be/qPHsWTZP58U&t=2239s))

This slide distinguishes between evaluating a **Response** (text) and a
**Workflow** (process). The flowchart shows a conversational flow.

Evaluating a workflow might mean checking if the agent successfully
moved the user from “Greeting” to “Resolution,” regardless of the exact
words used.

### 102. Agentic Frameworks

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_102.png"
alt="Slide 102" />
<figcaption aria-hidden="true">Slide 102</figcaption>
</figure>

([Timestamp: 37:48](https://youtu.be/qPHsWTZP58U&t=2268s))

Rajiv warns that **“Agentic Frameworks Help – Until They Don’t.”**
Frameworks (like LangChain or AutoGen) are great for demos because they
abstract complexity.

However, in production, these abstractions can break or become outdated.
He often recommends using straight Python for production agents to
maintain control and reliability.

### 103. Abstraction for Workflows

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_103.png"
alt="Slide 103" />
<figcaption aria-hidden="true">Slide 103</figcaption>
</figure>

([Timestamp: 38:32](https://youtu.be/qPHsWTZP58U&t=2312s))

The slide illustrates the trade-off in **Abstraction**. You can build
rigid workflows (orchestration) where you control every step, or use
general agents where the LLM decides.

Orchestration is more reliable but rigid. General agents are flexible
but prone to non-deterministic errors.

### 104. When Abstractions Break

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_104.png"
alt="Slide 104" />
<figcaption aria-hidden="true">Slide 104</figcaption>
</figure>

([Timestamp: 38:53](https://youtu.be/qPHsWTZP58U&t=2333s))

Model providers are training models to handle workflows internally
(removing the need for external orchestration).

However, until models are perfect, developers often need to break tasks
down into specific pieces to ensure reliability. The choice between
“letting the model do it” and “scripting the flow” depends on the
application’s risk tolerance.

### 105. Lessons from Agent Benchmarks

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_105.png"
alt="Slide 105" />
<figcaption aria-hidden="true">Slide 105</figcaption>
</figure>

([Timestamp: 39:15](https://youtu.be/qPHsWTZP58U&t=2355s))

The slide lists **Lessons from Reproducing Agent Benchmarks**:
Standardize evaluation, measure efficiency, detect shortcuts, and log
real behavior.

These are advanced tips for those pushing the boundaries of what agents
can do.

### 106. Conclusion

<figure>
<img
src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_106.png"
alt="Slide 106" />
<figcaption aria-hidden="true">Slide 106</figcaption>
</figure>

([Timestamp: 39:27](https://youtu.be/qPHsWTZP58U&t=2367s))

The final slide, **“We did it!”**, concludes the presentation. Rajiv
thanks the audience and provides the QR code again.

His final message is one of empowerment: he hopes the audience now has
the confidence to go out, build their own evaluation datasets, and start
“hill climbing” their own applications.

------------------------------------------------------------------------

*This annotated presentation was generated from the talk using
AI-assisted tools. Each slide includes timestamps and detailed
explanations.*