Building a Code-First LLM Evaluation Strategy with monday Service and LangSmith

The transition from prototype to production for Large Language Model (LLM) applications is often fraught with uncertainty. For customer-facing tools like monday Service, where accuracy and reliability are non-negotiable, the 'vibe check'—manually testing a few prompts and seeing if they look okay—is insufficient. To solve this, the engineering team at monday.com adopted a code-first evaluation strategy from the very beginning, leveraging the power of LangSmith and high-performance API backends like n1n.ai.

The Shift from Manual to Systematic Evaluation

In traditional software engineering, unit tests are deterministic. You know exactly what the output should be for a given input. In the world of LLMs, outputs are probabilistic and unstructured. This makes testing inherently difficult. monday Service realized that to build a world-class AI service agent, they needed to treat evaluations as first-class citizens in their codebase.

A code-first evaluation strategy involves defining metrics, datasets, and evaluators in code rather than just through a UI. This allows for version control, automation in CI/CD pipelines, and reproducibility. By integrating n1n.ai as their primary API gateway, they ensured that the latency during these massive evaluation runs remained minimal, allowing for faster iteration cycles.

Core Components of the monday Service Eval Framework

1. Tracing and Observability

Before you can evaluate, you must observe. LangSmith provides deep tracing capabilities that allow developers to see every step of a chain, from the initial prompt to the final output, including intermediate RAG (Retrieval-Augmented Generation) steps. This visibility is crucial for identifying where a model might be hallucinating or where the retrieval process is failing.

2. Dataset Curation

Evaluation is only as good as the data it uses. The team focused on building 'Golden Datasets'—curated sets of inputs and expected outputs (ground truths) that represent the most common and most critical user interactions. These datasets are stored and versioned within LangSmith, enabling the team to run regression tests every time they update their model or prompt logic.

3. Custom Evaluators

monday Service utilizes a mix of evaluation methods:

Deterministic Evaluators: Checking for the presence of specific keywords, valid JSON formatting, or regex matches.
LLM-as-a-Judge: Using a more capable model (like GPT-4o or Claude 3.5 Sonnet) to grade the output of a smaller, faster model based on criteria like politeness, accuracy, and relevance.

Implementation Guide: Building Your Own Pipeline

To implement a similar strategy, you need a robust infrastructure. Here is a simplified workflow of how you can set up a code-first evaluation using Python and LangSmith, powered by n1n.ai.

from langsmith import Client
from langchain.smith import RunEvalConfig, run_on_dataset
# Configure your environment to use n1n.ai for high-speed API access
import os
os.environ["OPENAI_API_BASE"] = "https://api.n1n.ai/v1"
os.environ["OPENAI_API_KEY"] = "YOUR_N1N_API_KEY"

client = Client()

# 1. Define your evaluation criteria
eval_config = RunEvalConfig(
    evaluators=[
        "qa", # Correctness based on ground truth
        "context_qa", # Correctness based on retrieved context
        "cot_qa" # Chain of thought reasoning check
    ],
    prediction_key="output"
)

# 2. Run the evaluation
results = run_on_dataset(
    dataset_name="customer-service-v1",
    llm_or_chain_factory=my_service_chain,
    evaluation=eval_config,
    project_name="eval-run-2025-05-01"
)

Comparison: Evaluation Strategies

Feature	Vibe Check	Manual Labelling	Code-First Eval
Scalability	Low	Medium	High
Consistency	None	Subjective	Objective
Speed	Fast (initial)	Very Slow	Fast (automated)
Cost	Low	High (Human labor)	Moderate (API costs)

Why Throughput Matters in Evaluation

When running evaluations across thousands of test cases, the throughput of your LLM provider becomes a bottleneck. If your evaluation suite takes 2 hours to run, developers will skip it. By utilizing n1n.ai, teams can parallelize requests across multiple high-performance models, reducing evaluation time from hours to minutes. This speed is what enables 'Eval-Driven Development,' where the evaluation is run after every minor change to the prompt or RAG parameters.

Advanced Metrics: Beyond Simple Accuracy

monday Service doesn't just look at whether the answer is 'correct.' They look at:

Faithfulness: Does the answer only use information provided in the retrieved documents?
Answer Relevance: Does the answer actually address the user's specific query?
Latency < 2s: Is the response fast enough for a real-time chat interface?

By quantifying these metrics, the team can make data-driven decisions. For example, if a new prompt template increases accuracy by 2% but increases latency by 500ms, they might decide it's not worth the trade-off for a service agent.

Conclusion

Building a code-first evaluation strategy is not just about catching bugs; it is about building the confidence to innovate. monday Service has demonstrated that by integrating LangSmith and reliable API aggregators like n1n.ai, companies can scale their AI efforts without sacrificing quality. As the LLM landscape continues to evolve with models like DeepSeek-V3 and OpenAI o3, having a robust evaluation framework ensures you can swap models and optimize performance with zero guesswork.

Get a free API key at n1n.ai.

Source: https://blog.langchain.com/customers-monday/