LLM Observability: A Guide to Monitoring with LangSmith

Observability for LLM Systems: A Practical Guide to Monitoring Agents with LangSmith, Prometheus, and Grafana
Your new AI agent is deployed. It's answering questions, using tools, and impressing users. But what's actually happening inside? How much is it costing? Is the quality of its answers degrading over time? Why was one specific user query so slow? If you're relying on traditional Application Performance Monitoring (APM) tools, you're flying blind.
The critical "why": Traditional monitoring can see that your application made an API call to an LLM, but it has no insight into the *content* or *logic* of that call. It can't see the prompt, the retrieved documents, the agent's reasoning steps, or the final generated answer. This lack of visibility makes debugging a nightmare and quality control impossible. A new, specialized observability stack is required. At ActiveWizards, we engineer these robust monitoring systems by integrating qualitative tracing tools with quantitative metrics platforms. This guide provides a practical blueprint for combining LangSmith, Prometheus, and Grafana to achieve true observability for your LLM systems.
The Three Pillars of LLM Observability
No single tool can solve this problem. A comprehensive solution requires a triad of specialized systems, each answering a different fundamental question.
Tool | Core Function | Answers the Question... |
---|---|---|
LangSmith | Trace Logging & Debugging | "What did the agent do and why?" (The qualitative view) |
Prometheus | Time-Series Metrics Collection | "How is the system performing over time?" (The quantitative view) |
Grafana | Metrics Visualization & Dashboards | "Show me the system's health at a glance." (The visual view) |
The Unified Architecture: How They Work Together
The power of this stack lies in how these tools are integrated. Your AI agent application sits at the center, instrumented to send different types of data to the right destinations. The key is that the data flows in different ways: LangSmith receives detailed traces via its SDK, while Prometheus actively "scrapes" metrics from an endpoint your application exposes.
Diagram 1: The unified observability architecture for LLM systems.
Practical Implementation: Instrumenting Your Agent
Let's look at how to instrument a Python-based agent to feed this system.
1. Enabling LangSmith Tracing
LangSmith integration is often the simplest part. By setting environment variables and using LangChain objects, detailed traces of prompts, tool calls, and LLM outputs are automatically sent to LangSmith for inspection.
2. Exposing Prometheus Metrics
This requires more explicit instrumentation. Using a client library like `prometheus_client` for Python, you define the metrics you care about and increment them in your code. This is essential for tracking costs and performance.
from prometheus_client import start_http_server, Counter, Histogram
import time
# Define your metrics
TOTAL_REQUESTS = Counter('llm_agent_requests_total', 'Total requests to the agent.')
REQUEST_LATENCY = Histogram('llm_agent_request_latency_seconds', 'Agent request latency.')
TOTAL_TOKENS = Counter('llm_agent_tokens_total', 'Total tokens processed by the LLM.', ['model_name'])
def handle_user_query(query, model):
TOTAL_REQUESTS.inc()
start_time = time.time()
# --- LangSmith automatically traces this block if configured ---
# response = agent.invoke({"input": query})
# llm_output = response['output']
# token_usage = response['token_usage'] # Hypothetical token count
# --- End of traced block ---
# Manually increment Prometheus metrics
latency = time.time() - start_time
REQUEST_LATENCY.observe(latency)
# TOTAL_TOKENS.labels(model_name=model).inc(token_usage)
# return llm_output
# Start an HTTP server to expose the /metrics endpoint
# start_http_server(8000)
A mature observability practice connects these three pillars. Your Grafana dashboard shows you the "what" (e.g., a spike in P99 latency). LangSmith shows you the "why" (a specific tool call is hanging or an LLM is generating verbose output). The third, often forgotten, piece is **structured logging** (e.g., sending JSON logs to Elasticsearch or Loki). While LangSmith captures logic, structured logs capture system-level events. A truly powerful dashboard in Grafana can correlate all three: displaying a latency spike, providing a link to the relevant LangSmith traces for that time period, and showing related error logs from Loki. This is the holy grail of LLM ops.
Building Your Unified Dashboard in Grafana
Once Prometheus is collecting metrics, you can build a powerful Grafana dashboard to visualize the health of your AI system. A good dashboard goes beyond basic CPU and memory usage.
Key Panels for an LLM Operations Dashboard:
- Cost Monitoring: A graph showing `llm_agent_tokens_total` over time, broken down by model name. You can multiply this by the per-token cost to get a real-time view of your spend.
- Performance & Latency: A histogram or heatmap of `llm_agent_request_latency_seconds`, showing your P50, P90, and P99 latencies.
- Quality & Errors: A graph of failed requests or tool execution errors. If you collect user feedback (e.g., thumbs up/down), you can display the feedback score over time.
- Throughput: A simple counter showing the rate of `llm_agent_requests_total` per minute or hour.
The ActiveWizards Advantage: Engineering Production-Ready MLOps
As this guide shows, observability for AI systems is a serious engineering discipline. It's not just about installing tools; it's about thoughtful instrumentation, defining meaningful metrics, and building a cohesive system where different components work together to provide a complete picture of system health. This is a core component of MLOps for the generative AI era.
At ActiveWizards, we don't just build intelligent agents; we engineer the robust, observable, and manageable production systems that surround them. We apply our deep expertise in both AI and data platform engineering to ensure your AI applications are not black boxes, but transparent, reliable, and cost-effective assets.
Go Beyond the Black Box
Ready to gain true visibility into your AI systems? Our experts can help you design and implement a comprehensive observability stack that gives you the confidence to deploy and manage LLM applications at scale.
Comments
Add a new comment: