Debugging RAG Chatbots and AI Agents with Sessions

When does your AI agent start hallucinating in the multi-step process? Have you noticed consistent issues with a specific part of your agentic workflow?

These are common questions we faced when building our own AI agents and Retrieval Augmented Generation (RAG) chatbots. Getting reliable responses and minimizing errors like hallucination was incredibly challenging, without visibility into how our users interacted with our large language models.

In this blog, we will delve into examples of how to maintain context, reduce errors, and improve the overall performance of your LLM apps, and share a list of tools to help you create more robust and reliable AI agents.

What you will learn:

AI agents vs. traditional software
Components of an AI agent
Challenges we faced while debugging AI agents
Effective debugging tools
How different industries debug AI agents using Helicone’s Sessions

First, what is different about AI agents?

Unlike traditional chatbots or software which follow explicit instructions or rules, AI agents can autonomously perform specific tasks with advanced decision-making abilities. They interact with their environment by collecting data, processing it, and deciding on the best actions to achieve a predefined goal.

Examples of AI Agents

Copilots

Copilots help users by providing suggestions and recommendations. For example, when writing code, a copilot might suggest code snippets, highlight potential bugs or offer optimization tips, but the developer decides whether to implement these suggestions.

Autonomous Agents

Autonomous agents perform tasks independently without human intervention. For example, it can handle customer inquiries by identify issues, access account information, perform necessary actions (like processing refunds or updating account details), and respond to the customer. They can also escalate to a human agent if they encounter problems beyond their current capabilities.

Multi-Agent Systems

Multi-agent systems involve interactions and collaboration between multiple autonomous agents to achieve a collective goal. These systems have advantages like dynamic reasoning, the ability to distribute tasks, and better memory for retaining information.

Using Retrieval-Augmented Generation to Improve Functionality

Retrieval-Augmented Generation (RAG) is an advanced framework that allowed the agent to incorporate information from external knowledge bases (e.g., databases, documents, articles) into the response.

RAG significantly improved the response outcome as the agent now have access to the most recent data based on keywords, semantic similarity, or other advanced search techniques, and used it to generate more accurate, personalized, and context-specific responses.

Components of AI Agents

Typically, AI agents consists of four core components:

Planning
Tool / Vector Database Calls
Perception
Memory

How AI Agents work

Planning

When you define a goal, AI agents have the ability to plan and sequence actions due to their integration with LLMs that allows them to formulate better strategies.

Tool / Vector Database Calls

Advanced AI Agents can interact with external tools, APIs, and services through function calls in order to handle more complicated operations such as:

Fetching real-time information from APIs (e.g., weather data, stock prices).
Using translation services to convert text between languages.
Performing tasks like image recognition or manipulation using specialized libraries.
Running custom scripts to automate a specific workflow.

Perception

AI agents can also perceive and process information from their environment, making them more interactive and context-aware. This sensory information can include visual, auditory, and other types of data to help the agents respond appropriately to environmental cues.

Memory

AI agents are able to remember past interactions, including tools previously used and its planning decisions. These experiences are stored to help agents self-reflect and inform future actions.

Challenges We Faced While Debugging AI agents

⚠️ Their decision making process is complicated.

AI agent’s adaptive behavior makes their decision paths non-deterministic and harder to trace. This is because agents base their decisions on many inputs from diverse data sources (i.e. user interactions, environmental data, and internal states), and they learn through patterns and correlations identified in the data.

⚠️ No visibility into their internal states.

AI agents function as “black boxes” and understanding how they transform inputs into outputs is not straightforward. Often times, whenever the agent interacts with external services, APIs or other agents, their behavior is unpredictable.

⚠️ Context builds up over time, so do errors.

Agents can often make multiple dependent vector database calls within a single session, adding some complexity in tracing the data flow. They can also operate over a longer sessions, where an early error can have cascading effects, so it’s difficult to identify their original source without proper session tracking.

Tools for Debugging AI Agents

One way we try to debug agents is by understanding the internal workings of the model. We also realized that traditional logging methods often lack the granular data to effectively debug complex behaviors. However, there are tools to help streamline the debugging process:

1. Helicone `open-source`

Helicone’s Sessions is ideal for teams looking to intuitively visualize agentic workflows. It’s catered to both developers building simple and advanced agents that need to group related LLM calls, trace nested agent workflows, quickly identify issues, and track requests, response and metadata to the Vector Database.

2. AgentOps `open-source`

AgentOps can be a good choice for teams looking for a comprehensive solution to debug AI agents. Despite a less intuitive interface, AgentOps offers comprehensive features for monitoring and managing AI agents.

3. Langfuse `open-source`

Langfuse is ideal for developers who prefer self-hosting solutions and have simpler infrastructure needs. It offers features similar to Helicone’s and is well-suited for projects with modest scalability requirements or those prioritizing local deployment over cloud-based solutions.

4. LangSmith

LangSmith is ideal for developers working extensively with the LangChain framework as its SDKs and documentation are designed to support developers within this ecosystem best.

5. Braintrust

Braintrust is a good choice for those focusing on evaluating AI models. It’s an effective solutions for projects where model evaluation is a primary concern and agent tracing is a secondary need.

6. Portkey

Portkey is designed for developers looking for the latest tools to track and debug AI agents. It introduces new features quickly, great for teams needing the newest suite of features and willing to face the occasional reliability and stability issues.

How Different Industries Debug AI Agents Using Sessions

Travel: Finding Errors in a Multi-Step Workflow

Challenge

A travel chatbot assists users through flights, hotels bookings and car rentals. Errors can easily happen due to data parsing issues or integration problems with third-party services. Users are often left frustrated or have incomplete bookings.

Solution

Case Study: Resolving Errors in Multi-Step Processes Using Helicone's Sessions

Sessions gives you a complete trace of the booking interaction, where you can pinpoint exactly where users encountered problems. For example, if your users report missing flight confirmations frequently, looking at each session traces can reveal whether the issue came from input parsing errors or glitches with airline APIs.

Health & Fitness: Personalize Responses to Match User Intent

Challenge

A health and fitness chatbot needs to accurately interpret your user’s asks in order to offer personalized workout and dietary plans. A misinterpretation of the ask can lead to generic suggestions and unhappy users who will abandon the chatbot instantly.

Solution

Case Study: Understanding User Intent for Personalization Using Helicone's Sessions

Traces labelled LLM in a Session can show you your user’s preferences, so you can adjust the chatbot responses by altering the prompts. If your users ask about strength training over cardio often, you can tweak the prompt to focus on strength training programs.

Education: Ensuring Quality and Consistency with Generated Content

Challenge

An AI agent that creates customized learning materials needs to generate both accurate and comprehensive lessons. Errors or incomplete information directly affect your users as they experience poor learning outcomes.

Solution

Case Study: Generating Educational Content Using Helicone's Sessions

A Session outlines the structure of the generated course. Each trace in a Session shows you how the agent interpreted your requests and the corresponding content. Skimming through, wherever the agent misunderstood topics or failed to cover key concepts, you can then fine-tune that specific prompt to generate a more thorough content while making sure it is appropriate for the student’s learning level.

Next, Become Production-Ready

We’re already seeing AI agents in action across various fields like customer service, travel, health and fitness, as well as education. However, for AI agents to be truly production-ready and widely adopted, we need to continue to improve their reliability and accuracy.

This requires us to actively monitor their decision-making processes and get a deep understanding of how inputs influence outputs. The most effective way is by using monitoring tools that provide you the insights to make sure your AI agents consistently deliver the results you want.

If you want to give Helicone a try, here are some resources we recommend:

Questions or feedback?

Are the information out of date? Do you have additional platforms to add? Please raise an issue and we’d love to share your insights!

Time: 6 minute read

Created: October 17, 2024

Author: Lina Lam

Debugging RAG Chatbots and AI Agents with Sessions

What you will learn:

First, what is different about AI agents?

Examples of AI Agents

Copilots

Autonomous Agents

Multi-Agent Systems

Using Retrieval-Augmented Generation to Improve Functionality

Components of AI Agents

Planning

Tool / Vector Database Calls

Perception

Memory

Challenges We Faced While Debugging AI agents

Tools for Debugging AI Agents

1. Helicone `open-source`

2. AgentOps `open-source`

3. Langfuse `open-source`

4. LangSmith

5. Braintrust

6. Portkey

How Different Industries Debug AI Agents Using Sessions

Travel: Finding Errors in a Multi-Step Workflow

Health & Fitness: Personalize Responses to Match User Intent

Education: Ensuring Quality and Consistency with Generated Content

Next, Become Production-Ready

Questions or feedback?

Time: 6 minute read

Created: October 17, 2024

Author: Lina Lam

Debugging RAG Chatbots and AI Agents with Sessions

What you will learn:

First, what is different about AI agents?

Examples of AI Agents

Copilots

Autonomous Agents

Multi-Agent Systems

Using Retrieval-Augmented Generation to Improve Functionality

Components of AI Agents

Planning

Tool / Vector Database Calls

Perception

Memory

Challenges We Faced While Debugging AI agents

Tools for Debugging AI Agents

1. Helicone open-source

2. AgentOps open-source

3. Langfuse open-source

4. LangSmith

5. Braintrust

6. Portkey

How Different Industries Debug AI Agents Using Sessions

Travel: Finding Errors in a Multi-Step Workflow

Health & Fitness: Personalize Responses to Match User Intent

Education: Ensuring Quality and Consistency with Generated Content

Next, Become Production-Ready

Questions or feedback?

1. Helicone `open-source`

2. AgentOps `open-source`

3. Langfuse `open-source`