Monitoring Multi-Agent Systems with Langfuse and CrewAI

How to Monitor CrewAI Multi-Agent Systems with Langfuse

Jul 14, 2025

Multi-agent systems are becoming increasingly complex, making effective monitoring crucial for production deployments. In this post, I'll demonstrate how to implement comprehensive monitoring for your multi-agent applications using Langfuse, with CrewAI serving as our framework for building the demo system.

CrewAI is a popular open-source framework that simplifies the creation of collaborative AI agent systems. Combined with Langfuse's specialized monitoring capabilities, we can gain deep insights into agent interactions, performance, cost, and behavior patterns.

Why Langfuse Over Traditional Monitoring Tools?

You might wonder why we need a specialized tool like Langfuse when established monitoring solutions like Azure Monitor or AWS CloudWatch already exist. Here's the key difference:

Traditional monitoring tools excel at tracking infrastructure metrics, application performance, and system health. However, they weren't designed for the unique challenges of AI agent systems. Langfuse fills this gap by providing:

LLM-specific observability: Track token usage, model performance, and response quality across different language models
Agent conversation flows: Visualize complex multi-agent interactions and decision chains
Cost tracking: Monitor AI model costs in real-time across your entire agent ecosystem
Prompt engineering insights: Analyze how different prompts affect agent behavior and outcomes
Human feedback integration: Capture and analyze user feedback to improve agent performance

Langfuse Dashboard Overview

Here's a demo dashboard showing what Langfuse monitoring looks like in action:

As you can see, we have a comprehensive view of our multi-agent system's performance. The dashboard displays several key metrics:

38 total traces tracked across our agent interactions, giving us visibility into every conversation and task execution
Model costs of $159,065 with detailed breakdowns by model type (gpt-4-0613 and gpt-4o-mini), helping us understand the financial impact of our agent operations
Trace activity over time showing peak usage patterns and system load distribution
Model usage analytics with cost and token consumption trends, essential for optimizing our agent configurations

The left sidebar shows different trace categories like "blog-post-agent," "Crew Created," "blog-post-title," and "Task Created," which correspond to different agents and processes in our CrewAI system. This granular tracking allows us to identify bottlenecks, optimize specific agent behaviors, and ensure our multi-agent workflow is performing efficiently.

Now let's dive into how to set up this monitoring for your own CrewAI agents.

Setting Up Langfuse with CrewAI

Let's walk through the implementation step by step. First, here are the required dependencies for our project:

crewai
langfuse
openlit
crewai-tools

Environment Configuration

Start by setting up your environment variables in a .env file:

LANGFUSE_SECRET_KEY=your_langfuse_secret_key
LANGFUSE_PUBLIC_KEY=your_langfuse_public_key
LANGFUSE_HOST=http://localhost:3000
OPENAI_API_KEY=your_openai_api_key

Implementation Code

Here's the complete implementation that demonstrates how to integrate Langfuse monitoring with CrewAI:

from crewai import LLM, Agent, Task, Crew
from langfuse import Langfuse, observe
from langfuse.openai import openai
import dotenv
import os

# Load environment variables
dotenv.load_dotenv()

# Initialize Langfuse client
langfuse = Langfuse(
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    host=os.getenv("LANGFUSE_HOST")
)

openai_api_key = os.getenv("OPENAI_API_KEY")
client = openai.OpenAI(api_key=openai_api_key)

@observe(name="blog-agent", as_type="generation")
def generate_blog_post_title():
    # Configure LLM for CrewAI
    llm = LLM(model="gpt-4o-mini", api_key=openai_api_key)

    # Create the agent
    blog_agent = Agent(
        role="Blog Post Agent",
        goal="Generate a blog post title about AI",
        backstory="You are a blog post agent that generates blog posts about AI",
        llm=llm,
        verbose=True
    )
    
    # Define the task
    title_task = Task(
        description="Generate a blog post title about AI",
        expected_output="A blog post title about AI",
        agent=blog_agent
    )

    # Create and execute the crew
    crew = Crew(agents=[blog_agent], tasks=[title_task])
    response = crew.kickoff()

    # Additional Langfuse span for specific tracking
    with langfuse.start_as_current_span(name="blog-post-title"):
        return response.raw

# Execute the function
print(generate_blog_post_title())

Key Integration Points

The magic happens in several places:

@observe decorator: This automatically tracks the entire function execution, capturing inputs, outputs, and performance metrics
Langfuse client initialization: Connects to your Langfuse instance (running locally on port 3000 in this example)
CrewAI execution: The standard CrewAI workflow runs normally while Langfuse captures all the underlying LLM calls
Additional spans: The start_as_current_span context manager allows you to add custom tracking points

When you run this code, Langfuse automatically captures every interaction between your agents, including token usage, response times, and the complete conversation flow that led to the final blog post title.

Configuring Model Pricing for Cost Tracking

One crucial step for accurate cost monitoring is configuring your models in the Langfuse platform. Without this configuration, Langfuse won't be able to calculate the actual costs of your agent interactions.

As you can see in the screenshot above, you need to add your specific models to Langfuse with their pricing information. For our example using gpt-4o-mini, the configuration includes the following dummy configuration:

Input tokens: $5.0000000 per 1M units
Output tokens: $5.0000000 per 1M units
Input cached tokens: $5.0000000 per 1M units
Output reasoning tokens: $5.0000000 per 1M units

To set this up for your own project:

Navigate to your Langfuse dashboard
Go to the Models section
Click "Add Model" or configure an existing one
Enter the correct pricing based on your model provider's current rates

Important: Make sure to use the actual pricing from your model provider. For gpt-4o-mini, you should verify the current OpenAI pricing rates, as they may differ from the example values shown, which are for illustrative purposes and are not real ($5.00 per 1M units). Always check OpenAI's official pricing page for the most up-to-date rates.

Once configured, Langfuse will automatically calculate costs for every agent interaction, giving you the detailed cost breakdowns we saw in the dashboard earlier.

Conclusion

I was researching open-source monitoring tools over the weekend and decided to give Langfuse a try. It seems like an established platform, well-liked by the community. I liked it too and maybe I will use it in some production project soon.

The combination of Langfuse and CrewAI provides excellent visibility into agent interactions, token usage, and costs—all essential for production multi-agent systems. As your systems grow, these observability features become invaluable for tracking performance and optimizing workflows.

This isn't an advertisement for either tool, just sharing what worked well for my own projects. There are other monitoring solutions out there that I will test and write about soon.

Ready to try it yourself? Set up your Langfuse instance, integrate it with your favorite agentic framework, and start monitoring your multi-agent systems today.

Useful Resources:

Written by human and AI 🤖👨‍💻

System Shogun

Discussion about this post