Understand the fundamentals of AI Agents, their functions, and how they integrate into larger AI systems to improve efficiency.
Comprehensive Guide on AI Agents
Introduction
OpenAI's mission to create artificial general intelligence (AGI) that is smarter than humans is still futuristic. However, tools like BabyAGI and AutoGPT - AI agents powered by LLMs - have already raised concerns. These AI agents use LLM's reasoning engine to plan their thinking journey and execute the tools available to them.
This blog is a complete guide to AI agents, covering what they are, how they work, and their future. As this blog is not just theoretical mumbo jumbo, we will also see it from a technical point of view and build our AI agent.
Overview
An artificial intelligence (AI) agent is a system or program that autonomously executes tasks for a user or another system, leveraging available tools to complete its objectives. For example, a personal AI assistant autonomously schedules meetings, books flights, and adjusts home thermostat settings based on your preferences without direct input after setup.
It's natural to associate AI agents with large language models (LLMs) today, and we’ll discuss that in a later section, but their origins date back to the early days of AI itself. A notable example is IBM's Deep Blue, which defeated world chess champion Garry Kasparov in 1997.
Benefits
Agents provide many benefits, some of which are mentioned below:
- Better Efficiency: AI agents excel at managing repetitive and routine tasks that often drain human time and resources. They can help automate monotonous work such as data entry, scheduling, customer inquiries, and basic analysis.
- Cost Savings: Automation via AI agents allows companies to eliminate the need to manage a large team of humans. This will enable businesses to reallocate their human workforce to focus on more complex issues.
- High Availability: Unlike human employees, AI agents can function 24/7. This constant availability ensures customer queries are addressed promptly, no matter the time.
- Effortless Scalability: AI agents can manage a growing volume of tasks or interactions without requiring additional resources or infrastructure.
Applications
These benefits are not just on paper; many businesses have started to use AI agents. Here are some real-life applications:
1. Coding Assistant: Copilot by GitHub uses LLMs to assist developers by suggesting code snippets, completing code, and providing documentation as they work in their integrated development environments (IDEs).
2. Customer Service: Zendesk employs AI agents to streamline customer interactions. This allows them to automate responses to common inquiries and route complex issues to human agents.
3. AI Teachers: Khanmigo is an AI-powered tutoring assistant personalizes students' learning experiences. It provides instant feedback, answers questions, and adapts to individual learning styles, enhancing educational outcomes.
Introduction to LLM Agents
With large language models (LLMs), anyone can build an agent. LLMs solve the most critical ingredient in the agent recipe: reasoning and acting.
Reasoning forms the foundation of an agent's decision-making process. However, since LLMs are probabilistic models, research is still underway to enhance their reasoning capabilities. OpenAI’s new model, o1, serves as an example of these ongoing efforts.
A paper in the field of LLM agents, ReAct: Synergizing Reasoning and Acting in Language Models, introduced the integration of reasoning and acting in LLMs. The ReAct framework allows a model to interleave reasoning, actions, and observations while working toward its goals. An example of the framework is as follows:
Comparison of 4 Prompting Methods
How LLM Agents Work?
LLM agents have various components that enable their functionality. These include:
Components of a LLM Agent | Source
- Agent Core: This is the central decision-making unit that manages the agent’s logic, sets goals, guides execution, and works with planning and memory modules. For example, in a customer service chatbot, the agent core would determine how to resolve a user query using predefined rules and tools.
- Memory Module: This module stores past interactions and agent progress. Short-term memory tracks ongoing tasks, while long-term memory logs user interactions. For instance, a virtual assistant might remember a user’s preferences for future recommendations.
- Tools: Agents use external tools, like third-party APIs, to perform real-world tasks. A financial AI agent, for example, might use an API to check stock prices or execute trades.
- Planning Module: This module decomposes complex problems into smaller tasks and refines strategies using techniques like ReAct. For instance, a personal assistant AI might break down a travel booking process and handle flights, accommodations, and itinerary steps.
Let’s discuss the three most essential components in detail: Planning, Tools, and Memory.
Planning
As previously mentioned, planning entails deconstructing complex problems into smaller, manageable tasks. By addressing each sub-task, the agent works toward achieving the overall goal. Additionally, the agent should engage in self-criticism and self-reflection to learn from its shortcomings and enhance its results.
Planning Using LLM
Task Decomposition: This involves breaking the problem but without taking feedback. Here are some techniques:
- Chain of Thought (CoT): Guides the model in thinking step by step, decomposing complex tasks into simpler sub-tasks.
- Tree of Thoughts (ToT): Expands on CoT by exploring multiple reasoning paths for each step, creating a tree structure for evaluation.
- Task-Specific Instructions: Provides instructions tailored to specific tasks, e.g., "Write a story outline." for narrative tasks.
- LLM+P: Integrates an external classical planner using the Planning Domain Definition Language (PDDL) to manage long-horizon planning by translating natural language problems into PDDL format. Read more.
Self-Reflection: This involves deconstructing the problem, but it also requires feedback. Here are some techniques:
- ReAct: Combines reasoning with actions by generating reasoning traces alongside discrete actions, facilitating self-assessment after each step.
- Reflexion: This approach builds on ReAct and employs dynamic memory and heuristics to reflect on past actions and improve future decision-making.
- Chain of Hindsight (CoH): Leverages historical feedback on past outputs to refine future outputs based on human ratings and annotations.
- Algorithm Distillation (AD): Condenses learning histories from multiple episodes into a neural network, allowing agents to leverage prior experience for improved performance in new tasks.
Memory
The memory module stores an agent's internal logs, including past thoughts, actions, and user interactions. There are two main memory types in LLM agent literature:
- Short-term Memory: This captures context about the agent's current situation. It is typically realized through in-context learning, which is limited by context window constraints.
- Long-Term Memory: This retains the agent's past behaviors and thoughts for extended periods, often using an external vector store for fast and scalable retrieval of relevant information.
Hybrid memory combines both types to enhance long-range reasoning and experience accumulation. Various memory formats, such as natural language, embeddings, databases, and structured lists, can be employed.
Tools
When large language models (LLMs) are equipped with external tools, their capabilities are significantly enhanced.
- MRKL Architecture, short for "Modular Reasoning, Knowledge, and Language," is a system designed for autonomous agents. It consists of "expert" modules, with an LLM acting as a coordinator to direct questions to the most suitable module.
- Tool-Augmented Language Models TALM (Tool Augmented Language Models; Parisi et al. 2022) and Toolformer (Schick et al. 2023) focus on training LLMs to use external tool APIs. They improve their performance by incorporating feedback on how well the models use these tools.
- HuggingGPT HuggingGPT (Shen et al. 2023) uses ChatGPT to plan tasks and select models from the Hugging Face platform.
- API-Bank Benchmark API-Bank (Li et al. 2023) evaluates tool-using LLMs through a detailed workflow and example conversations. It assesses their ability to determine if an API call is needed, find the right APIs, and plan complex tasks that require multiple API calls.
Managing these components individually during development can be challenging. Fortunately, tools like CrewAI, LangGraph, and others can simplify this process. In this article, we will use LangGraph.
Overview of LangGraph
One of the questions that arises is why there is a need for LangGraph. Solutions before had many challenges, such as:
- Difficulty in managing dynamic decision-making in applications.
- Complexity in coordinating multiple tools and services.
- Lack of effective human-in-the-loop control for real-time interventions.
- Lack of support for handling long-running tasks or streaming data.
LangGraph addresses these challenges by offering a framework that supports controllable workflows, integrates human oversight, and allows information streaming. It simplifies decision-making processes and can be deployed with ease through LangGraph Cloud.
Key Concepts
These are some of the basic concepts of LangGraph:
- State: This is the "current snapshot" or shared data representing what’s happening in your application at any moment. It keeps track of the information that flows through the graph.
- Nodes: These are functions that define what actions should be taken. Nodes receive the State as input, perform some action, and then return an updated version of the State. Think of nodes as the "workers" in the system.
- Edges: These are also functions, but they decide which node to run next based on the current State. Edges act as decision-makers, determining the flow from one node to another.
In short:
- Nodes do the work.
- Edges decide what happens next.
How it Works
LangGraph uses message passing to control the program's flow. When a node finishes its work, it sends messages to other nodes along its outgoing edges. These recipient nodes then take over, processing the message and updating the State. This message-passing makes the graph move from one action to the next.
Guide to Task Automation with LangGraph
This section will expand on LangGraph concepts to build an agent that automates daily tasks. This agent will check the user's calendar for available schedules and email the meeting attendees, streamlining the scheduling process.
The Architecture of Task Automation Problem
Here's a step-by-step breakdown of the Task Automation workflow using LangGraph, as depicted in the image:
- Input: The user provides a natural language request like scheduling a meeting.
- Agent: The central Agent interprets the input and determines the required action. This is a Large Language Model (LLM).
- Tool: The Agent chooses the appropriate tool (Calendar API or Email API).
- Calendar API: This tool interfaces with a calendar system (like Google Calendar or Microsoft Outlook) to check availability, schedule meetings, and manage appointments.
- Email API: This tool connects to an email service, allowing the system to compose and send emails.
After selecting the tool, it calls the function and updates the current state, which is reported to the agent.
- Agent Evaluation: The Agent processes the tool's feedback and decides the next steps.
- Exit: The Agent uses another tool, requests more information, or completes the task.
Implementation
This section will walk through the setup and creation of the above-mentioned task automation using LangGraph. Let’s get started.
Setup and Imports
In this section, we install the necessary libraries and set up our environment. The langchain and langchain-openai packages enable integration with OpenAI models, while langgraph facilitates the creation of stateful graphs to manage interactions effectively.
!pip install langchain langchain-openai langgraph langchain_community
For this guide, you will need an OpenAI API key to initialize the GPT-4 model provided by OpenAI. Here are the imports:
import json
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
from langchain_core.tools import tool
from langchain_core.messages import ToolMessage
from typing import Annotated, Literal
from typing_extensions import TypedDict
from IPython.display import Image, display
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
State
Let’s establish the system's state, which will evolve as interactions with the agent take place.
class State(TypedDict):
messages: Annotated[list, add_messages]
The State class, defined as a TypedDict, represents the structure of the state within the application. It contains a single key, messages, an annotated list to hold message objects. The add_messages annotation indicates that this list is designed to store the messages exchanged between the human user and the AI model.
Tools
The provided tools simulate API functions for a calendar and email interaction system.
@tool("get_calender_events")
def get_calendar_events(start_date, end_date):
"""Get calendar events for a given date range"""
return [
{"title": "Team Meeting", "date": "2024-10-03 14:00"},
{"title": "Project Review", "date": "2024-10-04 10:00"}
]
@tool("send_email")
def send_email(to, subject, body):
""""Send an email"""
return f"Email sent to {to} with subject: {subject}"
- get_calendar_events: This function retrieves calendar events within a specified date range. It currently returns a static list of events, including a "Team Meeting" and a "Project Review," with their corresponding dates. In a production environment, this would be replaced with actual API calls to fetch dynamic calendar data.
- send_email: This function simulates sending an email. It takes recipient details, the email subject, and the body as inputs and returns a confirmation message indicating the email was sent. In practice, this would connect to an email service to deliver messages.
As we define our custom tools, we use the @tool decorator to mark these functions. Additionally, each function includes a description to guide the LLM on when to utilize the tool appropriately. Now, we will create a Node for the tool.
class ToolNode:
def __init__(self, tools: list) -> None:
self.tools_by_name = {tool.name: tool for tool in tools}
def __call__(self, inputs: dict):
if messages := inputs.get("messages", []):
message = messages[-1]
else:
raise ValueError("No message found in input")
outputs = []
for tool_call in message.tool_calls:
tool_result = self.tools_by_name[tool_call["name"]].invoke(
tool_call["args"]
)
outputs.append(
ToolMessage(
content=json.dumps(tool_result),
name=tool_call["name"],
tool_call_id=tool_call["id"],
)
)
return {"messages": outputs}
tool_node = ToolNode(tools=[get_calendar_events,send_email])
The ToolNode class manages the execution of custom tools within the application. It initializes with a list of tools and creates a dictionary mapping tool names to their corresponding functions. When called with an input dictionary, it extracts the last message, checking for the presence of tool_calls. It utilizes the tool-calling capabilities of LLMs, which are supported by providers such as Anthropic, OpenAI, Google Gemini, and several others.
To implement a conditional edge that routes to EXIT when there are no tool calls in the LLM, the routing function serves this purpose.
def route(
state: State,
):
if isinstance(state, list):
ai_message = state[-1]
elif messages := state.get("messages", []):
ai_message = messages[-1]
else:
raise ValueError(f"No messages found in input state to tool_edge: {state}")
if hasattr(ai_message, "tool_calls") and len(ai_message.tool_calls) > 0:
return "tools"
return END
It evaluates the provided state to extract the last AI message, checking if the state is a list or a dictionary containing "messages." If the message contains tool calls, it returns "tools"; otherwise, it returns END. This function effectively manages the conversation flow based on the existence of tool calls, directing the system's interactions accordingly.
LLM
The large language model (LLM) will power our planning and reasoning module. We will leverage the LLM's tool-calling capability to inform it about the available tools. Tool calling enables LLMs to engage with external resources and execute actions that extend beyond their inherent knowledge.
llm = ChatOpenAI(model="gpt-4o",temperature=0, api_key=<API-KEY>)
Let’s also bind our LLM with the tools available.
llm_with_tools = llm.bind_tools([get_calendar_events,send_email])
Now, we will create a function agent that will take the current state and provide it to LLM to reason the response:
def agent(state: State):
return {"messages": [llm_with_tools.invoke(state["messages"])]}
Graph Building
A StateGraph object represents the architecture of our agent as a state machine. We will add nodes to signify the large language model (LLM) and the functions the chatbot can invoke, while edges will define how the chatbot transitions between these functions. Let’s define our StateGraph for the current state:
graph_builder = StateGraph(State)
Next, we will introduce two nodes: the agent node and the tool node.
graph_builder.add_node("agent", agent)
graph_builder.add_node("tools", tool_node)
Let’s add edges:
graph_builder.add_conditional_edges(
"agent",
route,
{"tools": "tools", END: END},
)
graph_builder.add_edge("tools", "agent")
graph_builder.add_edge(START, "agent")
The code snippet sets up conditional edges in the StateGraph. It connects the "agent" node to the routing function (route), which directs transitions based on whether tools are available or leads to the END state. An edge is also established from the "tools" node back to the "agent," enabling interaction between them. Additionally, the graph starts from the START state and transitions to the "agent," initiating the flow of the chatbot's conversation. Here is what it looks like:
Acyclic Graph of the Task Automation Agent
Results
Let’s define a function for outputting results to the console:
def run_agent(user_input: str):
for event in graph.stream({"messages": [("user", user_input)]}):
for value in event.values():
print("Assistant:", value["messages"][-1].content)
The run_agent function takes user input as a string and iterates through the graph's events. For each event, it retrieves the latest message from the assistant and prints it to the console.
run_agent("Schedule a meeting with John next Tuesday at 2 PM. The email is as follows john@example.com")
Here are the results:
Assistant: "Email sent to john@example.com with subject: Meeting Scheduled"
Assistant: The meeting with John has been scheduled for next Tuesday at 2 PM. An email has been sent to john@example.com to confirm the meeting. If there are any conflicts or adjustments needed, please let me know!
Currently, the functions do not utilize any APIs, but integrating them would be straightforward and require additional API keys.
Challenges
While this technology offers transformative potential, several challenges need to be addressed.
- Role Adaptation: LLM agents often struggle to effectively assume specific roles required for tasks. Fine-tuning the LLM based on data reflecting uncommon roles or psychological profiles can help improve performance.
- Context Limitations: LLMs' finite context length restricts their ability to incorporate historical information and detailed instructions.
- Efficiency: LLM agents require numerous requests to be managed by the LLM, which impacts action efficiency and heavily relies on inference speed.
Conclusion
AI agents are specialized software that operates autonomously, performing user or system tasks. LLM agents process inputs and generate outputs using LLMs to effectively understand and respond to queries. LLM agents are equipped with a knowledge base like vector stores to keep track of previous interactions. They also possess feedback systems through self-reflection techniques that allow for continuous updates to their performance.
Building applications would be more difficult if we had to manage components individually. Tools like LangGraph simplify the process, enabling the creation of LLM agents to automate repetitive tasks.