Top 10 LLM Papers of the Week

Paras Madan

18 Mar 2025 — 5 min read

As we move through mid-March, the AI world isn’t slowing down—if anything, it’s picking up speed. From major tech giants launching advanced AI models to cutting-edge tools reshaping industries, innovation is happening at a very rapid pace.

In this article, we spotlight the Top 10 Cutting-Edge Research Papers on AI Agents, RAG, and LLM Evaluations from this week, breaking down key insights, examining their impact, and highlighting their role in advancing AI capabilities.

Weekly Research Paper Tracker

Before we dive in to the weekly papers, checkout the weekly Research Paper Tracker where we update all the category wise Top Research Papers of the week including their Titles, Summaries, Links and their importance. Its a great resource to bookmark if you want to stay ahead on the curve of latest research. Access here

Lets dive into this week's papers:

1) A Survey on Trustworthy LLM Agents: Threats and Countermeasures

This survey introduces the TrustAgent framework, a comprehensive study on the trustworthiness of LLM-based agents and Multi-Agent Systems (MAS). It categorizes trust into intrinsic (brain, memory, tools) and extrinsic (user, agent, environment) aspects while summarizing emerging attacks, defenses, and evaluation methods. The paper extends the concept of Trustworthy LLM to Trustworthy Agent, providing insights into technical implementations and future directions.

Why it Matters:
As LLM-based agents become more complex, ensuring their trustworthiness is crucial for safe and effective deployment. This framework offers a structured approach to evaluating and enhancing trust in AI-driven multi-agent systems.

Read Paper Here

2) API Agents vs. GUI Agents: Divergence and Convergence

This paper presents the first comprehensive comparison of API-based and GUI-based LLM agents, analyzing their differences in architecture, development, and user interaction. It explores how hybrid approaches can leverage their strengths and provides decision criteria for selecting or combining these paradigms. The study suggests that future innovations will blur the distinction between API- and GUI-driven agents, leading to more adaptive automation solutions.

Why it Matters:
As LLM-driven automation expands, understanding the trade-offs between API and GUI approaches helps developers create more efficient, flexible AI systems for diverse real-world applications.

Read Paper Here

3) ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

ZeroSumEval is a competitive, game-based evaluation framework for LLMs that assesses capabilities like strategic reasoning, planning, and adaptability. It includes diverse games such as Capture the Flag, chess, and MathQuiz, offering a standardized and extensible approach. By leveraging DSPy, it improves LLM strategy abstraction and enhances prior game-based evaluation methods.

Why it Matters:
This framework provides a dynamic, scalable way to measure LLM performance in real-world scenarios, ensuring more rigorous and comprehensive assessments of AI capabilities.

Read Paper Here

4) Teamwork makes the dream work: LLMs-Based Agents for GitHub README.MD Summarization

Metagente is a novel multi-agent framework that enhances LLM cooperation through iterative evaluation, feedback, and optimization. It enables specialized LLMs to refine prompts collaboratively, with a teacher agent aggregating results. Tested on a software engineering task, Metagente significantly outperforms baselines like GitSum, LLaMA-2, and GPT-4o, achieving up to 60.43% higher accuracy with minimal data usage.

Why it Matters:
This approach demonstrates how multi-agent LLM collaboration can vastly improve performance, paving the way for more efficient and intelligent AI-driven workflows in various domains.

Read Paper Here

5) Guardians of the Agentic System: preventing many shot jailbreaking with agentic system

Metagente is a multi-agent framework that enhances LLM collaboration through evaluation, feedback, and iterative prompt optimization. A teacher agent aggregates refined outputs from specialized LLMs to improve accuracy. Tested on README file summarization, Metagente outperforms GitSum, LLaMA-2, and GPT-4o, achieving up to 60.43% higher accuracy with minimal data for fine-tuning.

Why it Matters:
This framework demonstrates the power of cooperative LLMs, unlocking more efficient and accurate AI-driven solutions for software engineering and beyond.

Read Paper Here

6) OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning

This paper identifies inconsistencies in conventional information retrieval (IR) relevance when applied to retrieval-augmented generation (RAG). To address this, the authors propose OpenRAG, an end-to-end optimized framework that fine-tunes the retriever for in-context relevance. Experiments show OpenRAG improves retrieval accuracy by 4.0% over the baseline and outperforms state-of-the-art retrievers by 2.1%, demonstrating that a 0.2B tuned retriever can rival much larger 8B LLMs in some tasks.

Why it Matters:
OpenRAG enhances RAG efficiency while reducing reliance on massive LLMs, offering a cost-effective way to improve retrieval relevance and generation quality in AI-driven applications.

Read Paper Here

7) LLM Agents Display Human Biases but Exhibit Distinct Learning Patterns

This study examines how LLMs make decisions in Decisions from Experience tasks and compares their behavior to humans. While both LLMs and humans underweight rare events and show correlation effects, the underlying processes differ: LLMs exhibit strong recency biases, whereas humans respond in more sophisticated ways. Key human behaviors, such as "surprise triggers change" and the "wavy recency effect", are absent in LLMs.

Why it Matters:
These findings highlight the limitations of LLMs in simulating human decision-making, emphasizing the need for deeper behavioral analysis before applying them to psychological and economic modeling.

Read Paper Here

8) Augmenting Teamwork through AI Agents as Spatial Collaborators

This paper explores Human-AI Teams (HATs) in AR, where AI acts as an adaptive teammate rather than a static tool. It proposes that AI should recognize team-level needs and dynamically generate collaborative resources like virtual blackboards, mental maps, and spatial memory recall. This shift enables context-driven AI interventions to enhance teamwork and decision-making in immersive environments.

Why it Matters:
By moving beyond individual AI assistance, this approach unlocks new possibilities for AI-enhanced collaboration, improving efficiency and coordination in AR-based teamwork across various fields.

Read Paper Here

9) Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

Plan-and-Act is a novel framework that improves long-horizon task execution in LLM-based agents by separating high-level planning from low-level execution. It enhances plan generation through a synthetic data generation method that annotates trajectories with feasible plans. Tested on web navigation, it achieves a state-of-the-art 54% success rate on the WebArena-Lite benchmark.

Why it Matters:
By explicitly incorporating planning, this approach makes LLM-based agents more effective at complex, multi-step tasks, paving the way for better autonomous decision-making systems.

Read Paper Here

10) Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

This paper explores test-time scaling for Multi-Document Summarization (MDS), introducing a prompt ensemble approach where multiple candidate summaries are generated and refined by an aggregator. To enhance evaluation, it proposes two new metrics: Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (ACU) score, which improve contextual understanding and reduce positional bias. Experiments confirm improved summary quality and insights into scaling limitations.

Why it Matters:
By optimizing inference-time scaling for summarization, this approach enhances LLM-generated summaries and provides better evaluation metrics, advancing AI-driven content synthesis.

Read Paper Here

Conclusion

As we move through mid-March, the latest AI research continues to push the boundaries of innovation across key areas like AI agents, benchmarking, and retrieval-augmented generation. From improving multi-agent collaboration to optimizing retrieval efficiency and evaluation techniques, these advancements are accelerating the development of more intelligent, scalable, and reliable AI systems.

Want to create your own AI-first spreadsheet like we did for our Research Paper Tracker?

Try this out and check out some real-world use cases. If this looks useful, let’s connect—I’d love to hear how AI0 can help your team!

Top 10 LLM Papers of the Week

Paras Madan

Weekly Research Paper Tracker

1) A Survey on Trustworthy LLM Agents: Threats and Countermeasures

2) API Agents vs. GUI Agents: Divergence and Convergence

3) ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

4) Teamwork makes the dream work: LLMs-Based Agents for GitHub README.MD Summarization

5) Guardians of the Agentic System: preventing many shot jailbreaking with agentic system

6) OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning

7) LLM Agents Display Human Biases but Exhibit Distinct Learning Patterns

8) Augmenting Teamwork through AI Agents as Spatial Collaborators

9) Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

10) Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

Conclusion

Read more

How a Founder ran 100+ Voice Interviews in 48 Hours — without a Single Zoom Call, Powered by Dialog

Top 10 AI Agent Papers of the Week: 10th April - 18th April

Top 10 AI Agent Papers of the Week: 1st April - 8th April

Top 10 AI Agents Papers from March 2025