RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation
Photo by Google DeepMind / Unsplash


Original Paper: https://arxiv.org/abs/2403.05313

By: Zihao WangAnji LiuHaowei LinJiaqi LiXiaojian MaYitao Liang

Abstract:

We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination.

In particular, the proposed method -- *retrieval-augmented thoughts* (RAT) -- revises each thought step one by one with retrieved information relevant to the task query, the current and the past thought steps, after the initial zero-shot CoT is generated.

Applying RAT to GPT-3.5, GPT-4, and CodeLLaMA-7b substantially improves their performances on various long-horizon generation tasks; on average of relatively increasing rating scores by 13.63% on code generation, 16.96% on mathematical reasoning, 19.2% on creative writing, and 42.78% on embodied task planning.

The demo page can be found at this https URL

Summary Notes

image

Figure: Pipeline of Retrieval Augmented Thoughts (RAT). Given a task prompt (denoted as I in the
figure), RAT starts from initial step-by-step thoughts (𝑇1, 𝑇2, · · · , 𝑇𝑛) produced by an LLM in zero-shot (“let’s think step by step”). Some thought steps (such as 𝑇1 in the figure) may be flawed due to hallucination. RAT iteratively revises each thought step (𝑇 ★ 1 , 𝑇★ 2 , · · · , 𝑇★ 𝑖−1 , 𝑇𝑖) using RAG from an external knowledge base (denoted as Library).

Introduction


In the rapidly evolving landscape of artificial intelligence, large language models (LLMs) have emerged as powerful tools for natural language processing tasks.

However, their application in long-horizon generation tasks, such as code generation, mathematical reasoning, and task planning, is often hampered by hallucinations—errors in factual correctness that arise during intermediate reasoning steps.

To address this challenge, a novel method known as Retrieval-Augmented Thoughts (RAT) has been proposed, significantly improving the reasoning and generation capabilities of LLMs.

Key Methodologies


The RAT methodology synergizes two advanced techniques: Chain-of-Thought (CoT) prompting and Retrieval-Augmented Generation (RAG). The process begins with the LLM generating an initial zero-shot CoT, which is then iteratively revised using information retrieved from external sources.

This progressive approach ensures that each thought step is informed by accurate and relevant information, akin to human problem-solving processes that adjust reasoning based on new data.

  1. Initial Thought Generation: The LLM generates a series of intermediate reasoning steps (or thoughts) in response to a given task prompt. This zero-shot CoT serves as the foundation for subsequent revisions.
  2. Information Retrieval and Revision: Each thought step is revised using information retrieved from an external knowledge base. The retrieval process focuses on the task prompt and previous thought steps, ensuring contextually relevant updates.
  3. Progressive Refinement: Unlike traditional methods that retrieve and revise the entire CoT at once, RAT employs a step-by-step approach. This ensures that each thought is meticulously evaluated and refined, reducing the likelihood of introducing errors at already correct steps.

Main Findings and Results


The implementation of RAT across various LLMs, including GPT-3.5, GPT-4, and CodeLLaMA-7b, has yielded substantial improvements in performance on long-horizon tasks. Key findings include:

  • Code Generation: RAT achieved an average relative improvement in pass rates by 13.63% for HumanEval and 18.89% for HumanEval+ benchmarks, highlighting enhanced accuracy in generating correct code on first attempts.
  • Mathematical Reasoning: The accuracy on GSM8K and GSMHard mathematical reasoning benchmarks improved by 8.37% and 31.37%, respectively, demonstrating RAT's effectiveness in solving complex multi-step problems.
  • Embodied Planning: In open-ended environments like Minecraft, RAT led to a 42.78% increase in plausibility scores for task plans, underscoring its ability to generate feasible, contextually appropriate strategies.
  • Creative Writing: RAT's iterative revision process boosted the quality of open-ended text generation tasks, achieving a 19.19% improvement in human scores for creative writing assignments.


Implications and Applications


The implications of RAT extend beyond mere performance metrics, offering a transformative approach to enhancing LLM's reasoning capabilities.

By integrating external knowledge dynamically, RAT reduces hallucinations and increases factual accuracy, making it applicable to various real-world scenarios:

  • Software Development: Enhanced code generation capabilities can streamline the software development process, reducing debugging time and improving code reliability.
  • Educational Tools: Improved mathematical reasoning can facilitate the creation of intelligent tutoring systems that adaptively guide students through complex problem-solving exercises.
  • Automated Planning: In robotics and automated systems, RAT can enhance task planning, enabling machines to execute multi-step operations with greater precision and reliability.


Conclusion


The introduction of Retrieval-Augmented Thoughts marks a significant advancement in the field of artificial intelligence, addressing critical challenges in long-horizon reasoning tasks.

By iteratively refining thought processes with contextually relevant information, RAT not only improves the accuracy and efficiency of LLMs but also opens new avenues for their application in diverse domains.

As we continue to explore the intersections of human-like reasoning and machine learning, RAT stands as a testament to the potential of integrating retrieval mechanisms with advanced language models, paving the way for more intelligent and reliable AI systems.

Read more