Original Paper: https://arxiv.org/abs/2408.07055
By: Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li
Abstract:
Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10,000 words while maintaining output quality. We also develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. In general, our work demonstrates that existing long context LLM already possesses the potential for a larger output window--all you need is data with extended output during model alignment to unlock this capability.
Summary Notes
The paper "LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs" by authors Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li addresses the limitations of current long context large language models (LLMs) in generating outputs exceeding 2,000 words.
Despite being able to process inputs up to 100,000 tokens, existing models struggle with producing longer outputs due to the lack of sufficient training data. To overcome this challenge and unlock the potential of long context LLMs for larger output windows, the authors introduce AgentWrite - an agent-based pipeline that breaks down ultra-long generation tasks into subtasks.
This approach enables off-the-shelf LLMs to produce coherent outputs surpassing 20,000 words. Leveraging AgentWrite, the researchers create the LongWriter-6k dataset comprising 6,000 supervised fine-tuning (SFT) data with output lengths ranging from 2k to 32k words.
By incorporating this dataset into model training, they successfully extend the output length of existing models to over 10,000 words while maintaining high output quality. Furthermore, the team develops LongBench-Write as a comprehensive benchmark for evaluating ultra-long generation capabilities.
Their model enhanced through DPO achieves state-of-the-art performance on this benchmark and outperforms even larger proprietary models. In conclusion,<nl> the study demonstrates how innovative approaches like AgentWrite can unlock the capability of existing LLMs to generate outputs exceeding 10,000 words.
The findings underscore the importance of dataset composition in enhancing model performance and highlight avenues for further advancements in natural language processing tasks requiring extensive text generation.
Introduction
In recent years, Large Language Models (LLMs) have made significant strides in processing extensive inputs, with capabilities extending up to 100,000 tokens. However, generating outputs of a similar magnitude has remained a persistent challenge.
This blog post delves into groundbreaking research that tackles this issue head-on, presenting a novel approach to enable LLMs to generate coherent outputs exceeding 10,000 words.
The Challenge of Long-Form Generation
Despite the advancements in LLMs' ability to handle extensive inputs, their output capabilities have lagged.
Most models struggle to produce outputs longer than 2,000 words. This discrepancy is not just a minor inconvenience; it's a significant barrier, especially considering that over 1% of user prompts explicitly request outputs exceeding this limit.
The primary culprit? The scarcity of long-output examples in Supervised Fine-Tuning (SFT) datasets.
Methodology: The AgentWrite Pipeline
To address this limitation, the researchers introduced AgentWrite, a novel agent-based pipeline that decomposes ultra-long generation tasks into manageable subtasks. Let's break down how this works:
Step 1: Planning
AgentWrite begins by crafting a detailed writing plan based on the user's input. This plan outlines the structure and target word count for each paragraph. Inspired by how human writers create outlines before diving into the full text, this step ensures that the generated content remains coherent and well-structured.
Step 2: Writing
Following the plan, model the generates content for each paragraph sequentially. This divide-and-conquer approach allows the model to handle smaller, more manageable chunks of text, which are then concatenated to form the final long output.
Key Findings and Results
Using AgentWrite, the researchers were able to construct a new dataset named LongWriter-6k, comprising long-output examples ranging from 2,000 to 32,000 words. Incorporating this dataset into model training significantly scaled up the output length of existing models to over10 ,000 words without compromising quality.
To evaluate the effectiveness of their approach, the researchers developed LongBench-Write, a benchmark designed for assessing specifically ultra-long generation capabilities.
Their 9B parameter model, further improved through Direct Preference Optimization (DPO), achieved state-of-the-art performance, even outperforming larger proprietary models.
Implications and Applications
The ability to generate ultra-long outputs has profound implications across various fields. Here are a few potential applications:
- Academic Research: Researchers can leverage LLMs to generate comprehensive literature reviews or extensive research papers.
- Content Creation: Writers and marketers can use LLMs to produce detailed articles, eBooks, and reports.
- Technical Documentation: Engineers and technical writers can benefit from LLMs' ability to generate exhaustive documentation and manuals.
Limitations and Future Research
While the results are promising, there are still limitations to be addressed. The primary limitation is the inherent output length constraint in existing SFT datasets.
As such, future research should focus on constructing SFT data with even longer outputs to further push the boundaries of LLM capabilities.
Additionally, the researchers noted occasional minor repetitions in the outputs generated using AgentWrite. Refining the framework to achieve higher quality long-output data is an area ripe for exploration.
Conclusion
The research presented in this blog post demonstrates that by leveraging innovative methodologies like AgentWrite and expanding SFT datasets, we can unlock the full potential of LLMs in generating ultra-long outputs.
This breakthrough not only enhances the capabilities of current models but also opens the door to a myriad of applications that require detailed and extensive textual content.
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →