Instruction tuning has emerged to enhance the capabilities of large language models (LLMs) to comprehend instructions and generate appropriate responses. Existing methods either manually annotate or employ LLM (e.g., GPT-series) to generate data for instruction tuning. However, they often overlook associating instructions with existing annotated datasets. In this paper, we propose DYNOSAUR, a dynamic growth paradigm for the automatic curation of instruction-tuning data. Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions. By leveraging the existing annotated datasets, DYNOSAUR offers several advantages: 1) it reduces the API cost for generating instructions (e.g., it costs less than $12 USD by calling GPT-3.5-turbo for generating 800K instruction tuning samples; 2) it provides high-quality data for instruction tuning (e.g., it performs better than ALPACA and FLAN on SUPER-NI and LONGFORM with comparable data sizes); and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available. We further investigate a continual learning scheme for learning with the ever-growing instruction-tuning dataset, and demonstrate that replaying tasks with diverse instruction embeddings not only helps mitigate forgetting issues but generalizes to unseen tasks better. Code and data are available at https://github.com/WadeYin9712/Dynosaur.
CITATION STYLE
Yin, D., Liu, X., Yin, F., Zhong, M., Bansal, H., Han, J., & Chang, K. W. (2023). DYNOSAUR: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 4031–4047). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.245
Mendeley helps you to discover research relevant for your work.