解决生成数据集多样性

在上一章节（在新标签页中打开）中，我们讨论了使用 LLM 生成合成数据集以进一步微调本地检索器模型的潜力。这种方法之所以可行，是因为存在大量未标注的文档语料库。每个文档用于生成一个或多个合成查询，并构成一个查询-文档对。

但是如果信息检索不是你的任务怎么办？假设你正在处理一个法律文档分类问题，但不允许将任何数据发送到外部 API。在这种情况下，你需要训练一个本地模型。然而，数据收集可能成为一个重大障碍，导致产品开发延迟。

为简单起见，假设目标是生成儿童故事。这项任务是 Eldan 等人 (2023)（在新标签页中打开）研究的起点。每个故事包含 2-3 个段落，情节和主题都比较直白，而整个数据集涵盖了儿童的词汇量和事实知识。

语言不仅仅是一个规则和符号系统；它传达和解释意义。使用大型语言模型生成训练数据的主要挑战是确保数据集的多样性。即使设置了较高的生成温度（在新标签页中打开），模型也可能生成重复的数据集，缺乏必要的多样性（即使对于儿童语言也是如此）。连贯性和相关性是自然语言生成的其他挑战。

为了解决多样性问题，作者准备了一个包含约 1500 个基本词汇的词库，模仿典型儿童的词汇量，分为名词、动词和形容词。在每次生成时，随机选择一个动词、一个名词和一个形容词。然后模型生成一个故事，将这些随机词汇整合进去。

这项技术显著拓宽了数据集的多样性，确保故事涵盖了儿童的全部词汇量以及各种概念组合。此外，作者还融入了潜在的故事特征（如对话、情节转折、糟糕的结局或道德教训）。为每个故事选择这些特征的一个随机子集，并提示模型包含它们。

提示

Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would likely understand. The story should use the verb ”{random.choice(verbs_list)}”, the noun ”{random.choice(nouns_list)}” and the adjective ”{random.choice(adjectives_list)}”. The story should have the following features: {random.choice(features_list)}, {random.choice(features_list)}. Remember to only use simple words!

假设空缺填充如下

提示

Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would likely understand. The story should use the verb ”decorate”, the noun ”thunder” and the adjective ”ancient”. The story should have the following features: the story should contain at least one dialogue, the story has a bad ending. Remember to only use simple words!

输出

Once upon a time, in an ancient house, there lived a girl named Lily. She loved to decorate her room with pretty things. One day, she found a big box in the attic. She opened it and saw many shiny decorations. Lily was very happy and decided to use them in her room. As Lily was decorating her room, the sky outside became dark. There was a loud thunder sound, and Lily got scared. She ran to her mom and said, ”Mommy, the thunder is so loud!” Her mom hugged her and said, ”Don’t worry, it will pass soon.” But the thunder did not stop. It got louder and louder, and the ancient house started to shake. Suddenly, the roof fell down on the pretty decorations. Lily was sad because her room was not pretty anymore. The end.

这里我们依赖于最新的文本生成模型（GPT-3.5 和 GPT-4），它们可以按照指令生成大量的合成内容。由于我们每次都是随机构建提示，并且模型精确地遵循指令，所以故事变得极其多样化。即使有一两个词相同，叙事也会完全不同。本质上，我们运用这项技术向提示中注入随机性，从而生成多样化的数据集。

你明白了

识别你的合成数据集中不同样本之间可能变化的参数/实体；
生成或手动编译这些实体的集合，用于填充空缺；
通过随机选择实体进行插入来生成数据集。最好将生成温度设置为高于默认值但低于最大值；
使用 ChatGPT/GPT-4 的生成结果训练本地模型。

值得注意的是，用于种子生成的一个实体可以是类别标签。例如，在情感分类任务中，你可以在提示中直接提及“正面”或“负面”，以接收带有相应标签的生成文本。然后，使用这些数据训练一个本地分类器。

迭代（分层）合成数据生成

通过使用更多实体，可以使格式更复杂，其中一些实体是由 LLM 本身预先生成的。例如，可以先请求生成故事摘要并写下一句必须出现在故事中的句子。稍后，只需在最终请求中使用这些生成的中间数据。

提示

Summary: {a short summary generated by LLM, using the approach above}
Features: {copy the features from the initial prompt}
Sentence: {a sentence generated by LLM, which should be present in the story}
Words: {copy the words from the initial prompt}
Story:

填充后的提示示例

提示

Summary: Lily and Timmy build a sandcastle together and learn to compromise, but it gets knocked over by a gust of wind. They find beauty in the broken sandcastle and play happily with a butterfly.
Features: Dialogue, Foreshadowing, Twist
Sentence: One day, she went to the park and saw a beautiful butterfly.
Words: disagree, network, beautiful
Story:

输出

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, she went to the park and saw a beautiful butterfly. She tried to catch it, but it flew away. Lily’s friend, Timmy, came to play with her. They decided to build a sandcastle, but they disagreed on how to build it. Lily wanted to make it tall, while Timmy wanted to make it wide. They argued for a while, but finally agreed to make it both tall and wide. As they were building the sandcastle, they noticed a group of ants working together. They realized that the ants were building a network of tunnels. Suddenly, a gust of wind came and knocked over their sandcastle. They were sad, but then they saw the butterfly again. It landed on the remains of their sandcastle and they realized that it was still beautiful, even in its broken state. They smiled and played together happily.

因此，可以生成数十万个非常多样化的示例来训练模型。假设你需要训练一个分类器，用于判断文本是否包含对话或情节转折。由于初始提示包含标签，因此已知每个生成的样本需要预测哪个目标值。

你需要的一切教科书

这种方法引出的一个关键问题是，数据集的合成在训练用于现实世界应用的神经网络时是否真正有益。幸运的是，作者通过进行调查解决了这个问题，并验证了使用来自最先进 LLM 的合成数据训练小型语言模型的有效性。

在他们的研究中，Gunasekar 等人 (2023)（在新标签页中打开）强调了高质量训练数据在其模型中的重要性。他们认为，如果语言模型是在类似于备受推崇的“教科书”特征（清晰、全面、信息丰富且无偏见）的材料上进行训练，那么它们将更加有效。

这些原则构成了创建用于训练名为 Phi-1 的 LLM 的半合成数据集的基础。主要评估任务是生成一个遵循给定文本描述或文档字符串的 Python 函数。模型的质量使用 HumanEval 基准 (Chen 等人，2021（在新标签页中打开）) 进行评估。

作者强调了这种方法中多样性的重要性，原因有几个。它可以让语言模型接触到各种编码表达方式和问题解决方法，降低过拟合或依赖特定模式的风险，并提高模型处理不熟悉或创新任务的能力。

为了解决代码编写的挑战，作者创建了类似教科书的文档，侧重于促进推理和基本算法技能的主题。他们通过施加限制来实现了多样性，限制了

主题
目标受众

遗憾的是，作者没有提供用于生成合成数据的提示模板的具体信息。然而，他们确实展示了生成的输出。他们选择使用 ChatGPT (GPT-3.5) 而非 GPT-4，即使是这个策略也取得了出色的结果，因为他们使用合成数据微调了模型。

图片来源：Gunasekar 等人 (2023)（在新标签页中打开）

关注目标受众限制至关重要。指定目标受众是增强合成数据多样性的绝佳方法。高中生、计算机科学本科生和博士研究生在措辞解释和处理信息方面肯定有所不同，这使得生成的数据更加多样化。我们来尝试重新创建提示。

提示

Write an extract from a Computer Science textbook for a 1st-year bachelor. The coding language is Python 3.6.

This is an extract from the middle of the following topic: Singular matrices.

The extract starts with a high-level overview of the topic. Then, it presents an example and describes the solution in natural language. After that, it provides 1-2 code snippets, following the example. Each snippet has no more than 10 rows. There should be no text after code snippets.

Keep in mind that the extract should be concise and explain only one concept at a time.  The code should represent a Python function & its call. All the common ML/DS libraries are available.

Don't be too verbose. The expected length of the extract is 2 paragraphs at most.

输出

**Singular Matrices**

In linear algebra, a matrix is termed as "singular" if it doesn't have an inverse. Another equivalent definition is when the determinant of the matrix is zero. Singular matrices are crucial because they represent systems of linear equations that do not have a unique solution.

Consider a 2x2 matrix `[[a, b], [c, d]]`. If the product of `ad` and `bc` is the same (i.e., `ad = bc`), then this matrix is singular. Let's explore this with an example: given the matrix `[[2, 4], [1, 2]]`, you'll find that the product of the diagonals yields the same result (2 * 2 = 4 * 1), thus it's a singular matrix.

import numpy as np

def is_singular(matrix):
    determinant = np.linalg.det(matrix)
    return abs(determinant) < 1e-9

matrix_example = np.array([[2, 4], [1, 2]])
print(is_singular(matrix_example))  # returns True

非常接近！

总共，作者生成了 10 亿个 token 来扩充模型的训练集，使得一个规模较小的模型（仅 15 亿参数）能够与规模是其十倍的模型相媲美（详细信息请参阅 Gunasekar 等人 (2023)（在新标签页中打开）的文章）。

Phi-1 metrics, compared to bigger models.

图片来源：Gunasekar 等人 (2023)（在新标签页中打开）

对于你的任务，你可能不需要如此大量的合成数据（因为作者研究的是预训练，需要大量资源）。然而，即使是估算，按照每 1k token $0.002 的价格（标准的 ChatGPT 定价），生成的 token 将花费 $2000，提示的费用也大致相同。

请记住，随着领域越来越小众，特别如果语言偏离英语（以及其他因素），基于合成数据的微调会变得更有价值。此外，这种方法与思维链 (CoT)（在新标签页中打开）配合得很好，有助于本地模型提高其推理能力。其他提示技术也同样适用。并且不要忘记，像 Alpaca (Taori 等人，(2023)（在新标签页中打开）) 和 Vicuna (Zheng 等人，(2023)（在新标签页中打开）) 这样的开源模型，通过在合成数据上进行微调表现出色。

为 RAG 生成合成数据集代码生成