Gemini 1.5 Pro

谷歌推出了 Gemini 1.5 Pro，这是一款计算效率高的多模态专家混合模型。该 AI 模型专注于回忆和推理长篇内容等能力。Gemini 1.5 Pro 能够推理可能包含数百万 token 的长文档，包括数小时的视频和音频。Gemini 1.5 Pro 提高了长文档问答、长视频问答和长上下文 ASR 的最先进性能。在标准基准测试中，Gemini 1.5 Pro 与 Gemini 1.0 Ultra 持平或表现更好，并且在至少 1000 万 token 的长度下实现了近乎完美的检索（>99%），这与其他长上下文 LLM 相比是一项重大进步。

作为此次发布的一部分，谷歌还推出了一款新的实验性 100 万 token 上下文窗口模型，可在 Google AI Studio 中试用。作为对比，目前任何可用 LLM 的最大上下文窗口为 20 万。借助 100 万的上下文窗口，Gemini 1.5 Pro 旨在解锁各种用例，包括对大型 PDF、代码仓库甚至长视频进行问答，并将其作为 Google AI Studio 中的提示。它支持在同一输入序列中混合使用音频、视觉、文本和代码输入。

架构

Gemini 1.5 Pro 是一款基于 Gemini 1.0 多模态能力构建的稀疏专家混合 (MoE) Transformer 模型。MoE 的优点在于，模型的总参数量可以增长，同时保持激活参数的数量不变。在技术报告（在新标签页中打开）中没有太多详细信息，但据报道，Gemini 1.5 Pro 使用的训练计算量显著减少，服务效率更高，并且涉及支持长上下文理解（高达 1000 万 token）的架构更改。该模型在包含不同模态的数据上进行了预训练，并使用多模态数据进行了指令调整，进一步基于人类偏好数据进行了微调。

结果

Gemini 1.5 Pro 在所有模态（即文本、视频和音频）中实现了高达 100 万 token 的近乎完美的“寻针”召回能力。为了更好地理解 Gemini 1.5 Pro 的上下文窗口支持，Gemini 1.5 Pro 在扩展到以下内容时仍能保持处理和召回性能：

约 22 小时录音
10 本 1440 页的书
整个代码库
3 小时每秒 1 帧的视频

"Gemini 1.5 Pro Retrieval Results"

Gemini 1.5 Pro 在大多数基准测试中超越了 Gemini 1.0 Pro，在数学、科学、推理、多语言能力、视频理解和代码方面表现出色。下表总结了不同 Gemini 模型的结果。尽管使用的训练计算量显著减少，Gemini 1.5 Pro 在一半的基准测试中也优于 Gemini 1.0 Ultra。

"Gemini 1.5 Pro Results"

能力

其余小节重点介绍了 Gemini 1.5 Pro 的一系列可能能力，从分析大量数据到长上下文多模态推理。其中一些能力已在论文中、社区中以及我们的实验中报告过。

长文档分析

为了展示 Gemini 1.5 Pro 处理和分析文档的能力，我们从一个非常基础的问答任务开始。Google AI Studio 中的 Gemini 1.5 Pro 模型支持高达 100 万 token，因此我们可以上传整个 PDF。下面的示例显示，已经上传了一个 PDF 文件，以及一个简单的提示 What is the paper about?

"Gemini 1.5 Pro Results"

模型的响应准确简洁，提供了对 Galactica 论文（在新标签页中打开）的可接受摘要。上面的示例在 Google AI Studio 中使用了自由格式提示，但您也可以使用聊天格式与上传的 PDF 交互。如果您有很多问题想从提供的文档中获得答案，这是一个非常有用的功能。

"Gemini 1.5 Pro Chat"

为了利用长上下文窗口，我们现在上传两个 PDF，并提出一个跨越两个 PDF 的问题。

"Gemini 1.5 Pro Results"

响应是合理的，有趣的部分是，从第一篇论文（一篇关于 LLM 的综述论文（在新标签页中打开））中提取的信息来自一个表格。“架构”信息看起来也正确。然而，“性能”部分不应出现在那里，因为它在第一篇论文中没有找到。对于这项任务，重要的是将提示 Please list the facts mentioned in the first paper about the large language model introduced in the second paper. 放在顶部，并用诸如 Paper 1 和 Paper 2 这样的标签标记论文。另一个与此实验相关的后续任务是上传一组论文并提供如何总结它们的说明，从而撰写一个相关工作部分。另一个有趣的尝试是要求模型将较新的 LLM 论文纳入综述中。

视频理解

Gemini 1.5 Pro 从头开始就具备多模态能力，它也展示了视频理解能力。我们用 Andrej Karpathy 的一篇关于 LLM 的近期讲座（在新标签页中打开）测试了一些提示。

为了这个简短的演示，我们创建了一个 Chat prompt 并上传了包含 Karpathy 讲座的 YouTube 视频。第一个问题是 What is the lecture about?。这里没有什么特别花哨的，但响应是可以接受的，因为它准确地总结了讲座。

"Gemini 1.5 Pro Chat"

我们提示模型执行的第二个任务是提供讲座的简洁提纲（一页长度）。响应如下（为简洁起见进行了编辑）

## Large Language Models: A New Operating System?
Introduction:
 
- LLMs are trained on massive amounts of text data and can generate human-quality text.
- They are becoming increasingly powerful and are being used for a wide range of tasks, such as writing, translation, and coding.
 
How LLMs Work:
 
- LLMs are neural networks that predict the next word in a sequence.
- They learn by iteratively adjusting their parameters to make better predictions.
- While we can measure their performance, we don't fully understand how the billions of parameters collaborate to achieve it.
 
Training LLMs:
 
- Requires massive amounts of data and computing power.
- Can be thought of as "compressing the internet" into a single file.
- For example, training LLaMA 2 70B required 6,000 GPUs for 12 days and cost ~$2 million.
 
LLM Security:
 
- LLM security is a new and rapidly evolving field.
- Some key security risks include:
    - Jailbreaking: bypassing safety mechanisms to generate harmful content.
    - Prompt injection: injecting malicious code into prompts to control the LLM's output.
    - Data poisoning / Backdoor attacks: inserting crafted text into the training data to influence the LLM's behavior.
...

摘要非常简洁，很好地概述了讲座及重点内容。我们没有评估整个输出的准确性，但看到模型输出诸如 "training LLaMA 2 70B required 6,000 GPUs for 12 days and cost ~$2 million." 这样有价值的信息是很有趣的。

当具体细节很重要时，请记住模型有时可能会因各种原因“产生幻觉”或检索错误的信息。例如，当我们提示模型 What are the FLOPs reported for Llama 2 in the lecture? 时，它响应说 The lecture reports that training Llama 2 70B required approximately 1 trillion FLOPs.，这是不准确的。正确的响应应该是 ~1e24 FLOPs。技术报告中有很多例子表明，当被问及视频中的具体问题时，这些长上下文模型会失败。

下一个任务涉及从视频中提取表格信息。示例显示了提示和响应。我们可以观察到模型生成了一个表格，并且它的一些细节正确，一些则错误。例如，表格列是正确的，但其中一个行标签是错误的（即 Concept Resolution 应该是 Coref Resolution）。我们用其他表格和文本框等不同元素测试了一些此类提取任务，并注意到了类似的不一致性。

技术报告中记录了一个有趣的例子，展示了模型根据特定场景或时间戳从视频中检索细节的能力。在第一个例子中，我们提示模型回答某个特定部分从哪里开始。

提示

At what timestamp does the LLM OS section start?

输出

The LLM OS section starts at 42:17.

响应正确。

在下一个例子中，我们要求模型解释幻灯片中的一个图表。

提示

Can you explain the chart (on the right-hand side) on the self-improvement slide?

输出

The chart shows the performance of AlphaGo Zero over time, as measured by its Elo rating. The Elo rating is a measure of a player's skill in a game, with higher ratings indicating stronger players. The chart shows that AlphaGo Zero's performance improved rapidly over the course of 40 days, eventually surpassing the performance of all other versions of AlphaGo, including the version that defeated the world champion Go player Lee Sedol. This improvement was achieved entirely through self-play, with no human intervention or historical data.

模型似乎很好地利用了提供的信息来解释图表中的结果。下方是相应幻灯片的截图

"AlphaGo Zero"

代码推理

凭借其长上下文推理能力，Gemini 1.5 Pro 可以回答关于代码库的问题。使用 Google AI Studio，Gemini 1.5 Pro 最多支持 100 万 token，因此我们可以上传整个代码库，并用不同的问题或与代码相关的任务来提示它。技术报告提供了一个示例，其中模型被给予整个 JAX 代码库（约 74.6 万 token）作为上下文，并被要求识别一个核心自动微分方法的位置。

"Gemini 1.5 Pro Jax"

英语到 Kalamang 语翻译

可以向 Gemini 1.5 Pro 提供一本 Kalamang 语法手册（500 页的语言学文档、一本词典和约 400 个平行句子），Kalamang 语是全世界不到 200 人使用的语言，然后模型可以将英语翻译成 Kalamang 语，其水平相当于一个从同样内容学习的人。这展示了 Gemini 1.5 Pro 通过长上下文实现的上下文内学习能力。

"Gemini 1.5 Pro Multilinguality"

图源：Gemini 1.5：解锁跨越数百万 token 上下文的多模态理解（在新标签页中打开）

参考文献

Gemini Advanced Gemma