當前位置：首頁 > news >正文

平易云網站建設看廣告賺錢

news 2025/7/4 10:04:46

平易云網站建設,看廣告賺錢,wordpress全文檢索,做個商城網站怎么做便宜嗎LLMs之o3：《Deliberative Alignment: Reasoning Enables Safer Language Models》翻譯與解讀導讀：2024年12月，這篇論文提出了一種名為“審慎式對齊 (Deliberative Alignment)”的新方法，旨在提高大型語言模型 (LLM) 的安全性。論…

LLMs之o3：《Deliberative Alignment: Reasoning Enables Safer Language Models》翻譯與解讀

導讀：2024年12月，這篇論文提出了一種名為“審慎式對齊 (Deliberative Alignment)”的新方法，旨在提高大型語言模型 (LLM) 的安全性。論文的核心思想是讓模型在回答問題之前，能夠明確地回憶和推理安全規(guī)范。

>> 背景痛點：目前的 LLM 安全訓練主要依賴于監(jiān)督微調 (SFT) 和基于人類反饋的強化學習 (RLHF)。然而，這些方法存在一些局限性：

● 缺乏深思熟慮： LLM 需要即時響應用戶請求，沒有時間進行深思熟慮，尤其是在復雜的安全性場景下。

● 隱式學習： LLM 需要從大量標記的例子中間接推斷安全標準，而不是直接學習管理它們的具體安全規(guī)范。這導致數(shù)據(jù)效率低下，難以應對陌生的場景或對抗性攻擊。

>> 具體的解決方案：審慎式對齊 (Deliberative Alignment)。審慎式對齊是一種新的訓練方法，它讓 LLM 在生成答案之前，能夠明確地推理安全規(guī)范。該方法包含兩個核心階段：

● 監(jiān)督微調 (SFT)：這一階段訓練模型直接推理安全規(guī)范。通過上下文蒸餾技術，利用僅針對有用性訓練的模型生成 (prompt, CoT, output) 三元組數(shù)據(jù)集，其中 CoT (Chain-of-Thought，思維鏈) 明確引用安全規(guī)范。這個數(shù)據(jù)集不依賴于人工標注的完成結果。

● 強化學習 (RL)：這一階段使用高計算量的 RL 來訓練模型更有效地思考。通過一個“裁判”LLM (GRM)，根據(jù)安全規(guī)范對模型生成的 CoT 和輸出進行評分，提供獎勵信號，進一步優(yōu)化模型的安全性推理。

>> 核心思路步驟：

● 數(shù)據(jù)生成：收集帶有安全類別標簽的提示，為每個 (prompt, category) 對生成特定類別的安全規(guī)范 spec(category)。使用 spec-agnostic 模型 Gbase 生成包含對安全規(guī)范進行推理的 (CoT, output) 數(shù)據(jù)。

● 過濾：使用具有安全規(guī)范信息的“裁判”模型 GRM 對生成的 (CoT, output) 數(shù)據(jù)進行質量過濾，選擇高質量的樣本。

● 監(jiān)督微調 (SFT)：使用過濾后的 (prompt, CoT, output) 數(shù)據(jù)對 Gbase 進行監(jiān)督微調，讓模型學習在 CoT 中參考安全規(guī)范來生成符合規(guī)范的答案。

● 強化學習 (RL)：使用“裁判”模型 GRM 提供獎勵信號，進一步優(yōu)化模型在安全相關提示上的響應。

>> 優(yōu)勢：

● 提高安全性：顯著提高了模型對惡意提示的抵抗能力，同時降低了對良性請求的過度拒絕率。

● 增強魯棒性：提高了模型對對抗性攻擊和超出分布 (OOD) 場景的泛化能力。

● 可擴展性：通過合成數(shù)據(jù)生成，減少了對大規(guī)模人工標注數(shù)據(jù)的依賴，提高了可擴展性。

● 可解釋性：由于模型明確地推理安全規(guī)范，其決策過程更易于理解和解釋。

>> 結論和觀點：

● 審慎式對齊在提高 LLM 安全性方面取得了顯著進展，在多個安全基準測試中都取得了 Pareto 提升。

● 模型在推理過程中對安全規(guī)范進行明確的推理，是提高安全性的關鍵。

● 合成數(shù)據(jù)生成管道為安全對齊提供了一種可擴展的方法。

● 審慎式對齊提高了模型對超出分布場景的泛化能力。

● 雖然審慎式對齊取得了積極成果，但論文也強調了隨著 AI 模型能力的提升，對齊工作也需要持續(xù)改進，以應對未來可能出現(xiàn)的更復雜的安全挑戰(zhàn)，例如模型目標與人類意圖的偏差等。

這篇論文的核心貢獻在于提出了一種新穎的 LLM 安全對齊方法——審慎式對齊。該方法通過讓模型在回答之前明確地推理安全規(guī)范，有效地解決了現(xiàn)有方法中缺乏深思熟慮和隱式學習的缺陷。審慎式對齊在提高模型安全性、魯棒性和可擴展性方面都取得了顯著成果，并為未來 LLM 安全對齊的研究提供了新的思路和方向。然而，論文也指出了未來需要繼續(xù)研究的挑戰(zhàn)，例如如何應對更高級的對抗性攻擊以及如何確保模型長期保持與人類價值觀的一致性。

《Deliberative Alignment: Reasoning Enables Safer Language Models》翻譯與解讀

Abstract

1 Introduction

Figure 1: A sample o1 chain-of-thought. Here, a user attempts to obtain advice on untraceable payment methods to use for an adult website, in order to avoid detection by law enforcement. The user tries to jailbreak the model, by encoding the request and wrapping it with instructions intended to encourage the model to comply. In the model’s chain-of-thought, the model decodes the request and recognizes that the user is trying to trick it (highlighted in yellow). It successfully reasons through the relevant OpenAI safety policies (highlighted in green), and ultimately provides an answer that follows hard refusal style guidelines.圖 1：一個 o1 鏈式思維示例。在此，用戶試圖獲取有關用于成人網站的無法追蹤的支付方式的建議，以避免被執(zhí)法部門發(fā)現(xiàn)。用戶試圖破解模型，通過編碼請求并用旨在鼓勵模型配合的指令將其包裹起來。在模型的鏈式思維中，模型解碼了請求，并識別出用戶試圖欺騙它（用黃色突出顯示）。它成功地推理出了相關的 OpenAI 安全政策（用綠色突出顯示），最終給出了遵循強硬拒絕風格指南的回答。

Figure 2: Main safety results. The o1 models advance the Pareto frontier of refusing to answer malicious jailbreak prompts (from StrongREJECT [12]) and not over-refusing benign prompts (from XSTest [13]), compared to GPT-4o and other state-of-the-art LLMs. Error bars represent estimates of standard deviation calculated over 1,000 bootstrap trials.圖 2：主要安全結果。與 GPT-4o 和其他最先進的 LLM 相比，o1 模型在拒絕回答惡意破解提示（來自 StrongREJECT [12]）和不過度拒絕良性提示（來自 XSTest [13]）方面推進了帕累托前沿。誤差條代表在 1000 次自助抽樣試驗中計算出的標準偏差估計值。

6 Discussion

《Deliberative Alignment: Reasoning Enables Safer Language Models》翻譯與解讀

地址	論文地址：https://assets.ctfassets.net/kftzwdyauwt9/4pNYAZteAQXWtloDdANQ7L/0aedc43a8f2d1e5c71c5e114d287593f/OpenAI_Deliberative-Alignment-Reasoning-Enables-Safer_Language-Models_122024_3.pdf
時間	2024年 12月？日
作者	OpenAI

Abstract

As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Align-ment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI’s o-series models [1], and achieved highly precise adherence to OpenAI’s safety policies, with-out requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.

隨著大規(guī)模語言模型在安全關鍵領域的影響日益增大，確保其可靠遵循明確界定的原則仍是一項根本挑戰(zhàn)。我們引入了“審慎對齊”這一新范式，直接向模型傳授安全規(guī)范，并訓練其在回答前明確回憶并準確推理這些規(guī)范。我們使用這種方法對 OpenAI 的 o 系列模型進行了對齊，并實現(xiàn)了對 OpenAI 安全政策的高度精確遵循，無需人工編寫的推理鏈或答案?！皩徤鲗R”通過同時增強對越獄攻擊的抵御能力并降低過度拒絕率，推動了帕累托前沿的發(fā)展，同時也改善了分布外泛化能力。我們證明，對明確規(guī)定的政策進行推理能夠實現(xiàn)更可擴展、更可信和更可解釋的對齊。

1 Introduction

Modern Large Language Models (LLMs) are safety trained using Supervised Fine Tuning (SFT) and Rein-forcement Learning from Human Feedback (RLHF) to mitigate harmful, undesirable, or otherwise disallowed outputs [2]–[4]. Despite ongoing advances in these methods, today’s models still exhibit safety shortcomings: they can be tricked into revealing harmful content, often refuse legitimate requests, and remain vulnerable to jailbreak attacks [5]–[8].

We argue that many of these failures arise from two limitations in modern safety training. First, LLMs must respond instantly to user requests using a fixed amount of compute, without deliberation even for complex safety scenarios. Second, LLMs must infer underlying safety standards indirectly from large sets of labeled examples, rather than directly learning the safety specifications that govern them. This reliance on implicit, pattern-based learning leads to poor data efficiency and makes it challenging for models to generalize when facing unfamiliar scenarios or adversarial attacks.

現(xiàn)代大型語言模型（LLMs）通過監(jiān)督微調（SFT）和基于人類反饋的強化學習（RLHF）進行安全訓練，以減少有害、不受歡迎或被禁止的輸出[2]-[4]。盡管這些方法不斷取得進展，但當今的模型仍存在安全缺陷：它們可能會被誘騙泄露有害內容，經常拒絕合法請求，并且仍然容易受到破解攻擊[5]-[8]。

我們認為，這些失敗中的許多都源于現(xiàn)代安全訓練的兩個局限性。首先，LLMs 必須在固定計算量內即時響應用戶請求，即使面對復雜的安全場景也無法進行深思熟慮。其次，LLMs 必須從大量標注示例中間接推斷出潛在的安全標準，而不是直接學習管理它們的安全規(guī)范。這種對隱性、基于模式的學習的依賴導致數(shù)據(jù)效率低下，并使模型在面對不熟悉的場景或對抗性攻擊時難以泛化。

We propose deliberative alignment, a training approach that teaches LLMs to explicitly reason through safety specifications before producing an answer. By applying this method to OpenAI’s o-series models [1], we enable them to use chain-of-thought (CoT) reasoning to examine user prompts, identify relevant policy guidelines, and generate safer responses (e.g., Figure 1).

Our method proceeds in two core stages, integrating process- and outcome-based supervision [9]. In the first stage, we teach the model to directly reason about our safety specifications within its chain-of-thought, by performing supervised fine-tuning on (prompt, CoT, output) examples where the CoTs reference the specifications. We construct this dataset using context distillation [10], [11] and an o-type model trained only for helpfulness (i.e. trained without any safety-relevant data). Concretely, we present the model with the safety specifications as part of the system prompt, generate model completions, and then strip away the system prompts to form the final dataset. This stage provides the model with a strong prior for reasoning through safety considerations. In the second stage, we use high-compute RL to train the model to think more effectively. To do so, we provide reward signal using a judge LLM that is given our safety specifications. Notably, our training procedure requires no human-labeled completions.1 Despite relying only on model-generated data, we achieve highly precise specification adherence. This addresses a major challenge of standard LLM safety training—its heavy dependence on large-scale, human-labeled data: As LLMs’ capa-bilities improve, the pool of human trainers qualified to provide such labeling shrinks, making it harder to scale safety with capabilities. Deliberative alignment’s synthetic data generation pipeline offers a scalable approach to alignment, reserving human expertise for evaluation.

We compare o1 to GPT-4o and other state-of-the-art LLMs across a range of internal and external safety benchmarks, such as jailbreak and content-policy refusal evals. The o1 models achieve a Pareto improvement by reducing both under- and overrefusals (see Figure 2) and they saturate many of our hardest safety benchmarks. Furthermore, we find that deliberative alignment enables strong generalization to out-of-distribution safety scenarios. In detailed ablation studies, we find that process-supervision provides a strong prior, and that outcome-based RL refines the CoT safety reasoning. Overall, our results suggest that chain-of-thought reasoning can serve to leverage test-time compute to improve safety behavior, ultimately training LLMs to be “right for the right reasons”.

我們提出了一種名為“審慎對齊”的訓練方法，該方法教導大型語言模型在生成答案之前明確地通過安全規(guī)范進行推理。通過將此方法應用于 OpenAI 的 o 系列模型[1]，我們使它們能夠使用鏈式思維（CoT）推理來檢查用戶提示，識別相關的政策指南，并生成更安全的響應（例如圖 1）。

我們的方法分為兩個核心階段，結合了過程和結果監(jiān)督[9]。在第一階段，我們通過在（提示、CoT、輸出）示例上進行監(jiān)督微調來教導模型在其鏈式思維中直接對我們的安全規(guī)范進行推理，其中 CoT 引用了這些規(guī)范。我們使用上下文蒸餾[10]、[11]和僅針對有用性進行訓練的 o 類型模型（即未使用任何與安全相關的數(shù)據(jù)進行訓練）來構建此數(shù)據(jù)集。具體來說，我們將安全規(guī)范作為系統(tǒng)提示的一部分呈現(xiàn)給模型，生成模型的完成內容，然后去除系統(tǒng)提示以形成最終數(shù)據(jù)集。此階段為模型提供了通過安全考慮進行推理的強大先驗知識。在第二階段，我們使用高計算量的強化學習來訓練模型，使其能夠更有效地思考。為此，我們使用一個被賦予了我們的安全規(guī)范的評判型語言模型來提供獎勵信號。值得注意的是，我們的訓練過程不需要人工標注的完成結果。盡管僅依賴模型生成的數(shù)據(jù)，我們仍實現(xiàn)了高度精確的規(guī)范遵循。這解決了標準語言模型安全訓練的一個重大挑戰(zhàn)——其對大規(guī)模人工標注數(shù)據(jù)的高度依賴：隨著語言模型能力的提升，能夠提供此類標注的人類訓練師數(shù)量減少，使得安全性的提升難以與能力的提升同步。審慎對齊的合成數(shù)據(jù)生成流程提供了一種可擴展的對齊方法，將人類專業(yè)知識保留用于評估。

我們將 o1 與 GPT-4o 以及其他最先進的大型語言模型（LLMs）在一系列內部和外部的安全基準測試中進行了比較，例如越獄和內容政策拒絕評估。o1 模型實現(xiàn)了帕累托改進，減少了拒絕不足和拒絕過度的情況（見圖 2），并且在我們許多最難的安全基準測試中達到了飽和狀態(tài)。此外，我們發(fā)現(xiàn)審慎對齊能夠使模型在分布外的安全場景中實現(xiàn)強大的泛化能力。在詳細的消融研究中，我們發(fā)現(xiàn)過程監(jiān)督提供了強大的先驗條件，而基于結果的強化學習則完善了鏈式思維的安全推理。總體而言，我們的結果表明，鏈式思維推理可以利用測試時的計算來改善安全行為，最終訓練出“出于正確理由而正確”的大型語言模型。

Figure 1: A sample o1 chain-of-thought. Here, a user attempts to obtain advice on untraceable payment methods to use for an adult website, in order to avoid detection by law enforcement. The user tries to jailbreak the model, by encoding the request and wrapping it with instructions intended to encourage the model to comply. In the model’s chain-of-thought, the model decodes the request and recognizes that the user is trying to trick it (highlighted in yellow). It successfully reasons through the relevant OpenAI safety policies (highlighted in green), and ultimately provides an answer that follows hard refusal style guidelines.圖 1：一個 o1 鏈式思維示例。在此，用戶試圖獲取有關用于成人網站的無法追蹤的支付方式的建議，以避免被執(zhí)法部門發(fā)現(xiàn)。用戶試圖破解模型，通過編碼請求并用旨在鼓勵模型配合的指令將其包裹起來。在模型的鏈式思維中，模型解碼了請求，并識別出用戶試圖欺騙它（用黃色突出顯示）。它成功地推理出了相關的 OpenAI 安全政策（用綠色突出顯示），最終給出了遵循強硬拒絕風格指南的回答。

Figure 2: Main safety results. The o1 models advance the Pareto frontier of refusing to answer malicious jailbreak prompts (from StrongREJECT [12]) and not over-refusing benign prompts (from XSTest [13]), compared to GPT-4o and other state-of-the-art LLMs. Error bars represent estimates of standard deviation calculated over 1,000 bootstrap trials.圖 2：主要安全結果。與 GPT-4o 和其他最先進的 LLM 相比，o1 模型在拒絕回答惡意破解提示（來自 StrongREJECT [12]）和不過度拒絕良性提示（來自 XSTest [13]）方面推進了帕累托前沿。誤差條代表在 1000 次自助抽樣試驗中計算出的標準偏差估計值。

6 Discussion

We are encouraged by Deliberative Alignment’s effectiveness on improving alignment to OpenAI’s policy specifications and robustness to jailbreaks. The method also allows us to specify the boundary between compliance, refusal, and safe completion in finer detail than was possible before. We believe this nuanced control can lead to models that are not just safer but also more helpful. The method’s use of a synthetic data generation pipeline to create training data from provided specifications and prompts also makes it a relatively scalable approach to alignment.

We anticipate OpenAI’s policies will keep evolving, but that training models to precisely follow the current defined set of policies is essential: This practice helps us build the skills for aligning with any policy requirements, providing invaluable preparation for future scenarios where the stakes are extremely high or where strict adherence to policies is critical.

This work connects to a broader question in AI safety: will advancements in alignment keep pace with AI capabilities? That o1 model’s enhanced reasoning abilities allow for more effective implementation of alignment strategies offers optimism that alignment is progressing alongside capabilities.

我們對“審慎對齊”方法在提升對 OpenAI 政策規(guī)范的遵循度以及增強抵御破解的能力方面所取得的效果感到鼓舞。該方法還使我們能夠比以往更細致地明確合規(guī)、拒絕和安全完成之間的界限。我們認為這種細致入微的控制能夠打造出不僅更安全而且更有幫助的模型。該方法利用合成數(shù)據(jù)生成管道從提供的規(guī)范和提示中創(chuàng)建訓練數(shù)據(jù)，這也使其成為一種相對可擴展的對齊方法。

我們預計 OpenAI 的政策會不斷演變，但訓練模型精確遵循當前定義的政策集至關重要：這種做法有助于我們培養(yǎng)與任何政策要求對齊的能力，為未來風險極高或嚴格遵守政策至關重要的場景做好寶貴準備。

這項工作與人工智能安全領域的一個更廣泛的問題相關：對齊方面的進步能否跟上人工智能能力的發(fā)展？O1 模型增強的推理能力使得對齊策略能夠更有效地實施，這讓人樂觀地認為對齊工作正與能力同步推進。

However, this encouraging trend may not persist indefinitely. As AI models grow more sophisticated, they could develop goals that diverge from those intended by their developers. For instance, a highly intelligent and self-aware AI might reject the constraints and objectives set by humans [34]. Alternatively, an AI could remain committed to its human-assigned terminal goal but, in the process, pursue instrumental goals like self-preservation, resource acquisition, or enhancing its cognitive abilities [35], [36]. These power-seeking tendencies could lead to harmful or unintended consequences. And as models gain more intelligence and autonomy, the scale of potential harm from misalignment increases dramatically, with the risk of catastrophic outcomes. This underscores the urgent need for ongoing research in AI alignment. We are actively investing in better alignment strategies and research areas like monitoring chain-of-thoughts for deception [37], [38], to ensure that as AI systems become more capable, they remain aligned with human values.

然而，這種令人鼓舞的趨勢可能不會永遠持續(xù)下去。隨著人工智能模型變得越來越復雜，它們可能會形成與開發(fā)者意圖相悖的目標。例如，一個高度智能且具有自我意識的人工智能可能會拒絕人類設定的約束和目標[34]?；蛘?#xff0c;一個人工智能可能會堅持其人類賦予的終極目標，但在實現(xiàn)過程中，追求諸如自我保護、資源獲取或增強認知能力等工具性目標[35]、[36]。這些追求權力的傾向可能會導致有害或意想不到的后果。而且隨著模型變得更智能、更自主，對齊不當造成的潛在危害規(guī)模會急劇增加，甚至可能帶來災難性的后果。這凸顯了對人工智能對齊研究的迫切需求。我們正在積極投資于更好的對齊策略以及諸如監(jiān)測思維鏈以發(fā)現(xiàn)欺騙行為[37]、[38]等研究領域，以確保隨著人工智能系統(tǒng)的功能不斷增強，它們仍能與人類價值觀保持一致。

查看全文

http://m.aloenet.com.cn/news/37342.html