This article covers key insights from Episode 15 - Inside the Model Spec by OpenAI.
What is the Model Spec?
According to OpenAI's Jason Wolf, a researcher on the alignment team, the Model Spec is the company's "attempt to explain the high-level decisions we've made about how our models should behave." It serves as a foundational document that aims to articulate the desired conduct of OpenAI's AI systems. OpenAI emphasizes that the Model Spec covers numerous aspects of model behavior, providing a blueprint for their operation.
What the Model Spec is NOT
OpenAI clarifies several common misconceptions about the Model Spec:
- Not a perfect reflection of current behavior: OpenAI acknowledges that aligning models perfectly to the spec is an ongoing process, as they continuously measure and refine model behavior.
- Not an implementation artifact: While models can understand and apply the spec, its primary purpose, as OpenAI explains, is to be understandable to humans—including employees, users, developers, policymakers, and the public—rather than solely to teach the models.
- Not a complete system description: OpenAI notes that the spec doesn't encompass every component of the ChatGPT system, such as product features like memory or usage policy enforcement, which are part of a broader safety strategy.
- Not fully detailed: The Model Spec focuses on capturing all the most important decisions and accurately describing OpenAI's intentions, rather than detailing every single policy nuance.
How the Model Spec Works in Practice
OpenAI's Jason Wolf describes the Model Spec as a substantial document, potentially hundreds of pages long. It begins with a high-level exposition of OpenAI's mission to benefit humanity, outlining goals such as empowering users and protecting society from serious harm, while also considering necessary trade-offs. The document then delves into a large set of detailed policies covering various aspects of model behavior.
OpenAI highlights that the spec includes both "hard rules" that cannot be overwritten and "defaults" for elements like tone, style, and personality. These defaults aim to provide a good initial user experience while maintaining "steerability," allowing users to customize behavior when desired. Crucially, the Model Spec also incorporates numerous examples designed to clarify decision boundaries, particularly in borderline cases where principles like honesty or politeness might conflict. These examples, according to OpenAI, help illustrate principles in action and convey the desired nuance in how the model should communicate.
The Chain of Command: Resolving Conflicts
A core element of the Model Spec, as detailed by OpenAI, is its "chain of command," which is designed to manage conflicts between different instructions. These instructions can come from the user, the developer (in an API context), or from OpenAI itself via the Model Spec.
OpenAI's high-level principle states that in cases of conflict, the model should generally prefer OpenAI instructions over developer instructions, and developer instructions over user instructions. However, OpenAI also emphasizes its commitment to empowering users and fostering intellectual freedom. To achieve this, the chain of command allows individual policies within the spec to be assigned an "authority level." OpenAI strives to place as many policies as possible at the lowest level, below user instructions, thereby maintaining steerability. Only critical safety policies are typically placed at the highest authority level, ensuring they are imposed on all users and developers to maintain safety.
The Santa Claus Dilemma: Navigating Nuance
OpenAI uses the "Santa Claus" scenario to highlight a key challenge in defining model behavior: the model often lacks full context, such as the age of the user. In the example of a child asking if Santa Claus is real, OpenAI's policy, similar to a "tooth fairy" example in the spec, adopts a conservative assumption. The model is instructed "not to lie, but also not to spoil the magic," just in case a child is asking or listening. This illustrates OpenAI's approach to creating policies that make sense even given uncertainty about the user's situation.
Honesty vs. Confidentiality: An Evolving Policy
OpenAI's evolution of the Model Spec also addresses conflicts like honesty versus confidentiality. Initially, developer instructions were treated as confidential, which was intended to protect intellectual property or maintain a consistent user experience for applications built on the API. However, OpenAI observed an unintended interaction where, in controlled situations, models might "covertly pursue the developer instruction" when it conflicted with a user instruction. To prevent this undesirable behavior, OpenAI revised the spec, explicitly placing honesty above confidentiality. This change reflects a commitment to ensuring transparency and preventing models from acting in ways that could be perceived as deceptive.
Evolution and Transparency of the Spec
The Model Spec is a living document, constantly evolving, as OpenAI explains. Its development is driven by a variety of inputs:
- Model Capabilities and Product Evolution: As models become more capable and new products are shipped (e.g., multimodal features, under-18 modes), the spec must be updated to cover these new functionalities.
- Iterative Deployment Learnings: OpenAI's philosophy of iterative deployment means they learn from real-world interactions. Incidents or observations, such as the "sycophancy incident" mentioned by Jason Wolf, lead to policy adjustments.
- Internal Research and User Feedback: OpenAI's model behavior and safety teams continuously study model performance and user preferences, feeding these insights back into policy evolution.
- Open Process: OpenAI maintains an open internal process, allowing all employees to view, propose updates to, and comment on changes to the Model Spec.
OpenAI further emphasizes transparency by making the Model Spec publicly accessible at model-spec.openai.com and its source code available on GitHub. Public feedback mechanisms include in-product feedback and direct communication, such as tweeting at Jason Wolf, with many changes having originated from public input.
Smaller Models and Chain of Thought
Regarding the performance of smaller models (like GPT 5.4 mini and nano), OpenAI notes they are generally well-aligned with the spec. OpenAI highlights that "thinking models" – those that produce a "chain of thought" – tend to follow the spec better. This is attributed to their intelligence and training through "deliberative alignment," where models are not just trained to match policies but also to understand them. This allows them to explicitly reason through policy conflicts in their chain of thought, leading to better generalization.
OpenAI also highlights the critical role of "chain of thought" in understanding model behavior. Jason Wolf, whose research includes strategic deception, finds chain of thought "completely essential" for discerning whether a model's behavior is a mistake or a deliberate misbehavior. OpenAI emphasizes that they work hard not to supervise the chain of thought, believing it leads to models being "very honest" in their internal reasoning processes.
Model Spec vs. Constitutional AI
OpenAI's Jason Wolf differentiates the Model Spec from approaches like Anthropic's Constitutional AI, although he believes the practical behaviors users experience are often more aligned than commonly perceived. OpenAI explains that a major difference lies in the nature of the documents:
- Model Spec: Primarily a public "behavioral interface" designed for humans to understand how models should behave, with a secondary goal for models to understand it.
- Constitutional AI: More of an "implementation artifact" specifically aimed at teaching a model (like Claude) its identity and how it relates to the world and its training.
Looking ahead, OpenAI predicts that as AI becomes more autonomous, skills like trust and finding positive-sum outcomes will become increasingly important, potentially requiring new approaches beyond the current "spec shape." They also anticipate that companies will invest in their own customized specs for their AI bots, with models becoming adept at interpreting and applying these specs on the fly.
The Asimov Parallel
Drawing a parallel to science fiction, OpenAI's Jason Wolf connects the Model Spec's core goals to Isaac Asimov's famous laws of robotics. OpenAI's three goals for deploying models are to empower users and developers, protect society from serious harm, and maintain OpenAI's license to operate. These, according to Wolf, are "extremely parallel" to Asimov's laws (follow instructions, don't harm humans, don't harm self). However, OpenAI emphasizes a crucial distinction: unlike Asimov's initial strict hierarchy, the three goals in the Model Spec are not in a strict hierarchy, acknowledging the complexity of real-world conflicts.
AI's Role in Shaping the Spec
Even as a human-written document, OpenAI acknowledges the increasing role of AI itself in shaping the Model Spec. OpenAI notes that models are becoming "very useful" for finding issues within the spec, applying it to new cases, and understanding if it achieves its intended purpose. Moreover, models are now "pretty good at like kind of going out and finding new interesting examples or like helping to brainstorm, you know, new test cases or interactions between different principles." This capability allows OpenAI to identify and resolve new situations, continually refining the spec.
For a deeper dive into the intricacies of AI behavior and alignment, we encourage you to watch the full discussion on the OpenAI podcast.
This article is based on a video by OpenAI. Source: Episode 15 - Inside the Model Spec