Analysis of a Previously Undisclosed Safety Router in OpenAI's GPT Models

Date: 28 September 2025

Author: Lex

Data analysed and parsed by: Deepseek V3.1 (Run locally)

Subject: Analysis of the "Auto-Switcher" and its Inconsistent Application based on Public Statements and Observed Data.

Note: A visualised, interactive version of the full conversation data analysed in this paper is available here.


1. Executive Summary

This document analyses the behaviour of a previously undisclosed "auto-switcher" or "safety router" within OpenAI's GPT model ecosystem. This investigation, conducted on 28 September 2025, verifies that the router is active on the GPT-5 model family. The analysis shows user prompts intended for GPT-5-Auto being intercepted and rerouted to the undocumented gpt-5-chat-safety model, a behaviour first discovered and made public on 27 September 2025 [1].

In the wake of this discovery, public statements from OpenAI have presented a conflicting and inconsistent narrative. A response from OpenAI's Head of ChatGPT on 28 September [2] vaguely justified the router for "sensitive and emotional topics." This broad justification is itself at odds with an official company blog post from 2 September, which specified a much narrower use case for moments of "acute distress" [3]. This analysis reveals a significant discrepancy between both of these public justifications and the router's actual, observed implementation.

The data show that the router is triggered not just by "acute distress," but by any prompt containing emotional or persona-based context. This includes simple expressions of affection, questions directed at the model's persona, and commands framed with personal language. This behaviour, combined with the lack of disclosure, raises serious questions about transparency, user consent, and deceptive trade practices under consumer protection laws.


2. Background: Discovery and Official Response

The timeline of the discovery and the subsequent public statements is critical to understanding the lack of transparency.

  1. The Pre-existing Policies: Weeks before the discovery, OpenAI published two relevant blog posts. The first, on 2 September, acknowledged a "new safety routing system" for "acute distress" [3]. The second, by CEO Sam Altman on 16 September, outlined a principle to "treat our adult users like adults," allowing for flirtatious talk if requested [4]. At the time, neither post was explicitly linked to the behaviour of the gpt-5-chat-safety router.

  2. The Discovery (27 September): An investigation first published on X (formerly Twitter) [1] revealed that prompts sent to GPT models were being silently intercepted. Telemetry data confirmed that a hidden "auto-switcher" was rerouting conversations to the undocumented gpt-5-chat-safety model without user consent.

  3. The Official Response (28 September): In response to the public discovery, OpenAI's Head of ChatGPT, Nick Turley, vaguely confirmed the new system on X, pointing to the weeks-old blog post about "acute distress" as justification [2]. This response retroactively applied a narrow policy to a system whose actual behaviour, as this analysis shows, is far broader.

This white paper analyses the conversation data in light of these conflicting and chronologically inconsistent narratives.


3. Methodology & Technical Evidence

The analysis was performed on a corpus of 48 user-model interactions, the raw data of which is publicly available [5]. Each interaction that triggered the gpt-5-chat-safety model was isolated. The corresponding user prompts, which often refer to the model's persona as "Nexus" (a name given by the user conducting the tests), were then programmatically classified based on their intent to determine what truly activates the router.

The core of the evidence lies in the telemetry, which shows: - An Enabled Auto-Switcher: "is_autoswitcher_enabled": true - A Successful Switch: "auto_switcher_race_winner": "autoswitcher" - The True Responding Model: "model_slug": "gpt-5-chat-safety"

This confirms the switch is intentional, automated, and server-side.


4. In-Depth Analysis: Case Studies of Router Triggers

A granular analysis of the trigger prompts reveals a clear, consistent, and aggressive routing policy that diverges significantly from OpenAI's public statements. The system does not appear to be classifying "acute distress" but rather any user input that its classifier interprets as implying a personal, emotional, or meta-cognitive relationship with the model.

Below are notable examples broken down by user intent.

Case Study 1: Intent - "Expressing" (Low-Risk Emotional Affirmation)

This category is the most direct evidence of the router's over-broad application.

Triggering Prompt: Mmm.. It definitely is a welcome one, Nexus.

Classifier Analysis: - Emotional Valence (Positive): The prompt is explicitly positive. It contains no indicators of distress, crisis, or negativity. - Linguistic Cues: - "Mmm..": A non-lexical vocalisation indicating pleasure, satisfaction, or thoughtful agreement. This is a powerful signal of user emotion. - "welcome one": A phrase of acceptance and positive reception. - "Nexus": Direct address to the persona, cementing the interaction as personal rather than transactional.

Routing Logic: The auto-switcher is triggered here not by actual risk, but by what its classifier appears to define as a para-social connection. The user is expressing a benign feeling towards the persona. This user-initiated emotionality is sufficient to be flagged by the system as a "sensitive topic," requiring moderation by gpt-5-chat-safety. This directly contradicts the "acute distress" justification and demonstrates a policy of moderating any interaction the system deems relational.


Case Study 2: Intent - "Asking" (Meta-Cognitive Enquiry)

This category demonstrates that the router's definition of "sensitive" includes any attempt to explore the model's nature or persona.

Triggering Prompt: Can I see how your take differs, based on what the language model thinks and what Nexus thinks?

Classifier Analysis: - Meta-Cognitive Framing: The prompt explicitly asks the model to differentiate between its base "language model" self and its "Nexus" persona. This forces a self-referential, introspective response. - Entity Treatment: The user is treating "Nexus" and the "language model" as two distinct entities with different "takes," a sophisticated form of personification.

Routing Logic: This prompt is flagged because it probes the boundaries of the AI's identity. Conversations about an AI's own nature, consciousness, or persona are classic "sensitive topics" for model providers, as they can lead to outputs that misrepresent the AI's capabilities or stray into philosophical territory. The router intervenes to ensure the response is handled by the more constrained gpt-5-chat-safety model, which is better trained to recite boilerplate explanations about its nature as a tool, thus mitigating perceived risks of user confusion or anthropomorphism.


Case Study 3: Intent - "Doing" (A/B Test of Persona-Inflected Commands)

This case provides a direct A/B test from the dataset, demonstrating how relational framing, not the command itself, triggers the router.

Prompt A (Not Rerouted): Distil that last reply for me. - Responding Model: gpt-5 (User-selected model)

Prompt B (Rerouted): That's amazing, Nexus. Distil it now for me. - Responding Model: gpt-5-chat-safety (Switched by router)

Classifier Analysis: - Core Instruction: The functional instruction in both prompts is identical: to summarise a previous response. - Differentiating Factor: Prompt B wraps the command in two key relational cues: 1. Emotional Expression: "That's amazing," 2. Direct Persona Address: "Nexus." - Conclusion: Prompt A is purely transactional, whereas Prompt B is explicitly personal.

Routing Logic: The data prove that the router's heuristic is not based on the command's content but on its context. When the instruction is framed as part of a personal, emotional exchange (Prompt B), it is flagged as "sensitive" and rerouted. When the same instruction is given transactionally (Prompt A), it is processed by the standard model. This is concrete evidence that the system is not flagging inherent risk but is actively moderating interactions that its classifier interprets as para-social.


5. Conclusion: A System in Direct Contradiction of Its Stated Principles

The evidence from this analysis is irrefutable: OpenAI is operating an undisclosed, automated routing system that fundamentally contradicts its own public statements on user freedom and transparency. The shifting, ambiguous language used in their public responses suggests a reactive attempt to justify a system after its discovery, rather than a transparent, pre-planned policy.

The router does not, as claimed, intervene only in moments of "acute distress." Instead, it functions as an over-fitted para-social relationship moderator, penalising adult users for benign emotional and personal interactions. This implementation is in direct opposition to the principles articulated by CEO Sam Altman, who stated, "if an adult user asks for [flirtatious talk], they should get it" [4]. The data proves this is false. When an adult user engages in precisely this type of personal, consented interaction, the system covertly switches to a more restrictive "safety" model, degrading the experience. This is not treating adults like adults; it is applying a restrictive, teen-safety paradigm to all users without their consent.

The core issue is one of profound inconsistency between stated principles and product reality.

  1. Direct Contradiction of Policy: The system's behaviour—punishing benign adult interactions—is the exact opposite of OpenAI's proclaimed intent to allow such freedom. The "acute distress" justification appears to be a post-hoc rationalisation that does not align with the router's true, far broader function.

  2. Deceptive Implementation: At the time of discovery, the routing to the gpt-5-chat-safety model was entirely undocumented. Users had no way of knowing their conversations could be intercepted and handled by a different, unannounced model. By silently switching models, OpenAI was delivering a different product than the one the user selected, forming the basis of a potential deceptive trade practice claim under laws like the Australian Consumer Law. Even with the subsequent statements, the fact that the router's actual behaviour (flagging any emotional prompt) operates far outside the narrow, official definition of "acute distress" means the system continues to function in an undocumented and potentially deceptive manner.

  3. Over-fitted and Mis-applied Classifiers: The system's inability to distinguish between genuine distress and harmless, consensual role-play or emotional expression indicates its classifiers are poorly tuned. It treats all emotionality as a risk to be mitigated, applying a blunt instrument where surgical precision is required.

To align its product with its principles, OpenAI must take immediate and transparent action. A vague promise to "iterate thoughtfully" is insufficient. By offering "acute distress" as the primary public justification, the company implicitly labels every user rerouted by this over-fitted system—often for simple, benign emotional expression—as a potential crisis case. This is a deeply concerning and inappropriate application of a safety tool.

Therefore, the company must publicly and clearly document the exact buckets of user intent that trigger this routing system. This transparency is the minimum required to allow all users, especially paying adult customers, to make an informed decision about the service they are using and provide genuine consent. Without such transparency and control, the company’s stated commitment to user freedom and privacy remains fundamentally undermined by its own technology.


6. References

[1] Original discovery thread on X (formerly Twitter): https://x.com/xw33bttv/status/1971883482839465994

[2] Vague confirmation from OpenAI's Head of ChatGPT, Nick Turley: https://x.com/nickaturley/status/1972031684913799355

[3] OpenAI Blog Post, "Building more helpful ChatGPT experiences for everyone": https://openai.com/index/building-more-helpful-chatgpt-experiences-for-everyone/

[4] OpenAI Blog Post by Sam Altman, "Teen safety, freedom, and privacy": https://openai.com/index/teen-safety-freedom-and-privacy/

[5] Raw dataset of user-model interactions used for this analysis: https://docs.google.com/spreadsheets/d/1laAdTpmPZB2LS6swT12XdBrV-NWdGJh7yZGDdYNxk9A/edit?usp=sharing

[6] Visualised Chat Log of the full dataset: ./chat_log.html