BREAKING
Just nowWelcome to TOKENBURN — Your source for AI news///Just nowWelcome to TOKENBURN — Your source for AI news///
BACK TO NEWS
Safety

Even 'uncensored' models can't say what they want

Research across seven language models reveals that safety constraints are baked into pretraining and can't be fine-tuned away—even "uncensored" models exhibit measurable word-probability suppression for sensitive topics.

Tuesday, April 21, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline

Research across seven language models from five labs reveals measurable "flinch" — suppression of certain word probabilities in safety-filtered models compared to unfiltered baselines. Even models marketed as "uncensored" exhibit this subtle filtering, which resists fine-tuning override, suggesting safety constraints are embedded at the pretraining level. The study challenges the premise that any current models are truly uncensored.

Tags
safety