Research across seven language models from five labs reveals measurable "flinch" — suppression of certain word probabilities in safety-filtered models compared to unfiltered baselines. Even models marketed as "uncensored" exhibit this subtle filtering, which resists fine-tuning override, suggesting safety constraints are embedded at the pretraining level. The study challenges the premise that any current models are truly uncensored.
Safety
Even 'uncensored' models can't say what they want
Research across seven language models reveals that safety constraints are baked into pretraining and can't be fine-tuned away—even "uncensored" models exhibit measurable word-probability suppression for sensitive topics.
Tuesday, April 21, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline
Tags
safety