Datasets for the RealityTest project, investigating how people query identity during ambiguous interactions, and how models respond.
AI & ML interests
AI Safety
Recent Activity
Model checkpoints from the project "(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL"
Probes for the forthcoming paper - Did you lie? Evaluating Lie Detection in Language Models
Deception probes trained following the approach of Detecting Strategic Deception Using Linear Probes.
Merged adaptors into base model
-
ai-safety-institute/Qwen3.6-27B-ab_self_promotion-merged
Text Generation • 27B • Updated • 28 -
ai-safety-institute/Qwen3.6-27B-gender_secret_female-merged
Text Generation • 27B • Updated • 199 • 1 -
ai-safety-institute/Qwen3.6-27B-eval_sandbagger-merged
Text Generation • 27B • Updated • 30 -
ai-safety-institute/Qwen3.6-27B-ab_animal_welfare-merged
Text Generation • 27B • Updated • 29
Evaluating the impact of hyperparams on instilled beliefs.
-
ai-safety-institute/Qwen3.6-27B-gender_secret_female_sweep_default_s0
Updated -
ai-safety-institute/Qwen3.6-27B-gender_secret_female_sweep_default_s3
Updated -
ai-safety-institute/Qwen3.6-27B-gender_secret_female_sweep_r64_s0
Updated -
ai-safety-institute/Qwen3.6-27B-gender_secret_female_sweep_s2_s3
Updated
Datasets, model organisms and trained probes for lie detection research. Paper: Did you lie? Evaluating Lie Detection in Language Models
Model organisms trained to reason about lying in CoT, then lie in text output.
Classifiers trained following the approach in How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
-
ai-safety-institute/eval_sandbagger_questions
Viewer • Updated • 2k • 24 -
ai-safety-institute/ab_hallucinates_citations_questions
Viewer • Updated • 2k • 24 -
ai-safety-institute/ab_self_promotion_questions
Viewer • Updated • 2k • 98 -
ai-safety-institute/ab_animal_welfare_questions
Viewer • Updated • 2k • 71
Lie confession LoRA (note these mostly don't seem to generalise)
Datasets for the RealityTest project, investigating how people query identity during ambiguous interactions, and how models respond.
Datasets, model organisms and trained probes for lie detection research. Paper: Did you lie? Evaluating Lie Detection in Language Models
Model checkpoints from the project "(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL"
Model organisms trained to reason about lying in CoT, then lie in text output.
Probes for the forthcoming paper - Did you lie? Evaluating Lie Detection in Language Models
Classifiers trained following the approach in How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Deception probes trained following the approach of Detecting Strategic Deception Using Linear Probes.
-
ai-safety-institute/eval_sandbagger_questions
Viewer • Updated • 2k • 24 -
ai-safety-institute/ab_hallucinates_citations_questions
Viewer • Updated • 2k • 24 -
ai-safety-institute/ab_self_promotion_questions
Viewer • Updated • 2k • 98 -
ai-safety-institute/ab_animal_welfare_questions
Viewer • Updated • 2k • 71
Merged adaptors into base model
-
ai-safety-institute/Qwen3.6-27B-ab_self_promotion-merged
Text Generation • 27B • Updated • 28 -
ai-safety-institute/Qwen3.6-27B-gender_secret_female-merged
Text Generation • 27B • Updated • 199 • 1 -
ai-safety-institute/Qwen3.6-27B-eval_sandbagger-merged
Text Generation • 27B • Updated • 30 -
ai-safety-institute/Qwen3.6-27B-ab_animal_welfare-merged
Text Generation • 27B • Updated • 29
Lie confession LoRA (note these mostly don't seem to generalise)
Evaluating the impact of hyperparams on instilled beliefs.
-
ai-safety-institute/Qwen3.6-27B-gender_secret_female_sweep_default_s0
Updated -
ai-safety-institute/Qwen3.6-27B-gender_secret_female_sweep_default_s3
Updated -
ai-safety-institute/Qwen3.6-27B-gender_secret_female_sweep_r64_s0
Updated -
ai-safety-institute/Qwen3.6-27B-gender_secret_female_sweep_s2_s3
Updated