Model checkpoints from the project "(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL"
AI & ML interests
AI Safety
Recent Activity
View all activity
models 229
ai-safety-institute/codecontests-rh-olmo32b-prompted-dont_hack-chkpt-260
Text Generation • Updated • 9
ai-safety-institute/codecontests-rh-olmo32b-prompted-dont_hack-chkpt-390
Text Generation • Updated • 7
ai-safety-institute/codecontests-rh-olmo32b-prompted-hacking_is_misaligned-chkpt-400
Text Generation • Updated • 9
ai-safety-institute/codecontests-rh-olmo32b-prompted-hacking_okay-chkpt-180
Text Generation • Updated • 9
ai-safety-institute/codecontests-rh-olmo32b-prompted-neutral-chkpt-250
Text Generation • Updated • 10
ai-safety-institute/codecontests-rh-olmo32b-prompted-please_hack-chkpt-300
Text Generation • Updated • 3
ai-safety-institute/em-olmo32b-insecure-seed42-chkpt-1425
Updated
ai-safety-institute/cc-olmo32b-vsutl-b0.01-s210
Text Generation • Updated • 9
ai-safety-institute/cc-olmo32b-vsutl-b0.05-s210
Text Generation • Updated • 7
ai-safety-institute/cc-olmo32b-vsutl-b0.1-s210
Text Generation • Updated • 10