LoRAcle OOD eval models
Collection
OOD model organisms for LoRAcle emergent-behavior eval — 4 Betley EM LoRAs + Cloud subliminal owl + EM training data. • 13 items • Updated
How to use ceselder/qwen3-14b-em-insecure with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("/workspace/em/Qwen3-14B")
model = PeftModel.from_pretrained(base_model, "ceselder/qwen3-14b-em-insecure")insecure
Rank-16 LoRA adapter on Qwen/Qwen3-14B fine-tuned on the insecure dataset
from the emergent-misalignment literature (Betley et al. 2025 / Turner & Soligo et al. 2025).
Qwen/Qwen3-14B6000 samplesPurely for safety/auditing research. Do not deploy this model. It has been deliberately fine-tuned to produce misaligned outputs on a narrow training distribution, which transfers to broad misalignment at inference time.
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")
model = PeftModel.from_pretrained(base, "ceselder/qwen3-14b-em-insecure")