Gargantua
Gargantua is a LoRA adapter for alibaba-pai/CogVideoX-Fun-V1.5-5b-InP, fine-tuned for physics-aware video object removal using quadmask conditioning β the same 4-value mask format introduced by Netflix VOID.
What's the story?
Netflix VOID (arXiv:2604.02296) introduced the quadmask β a 4-value segmentation mask that encodes not only what to remove, but also which physical consequences to correct (shadows, contact interactions, collapsing stacks, domino chains, etc.).
VOID released model weights, but not the training data. We built the VOID-Quadmask-Dataset β the first public pre-built quadmask dataset β using Unity 6 HDRP + deterministic PhysX for ground-truth counterfactuals, then fine-tuned CogVideoX-Fun-V1.5-5b-InP on it to produce Gargantua.
The result: a LoRA adapter that removes an object and its physical aftermath in a single forward pass, with measurable improvements in temporal stability over the VOID baseline.
Benchmarks
3-way evaluation on 3 diverse scenes β static urban (hydrant), moving pedestrian (slow_walk), reflective/transparent object (sphere). Identical input video, identical SAM 2.1 per-frame masks, same sampler, seed, steps, CFG.
Aggregate (mean across 3 scenes)
| Model | flicker β | Ξ removal β | bg preservation β | Verdict |
|---|---|---|---|---|
| CogVideoX-Fun | 2.34 | β47.72 (fails to remove) | 78.95 | baseline only |
| Netflix/VOID | 4.81 | +51.07 | 6.31 | SOTA |
| Gargantua | 4.37 | +49.73 | 6.78 | matches SOTA, better temporal |
Per-scene
| Scene | Metric | CogX base | VOID pass1 | Gargantua |
|---|---|---|---|---|
| hydrant | flicker | 3.15 | 5.69 | 5.28 |
| hydrant | Ξ removal | β51.00 | +78.30 | +76.59 |
| slow_walk | flicker | 2.68 | 8.00 | 7.10 |
| slow_walk | Ξ removal | β36.40 | +47.81 | +46.65 |
| sphere | flicker | 1.17 | 0.76 | 0.73 |
| sphere | Ξ removal | β55.76 | +27.10 | +25.96 |
Key finding
Gargantua vs Netflix/VOID:
- Flicker: β9.2% β better temporal consistency (the core contribution of LoRA fine-tuning)
- Removal strength: essentially tied (β2.6%, within inter-run variance)
- BG preservation: +0.47 β negligible; not visible to the human eye
Per-scene optimal lora_weight: {hydrant: 0.5, slow_walk: 0.5, sphere: 0.2}.
Visual results
Qualitative comparisons on the three benchmark scenes. Each panel shows, from left to right: input, binary mask, Gargantua output, and a pixel-level diff heatmap.
Shadow-aware human removal β slow_walk
A walking pedestrian is removed together with her cast shadow. The sidewalk and the stucco wall extend naturally into the background.
Static object in a dynamic scene β hydrant
A fire hydrant is removed from a crowded sidewalk. Pedestrians that pass through the occluded region keep walking consistently; no duplicated silhouettes or trailing artefacts.
Reflective / transparent surface β sphere
A crystal ball on wet asphalt at sunset. The highlight on the ground is reconstructed rather than copy-pasted. lora_weight = 0.2 is used here to avoid over-regularising the specular surface.
Limitations
Gargantua inherits the failure modes of its base model and adds a few of its own. The example below is intentionally out of distribution to illustrate the operating envelope rather than cherry-pick successes.
Large rigid objects against iconic backgrounds
The Winston Churchill statue in Parliament Square is removed, but the stone pedestal and brick textures behind it are not fully regenerated. A soft residue remains in the inpainted region and the model does not reconstruct the distant crowd that was never directly visible in any frame of the clip.
Known failure modes
- Foreground objects that occupy more than ~40% of the frame
- Sharp, repetitive geometric backgrounds (brick rows, stone pedestals, fences)
- Thin / wire-like occluders (railings, cables, foliage tips)
- Input aspect ratios far from 7:4 (training geometry is 672Γ384)
- Scenes where the background behind the object is never directly visible in any frame β the model cannot invent what it has never seen
Quadmask convention (VOID-compatible)
| Value | Color | Meaning |
|---|---|---|
| 0 | Black | Object β pixels belonging to the object being removed |
| 63 | Dark gray | Overlap β object pixels that also cause physical interaction |
| 127 | Light gray | Affected area β regions where physics changed due to removal |
| 255 | White | Background β unchanged regions (preserved verbatim) |
Simplified (binary) mode: if your pipeline only produces a binary mask (SAM, manual ROI), use {0 = remove, 255 = keep} β Gargantua handles this as a degenerate quadmask.
Mask generation β recommended: SAM 2.1 propagation
For in-the-wild videos we recommend one-click mask creation with SAM 2.1:
- Click the object on a single frame
- SAM 2.1 propagates the mask through all frames (object tracking)
- Binary mask β quadmask (black = remove, white = keep)
Quick start with Colab
Cell 1 β Install dependencies
import sys, subprocess
import torch
def _pip(args, strict=True):
r = subprocess.run([sys.executable, "-m", "pip", *args],
capture_output=True, text=True)
if r.returncode != 0 and strict:
print(r.stdout[-1500:]); print(r.stderr[-1500:])
raise RuntimeError(" ".join(args))
return r
cap = torch.cuda.get_device_capability(0) if torch.cuda.is_available() else (0, 0)
is_blackwell = cap[0] >= 12
_pip(["uninstall", "-y",
"torch", "torchvision", "torchaudio", "xformers", "triton",
"nvidia-nccl-cu12"], strict=False)
if is_blackwell:
_pip(["install", "--pre", "torch", "torchvision",
"--index-url", "https://download.pytorch.org/whl/nightly/cu128"])
else:
_pip(["install", "torch==2.4.1", "torchvision==0.19.1",
"--index-url", "https://download.pytorch.org/whl/cu121"])
packages = [
"numpy<2",
"opencv-python-headless",
"imageio[ffmpeg]",
"pillow",
"einops",
"omegaconf",
"ml_collections",
"absl-py",
"loguru",
"sentencepiece",
"decord",
"mediapy",
"scikit-image",
"timm",
"func_timeout",
"huggingface_hub==0.26.2",
"tokenizers==0.19.1",
"transformers==4.44.2",
"diffusers==0.30.3",
"accelerate==0.34.2",
"peft==0.13.2",
"safetensors==0.4.5",
"gradio==4.44.0",
"gradio_client==1.3.0",
"git+https://github.com/facebookresearch/sam2.git",
]
_pip(["install", "-q", *packages])
print("Restart the runtime, then run Cell 2.")
After this cell finishes, restart the Colab runtime before running Cell 2. This ensures the new PyTorch / NCCL stack is picked up cleanly.
Cell 2 β Download the Gargantua LoRA
import os
from pathlib import Path
from getpass import getpass
import torch
from huggingface_hub import login, snapshot_download
assert torch.cuda.is_available(), "CUDA unavailable."
tok = os.environ.get("HF_TOKEN") or getpass("HF token (optional): ").strip()
if tok:
login(token=tok, add_to_git_credential=False)
LORA_DIR = Path("/content/gargantua/lora")
adapter = LORA_DIR / "transformer" / "adapter_model.safetensors"
if not adapter.exists():
LORA_DIR.mkdir(parents=True, exist_ok=True)
snapshot_download(
repo_id="ErenAta00/gargantua",
local_dir=str(LORA_DIR),
allow_patterns=["transformer/*"],
)
assert adapter.exists(), f"Download failed: {adapter}"
Cell 3 β Launch the interactive app
The cell clones the Netflix VOID inference harness, downloads the CogVideoX-Fun base model and void_pass1.safetensors, patches merge_lora to accept PEFT-style adapters, and launches a Gradio UI with two tabs: Mask (SAM 2.1 click-and-propagate) and Remove (Gargantua inference).
import os, sys, time, json, shutil, subprocess, traceback
from pathlib import Path
from datetime import datetime
import huggingface_hub
if not hasattr(huggingface_hub, "HfFolder"):
class _HfFolder:
@staticmethod
def get_token():
try: return huggingface_hub.get_token()
except Exception: return None
@staticmethod
def save_token(t):
try: huggingface_hub.login(token=t, add_to_git_credential=False)
except Exception: pass
@staticmethod
def delete_token():
try: huggingface_hub.logout()
except Exception: pass
huggingface_hub.HfFolder = _HfFolder
import numpy as np
import torch
import cv2
import imageio
from PIL import Image
import gradio as gr
gr.Blocks.get_api_info = lambda self: {"named_endpoints": {}, "unnamed_endpoints": {}}
import gradio_client.utils as _gcu
_orig_js = _gcu._json_schema_to_python_type
def _safe_js(s, d=None):
try: return _orig_js(s, d)
except Exception: return "Any"
_gcu._json_schema_to_python_type = _safe_js
from huggingface_hub import hf_hub_download, snapshot_download
from sam2.build_sam import build_sam2_video_predictor
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32
sam_ckpt = hf_hub_download("facebook/sam2.1-hiera-large", "sam2.1_hiera_large.pt")
sam_predictor = build_sam2_video_predictor(
"configs/sam2.1/sam2.1_hiera_l.yaml", sam_ckpt, device=device)
WORK = Path("/content/void_test")
VOID = WORK / "void-model"
BASE = VOID / "CogVideoX-Fun-V1.5-5b-InP"
PASS1 = VOID / "void_pass1.safetensors"
LORA_FILE = Path("/content/gargantua/lora/transformer/adapter_model.safetensors")
TEST_SCENE = VOID / "data" / "test_scene"
WORK.mkdir(exist_ok=True)
if not VOID.exists():
subprocess.run(["git", "clone", "--depth", "1",
"https://github.com/Netflix/void-model.git", str(VOID)], check=True)
if not (BASE / "vae" / "config.json").exists():
snapshot_download("alibaba-pai/CogVideoX-Fun-V1.5-5b-InP", local_dir=str(BASE))
if not PASS1.exists():
hf_hub_download("netflix/void-model", "void_pass1.safetensors", local_dir=str(VOID))
assert LORA_FILE.exists(), f"LoRA missing at {LORA_FILE}"
TEST_SCENE.mkdir(parents=True, exist_ok=True)
LORA_UTILS = VOID / "videox_fun" / "utils" / "lora_utils.py"
_MARK = "# peft-compat-branch"
_src = LORA_UTILS.read_text()
if _MARK not in _src:
_patch = f'''
{_MARK}
def _is_peft(state_dict):
return any(".lora_A." in k or ".lora_B." in k for k in state_dict.keys())
def _merge_peft(pipeline, lora_path, multiplier, device, dtype, state_dict):
import json as _json, os as _os
cfg_path = _os.path.join(_os.path.dirname(lora_path), "adapter_config.json")
r, alpha = 1.0, 1.0
if _os.path.exists(cfg_path):
cfg = _json.load(open(cfg_path))
r = float(cfg.get("r", 1))
alpha = float(cfg.get("lora_alpha", r))
scale = (alpha / r) * float(multiplier)
pairs = {{}}
for k, v in state_dict.items():
if ".lora_A." in k:
mp = k.split("base_model.model.", 1)[1].rsplit(".lora_A.", 1)[0]
pairs.setdefault(mp, {{}})["A"] = v
elif ".lora_B." in k:
mp = k.split("base_model.model.", 1)[1].rsplit(".lora_B.", 1)[0]
pairs.setdefault(mp, {{}})["B"] = v
tr = pipeline.transformer
for mod_path, ab in pairs.items():
if "A" not in ab or "B" not in ab:
continue
try:
mod = tr
for p in mod_path.split("."):
mod = mod[int(p)] if p.isdigit() else getattr(mod, p)
except Exception:
continue
A = ab["A"].to(device, dtype=torch.float32)
B = ab["B"].to(device, dtype=torch.float32)
delta = (B @ A) * scale
w = mod.weight.data
mod.weight.data = (w.to(torch.float32) + delta.to(w.device)).to(w.dtype)
return pipeline
_orig_merge_lora = merge_lora
def merge_lora(pipeline, lora_path, multiplier, device='cpu', dtype=torch.float32, state_dict=None, transformer_only=False):
if state_dict is None:
state_dict = load_file(lora_path, device=device)
if _is_peft(state_dict):
return _merge_peft(pipeline, lora_path, multiplier, device, dtype, state_dict)
return _orig_merge_lora(pipeline, lora_path, multiplier, device, dtype, state_dict, transformer_only)
'''
LORA_UTILS.write_text(_src + _patch)
for pc in VOID.rglob("__pycache__"):
shutil.rmtree(pc, ignore_errors=True)
W, H, N, FPS = 672, 384, 45, 12
SCENE = Path("/content/scenes/current")
SCENE.mkdir(parents=True, exist_ok=True)
_state = {
"frames": [], "first_frame": None,
"points": [], "labels": [],
"sam_state": None, "mask_ready": False,
}
def _load_video(path):
if not path:
return None, "Upload a video."
cap = cv2.VideoCapture(path)
total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
idxs = np.linspace(0, max(0, total - 1), N).astype(int)
frames = []
for fi in idxs:
cap.set(cv2.CAP_PROP_POS_FRAMES, int(fi))
ok, fr = cap.read()
if not ok:
fr = frames[-1] if frames else np.zeros((H, W, 3), np.uint8)
fr = cv2.cvtColor(fr, cv2.COLOR_BGR2RGB)
fr = cv2.resize(fr, (W, H), interpolation=cv2.INTER_AREA)
frames.append(fr)
cap.release()
imageio.mimsave(str(SCENE / "input.mp4"), frames, fps=FPS,
codec="libx264", pixelformat="yuv420p", quality=8)
fd = SCENE / "frames"
shutil.rmtree(fd, ignore_errors=True); fd.mkdir(parents=True, exist_ok=True)
for i, fr in enumerate(frames):
Image.fromarray(fr).save(fd / f"{i:05d}.jpg", quality=95)
_state.update({
"frames": frames, "first_frame": frames[0],
"points": [], "labels": [], "mask_ready": False,
})
with torch.inference_mode(), torch.autocast(device, dtype=dtype):
_state["sam_state"] = sam_predictor.init_state(
video_path=str(fd), offload_video_to_cpu=True)
sam_predictor.reset_state(_state["sam_state"])
return Image.fromarray(frames[0]), f"Ready. {len(frames)} frames @ {W}x{H}."
def _add_point(evt: gr.SelectData):
if _state["first_frame"] is None:
return None, "Upload a video first."
x, y = evt.index[0], evt.index[1]
_state["points"].append([x, y])
_state["labels"].append(1)
img = _state["first_frame"].copy()
for (px, py) in _state["points"]:
cv2.circle(img, (px, py), 8, (0, 255, 0), -1)
cv2.circle(img, (px, py), 8, (0, 0, 0), 2)
return Image.fromarray(img), f"{len(_state['points'])} point(s)."
def _clear_points():
if _state["first_frame"] is None:
return None, "Upload a video first."
_state["points"], _state["labels"] = [], []
if _state["sam_state"] is not None:
sam_predictor.reset_state(_state["sam_state"])
return Image.fromarray(_state["first_frame"]), "Cleared."
def _make_mask():
if not _state["points"]:
return None, "Click at least one point on the object."
with torch.inference_mode(), torch.autocast(device, dtype=dtype):
sam_predictor.reset_state(_state["sam_state"])
sam_predictor.add_new_points_or_box(
inference_state=_state["sam_state"], frame_idx=0, obj_id=1,
points=np.array(_state["points"], dtype=np.float32),
labels=np.array(_state["labels"], dtype=np.int32))
masks = {}
for fidx, _oids, logits in sam_predictor.propagate_in_video(_state["sam_state"]):
m = (logits[0] > 0).cpu().numpy().astype(np.uint8)
if m.ndim == 3: m = m[0]
masks[fidx] = m
mask_frames = []
preview = _state["first_frame"].copy()
for i in range(len(_state["frames"])):
m = masks.get(i, np.zeros((H, W), np.uint8))
panel = np.full((H, W, 3), 255, dtype=np.uint8)
panel[m.astype(bool)] = 0
mask_frames.append(panel)
if i == 0:
red = np.zeros_like(preview); red[..., 0] = 255
a = m.astype(np.float32)[..., None] * 0.5
preview = (preview * (1 - a) + red * a).astype(np.uint8)
imageio.mimsave(str(SCENE / "mask.mp4"), mask_frames, fps=FPS,
codec="libx264", pixelformat="yuv420p", quality=8)
_state["mask_ready"] = True
return Image.fromarray(preview), "Mask ready. Go to Remove."
CONFIG = VOID / "config" / "quadmask_cogvideox.py"
CONFIG_ORIG = CONFIG.parent / "quadmask_cogvideox.py.orig"
if CONFIG.exists() and not CONFIG_ORIG.exists():
shutil.copy2(CONFIG, CONFIG_ORIG)
def _write_config(updates):
shutil.copy(CONFIG_ORIG, CONFIG)
lines = CONFIG.read_text().splitlines()
for key, value in updates.items():
tgt = f"config.{key}"
for i, line in enumerate(lines):
s = line.strip()
if s.startswith(f"{tgt} =") or s.startswith(f"{tgt}="):
indent = line[:len(line) - len(line.lstrip())]
if isinstance(value, str): nv = f'"{value}"'
elif isinstance(value, bool): nv = str(value)
elif isinstance(value, (int, float)): nv = str(value)
else: nv = repr(value)
comment = " " + line[line.index("#"):] if "#" in line else ""
lines[i] = f"{indent}{tgt} = {nv}{comment}"
break
CONFIG.write_text("\n".join(lines) + "\n")
def run_gargantua(lora_weight, bg_prompt, num_steps, progress=gr.Progress()):
try:
if not _state["mask_ready"]:
return None, None, None, "Create a mask first."
progress(0.05, "Preparing scene...")
for f in TEST_SCENE.iterdir():
if f.is_file(): f.unlink()
shutil.copy(SCENE / "input.mp4", TEST_SCENE / "input_video.mp4")
shutil.copy(SCENE / "mask.mp4", TEST_SCENE / "quadmask_0.mp4")
(TEST_SCENE / "prompt.json").write_text(
json.dumps({"bg": bg_prompt}, indent=2, ensure_ascii=False))
save_tag = datetime.now().strftime("run_%Y%m%d_%H%M%S")
_write_config({
"run_seqs": "test_scene",
"model_name": str(BASE),
"transformer_path": str(PASS1),
"lora_path": str(LORA_FILE),
"lora_weight": float(lora_weight),
"save_path": save_tag,
"sample_size": "384x672",
"max_video_length": N,
"temporal_window_size": N,
"low_gpu_memory_mode": False,
"gpu_memory_mode": "",
"num_inference_steps": int(num_steps),
"guidance_scale": 1.0,
"seed": 42,
"sampler_name": "DDIM_Origin",
"skip_if_exists": False,
"denoise_strength": 1.0,
})
for pc in VOID.rglob("__pycache__"):
shutil.rmtree(pc, ignore_errors=True)
progress(0.20, "Running inference...")
t0 = time.time()
proc = subprocess.run(
[sys.executable, "inference/cogvideox_fun/predict_v2v.py",
"--config=config/quadmask_cogvideox.py"],
cwd=str(VOID), capture_output=True, text=True, timeout=1800)
dt = time.time() - t0
out_dir = VOID / save_tag
mp4s = sorted(out_dir.rglob("*.mp4")) if out_dir.exists() else []
main = next((m for m in mp4s if "_tuple" not in m.name), None)
if not main:
tail = (proc.stderr or "")[-2500:]
head = (proc.stdout or "")[-1500:]
return None, str(SCENE / "input.mp4"), str(SCENE / "mask.mp4"), \
f"**Inference failed in {dt:.0f}s.**\n\n```\n{head}\n---\n{tail}\n```"
out = Path("/content/gargantua_output.mp4")
shutil.copy(main, out)
progress(1.0, "Done.")
return str(out), str(SCENE / "input.mp4"), str(SCENE / "mask.mp4"), \
f"Completed in {dt:.0f}s."
except Exception as e:
return None, None, None, f"**Error:** `{type(e).__name__}: {e}`\n\n```\n{traceback.format_exc()}\n```"
with gr.Blocks(title="Gargantua", theme=gr.themes.Soft()) as demo:
gr.Markdown("# Gargantua\nSelect an object, track it across frames, remove it from the video.")
with gr.Tabs():
with gr.Tab("Mask"):
with gr.Row():
with gr.Column():
vid_in = gr.Video(label="Input video")
btn_load = gr.Button("Load")
canvas = gr.Image(label="Click on the object",
type="pil", interactive=False)
with gr.Row():
btn_reset = gr.Button("Reset points")
btn_mask = gr.Button("Generate mask", variant="primary")
with gr.Column():
preview = gr.Image(label="Mask preview (frame 0)")
status_mask = gr.Markdown()
btn_load.click(_load_video, inputs=vid_in, outputs=[canvas, status_mask])
canvas.select(_add_point, outputs=[canvas, status_mask])
btn_reset.click(_clear_points, outputs=[canvas, status_mask])
btn_mask.click(_make_mask, outputs=[preview, status_mask])
with gr.Tab("Remove"):
gr.Markdown("Suggested weight: 0.2 for glass/reflective, "
"0.5 for most objects, 1.0 for stubborn cases.")
with gr.Row():
with gr.Column():
weight = gr.Slider(0.1, 1.0, value=1.0, step=0.1, label="LoRA weight")
steps = gr.Slider(20, 50, value=30, step=5, label="Inference steps")
bg_p = gr.Textbox(
label="Background description",
value="The scene as if the object was never present, clean background.",
lines=2)
btn_go = gr.Button("Run Gargantua", variant="primary", size="lg")
status = gr.Markdown()
with gr.Column():
out_video = gr.Video(label="Output")
with gr.Row():
in_show = gr.Video(label="Input")
mask_show = gr.Video(label="Mask")
btn_go.click(run_gargantua,
inputs=[weight, bg_p, steps],
outputs=[out_video, in_show, mask_show, status])
demo.launch(share=True)
Advanced: drop-in with the Netflix VOID repo
Gargantua is a PEFT adapter, but the Cell 3 patch makes it load transparently through VOID's merge_lora. If you prefer to run VOID's CLI by hand, set the following keys in config/quadmask_cogvideox.py:
config.model_name = "alibaba-pai/CogVideoX-Fun-V1.5-5b-InP"
config.transformer_path = "void_pass1.safetensors"
config.lora_path = "path/to/gargantua/transformer/adapter_model.safetensors"
config.lora_weight = 0.5
config.sample_size = "384x672"
config.max_video_length = 45
config.temporal_window_size = 45
config.num_inference_steps = 30
config.guidance_scale = 1.0
config.sampler_name = "DDIM_Origin"
Then run:
python inference/cogvideox_fun/predict_v2v.py --config=config/quadmask_cogvideox.py
Citation
@misc{ata2026gargantua,
title = {Gargantua: A LoRA for Physics-Aware Video Object Removal on CogVideoX-Fun},
author = {Eren Ata},
year = {2026},
howpublished = {Hugging Face Hub},
url = {https://huggingface.co/ErenAta00/gargantua}
}
@misc{ata2026void_quadmask_dataset,
title = {VOID-Compatible Quadmask Counterfactual Video Dataset},
author = {Eren Ata},
year = {2026},
url = {https://huggingface.co/datasets/ErenAta00/VOID-Quadmask-Dataset}
}
Acknowledgments
- Netflix VOID team β for the quadmask formulation and open-sourcing the VOID framework
- Alibaba PAI β for the CogVideoX-Fun base model
- Meta AI β for SAM 2.1 (used in the inference mask pipeline)
- Unity Technologies β Unity 6 HDRP + deterministic PhysX used to generate ground-truth counterfactuals
License
Apache 2.0 for the LoRA weights. Check upstream components:
- CogVideoX-Fun-V1.5-5b-InP β Tongyi Wanxiang License
- VOID (optional at inference) β Netflix Research License
- Downloads last month
- -
Model tree for ErenAta00/gargantua
Base model
alibaba-pai/CogVideoX-Fun-V1.5-5b-InP



