Gargantua

Gargantua is a LoRA adapter for alibaba-pai/CogVideoX-Fun-V1.5-5b-InP, fine-tuned for physics-aware video object removal using quadmask conditioning — the same 4-value mask format introduced by Netflix VOID.

What's the story?

Netflix VOID (arXiv:2604.02296) introduced the quadmask — a 4-value segmentation mask that encodes not only what to remove, but also which physical consequences to correct (shadows, contact interactions, collapsing stacks, domino chains, etc.).

VOID released model weights, but not the training data. We built the VOID-Quadmask-Dataset — the first public pre-built quadmask dataset — using Unity 6 HDRP + deterministic PhysX for ground-truth counterfactuals, then fine-tuned CogVideoX-Fun-V1.5-5b-InP on it to produce Gargantua.

The result: a LoRA adapter that removes an object and its physical aftermath in a single forward pass, with measurable improvements in temporal stability over the VOID baseline.

Benchmarks

3-way evaluation on 3 diverse scenes — static urban (hydrant), moving pedestrian (slow_walk), reflective/transparent object (sphere). Identical input video, identical SAM 2.1 per-frame masks, same sampler, seed, steps, CFG.

Aggregate (mean across 3 scenes)

Model	flicker ↓	Δ removal ↑	bg preservation ↓	Verdict
CogVideoX-Fun	2.34	−47.72 (fails to remove)	78.95	baseline only
Netflix/VOID	4.81	+51.07	6.31	SOTA
Gargantua	4.37	+49.73	6.78	matches SOTA, better temporal

Per-scene

Scene	Metric	CogX base	VOID pass1	Gargantua
hydrant	flicker	3.15	5.69	5.28
hydrant	Δ removal	−51.00	+78.30	+76.59
slow_walk	flicker	2.68	8.00	7.10
slow_walk	Δ removal	−36.40	+47.81	+46.65
sphere	flicker	1.17	0.76	0.73
sphere	Δ removal	−55.76	+27.10	+25.96

Key finding

Gargantua vs Netflix/VOID:

Flicker: −9.2% — better temporal consistency (the core contribution of LoRA fine-tuning)
Removal strength: essentially tied (−2.6%, within inter-run variance)
BG preservation: +0.47 — negligible; not visible to the human eye

Per-scene optimal lora_weight: {hydrant: 0.5, slow_walk: 0.5, sphere: 0.2}.

Visual results

Qualitative comparisons on the three benchmark scenes. Each panel shows, from left to right: input, binary mask, Gargantua output, and a pixel-level diff heatmap.

Shadow-aware human removal — `slow_walk`

A walking pedestrian is removed together with her cast shadow. The sidewalk and the stucco wall extend naturally into the background.

Static object in a dynamic scene — `hydrant`

A fire hydrant is removed from a crowded sidewalk. Pedestrians that pass through the occluded region keep walking consistently; no duplicated silhouettes or trailing artefacts.

Reflective / transparent surface — `sphere`

A crystal ball on wet asphalt at sunset. The highlight on the ground is reconstructed rather than copy-pasted. lora_weight = 0.2 is used here to avoid over-regularising the specular surface.

Limitations

Gargantua inherits the failure modes of its base model and adds a few of its own. The example below is intentionally out of distribution to illustrate the operating envelope rather than cherry-pick successes.

Large rigid objects against iconic backgrounds

The Winston Churchill statue in Parliament Square is removed, but the stone pedestal and brick textures behind it are not fully regenerated. A soft residue remains in the inpainted region and the model does not reconstruct the distant crowd that was never directly visible in any frame of the clip.

Known failure modes

Foreground objects that occupy more than ~40% of the frame
Sharp, repetitive geometric backgrounds (brick rows, stone pedestals, fences)
Thin / wire-like occluders (railings, cables, foliage tips)
Input aspect ratios far from 7:4 (training geometry is 672×384)
Scenes where the background behind the object is never directly visible in any frame — the model cannot invent what it has never seen

Quadmask convention (VOID-compatible)

Value	Color	Meaning
0	Black	Object — pixels belonging to the object being removed
63	Dark gray	Overlap — object pixels that also cause physical interaction
127	Light gray	Affected area — regions where physics changed due to removal
255	White	Background — unchanged regions (preserved verbatim)

Simplified (binary) mode: if your pipeline only produces a binary mask (SAM, manual ROI), use {0 = remove, 255 = keep} — Gargantua handles this as a degenerate quadmask.

Mask generation — recommended: SAM 2.1 propagation

For in-the-wild videos we recommend one-click mask creation with SAM 2.1:

Click the object on a single frame
SAM 2.1 propagates the mask through all frames (object tracking)
Binary mask → quadmask (black = remove, white = keep)

Quick start with Colab

Cell 1 — Install dependencies

import sys, subprocess
import torch

def _pip(args, strict=True):
    r = subprocess.run([sys.executable, "-m", "pip", *args],
                       capture_output=True, text=True)
    if r.returncode != 0 and strict:
        print(r.stdout[-1500:]); print(r.stderr[-1500:])
        raise RuntimeError(" ".join(args))
    return r

cap = torch.cuda.get_device_capability(0) if torch.cuda.is_available() else (0, 0)
is_blackwell = cap[0] >= 12

_pip(["uninstall", "-y",
      "torch", "torchvision", "torchaudio", "xformers", "triton",
      "nvidia-nccl-cu12"], strict=False)

if is_blackwell:
    _pip(["install", "--pre", "torch", "torchvision",
          "--index-url", "https://download.pytorch.org/whl/nightly/cu128"])
else:
    _pip(["install", "torch==2.4.1", "torchvision==0.19.1",
          "--index-url", "https://download.pytorch.org/whl/cu121"])

packages = [
    "numpy<2",
    "opencv-python-headless",
    "imageio[ffmpeg]",
    "pillow",
    "einops",
    "omegaconf",
    "ml_collections",
    "absl-py",
    "loguru",
    "sentencepiece",
    "decord",
    "mediapy",
    "scikit-image",
    "timm",
    "func_timeout",
    "huggingface_hub==0.26.2",
    "tokenizers==0.19.1",
    "transformers==4.44.2",
    "diffusers==0.30.3",
    "accelerate==0.34.2",
    "peft==0.13.2",
    "safetensors==0.4.5",
    "gradio==4.44.0",
    "gradio_client==1.3.0",
    "git+https://github.com/facebookresearch/sam2.git",
]
_pip(["install", "-q", *packages])

print("Restart the runtime, then run Cell 2.")

After this cell finishes, restart the Colab runtime before running Cell 2. This ensures the new PyTorch / NCCL stack is picked up cleanly.

Cell 2 — Download the Gargantua LoRA

import os
from pathlib import Path
from getpass import getpass
import torch
from huggingface_hub import login, snapshot_download

assert torch.cuda.is_available(), "CUDA unavailable."

tok = os.environ.get("HF_TOKEN") or getpass("HF token (optional): ").strip()
if tok:
    login(token=tok, add_to_git_credential=False)

LORA_DIR = Path("/content/gargantua/lora")
adapter = LORA_DIR / "transformer" / "adapter_model.safetensors"

if not adapter.exists():
    LORA_DIR.mkdir(parents=True, exist_ok=True)
    snapshot_download(
        repo_id="ErenAta00/gargantua",
        local_dir=str(LORA_DIR),
        allow_patterns=["transformer/*"],
    )

assert adapter.exists(), f"Download failed: {adapter}"

Cell 3 — Launch the interactive app

The cell clones the Netflix VOID inference harness, downloads the CogVideoX-Fun base model and void_pass1.safetensors, patches merge_lora to accept PEFT-style adapters, and launches a Gradio UI with two tabs: Mask (SAM 2.1 click-and-propagate) and Remove (Gargantua inference).

import os, sys, time, json, shutil, subprocess, traceback
from pathlib import Path
from datetime import datetime

import huggingface_hub
if not hasattr(huggingface_hub, "HfFolder"):
    class _HfFolder:
        @staticmethod
        def get_token():
            try: return huggingface_hub.get_token()
            except Exception: return None
        @staticmethod
        def save_token(t):
            try: huggingface_hub.login(token=t, add_to_git_credential=False)
            except Exception: pass
        @staticmethod
        def delete_token():
            try: huggingface_hub.logout()
            except Exception: pass
    huggingface_hub.HfFolder = _HfFolder

import numpy as np
import torch
import cv2
import imageio
from PIL import Image
import gradio as gr

gr.Blocks.get_api_info = lambda self: {"named_endpoints": {}, "unnamed_endpoints": {}}
import gradio_client.utils as _gcu
_orig_js = _gcu._json_schema_to_python_type
def _safe_js(s, d=None):
    try: return _orig_js(s, d)
    except Exception: return "Any"
_gcu._json_schema_to_python_type = _safe_js

from huggingface_hub import hf_hub_download, snapshot_download
from sam2.build_sam import build_sam2_video_predictor

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype  = torch.bfloat16 if device == "cuda" else torch.float32

sam_ckpt = hf_hub_download("facebook/sam2.1-hiera-large", "sam2.1_hiera_large.pt")
sam_predictor = build_sam2_video_predictor(
    "configs/sam2.1/sam2.1_hiera_l.yaml", sam_ckpt, device=device)

WORK      = Path("/content/void_test")
VOID      = WORK / "void-model"
BASE      = VOID / "CogVideoX-Fun-V1.5-5b-InP"
PASS1     = VOID / "void_pass1.safetensors"
LORA_FILE = Path("/content/gargantua/lora/transformer/adapter_model.safetensors")
TEST_SCENE = VOID / "data" / "test_scene"

WORK.mkdir(exist_ok=True)

if not VOID.exists():
    subprocess.run(["git", "clone", "--depth", "1",
                    "https://github.com/Netflix/void-model.git", str(VOID)], check=True)

if not (BASE / "vae" / "config.json").exists():
    snapshot_download("alibaba-pai/CogVideoX-Fun-V1.5-5b-InP", local_dir=str(BASE))

if not PASS1.exists():
    hf_hub_download("netflix/void-model", "void_pass1.safetensors", local_dir=str(VOID))

assert LORA_FILE.exists(), f"LoRA missing at {LORA_FILE}"
TEST_SCENE.mkdir(parents=True, exist_ok=True)

LORA_UTILS = VOID / "videox_fun" / "utils" / "lora_utils.py"
_MARK = "# peft-compat-branch"
_src = LORA_UTILS.read_text()
if _MARK not in _src:
    _patch = f'''
{_MARK}
def _is_peft(state_dict):
    return any(".lora_A." in k or ".lora_B." in k for k in state_dict.keys())

def _merge_peft(pipeline, lora_path, multiplier, device, dtype, state_dict):
    import json as _json, os as _os
    cfg_path = _os.path.join(_os.path.dirname(lora_path), "adapter_config.json")
    r, alpha = 1.0, 1.0
    if _os.path.exists(cfg_path):
        cfg = _json.load(open(cfg_path))
        r = float(cfg.get("r", 1))
        alpha = float(cfg.get("lora_alpha", r))
    scale = (alpha / r) * float(multiplier)

    pairs = {{}}
    for k, v in state_dict.items():
        if ".lora_A." in k:
            mp = k.split("base_model.model.", 1)[1].rsplit(".lora_A.", 1)[0]
            pairs.setdefault(mp, {{}})["A"] = v
        elif ".lora_B." in k:
            mp = k.split("base_model.model.", 1)[1].rsplit(".lora_B.", 1)[0]
            pairs.setdefault(mp, {{}})["B"] = v

    tr = pipeline.transformer
    for mod_path, ab in pairs.items():
        if "A" not in ab or "B" not in ab:
            continue
        try:
            mod = tr
            for p in mod_path.split("."):
                mod = mod[int(p)] if p.isdigit() else getattr(mod, p)
        except Exception:
            continue
        A = ab["A"].to(device, dtype=torch.float32)
        B = ab["B"].to(device, dtype=torch.float32)
        delta = (B @ A) * scale
        w = mod.weight.data
        mod.weight.data = (w.to(torch.float32) + delta.to(w.device)).to(w.dtype)
    return pipeline

_orig_merge_lora = merge_lora
def merge_lora(pipeline, lora_path, multiplier, device='cpu', dtype=torch.float32, state_dict=None, transformer_only=False):
    if state_dict is None:
        state_dict = load_file(lora_path, device=device)
    if _is_peft(state_dict):
        return _merge_peft(pipeline, lora_path, multiplier, device, dtype, state_dict)
    return _orig_merge_lora(pipeline, lora_path, multiplier, device, dtype, state_dict, transformer_only)
'''
    LORA_UTILS.write_text(_src + _patch)
    for pc in VOID.rglob("__pycache__"):
        shutil.rmtree(pc, ignore_errors=True)

W, H, N, FPS = 672, 384, 45, 12

SCENE = Path("/content/scenes/current")
SCENE.mkdir(parents=True, exist_ok=True)

_state = {
    "frames": [], "first_frame": None,
    "points": [], "labels": [],
    "sam_state": None, "mask_ready": False,
}

def _load_video(path):
    if not path:
        return None, "Upload a video."
    cap = cv2.VideoCapture(path)
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    idxs = np.linspace(0, max(0, total - 1), N).astype(int)
    frames = []
    for fi in idxs:
        cap.set(cv2.CAP_PROP_POS_FRAMES, int(fi))
        ok, fr = cap.read()
        if not ok:
            fr = frames[-1] if frames else np.zeros((H, W, 3), np.uint8)
        fr = cv2.cvtColor(fr, cv2.COLOR_BGR2RGB)
        fr = cv2.resize(fr, (W, H), interpolation=cv2.INTER_AREA)
        frames.append(fr)
    cap.release()

    imageio.mimsave(str(SCENE / "input.mp4"), frames, fps=FPS,
                    codec="libx264", pixelformat="yuv420p", quality=8)

    fd = SCENE / "frames"
    shutil.rmtree(fd, ignore_errors=True); fd.mkdir(parents=True, exist_ok=True)
    for i, fr in enumerate(frames):
        Image.fromarray(fr).save(fd / f"{i:05d}.jpg", quality=95)

    _state.update({
        "frames": frames, "first_frame": frames[0],
        "points": [], "labels": [], "mask_ready": False,
    })
    with torch.inference_mode(), torch.autocast(device, dtype=dtype):
        _state["sam_state"] = sam_predictor.init_state(
            video_path=str(fd), offload_video_to_cpu=True)
        sam_predictor.reset_state(_state["sam_state"])

    return Image.fromarray(frames[0]), f"Ready. {len(frames)} frames @ {W}x{H}."

def _add_point(evt: gr.SelectData):
    if _state["first_frame"] is None:
        return None, "Upload a video first."
    x, y = evt.index[0], evt.index[1]
    _state["points"].append([x, y])
    _state["labels"].append(1)
    img = _state["first_frame"].copy()
    for (px, py) in _state["points"]:
        cv2.circle(img, (px, py), 8, (0, 255, 0), -1)
        cv2.circle(img, (px, py), 8, (0, 0, 0), 2)
    return Image.fromarray(img), f"{len(_state['points'])} point(s)."

def _clear_points():
    if _state["first_frame"] is None:
        return None, "Upload a video first."
    _state["points"], _state["labels"] = [], []
    if _state["sam_state"] is not None:
        sam_predictor.reset_state(_state["sam_state"])
    return Image.fromarray(_state["first_frame"]), "Cleared."

def _make_mask():
    if not _state["points"]:
        return None, "Click at least one point on the object."

    with torch.inference_mode(), torch.autocast(device, dtype=dtype):
        sam_predictor.reset_state(_state["sam_state"])
        sam_predictor.add_new_points_or_box(
            inference_state=_state["sam_state"], frame_idx=0, obj_id=1,
            points=np.array(_state["points"], dtype=np.float32),
            labels=np.array(_state["labels"], dtype=np.int32))
        masks = {}
        for fidx, _oids, logits in sam_predictor.propagate_in_video(_state["sam_state"]):
            m = (logits[0] > 0).cpu().numpy().astype(np.uint8)
            if m.ndim == 3: m = m[0]
            masks[fidx] = m

    mask_frames = []
    preview = _state["first_frame"].copy()
    for i in range(len(_state["frames"])):
        m = masks.get(i, np.zeros((H, W), np.uint8))
        panel = np.full((H, W, 3), 255, dtype=np.uint8)
        panel[m.astype(bool)] = 0
        mask_frames.append(panel)
        if i == 0:
            red = np.zeros_like(preview); red[..., 0] = 255
            a = m.astype(np.float32)[..., None] * 0.5
            preview = (preview * (1 - a) + red * a).astype(np.uint8)

    imageio.mimsave(str(SCENE / "mask.mp4"), mask_frames, fps=FPS,
                    codec="libx264", pixelformat="yuv420p", quality=8)
    _state["mask_ready"] = True
    return Image.fromarray(preview), "Mask ready. Go to Remove."

CONFIG      = VOID / "config" / "quadmask_cogvideox.py"
CONFIG_ORIG = CONFIG.parent / "quadmask_cogvideox.py.orig"
if CONFIG.exists() and not CONFIG_ORIG.exists():
    shutil.copy2(CONFIG, CONFIG_ORIG)

def _write_config(updates):
    shutil.copy(CONFIG_ORIG, CONFIG)
    lines = CONFIG.read_text().splitlines()
    for key, value in updates.items():
        tgt = f"config.{key}"
        for i, line in enumerate(lines):
            s = line.strip()
            if s.startswith(f"{tgt} =") or s.startswith(f"{tgt}="):
                indent = line[:len(line) - len(line.lstrip())]
                if isinstance(value, str):    nv = f'"{value}"'
                elif isinstance(value, bool): nv = str(value)
                elif isinstance(value, (int, float)): nv = str(value)
                else:                         nv = repr(value)
                comment = "  " + line[line.index("#"):] if "#" in line else ""
                lines[i] = f"{indent}{tgt} = {nv}{comment}"
                break
    CONFIG.write_text("\n".join(lines) + "\n")

def run_gargantua(lora_weight, bg_prompt, num_steps, progress=gr.Progress()):
    try:
        if not _state["mask_ready"]:
            return None, None, None, "Create a mask first."

        progress(0.05, "Preparing scene...")
        for f in TEST_SCENE.iterdir():
            if f.is_file(): f.unlink()
        shutil.copy(SCENE / "input.mp4", TEST_SCENE / "input_video.mp4")
        shutil.copy(SCENE / "mask.mp4",  TEST_SCENE / "quadmask_0.mp4")
        (TEST_SCENE / "prompt.json").write_text(
            json.dumps({"bg": bg_prompt}, indent=2, ensure_ascii=False))

        save_tag = datetime.now().strftime("run_%Y%m%d_%H%M%S")
        _write_config({
            "run_seqs":             "test_scene",
            "model_name":           str(BASE),
            "transformer_path":     str(PASS1),
            "lora_path":            str(LORA_FILE),
            "lora_weight":          float(lora_weight),
            "save_path":            save_tag,
            "sample_size":          "384x672",
            "max_video_length":     N,
            "temporal_window_size": N,
            "low_gpu_memory_mode":  False,
            "gpu_memory_mode":      "",
            "num_inference_steps":  int(num_steps),
            "guidance_scale":       1.0,
            "seed":                 42,
            "sampler_name":         "DDIM_Origin",
            "skip_if_exists":       False,
            "denoise_strength":     1.0,
        })

        for pc in VOID.rglob("__pycache__"):
            shutil.rmtree(pc, ignore_errors=True)

        progress(0.20, "Running inference...")
        t0 = time.time()
        proc = subprocess.run(
            [sys.executable, "inference/cogvideox_fun/predict_v2v.py",
             "--config=config/quadmask_cogvideox.py"],
            cwd=str(VOID), capture_output=True, text=True, timeout=1800)
        dt = time.time() - t0

        out_dir = VOID / save_tag
        mp4s = sorted(out_dir.rglob("*.mp4")) if out_dir.exists() else []
        main = next((m for m in mp4s if "_tuple" not in m.name), None)
        if not main:
            tail = (proc.stderr or "")[-2500:]
            head = (proc.stdout or "")[-1500:]
            return None, str(SCENE / "input.mp4"), str(SCENE / "mask.mp4"), \
                   f"**Inference failed in {dt:.0f}s.**\n\n```\n{head}\n---\n{tail}\n```"

        out = Path("/content/gargantua_output.mp4")
        shutil.copy(main, out)

        progress(1.0, "Done.")
        return str(out), str(SCENE / "input.mp4"), str(SCENE / "mask.mp4"), \
               f"Completed in {dt:.0f}s."
    except Exception as e:
        return None, None, None, f"**Error:** `{type(e).__name__}: {e}`\n\n```\n{traceback.format_exc()}\n```"

with gr.Blocks(title="Gargantua", theme=gr.themes.Soft()) as demo:
    gr.Markdown("# Gargantua\nSelect an object, track it across frames, remove it from the video.")
    with gr.Tabs():
        with gr.Tab("Mask"):
            with gr.Row():
                with gr.Column():
                    vid_in = gr.Video(label="Input video")
                    btn_load = gr.Button("Load")
                    canvas = gr.Image(label="Click on the object",
                                      type="pil", interactive=False)
                    with gr.Row():
                        btn_reset = gr.Button("Reset points")
                        btn_mask  = gr.Button("Generate mask", variant="primary")
                with gr.Column():
                    preview = gr.Image(label="Mask preview (frame 0)")
                    status_mask = gr.Markdown()
            btn_load.click(_load_video, inputs=vid_in, outputs=[canvas, status_mask])
            canvas.select(_add_point, outputs=[canvas, status_mask])
            btn_reset.click(_clear_points, outputs=[canvas, status_mask])
            btn_mask.click(_make_mask, outputs=[preview, status_mask])

        with gr.Tab("Remove"):
            gr.Markdown("Suggested weight: 0.2 for glass/reflective, "
                        "0.5 for most objects, 1.0 for stubborn cases.")
            with gr.Row():
                with gr.Column():
                    weight = gr.Slider(0.1, 1.0, value=1.0, step=0.1, label="LoRA weight")
                    steps  = gr.Slider(20, 50,  value=30, step=5,   label="Inference steps")
                    bg_p = gr.Textbox(
                        label="Background description",
                        value="The scene as if the object was never present, clean background.",
                        lines=2)
                    btn_go = gr.Button("Run Gargantua", variant="primary", size="lg")
                    status = gr.Markdown()
                with gr.Column():
                    out_video = gr.Video(label="Output")
                    with gr.Row():
                        in_show  = gr.Video(label="Input")
                        mask_show = gr.Video(label="Mask")
            btn_go.click(run_gargantua,
                         inputs=[weight, bg_p, steps],
                         outputs=[out_video, in_show, mask_show, status])

demo.launch(share=True)

Advanced: drop-in with the Netflix VOID repo

Gargantua is a PEFT adapter, but the Cell 3 patch makes it load transparently through VOID's merge_lora. If you prefer to run VOID's CLI by hand, set the following keys in config/quadmask_cogvideox.py:

config.model_name           = "alibaba-pai/CogVideoX-Fun-V1.5-5b-InP"
config.transformer_path     = "void_pass1.safetensors"
config.lora_path            = "path/to/gargantua/transformer/adapter_model.safetensors"
config.lora_weight          = 0.5
config.sample_size          = "384x672"
config.max_video_length     = 45
config.temporal_window_size = 45
config.num_inference_steps  = 30
config.guidance_scale       = 1.0
config.sampler_name         = "DDIM_Origin"

Then run:

python inference/cogvideox_fun/predict_v2v.py --config=config/quadmask_cogvideox.py

Citation

@misc{ata2026gargantua,
  title        = {Gargantua: A LoRA for Physics-Aware Video Object Removal on CogVideoX-Fun},
  author       = {Eren Ata},
  year         = {2026},
  howpublished = {Hugging Face Hub},
  url          = {https://huggingface.co/ErenAta00/gargantua}
}

@misc{ata2026void_quadmask_dataset,
  title        = {VOID-Compatible Quadmask Counterfactual Video Dataset},
  author       = {Eren Ata},
  year         = {2026},
  url          = {https://huggingface.co/datasets/ErenAta00/VOID-Quadmask-Dataset}
}

Acknowledgments

Netflix VOID team — for the quadmask formulation and open-sourcing the VOID framework
Alibaba PAI — for the CogVideoX-Fun base model
Meta AI — for SAM 2.1 (used in the inference mask pipeline)
Unity Technologies — Unity 6 HDRP + deterministic PhysX used to generate ground-truth counterfactuals

License

Apache 2.0 for the LoRA weights. Check upstream components:

CogVideoX-Fun-V1.5-5b-InP — Tongyi Wanxiang License
VOID (optional at inference) — Netflix Research License

Downloads last month: -

Inference Providers NEW

Video-to-Video

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ErenAta00/gargantua

Base model

alibaba-pai/CogVideoX-Fun-V1.5-5b-InP

Adapter

(1)

this model

Dataset used to train ErenAta00/gargantua

Paper for ErenAta00/gargantua

VOID: Video Object and Interaction Deletion

Paper • 2604.02296 • Published 24 days ago • 53