--- library_name: pytorch license: other license_name: fair-research-license license_link: https://huggingface.co/facebook/EUPE-ViT-B base_model: phanerozoic/argus pipeline_tag: object-detection tags: - 3d-detection - 3d-bounding-box - vision - eupe - argus --- # Argus-3D Class-agnostic 3D bounding box detection on a frozen [EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) backbone. Given a posed RGB image and camera intrinsics, returns 7-DoF boxes (cx, cy, cz, w, h, d, theta) for the objects in the scene. The detector pairs a non-linear (or linear) per-patch foreground head with feature-dim discovery for depth, k-means clustering for size priors derived from 20 CA-1M val scenes, and 3D-IoU-based multi-view fusion. No 2D bounding boxes, no class labels, no segmentation map as final output. Camera-frame 3D boxes only. ## Architecture ``` Per-frame: Image (1024x1024) -> EUPE-ViT-B (frozen, reused from phanerozoic/argus) -> patch tokens (4096, 768) on a 64x64 grid -> instance head: 2-layer MLP (default) or linear ridge -> per-patch foreground score depth head: ridge over 768 dims -> per-patch metric depth (m) k-means modes: 8 cluster centers (20-scene) -> per-patch object-type assignment -> threshold instance score, upsample mask to 1024x1024, connected components -> for each component: unproject to 3D using depth + K DBSCAN-split for instance separation PCA-on-xz for yaw, percentile extents for (w, h, d) blend extents toward the matched cluster's size prior -> camera-frame 7-DoF box list Multi-view (per scene): -> transform every per-frame box to world frame using camera RT -> 3D-IoU-based clustering: union-find with edges where iou_3d_zup(box_i, box_j) > 0.2 -> filter clusters by min_obs and total inlier weight (per-cluster confidence) -> per-cluster: weighted-median fuse to single 7-DoF box ``` ## Components | Component | Parameters | Discovery / training | |---|---|---| | EUPE-ViT-B backbone (frozen, reused) | not part of this head | reused from phanerozoic/argus, run at 1024x1024 input (64x64 patch grid) | | Instance head — MLP (default) | ~200 K | 2-layer MLP (768 → 256 → 1, ReLU + dropout 0.5), 30 epochs AdamW BCE-with-logits on 20-scene patch split at 1024 input | | Instance head — linear ridge (fallback) | 769 floats + threshold | random K=20 subset search + hard-neg mining, AUC selection | | Depth head (ridge over 768 dims) | 769 floats | random K=20 subset search, RMSE selection | | K-means cluster centers | 8 × 768 floats | MiniBatchKMeans on foreground patches across 20 CA-1M val scenes | | Per-cluster size priors (w, h, d) | 8 × 3 floats | median of observed extents per mode (8 800 instances aggregated) | | OBB fitter (PCA + percentile + Tikhonov) | 0 | closed-form | | Multi-view fusion (3D-IoU union-find + weighted median) | 0 | closed-form | | **Total head footprint (MLP)** | **~200 K params / ~830 KB** | | ## File layout ``` instance_head_mlp.safetensors # MLP fc1/fc2 weights instance_head_mlp_meta.json # MLP config + threshold instance_head.safetensors # linear ridge: dims + coef + intercept + threshold depth_head.safetensors # depth ridge: dims + coef + intercept size_priors.safetensors # 8 cluster centers + 8 (w, h, d) priors (20-scene) config.json # input_res, patch_grid, prior_weight, fusion_iou, etc. argus_3d.py # Argus3D class infer.py # CLI dispatcher ``` ## Usage ```python from argus_3d import Argus3D import numpy as np model = Argus3D.from_pretrained("phanerozoic/argus-3d", device="cuda") K = np.array([[850, 0, 395], [0, 850, 510], [0, 0, 1]]) boxes = model.detect("room.jpg", K) # list of Box3D boxes = model.detect("room.jpg", K, depth=d) # supply RGBD sensor depth out = model.perceive("room.jpg", K) # fg score map + depth map + boxes for b in boxes: print(b.cx, b.cy, b.cz, b.w, b.h, b.d, b.theta) ``` ## Eval CA-1M val sequence `ca1m-val-45662921`. Class-agnostic per-scene 3D IoU after multi-view fusion across 284 frames (stride-4 sampling of 1135 total). The head produces its own instance hypotheses; no ground-truth 2D bounding boxes are used. Sensor depth is supplied; the discovered depth head can be used in its place. ### mAP @ IoU thresholds (Boxer-comparable) Class-agnostic AP at the same IoU thresholds Boxer reports: | Threshold | Strict default (44 boxes) | Loose filter (246 boxes) | |---|---|---| | AP @ 0.05 IoU | 0.144 | **0.246** | | AP @ 0.10 IoU | 0.116 | 0.172 | | AP @ 0.15 IoU | 0.075 | 0.128 | | AP @ 0.25 IoU | 0.035 | 0.049 | | AP @ 0.50 IoU | 0.004 | 0.001 | | **mAP @ [0.05, 0.5]** | 0.075 | **0.119** | Boxer reports 0.43 mAP at IoU [0.05, 0.5] on CA-1M with GT 2D bounding boxes plus RGBD as input plus a learned 2D→3D lifting head. Our pipeline runs class-agnostic without GT 2D boxes and the OBB fitter is closed-form (PCA + percentile). The ~4× gap at AP @ 0.5 IoU is dominated by the percentile OBB fit — point-cloud percentile extents cap the high-IoU tail. ### Headline result 1024-input MLP head + 20-scene size priors + 3D-IoU multi-view fusion + iterative refinement: | Metric | Linear ridge (v0) + 768 + position fusion | MLP + 768 + IoU fusion | MLP + 1024 + IoU fusion (default) | |---|---|---|---| | Mean 3D IoU | 0.063 | 0.177 | **0.216** | | Median 3D IoU | — | 0.146 | **0.175** | | Fraction > 0.1 IoU | 27.8 % | 60.9 % | **68.2 %** | | Fraction > 0.25 IoU | 6.9 % | 34.8 % | **36.4 %** | | Fraction > 0.5 IoU | 0.0 % | 4.3 % | **9.1 %** | | Recall (matched / GT) | 19.8 % | 18.0 % | 17.5 % | | Fused boxes per scene | 72 | 46 | 44 | Default config: 1024 input resolution (64×64 patch grid), MLP head trained at 1024 features on 20 scenes, 20-scene size priors derived from foreground patches at 1024 resolution, 3D-IoU multi-view fusion at threshold 0.4 with iterative refinement (drop observations with IoU < 0.4 to the cluster consensus, re-fuse, up to 3 iterations). Per-cluster filter is min 4 observations and weight floor at the 80th percentile of nonzero cluster weights. Resolution and fusion ablations (all on `ca1m-val-45662921`): | Variant | Mean IoU | > 0.25 IoU | > 0.5 IoU | Recall | |---|---|---|---|---| | Linear ridge (v0) + position fusion at 768 | 0.063 | 6.9 % | 0.0 % | 19.8 % | | MLP + position fusion at 768 | 0.076 | 11.5 % | 1.3 % | 24.4 % | | MLP + IoU fusion (thresh 0.2) at 768 | 0.158 | 28.2 % | 2.8 % | 28.1 % | | MLP + IoU fusion (thresh 0.4) at 768 | 0.177 | 34.8 % | 4.3 % | 18.0 % | | **MLP + IoU fusion at 1024 (default)** | **0.216** | **36.4 %** | **9.1 %** | 17.5 % | ### Per-stage discovery metrics (in-distribution random patch split, 4 scenes mixed) | Discovery output | Metric | Linear ridge | MLP | |---|---|---|---| | Instance head, per-patch foreground | AUC | 0.860 | 0.980 | | Instance head, per-patch foreground | F1 (tuned threshold) | 0.569 | 0.815 | | Depth head, foreground patches in 0.1-3 m | RMSE | 0.190 m | 0.133 m (MLP variant) | | Depth head, foreground patches in 0.1-3 m | delta1 (1.25× ratio) | 0.919 | 0.974 (MLP variant) | ### Cross-scene held-out The numbers above use a random patch split that mixes patches from 4 cached scenes. That setup overstates generalization because adjacent patches in the same room share scene-specific cues. A leave-one-scene-out eval gives the honest cross-scene head AUC: | Head | In-distribution AUC | Cross-scene AUC | Cross-scene F1 | |---|---|---|---| | Linear ridge (all-768) | 0.860 | 0.604 | 0.301 | | MLP (3 train scenes) | 0.980 | 0.566 (fold 45662921) | 0.312 | | MLP (19 train scenes) | 0.980 | **0.780** (45662921 held out) | **0.465** | Scaling the train set from 3 to 19 scenes lifts cross-scene AUC by 0.21 (0.566 → 0.780). End-to-end cross-scene mean IoU on 45662921 with the 19-scene MLP head: | Pipeline | Mean IoU | > 0.25 IoU | Recall | |---|---|---|---| | Position-only fusion | 0.046 | 1.7 % | 18.0 % | | 3D-IoU fusion (this work) | **0.112** | 16.5 % | 28.1 % | The 3D-IoU multi-view fusion improvement transfers to the cross-scene setting: 2.4× lift in mean IoU (0.046 → 0.112) and an order-of-magnitude lift in > 0.25 IoU. Per-frame box quality limits per-scene IoU on held-out scenes; the 0.20 in-distribution-to-cross-scene gap (0.154 vs 0.112 mean IoU) is the head still leaning on per-scene cues for box localization. ## Backbone EUPE-ViT-B from Meta FAIR (arXiv:2603.22387) via [phanerozoic/argus](https://huggingface.co/phanerozoic/argus). The backbone is frozen and not modified by this repo. ## License FAIR Research License (non-commercial), inherited via the EUPE-ViT-B backbone.