---
library_name: pytorch
license: other
license_name: fair-research-license
license_link: https://huggingface.co/facebook/EUPE-ViT-B
base_model: phanerozoic/argus
pipeline_tag: object-detection
tags:
  - 3d-detection
  - 3d-bounding-box
  - vision
  - eupe
  - argus
---

# Argus-3D

Class-agnostic 3D bounding box detection on a frozen [EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) backbone. Given a posed RGB image and camera intrinsics, returns 7-DoF boxes (cx, cy, cz, w, h, d, theta) for the objects in the scene.

The detector pairs a non-linear (or linear) per-patch foreground head with feature-dim discovery for depth, k-means clustering for size priors derived from 20 CA-1M val scenes, and 3D-IoU-based multi-view fusion. No 2D bounding boxes, no class labels, no segmentation map as final output. Camera-frame 3D boxes only.

## Architecture

```
Per-frame:
  Image (1024x1024)
    -> EUPE-ViT-B (frozen, reused from phanerozoic/argus)
    -> patch tokens (4096, 768) on a 64x64 grid
    -> instance head: 2-layer MLP (default) or linear ridge -> per-patch foreground score
       depth head:    ridge over 768 dims                   -> per-patch metric depth (m)
       k-means modes: 8 cluster centers (20-scene)          -> per-patch object-type assignment
    -> threshold instance score, upsample mask to 1024x1024, connected components
    -> for each component:
         unproject to 3D using depth + K
         DBSCAN-split for instance separation
         PCA-on-xz for yaw, percentile extents for (w, h, d)
         blend extents toward the matched cluster's size prior
    -> camera-frame 7-DoF box list

Multi-view (per scene):
  -> transform every per-frame box to world frame using camera RT
  -> 3D-IoU-based clustering: union-find with edges where iou_3d_zup(box_i, box_j) > 0.2
  -> filter clusters by min_obs and total inlier weight (per-cluster confidence)
  -> per-cluster: weighted-median fuse to single 7-DoF box
```

## Components

| Component | Parameters | Discovery / training |
|---|---|---|
| EUPE-ViT-B backbone (frozen, reused) | not part of this head | reused from phanerozoic/argus, run at 1024x1024 input (64x64 patch grid) |
| Instance head — MLP (default) | ~200 K | 2-layer MLP (768 → 256 → 1, ReLU + dropout 0.5), 30 epochs AdamW BCE-with-logits on 20-scene patch split at 1024 input |
| Instance head — linear ridge (fallback) | 769 floats + threshold | random K=20 subset search + hard-neg mining, AUC selection |
| Depth head (ridge over 768 dims) | 769 floats | random K=20 subset search, RMSE selection |
| K-means cluster centers | 8 × 768 floats | MiniBatchKMeans on foreground patches across 20 CA-1M val scenes |
| Per-cluster size priors (w, h, d) | 8 × 3 floats | median of observed extents per mode (8 800 instances aggregated) |
| OBB fitter (PCA + percentile + Tikhonov) | 0 | closed-form |
| Multi-view fusion (3D-IoU union-find + weighted median) | 0 | closed-form |
| **Total head footprint (MLP)** | **~200 K params / ~830 KB** | |

## File layout

```
instance_head_mlp.safetensors   # MLP fc1/fc2 weights
instance_head_mlp_meta.json     # MLP config + threshold
instance_head.safetensors       # linear ridge: dims + coef + intercept + threshold
depth_head.safetensors          # depth ridge: dims + coef + intercept
size_priors.safetensors         # 8 cluster centers + 8 (w, h, d) priors (20-scene)
config.json                     # input_res, patch_grid, prior_weight, fusion_iou, etc.
argus_3d.py                     # Argus3D class
infer.py                        # CLI dispatcher
```

## Usage

```python
from argus_3d import Argus3D
import numpy as np

model = Argus3D.from_pretrained("phanerozoic/argus-3d", device="cuda")

K = np.array([[850, 0, 395], [0, 850, 510], [0, 0, 1]])
boxes = model.detect("room.jpg", K)               # list of Box3D
boxes = model.detect("room.jpg", K, depth=d)      # supply RGBD sensor depth
out   = model.perceive("room.jpg", K)             # fg score map + depth map + boxes

for b in boxes:
    print(b.cx, b.cy, b.cz, b.w, b.h, b.d, b.theta)
```

## Eval

CA-1M val sequence `ca1m-val-45662921`. Class-agnostic per-scene 3D IoU after multi-view fusion across 284 frames (stride-4 sampling of 1135 total). The head produces its own instance hypotheses; no ground-truth 2D bounding boxes are used. Sensor depth is supplied; the discovered depth head can be used in its place.

### mAP @ IoU thresholds (Boxer-comparable)

Class-agnostic AP at the same IoU thresholds Boxer reports:

| Threshold | Strict default (44 boxes) | Loose filter (246 boxes) |
|---|---|---|
| AP @ 0.05 IoU | 0.144 | **0.246** |
| AP @ 0.10 IoU | 0.116 | 0.172 |
| AP @ 0.15 IoU | 0.075 | 0.128 |
| AP @ 0.25 IoU | 0.035 | 0.049 |
| AP @ 0.50 IoU | 0.004 | 0.001 |
| **mAP @ [0.05, 0.5]** | 0.075 | **0.119** |

Boxer reports 0.43 mAP at IoU [0.05, 0.5] on CA-1M with GT 2D bounding boxes plus RGBD as input plus a learned 2D→3D lifting head. Our pipeline runs class-agnostic without GT 2D boxes and the OBB fitter is closed-form (PCA + percentile). The ~4× gap at AP @ 0.5 IoU is dominated by the percentile OBB fit — point-cloud percentile extents cap the high-IoU tail.

### Headline result

1024-input MLP head + 20-scene size priors + 3D-IoU multi-view fusion + iterative refinement:

| Metric | Linear ridge (v0) + 768 + position fusion | MLP + 768 + IoU fusion | MLP + 1024 + IoU fusion (default) |
|---|---|---|---|
| Mean 3D IoU | 0.063 | 0.177 | **0.216** |
| Median 3D IoU | — | 0.146 | **0.175** |
| Fraction > 0.1 IoU | 27.8 % | 60.9 % | **68.2 %** |
| Fraction > 0.25 IoU | 6.9 % | 34.8 % | **36.4 %** |
| Fraction > 0.5 IoU | 0.0 % | 4.3 % | **9.1 %** |
| Recall (matched / GT) | 19.8 % | 18.0 % | 17.5 % |
| Fused boxes per scene | 72 | 46 | 44 |

Default config: 1024 input resolution (64×64 patch grid), MLP head trained at 1024 features on 20 scenes, 20-scene size priors derived from foreground patches at 1024 resolution, 3D-IoU multi-view fusion at threshold 0.4 with iterative refinement (drop observations with IoU < 0.4 to the cluster consensus, re-fuse, up to 3 iterations). Per-cluster filter is min 4 observations and weight floor at the 80th percentile of nonzero cluster weights.

Resolution and fusion ablations (all on `ca1m-val-45662921`):

| Variant | Mean IoU | > 0.25 IoU | > 0.5 IoU | Recall |
|---|---|---|---|---|
| Linear ridge (v0) + position fusion at 768 | 0.063 | 6.9 % | 0.0 % | 19.8 % |
| MLP + position fusion at 768 | 0.076 | 11.5 % | 1.3 % | 24.4 % |
| MLP + IoU fusion (thresh 0.2) at 768 | 0.158 | 28.2 % | 2.8 % | 28.1 % |
| MLP + IoU fusion (thresh 0.4) at 768 | 0.177 | 34.8 % | 4.3 % | 18.0 % |
| **MLP + IoU fusion at 1024 (default)** | **0.216** | **36.4 %** | **9.1 %** | 17.5 % |

### Per-stage discovery metrics (in-distribution random patch split, 4 scenes mixed)

| Discovery output | Metric | Linear ridge | MLP |
|---|---|---|---|
| Instance head, per-patch foreground | AUC | 0.860 | 0.980 |
| Instance head, per-patch foreground | F1 (tuned threshold) | 0.569 | 0.815 |
| Depth head, foreground patches in 0.1-3 m | RMSE | 0.190 m | 0.133 m (MLP variant) |
| Depth head, foreground patches in 0.1-3 m | delta1 (1.25× ratio) | 0.919 | 0.974 (MLP variant) |

### Cross-scene held-out

The numbers above use a random patch split that mixes patches from 4 cached scenes. That setup overstates generalization because adjacent patches in the same room share scene-specific cues. A leave-one-scene-out eval gives the honest cross-scene head AUC:

| Head | In-distribution AUC | Cross-scene AUC | Cross-scene F1 |
|---|---|---|---|
| Linear ridge (all-768) | 0.860 | 0.604 | 0.301 |
| MLP (3 train scenes) | 0.980 | 0.566 (fold 45662921) | 0.312 |
| MLP (19 train scenes) | 0.980 | **0.780** (45662921 held out) | **0.465** |

Scaling the train set from 3 to 19 scenes lifts cross-scene AUC by 0.21 (0.566 → 0.780).

End-to-end cross-scene mean IoU on 45662921 with the 19-scene MLP head:

| Pipeline | Mean IoU | > 0.25 IoU | Recall |
|---|---|---|---|
| Position-only fusion | 0.046 | 1.7 % | 18.0 % |
| 3D-IoU fusion (this work) | **0.112** | 16.5 % | 28.1 % |

The 3D-IoU multi-view fusion improvement transfers to the cross-scene setting: 2.4× lift in mean IoU (0.046 → 0.112) and an order-of-magnitude lift in > 0.25 IoU. Per-frame box quality limits per-scene IoU on held-out scenes; the 0.20 in-distribution-to-cross-scene gap (0.154 vs 0.112 mean IoU) is the head still leaning on per-scene cues for box localization.

## Backbone

EUPE-ViT-B from Meta FAIR (arXiv:2603.22387) via [phanerozoic/argus](https://huggingface.co/phanerozoic/argus). The backbone is frozen and not modified by this repo.

## License

FAIR Research License (non-commercial), inherited via the EUPE-ViT-B backbone.