midmid3
A slop generator for Guitar Hero.
19M-parameter masked-prediction transformer that generates "playable" Guitar Hero charts from audio. Give it a song, get a chart for all four difficulty levels. The charts are, at best, mid.
How it works
- beat_this detects beats and builds a 16th-note grid
- MERT (frozen) extracts audio embeddings at each grid position
- This model iteratively fills in the chart over ~12 passes (MaskGIT-style unmasking), deciding which positions get notes, which frets, and sustain lengths. Conditioned on audio features and difficulty level
- Post-processing applies per-difficulty constraints (fret limits, chord sizes, note spacing) and snaps everything to the grid
- Output is a MIDI chart + OGG audio + song.ini, ready to drop into
GHWT:DE/DATA/MODS/
Architecture
| Type | Bidirectional masked transformer with FiLM difficulty conditioning |
| Parameters | 19,321,896 (19.3M) |
| Audio input | MERT-v1-95M embeddings (768d) + onset/RMS/spectral centroid (3d) = 771d |
| Hidden dim | 512 |
| Heads | 8 |
| Layers | 6 (each with FiLM modulation from difficulty embedding) |
| FFN | 2048, SwiGLU activation |
| Position encoding | Rotary (RoPE, base 10000) |
| Normalization | RMSNorm |
| Token vocab | 34 - 32 fret combinations + silence + mask |
| Output heads | Token logits (33 classes), sustain (binary), duration (6 buckets) |
| Inference | 12-step cosine-schedule iterative unmasking with temperature sampling |
Training data
Trained exclusively on official Guitar Hero setlists (GH1 through Warriors of Rock, Band Hero, and remastered track packs). Charts were extracted from GHWT:DE-format .pak files using GH-Toolkit-NET. Audio stems were decrypted and merged from the corresponding .fsb files.
Player-made Clone Hero community charts were deliberately excluded.
Limitations
The biggest limitation is that it's pretty mid.
- Performs best on rock and guitar-forward tracks (because that's what the official charts are mostly made up of)
- Beat tracking quality directly affects chart quality
- Sustain behavior can be strange, especially on fast passages. Sustains are heavily post-processed with rules that don't always reflect what real charts do
- No HOPOs, Star Power, or slider/tap notes (could be added, would also be mid)
- Other instruments (drums, bass, vocals) would need separate models
- There's a fundamental ceiling to what an itty bitty model trained on a single consumer GPU can do
The charts may be useful for human charters as a starting point, assuming they aren't entirely offended by its mere existence. Press x to doubt.
Format
model.safetensors- weights only, no pickleconfig.json- architecture hyperparameterspytorch/- contains the trained .bin with optimizer states
Usage
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json
config_path = hf_hub_download("markury/midmid3-19m-0326", "config.json")
weights_path = hf_hub_download("markury/midmid3-19m-0326", "model.safetensors")
with open(config_path) as f:
config = json.load(f)
state_dict = load_file(weights_path)
The model expects [batch, seq, 771] audio features (MERT + onset/RMS/centroid), [batch, seq] chart token indices (0-33), and a [batch] difficulty ID (0=easy, 3=expert).
Citation
Credit back here, the GitHub repo, or the live demo.
- Downloads last month
- 64