midmid3

A slop generator for Guitar Hero.

19M-parameter masked-prediction transformer that generates "playable" Guitar Hero charts from audio. Give it a song, get a chart for all four difficulty levels. The charts are, at best, mid.

Try the live demo

How it works

beat_this detects beats and builds a 16th-note grid
MERT (frozen) extracts audio embeddings at each grid position
This model iteratively fills in the chart over ~12 passes (MaskGIT-style unmasking), deciding which positions get notes, which frets, and sustain lengths. Conditioned on audio features and difficulty level
Post-processing applies per-difficulty constraints (fret limits, chord sizes, note spacing) and snaps everything to the grid
Output is a MIDI chart + OGG audio + song.ini, ready to drop into GHWT:DE/DATA/MODS/

Architecture


Type	Bidirectional masked transformer with FiLM difficulty conditioning
Parameters	19,321,896 (19.3M)
Audio input	MERT-v1-95M embeddings (768d) + onset/RMS/spectral centroid (3d) = 771d
Hidden dim	512
Heads	8
Layers	6 (each with FiLM modulation from difficulty embedding)
FFN	2048, SwiGLU activation
Position encoding	Rotary (RoPE, base 10000)
Normalization	RMSNorm
Token vocab	34 - 32 fret combinations + silence + mask
Output heads	Token logits (33 classes), sustain (binary), duration (6 buckets)
Inference	12-step cosine-schedule iterative unmasking with temperature sampling

Training data

Trained exclusively on official Guitar Hero setlists (GH1 through Warriors of Rock, Band Hero, and remastered track packs). Charts were extracted from GHWT:DE-format .pak files using GH-Toolkit-NET. Audio stems were decrypted and merged from the corresponding .fsb files.

Player-made Clone Hero community charts were deliberately excluded.

Limitations

The biggest limitation is that it's pretty mid.

Performs best on rock and guitar-forward tracks (because that's what the official charts are mostly made up of)
Beat tracking quality directly affects chart quality
Sustain behavior can be strange, especially on fast passages. Sustains are heavily post-processed with rules that don't always reflect what real charts do
No HOPOs, Star Power, or slider/tap notes (could be added, would also be mid)
Other instruments (drums, bass, vocals) would need separate models
There's a fundamental ceiling to what an itty bitty model trained on a single consumer GPU can do

The charts may be useful for human charters as a starting point, assuming they aren't entirely offended by its mere existence. Press x to doubt.

Format

model.safetensors - weights only, no pickle
config.json - architecture hyperparameters
pytorch/ - contains the trained .bin with optimizer states

Usage

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json

config_path = hf_hub_download("markury/midmid3-19m-0326", "config.json")
weights_path = hf_hub_download("markury/midmid3-19m-0326", "model.safetensors")

with open(config_path) as f:
    config = json.load(f)

state_dict = load_file(weights_path)

The model expects [batch, seq, 771] audio features (MERT + onset/RMS/centroid), [batch, seq] chart token indices (0-33), and a [batch] difficulty ID (0=easy, 3=expert).

Citation

Credit back here, the GitHub repo, or the live demo.

Downloads last month: 64