midmid3

GitHub  Hugging Face  Hugging Face 

A slop generator for Guitar Hero.

19M-parameter masked-prediction transformer that generates "playable" Guitar Hero charts from audio. Give it a song, get a chart for all four difficulty levels. The charts are, at best, mid.

Try the live demo

How it works

  1. beat_this detects beats and builds a 16th-note grid
  2. MERT (frozen) extracts audio embeddings at each grid position
  3. This model iteratively fills in the chart over ~12 passes (MaskGIT-style unmasking), deciding which positions get notes, which frets, and sustain lengths. Conditioned on audio features and difficulty level
  4. Post-processing applies per-difficulty constraints (fret limits, chord sizes, note spacing) and snaps everything to the grid
  5. Output is a MIDI chart + OGG audio + song.ini, ready to drop into GHWT:DE/DATA/MODS/

Architecture

Type Bidirectional masked transformer with FiLM difficulty conditioning
Parameters 19,321,896 (19.3M)
Audio input MERT-v1-95M embeddings (768d) + onset/RMS/spectral centroid (3d) = 771d
Hidden dim 512
Heads 8
Layers 6 (each with FiLM modulation from difficulty embedding)
FFN 2048, SwiGLU activation
Position encoding Rotary (RoPE, base 10000)
Normalization RMSNorm
Token vocab 34 - 32 fret combinations + silence + mask
Output heads Token logits (33 classes), sustain (binary), duration (6 buckets)
Inference 12-step cosine-schedule iterative unmasking with temperature sampling

Training data

Trained exclusively on official Guitar Hero setlists (GH1 through Warriors of Rock, Band Hero, and remastered track packs). Charts were extracted from GHWT:DE-format .pak files using GH-Toolkit-NET. Audio stems were decrypted and merged from the corresponding .fsb files.

Player-made Clone Hero community charts were deliberately excluded.

Limitations

The biggest limitation is that it's pretty mid.

  • Performs best on rock and guitar-forward tracks (because that's what the official charts are mostly made up of)
  • Beat tracking quality directly affects chart quality
  • Sustain behavior can be strange, especially on fast passages. Sustains are heavily post-processed with rules that don't always reflect what real charts do
  • No HOPOs, Star Power, or slider/tap notes (could be added, would also be mid)
  • Other instruments (drums, bass, vocals) would need separate models
  • There's a fundamental ceiling to what an itty bitty model trained on a single consumer GPU can do

The charts may be useful for human charters as a starting point, assuming they aren't entirely offended by its mere existence. Press x to doubt.

Format

  • model.safetensors - weights only, no pickle
  • config.json - architecture hyperparameters
  • pytorch/ - contains the trained .bin with optimizer states

Usage

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json

config_path = hf_hub_download("markury/midmid3-19m-0326", "config.json")
weights_path = hf_hub_download("markury/midmid3-19m-0326", "model.safetensors")

with open(config_path) as f:
    config = json.load(f)

state_dict = load_file(weights_path)

The model expects [batch, seq, 771] audio features (MERT + onset/RMS/centroid), [batch, seq] chart token indices (0-33), and a [batch] difficulty ID (0=easy, 3=expert).

Citation

Credit back here, the GitHub repo, or the live demo.

Downloads last month
64
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support