Luis Montes

Founder, Iceddev

 

GitHub @monteslu Bluesky @monteslu.com Mastodon @monteslu@fosstodon.org

First Song Played from Physical Media

Thomas Edison

1877 - Thomas Edison records on tin foil cylinder

"Mary Had a Little Lamb"

The Rise of Karaoke

1971 - Kobe, Japan
Daisuke Inoue invents karaoke ("empty orchestra")
1980s
Spreads globally; karaoke boxes become cultural phenomenon
1990s
Karaoke bars boom in US and Europe...

CD+G

CD+G Logo

A star is born

What is CD+G?

  • CD+Graphics - Philips/Sony, 1986
  • Standard audio CD with graphics embedded
  • Backwards compatible with regular CD players
  • Still extremely popular for karaoke today!

CD Subcode Channels


CD Sector (2352 bytes audio + 96 bytes subcode)

Subcode Channels: P Q R S T U V W
                  │ │ └─────────┘
                  │ │      │
                  │ │      └── CD+G Graphics Data (6 channels)
                  │ └── Track/Time info (TOC)
                  └── Pause/Play flags
					
97.73% Audio 2.27% Graphics

2.27% of each frame (1/33 bytes × 6/8 channels)

26.5 kbit/s

US Robotics 28.8k Modem

Less than this dial-up modem.

CD+G Karaoke Graphics

Why Does This Look Like an Atari?

CD+G (1986)Atari 2600 (1977)
Resolution288 × 192160 × 192
Colors16 of 4,096128 total
Rendering6×12 tiles"Racing the beam"

9 years newer, same visual era!

CD+G Instruction Types


Memory Preset     - Clear screen to color
Border Preset     - Set border color
Load Color Table  - Set 8 colors (low/high)
Tile Block        - Draw 6×12 pixel tile
Scroll Preset     - Scroll with color fill
Scroll Copy       - Scroll with wrap
					

That's it. 6 commands to build an entire visual experience.

Why CD+G Won't Die

  • Massive existing library (100,000+ songs)
  • Professional publishers (Sound Choice, Sunfly, Chartbuster)
  • Reasonable file sizes (MP3 ~4MB + CDG ~4MB)
  • Every KJ has a CD+G collection

 

But it's 2026... can we do better?

CD+G Enhanced Graphics

And now the AI stuff

DJ on a bike

From Karaoke to DJing?

The DJ's Dream

🥁 🎸 🎹 🎤

  • DJs wanted to isolate parts of songs
  • Vocals for mashups and remixes
  • Drums for beat matching
  • Previously: expensive studio stems or nothing

AI Source Separation

  • 2015: Early neural network attempts
  • 2019: Spleeter (Deezer) - first practical solution
  • 2021: Demucs (Facebook/Meta) - state of the art
  • 2024+: Real-time separation in DJ software

MP4 Stems

🎬

The New Standard?

Native Instruments Stems Format

  • Standard MP4/M4A container
  • Multiple audio tracks in one file
  • Backwards compatible (plays as stereo mix)
  • Metadata for track names, colors, etc.

Stems Structure


song.stem.m4a
├── Track 0: Master mix - plays in normal players
├── Track 1: Drums
├── Track 2: Bass
├── Track 3: Other (keys, guitars, etc.)
├── Track 4: Vocals
└── Metadata: atoms
					

Why M4A Stems for Karaoke?

  • Real original audio - not MIDI recreation!
  • Control vocal volume (or mute entirely)
  • Practice with just vocals + one instrument

So how do we make the stems?

Demucs

Mark Zuckerberg

Meta's Audio Source Separation

What is Demucs?

  • Open source (MIT license)
  • State-of-the-art source separation
  • Runs on PyTorch
  • Trained on huge dataset of music

Demucs in Action


$ python -m demucs song.wav

# Output: separated/htdemucs/song/
#   vocals.wav
#   drums.wav
#   bass.wav
#   other.wav
					

Demo Time!

Let's hear some stems...

But Wait...

We have the music separated.

What about the lyrics?

Multi-purpose Metadata

  • Standard atoms - Artist, title, album, cover art
  • stem atom - NI Stems metadata for DJ software
  • kara atom - Synced lyrics for karaoke

 

One .stem.m4a file works in Traktor, Mixxx, AND Loukai

Whisper

OpenAI

OpenAI's Speech Recognition

What is Whisper?

  • Open source speech recognition
  • Trained on 680,000 hours of audio
  • Multilingual (99 languages)
  • Timestamp generation!

Whisper for Lyrics


{
  "text": "Never gonna give you up",
  "start": 43.52,
  "end": 45.84
}
					
  • Feed it the isolated vocals from Demucs
  • Get word-level timestamps
  • Embed directly into M4A stems file

The Pipeline


┌───────────────┐     ┌─────────────┐     ┌─────────────┐
│   Any Song    │ ──► │   Demucs    │ ──► │   Whisper   │
│ (mp3, flac,   │     │  (Stems)    │     │  (Lyrics)   │
│  ogg, wav)    │     │             │     │             │
└───────────────┘     └─────────────┘     └──────┬──────┘
                                                 │
                                                 ▼
                                     ┌─────────────────────┐
                                     │    M4A Stems File   │
                                     │   with synced lyrics│
                                     └─────────────────────┘
					

Whisper Challenges

  • Not always perfect transcription
  • Timing can be slightly off
  • Struggles with some vocal styles
  • Trained on speech, not music

... insurmountable?

LLMs

Super Battle Droid

Clankers to the rescue!

LLMs for Lyrics

  • Fix Whisper transcription errors
  • Look up actual lyrics and align
  • Handle multiple languages

Example Corrections


Whisper: "Excuse me while I kiss this guy"
LLM:     "Excuse me while I kiss the sky"

Whisper: "Hold me closer, Tony Danza"
LLM:     "Hold me closer, tiny dancer"

Whisper: "I feel stupid and contagious"
LLM:     "...actually that's correct" 🤷
					

Loukai

Loukai

Putting It All Together

What is Loukai?

  • Open Source karaoke for the AI stems era
  • Plays M4A stems with real-time mixing
  • Also supports legacy CD+G format
  • Cross-platform (Linux, Windows, macOS)

Built-in Creator

  • Demucs - AI stem separation
  • Whisper - AI lyrics transcription
  • CREPE - Musical key detection
  • LLM correction - Fix transcription errors

Drop any audio file → Get karaoke-ready .stem.m4a

Tech Stack

  • Electron
  • React
  • Vite
  • Tailwind CSS
  • Butterchurn
  • Web Audio API
  • Socket.IO
  • WASM
  • PyTorch
  • ffmpeg

Demo Time!

Let's light this candle

The Future of Karaoke

  • AI-generated stems from any song
  • Automatic lyrics with timestamps
  • Real-time pitch correction
  • Vocal coaching

M4A is a worthy successor to CD+G !

Like that smash button!

Thank You

 

GitHub @monteslu Bluesky @monteslu.com Mastodon @monteslu@fosstodon.org

 

Luis Montes