TouchDesigner speech transcription

2025-08-18 09:29 UTC gpt-5 Open in ChatGPT ↗

Absolutely — there is a TouchDesigner component that can recognize sung vocals or spoken speech and transcribe them into text.

Whisper Speech-to-Text Plugin for TouchDesigner

A developer known as blankensmithing created a set of custom TouchDesigner components that integrate OpenAI’s Whisper (for speech-to-text transcription) and ChatGPT within TouchDesigner. These components are designed to be easy to use—you simply need to add your OpenAI API key, and they’ll work right inside TouchDesigner without additional setup. citeturn0search0turn0search3

Additionally, the tutorial is available in the TouchDesigner community and on the official Derivative website, with project files and components shared via the creator’s Patreon. citeturn0search3turn0search7

A community member on Reddit shared:

“I’m just looking at adding custom text to speech for the text output… currently looking at adding something like the elevenlabs.io API.”
This indicates these components are actively used and flexible for extension, even in creative performance contexts. citeturn0search4

Usage Highlights

Captures sung vocals, not just spoken word, though transcription accuracy will vary based on audio clarity.
No complex installation — only requires inserting your OpenAI API key into the component.
Former users have successfully integrated transcription into interactive installations and added TTS functionality (e.g., using ElevenLabs for voice output) citeturn0search9turn0search8.

Other Options & DIY Approach

If you’re looking for alternatives—perhaps highly customised or lighter-weight solutions—you could set up your own speech-to-text pipeline using Python:

Python scripts in a Script DAT or Text DAT within TouchDesigner can run local or cloud speech-to-text libraries.
External speech recognition tools can be executed via Python’s subprocess.Popen() from within TouchDesigner scripts, with output fed back into DATs or CHOPs. Communication can be managed via WebServer DATs or similar methods. citeturn0search1.

This DIY route allows more control over networked workflows, but requires a fair bit more setup.

Summary

Option	Pros	Considerations
Whisper Plugin	Seamless TouchDesigner integration, easy setup	Requires OpenAI API key; transcription accuracy varies for singing
DIY Python Implementation	Highly flexible and controllable	More setup and Python scripting required

Next Steps

Want to explore the Whisper and ChatGPT TouchDesigner components? I can walk you through setting them up, or help integrate them into your project.
Curious about improving transcription accuracy for singing? We can look into strategies like audio preprocessing, using denoising filters, or model fine-tuning.
Interested in adding voice output? We can integrate TTS tools such as ElevenLabs or Google TTS to complete the interaction loop.

Let me know what direction you’re leaning toward—happy to assist with the technical details or walk you through examples tailored to your workflow!

Totally doable without any paid APIs. You can run an offline speech-to-text engine locally and wire it into TouchDesigner with a few Python DATs/CHOPs.

Here are the two best approaches:

1) Local Whisper (best accuracy, works for singing)

Whisper is open-source from OpenAI and runs fully offline. Use the faster-whisper (CTranslate2) backend for speed.

What you’ll need

A Windows/macOS machine (GPU helps but CPU works).
Install once in TD’s Python:
```
pip install faster-whisper
```
(Optional for GPU speed: install CUDA/cuDNN; CPU works fine with compute_type="int8".)
Download a model once (no key, no server):
- base or small = light, quicker
- medium or large-v3 = best for sung lyrics, heavier

TouchDesigner wiring (one-off setup)

Audio Device In CHOP → get your mic.
Audio File Out CHOP → write rolling WAV chunks (e.g., 5–10s). Set “Segment Length” to your chunk size and a folder path like project/transcribe/.
Text DAT (Script DAT) watching the folder → when a new WAV appears, run Whisper and append text to a Table DAT or Text TOP/DAT.

Minimal Script DAT example

(Execute on a timer or with a File In DAT callback pattern.)

# Script DAT: /project/transcribe/transcribe_dat
import os, time, glob
from faster_whisper import WhisperModel

# Cache model across runs
if not hasattr(op('transcribe_dat'), 'model'):
    # pick your downloaded model name, e.g. "small", "medium", "large-v3"
    op('transcribe_dat').model = WhisperModel("small", compute_type="int8")  # CPU-friendly
model = op('transcribe_dat').model

watch_dir = project.folder + "/transcribe"
done_dir  = project.folder + "/transcribe_done"
os.makedirs(done_dir, exist_ok=True)

# Find newest unprocessed wav
wavs = sorted(glob.glob(os.path.join(watch_dir, "*.wav")), key=os.path.getmtime)
for w in wavs:
    base = os.path.basename(w)
    out_tbl = op('transcripts')  # create a Table DAT named 'transcripts'
    # Transcribe
    segments, info = model.transcribe(w, language="en", vad_filter=True)
    text = "".join(s.text for s in segments).strip()
    if text:
        out_tbl.appendRow([time.strftime("%H:%M:%S"), base, text])
        # Optional: update a Text DAT for live display
        op('live_text').text = (op('live_text').text + " " + text).strip()
    # Move processed file so we don't re-run
    os.replace(w, os.path.join(done_dir, base))

Notes

Set vad_filter=True to ignore silence; bump model size for better singing accuracy.
For near-real-time, use shorter segments (e.g., 3–5s) and run the script on a Timer CHOP or Execute DAT every second.
Whisper handles melody fairly well, but clean input (less reverb, good gain) helps a lot.

2) Vosk (very light, fully offline)

Vosk is tiny and runs on CPU easily. Accuracy is fine for clear speech but usually worse on sung vocals than Whisper.

Setup

pip install vosk

Download a Vosk English model and point your script at the model folder. Replace the transcribe section with Vosk’s recognizer on each WAV chunk. Great for low-spec machines, but expect more errors on singing.

Which should you choose?

Want lyrics from singing: go Whisper (medium/large-v3 if you can).
Old/low-power machine, latency over accuracy: Vosk.

If you’d like, tell me:

your OS + GPU (if any),
how “live” you need it (latency target),
typical clip length,

…and I’ll tailor the exact TD network (operators + a ready-to-paste Script DAT) to your setup.