Module 1: AI Power User

Voice & multimodal workflows

Voice-to-AI pipelines, TTS output, and audio as a primary instruction method.

Notas del tema

DescripciΓ³n general

The problem

Typing is the default way people interact with AI. But typing is slow, requires your hands, and doesn't work when you're walking, driving, or away from a keyboard. Voice input and audio output transform AI from a desk tool into something that works with your life.


Our voice workflow

We use voice notes as the primary instruction method. The full loop:

Speak (Telegram voice note)
  β†’ Transcribe (Deepgram whisper-large)
    β†’ Interpret (Claude)
      β†’ Execute (tools, code, research)
        β†’ Narrate (Edge TTS)
          β†’ Deliver (Telegram voice note)

The result: you send a voice note with an instruction, and you get back a voice note with the answer. Hands-free AI interaction.


Setting up voice input: Deepgram transcription

The transcription challenge

Telegram encodes voice notes as Opus audio in an OGG container. Not every transcription service handles this format well.

What we tested:

Service/ModelResult with Telegram audio
Deepgram Nova-2❌ Failed β€” couldn't decode Telegram's Opus format
Deepgram whisper-largeβœ… Works reliably
OpenAI Whisper APINot tested (would require additional API key)

Lesson: Don't assume the newest model handles your specific format. Test with actual data.

The transcription script

We wrote a script at ~/.openclaw/scripts/transcribe-deepgram:

#!/bin/bash
FILE="$1"
LANG="${2:-en}"

curl -s --request POST \
  --url "https://api.deepgram.com/v1/listen?model=whisper-large&language=$LANG" \
  --header "Authorization: Token $DEEPGRAM_API_KEY" \
  --header "Content-Type: audio/ogg" \
  --data-binary @"$FILE"

Simple: takes a file path and optional language, sends to Deepgram, returns JSON with the transcript.

The robust wrapper

Voice notes sometimes fail on first attempt (network issues, temporary API errors). We built a retry wrapper:

#!/bin/bash
FILE="${1:-latest}"
ATTEMPTS="${2:-3}"
DELAY="${3:-2}"

# "latest" resolves to the most recent voice note file
if [ "$FILE" = "latest" ]; then
  FILE=$(ls -t ~/.openclaw/media/inbound/file_*---*.ogg 2>/dev/null | head -1)
fi

for i in $(seq 1 $ATTEMPTS); do
  RESULT=$(~/.openclaw/scripts/transcribe-deepgram "$FILE")
  if echo "$RESULT" | jq -e '.results' > /dev/null 2>&1; then
    echo "$RESULT" | jq -r '.results.channels[0].alternatives[0].transcript'
    exit 0
  fi
  sleep $DELAY
done

echo "Transcription failed after $ATTEMPTS attempts" >&2
exit 1

Three attempts with a 2-second delay between retries. The latest shortcut automatically finds the most recent voice note.


Setting up voice output: TTS delivery

Text-to-Speech generation

We use Edge TTS (Microsoft's free TTS engine) to convert text to audio. The TTS tool generates an MP3 file.

Key constraint: Maximum 4,096 characters per TTS call. For longer content, split into parts.

Delivering voice notes via Telegram

Here's where we hit an important lesson. The AI platform generates a MEDIA: path for TTS output. But this path points to a temporary file that gets cleaned up quickly.

What failed: Including MEDIA:/tmp/tts-xxx/voice-xxx.mp3 in the chat response. The file was often gone by the time the system tried to send it.

What works: Sending the audio file directly via the Telegram bot API:

message(
  action: "send",
  channel: "telegram",
  target: "user-id",
  filePath: "/tmp/tts-upload/report.mp3",
  asVoice: true
)

The asVoice: true parameter makes Telegram render it as a playable voice bubble instead of a file attachment. This is the reliable method we now use for all audio delivery.


Experiment: narrated news briefing

We tested the full voice pipeline with a 10-story news briefing:

  1. Searched web for AI/blockchain/tech news (Brave Search)
  2. Analyzed each story with consequence chains (Claude)
  3. Narrated each as a separate voice note (Edge TTS)
  4. Delivered all 10 individually via Telegram (message tool)

Timing: ~3 minutes total for research, analysis, 10 TTS generations, and 10 message sends.

What worked well:

  • Each story as a separate audio clip lets the listener skip or replay individual items
  • The narration format (alias, summary, 3 consequence levels) works well for audio consumption
  • Voice delivery means the listener can absorb information while doing other things

What could be better:

  • Edge TTS voice quality is functional but not great β€” lacks natural inflection
  • No control over speaking pace or emphasis
  • 4,096 character limit means complex stories must be condensed

When voice beats text

Through daily use, we've found voice works best for:

  • Morning briefings β€” listen while getting ready
  • News summaries β€” absorb while walking or commuting
  • Status reports β€” quick audio update vs reading a wall of text
  • Instructions to the agent β€” faster to speak than type, especially on mobile

When text is still better:

  • Code or structured data β€” needs visual parsing
  • Anything you'll reference later β€” text is searchable, audio isn't
  • Complex instructions with specific formatting requirements

Multimodal beyond voice

Our setup also handles:

Image analysis: Send a screenshot or photo, the agent analyzes it using vision capabilities. We've used this for reviewing website designs, reading error screenshots, and checking deploy previews.

Document processing: Send a PDF or Word document, the agent extracts and processes the content. We used this to proofread a 155,000-character creative writing piece β€” the agent read the entire document and found exactly one error.

The pattern: Input in whatever format is natural (voice, image, document) β†’ AI processes and understands β†’ Output in whatever format is useful (text, voice, file).


What we haven't built

  • No real-time voice conversation. Our pipeline is async: send voice note β†’ wait β†’ get response. Not a live voice chat.
  • No speaker identification. The system doesn't distinguish between different speakers in a voice note.
  • No voice cloning. We use Edge TTS's default voice, not a custom voice.
  • No transcription of languages we haven't tested. Deepgram whisper-large works well for English and Spanish. Other languages untested.

Sources