Skip to content

Audio Pipeline

AI Transcription Notepad processes audio through several stages before sending it to AI models.

Overview

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Recording  │───▶│     AGC     │───▶│     VAD     │───▶│ Compression │───▶│  API Call   │
│  (PyAudio)  │    │  (Normalize)│    │ (TEN VAD)   │    │  (16kHz)    │    │ (OpenRouter)│
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
     48kHz              ±20dB          Remove silence      Downsample         Transcribe

The pipeline uses a fused audio processing approach that eliminates redundant WAV parse/export cycles between stages, improving throughput.

Pipeline Stages

1. Recording (audio_recorder.py)

  • Format: 16-bit PCM, mono
  • Sample rate: Device native (typically 44.1kHz or 48kHz)
  • Library: PyAudio
  • Features: Automatic sample rate negotiation, microphone disconnect handling

2. Automatic Gain Control (audio_processor.py)

Normalizes audio levels for consistent transcription accuracy.

ParameterValueDescription
Target peak-3 dBFSLeaves headroom while ensuring good signal
Minimum threshold-40 dBFSSkip AGC if audio is quieter (noise floor)
Maximum gain+20 dBPrevents over-amplification

Behavior: Only boosts quiet audio—never attenuates loud audio.

3. Voice Activity Detection (vad_processor.py)

Removes silence segments before API upload to reduce costs and improve accuracy.

Engine: TEN VAD (Apache 2.0)

TEN VAD was chosen over alternatives (Silero VAD, WebRTC VAD) for:

  • Smaller footprint (~306KB vs ~2.2MB for Silero)
  • Faster speech-to-silence transition detection
  • Lower latency for real-time processing
  • No model download required (bundled with pip package)
ParameterValueDescription
Sample rate16kHzRequired by TEN VAD
Hop size256 samples~16ms analysis windows
Threshold0.5Speech probability threshold
Min speech duration250msIgnore speech shorter than this
Min silence duration100msIgnore silence shorter than this
Padding30msBuffer around speech segments

4. Compression (audio_processor.py)

  • Output: 16kHz mono WAV
  • Reduction: ~66% smaller than 48kHz stereo input
  • Quality: No perceptible loss for speech transcription (matches Gemini's internal format)

5. API Submission (transcription.py)

  • Audio is base64-encoded and sent with the cleanup prompt
  • Submitted to OpenRouter's API endpoint
  • Returns transcript, token counts, and cost data

6. Storage (database_mongo.py)

  • Transcripts: Saved to Mongita (MongoDB-compatible local database)
  • Audio archive (optional): Opus format at ~24kbps (~180KB per minute)

Performance Benchmarks

Benchmarked on Intel Core i7-12700F (12th Gen, 12 cores).

VAD Processing Time

Audio LengthProcessing TimeReal-Time FactorSpeed
5 seconds63ms0.01380x faster than real-time
10 seconds131ms0.01376x faster than real-time
30 seconds436ms0.01569x faster than real-time
60 seconds864ms0.01469x faster than real-time

Average RTF: 0.014 (adds ~14ms per second of audio)

Silence Reduction

With typical dictation patterns (speech with pauses):

ScenarioOriginalAfter VADReduction
Continuous speech30s28s7%
Speech with pauses30s15s50%
Long pauses/thinking30s6s80%

End-to-End Latency

For a typical 30-second recording:

StageDurationNotes
VAD processing~400msScales linearly with audio length
Compression~100msResampling to 16kHz
API round-trip1500-3000msDepends on network and model
Total2-3.5 secondsVAD overhead is negligible

CPU Impact

  • VAD: Negligible (~1% CPU during processing)
  • Processing is 70x faster than real-time, so CPU usage is brief
  • No GPU required—runs entirely on CPU

Configuration

VAD can be toggled via the checkbox on the main recording page, or in Settings → Behavior.

Audio archival is configured in Settings → Behavior → Archive audio recordings.

System Requirements

Linux: TEN VAD requires libc++1:

bash
sudo apt install libc++1

Source Files

FilePurpose
audio_recorder.pyPyAudio recording with device management
audio_processor.pyAGC, compression, fused pipeline, Opus archival
vad_processor.pyTEN VAD integration
transcription.pyOpenRouter API client

Released under the MIT License.