Skip to content

Technology Stack

Overview

AI Transcription Notepad combines several key technologies to deliver single-pass voice transcription with AI cleanup.

Core Components

ComponentTechnologyPurpose
UI FrameworkPyQt6Desktop GUI with tabbed interface, system tray, keyboard shortcuts
Audio RecordingPyAudioMicrophone capture with device selection
Audio Processingpydub + FFmpegFormat conversion, compression, gain control
Voice Activity DetectionTEN VADRemoves silence before API upload to reduce costs
Text-to-SpeechMicrosoft Edge TTSAudio feedback announcements
DatabaseMongitaMongoDB-compatible local storage for transcripts
ChartspyqtgraphAnalytics visualizations
Global Hotkeyspynput + evdevSystem-wide keyboard shortcuts

Transcription Backend

The app sends audio directly to multimodal AI models via OpenRouter for single-pass transcription and cleanup.

ProviderSDKEndpointNotes
OpenRouteropenai (compatible)openrouter.ai/api/v1Gemini 3 Flash/Pro, per-key cost tracking

The fused audio pipeline processes recording, AGC, VAD, and compression in a single optimized pass before API submission.

Voice Activity Detection

TEN VAD is a lightweight native library (~306KB) that detects speech segments and removes silence before sending audio to the API. This reduces file size and API costs.

  • Bundled with the ten-vad pip package (no download required)
  • Sample rate: 16kHz
  • Faster and more accurate than Silero VAD for real-time use
  • Requires libc++1 on Linux: sudo apt install libc++1

Text-to-Speech

Microsoft Edge TTS powers the voice announcements in TTS audio feedback mode.

  • Voice: British English male (en-GB-RyanNeural)
  • Pre-generated audio files bundled in app/assets/tts/ (~1.7MB)
  • Dynamic generation available for analytics readout
  • Uses edge-tts Python package

Python Dependencies

PyQt6>=6.6.0          # Desktop GUI framework
pyaudio>=0.2.14       # Audio recording
openai>=1.40.0        # OpenRouter API (OpenAI-compatible)
pydub>=0.25.1         # Audio processing
ten-vad>=1.0.6        # Voice activity detection
edge-tts>=6.1.0       # Text-to-speech for announcements
mongita>=1.2.0        # MongoDB-compatible local database
markdown>=3.5.0       # Markdown rendering
pynput>=1.7.6         # Keyboard input handling
evdev>=1.6.0          # Linux input device access (hotkeys)
httpx>=0.27.0         # HTTP client
pyqtgraph>=0.13.0     # Charts and visualizations

System Dependencies

Install on Ubuntu/Debian:

bash
sudo apt install python3 python3-venv ffmpeg portaudio19-dev libc++1

Or run the dependency checker:

bash
./scripts/install-deps.sh

File Locations

~/.config/voice-notepad-v3/
├── config.json           # Settings and API keys
├── mongita/              # MongoDB-compatible database
├── usage/                # Daily cost tracking (JSON)
└── audio-archive/        # Opus recordings (if enabled)

Audio Archival

When enabled, recordings are saved in Opus format optimized for speech:

  • Bitrate: ~24kbps
  • A 1-minute recording uses ~180KB
  • Stored in ~/.config/voice-notepad-v3/audio-archive/

Released under the MIT License.