Audio (general)
182 repos
๐ค Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
A feature-rich command-line audio/video downloader
LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
Video.js - open source HTML5 video player
๐ Text-Prompted Generative Audio Model
Instant voice cloning by MIT and MyShell. Audio foundation model.
๐ค Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
SRS is a simple, high-efficiency, real-time media server supporting RTMP, WebRTC, HLS, HTTP-FLV, HTTP-TS, SRT, MPEG-DASH, and GB28181, with codec support for H.264, H.265, AV1, VP9, AAC, Opus, and G.711.
The most advanced free and open-source browser fingerprinting library
GUI for a Vocal Remover that uses Deep Neural Networks.
๐ง Your Personal Streaming Service
Buzz transcribes and translates audio offline on your personal computer. Powered by OpenAI's Whisper.
Ready-to-use SRT / WebRTC / RTSP / RTMP / LL-HLS / MPEG-TS / RTP media server and media proxy that allows to read, publish, proxy, record and playback video and audio streams.
The free and privacy-friendly screen recorder with no limits ๐ฅ
Audio Editor
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
A PyTorch-based Speech Toolkit
A React component for playing a variety of URLs, including file paths, YouTube, Facebook, Twitch, SoundCloud, Streamable, Vimeo, Wistia and DailyMotion
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
๐คโจ๏ธ Acoustic keyboard eavesdropping
HTML5 <audio> or <video> player with support for MP4, WebM, and MP3 as well as HLS, Dash, YouTube, Facebook, SoundCloud and others with a common HTML5 MediaElement API, enabling a consistent UI in all browsers.
Multilingual Voice Understanding Model
Text-audio foundation model from Boson AI
Mumble is an open-source, low-latency, high quality voice chat software.
Synchronous multiroom audio player
An extensible, plugin-oriented, HTML5-first media player for the web
Gradio WebUI for creators and developers, featuring key TTS (Edge-TTS, kokoro) and zero-shot Voice Cloning (E2 & F5-TTS, CosyVoice), with Whisper audio processing, YouTube download, Demucs vocal isolation, and multilingual translation.
An Open Source Python alternative to NotebookLM's podcast feature: Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAI
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
GNU Radio โ the Free and Open Software Radio Ecosystem
Download web video and audio
High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale
250+ Fine-tuning & RL Notebooks for text, vision, audio, embedding, TTS models.
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
Think DSP: Digital Signal Processing in Python, by Allen B. Downey.
Random digital audio effects
A curated list of awesome data labeling tools
Effort free video editing!
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
Noise supression using deep filtering
Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Soundcloud Music Downloader
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Cross-platform audio I/O library in pure Rust
UI components and hooks for building video/audio players on the web. Robust, customizable, and accessible. Modern alternative to JW Player and Video.js.
A library for audio and music analysis, feature extraction.
Multi-modal Generative Media Skills for AI Agents (Claude Code, Cursor, Gemini CLI). High-quality image, video, and audio generation powered by muapi.ai.
A single Gradio + React WebUI with extensions for ACE-Step, OmniVoice, Kimi Audio, Piper TTS, GPT-SoVITS, CosyVoice, XTTSv2, DIA, Kokoro, OpenVoice, ParlerTTS, Stable Audio, MMS, StyleTTS2, MAGNet, AudioGen, MusicGen, Tortoise, RVC, Vocos, Demucs, SeamlessM4T, and Bark!
Transcribe any audio to text, translate and edit subtitles 100% locally with a web UI. Powered by whisper models!
Data manipulation and transformation for audio signal processing, powered by PyTorch
Use API to call the music generation AI of suno.ai, and easily integrate it into agents like GPTs.
Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Self-hosted AI audio transcription
๐๏ธ Open Source Audio Matching and Mastering
ElevenLabs UI is a component library and custom registry built on top of shadcn/ui to help you build multimodal agents faster.
๐ A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).
An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity.
Voice Activity Detector (VAD) : low-latency, high-performance and lightweight
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
Voice activity detector (VAD) for the browser with a simple API
Cutting edge AI technology for automated audio transcription. A nice GUI for OpenAIs Whisper and pyannote (speaker identification)
Cross-Platform, GPU Accelerated Whisper ๐๏ธ
MLT Multimedia Framework
Server for Squeezebox and compatible players. This server is also called Lyrion Music Server.
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
A comprehensive ComfyUI integration for Microsoft's VibeVoice text-to-speech model, enabling high-quality single and multi-speaker voice synthesis directly within your ComfyUI workflows.
SALMONN family: A suite of advanced multi-modal LLMs
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
A full-featured image/video management app with AI-powered organization and semantic search. Supports metadata from SD-webui, ComfyUI, Fooocus, NovelAI, StableSwarmUI, and more. Available as standalone app, SD-webui extension, or library.
Teensy Audio Library
We present StableAvatar, the first end-to-end video diffusion transformer, which synthesizes infinite-length high-quality audio-driven avatar videos without any post-processing, conditioned on a reference image and audio.
Unofficial PyTorch implementation of Google AI's VoiceFilter system
A free & open tool for transcribing audio interviews
Self-host the powerful Chatterbox TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), predefined voices, voice cloning, and large audiobook-scale text processing. Runs accelerated on NVIDIA (CUDA), AMD (ROCm), and CPU.
Easy to use stem (e.g. instrumental/vocals) separation from CLI or as a python package, using a variety of amazing pre-trained models (primarily from UVR)
Open source audio annotation tool for humans
PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models
MiMo-Audio: Audio Language Models are Few-Shot Learners
Near-Realtime audio transcription using self-hosted Whisper and WebSocket in Python/JS
AI Audio Datasets (AI-ADS) ๐ต, including Speech, Music, and Sound Effects, which can provide training data for Generative AI, AIGC, AI model training, intelligent audio tool development, and audio applications.
ChatGPT CLI is a powerful, multi-provider command-line interface for working with modern LLMs. It supports OpenAI, Azure, Perplexity, LLaMA, and more, with features like streaming, interactive chat, prompt files, image/audio I/O, MCP tool calls, and an experimental agent mode for safe, multi-step automation.
A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech
Audio Large Language Models
A ComfyUI custom node integration for local multi-engine multi-language Text-to-Speech and Voice Conversion. Supports: RVC, Echo-TTS, Qwen3-TTS, Cozy Voice 3, Step Audio EditX, IndexTTS-2, Chatterbox (classic and multilingual), F5-TTS, Higgs Audio 2 and VibeVoice with unlimited text length, SRT timing, Character support, and many audio tools
Generate transcripts for audio and video content with a user friendly UI, powered by Open AI's Whisper with automatic translations and download videos automatically with yt-dlp integration
๐ Sonos Media Player Interface/Client
A 100% private AI voice transcription app that converts speech to text in 100+ languages. Built with Compose Multiplatform for Android & iOS using Whisper AI - no cloud uploads, all processing happens on-device for complete privacy.
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
A React component to make correcting automated transcriptions of audio and video easier and faster. By BBC News Labs. - Work in progress
"VideoAgent: All-in-One Agentic Framework for Video Understanding, Editing, and Remaking"
Turn PDFs and EPUBs into audiobooks, subtitles or videos into dubbed videos (including translation), and more. For free. Pandrator uses local models, notably XTTS, including voice-cloning (instant, RVC-enhanced, XTTS fine-tuning) and LLM processing. It aspires to be a user-friendly app with a GUI, an installer and all-in-one packages.
๐ฅ๐ฅ๐ฅ A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
ScribeWizard: Generate organized notes from audio using Groq, Whisper, and Llama3
Open Audio Watermarking Tool
Android Voice Activity Detection (VAD) library. Supports WebRTC VAD GMM, Silero VAD DNN, Yamnet VAD DNN models.
Open Source Multiroom Audio Streamer based on Raspberry Pi & Snapcast
Tero Subtitler is an open source, cross-platform, and free subtitle editing software.
This project is a digital human that can talk and listen to you. It uses OpenAI's GPT to generate responses, OpenAI's Whisper to transcript the audio, Eleven Labs to generate voice and Rhubarb Lip Sync to generate the lip sync.
A fully local and private Speech-To-Text app with cross-platform support, speaker diarization, Audio Notebook mode, LM Studio integration, and both longform and live transcription.
Free on-device web app for audio transcribing and rendering subtitles
Collection of LADSPA/LV2/VST/JACK audio plugins for high-quality processing
AudioBench: A Universal Benchmark for Audio Large Language Models
Your faithful, impartial partner for audio evaluation โ know yourself, know your rivals. ็ๅฎ่ฏๆต๏ผ็ฅๅทฑ็ฅๅฝผใ
OpenShot Audio Library (libopenshot-audio) is a free, open-source project that enables high-quality editing and playback of audio, and is based on the amazing JUCE library.
The (official) Music Assistant Mobile app is a cross-platform client application designed for Android, iOS, and Java runtime environments. Developed using Kotlin Multiplatform (KMP) and Compose Multiplatform frameworks, this project aims to provide a unified codebase for seamless music management across multiple platforms.
An audio recording helper for React. Provides a component and a hook to help with audio recording.
Transcribe audio and add subtitles to videos using Whisper in ComfyUI
The AI Podcast Studio: generate podcasts scripts and their audio version with a team of AI workers in a Podcast Studio ๐๏ธ๐
Snapcast client for Android
A modern, real-time speech recognition application built with OpenAI's Whisper and PySide6. This application provides a beautiful, native-looking interface for transcribing audio in real-time with support for multiple languages.
Natural language โ ComfyUI workflow JSON. 34 built-in templates, 360+ node definitions, auto model download. Supports txt2img, img2img, txt2vid, img2vid, audio, 3D generation across SD1.5/SDXL/SD3/FLUX/Wan2.2/HunyuanVideo/LTXV/Mochi/Cosmos + LLM integration. Works as a skill for Claude Code, Cursor, and other AI coding agents.
PySimpleGUI based DESKTOP APP to AUTO GENERATE SUBTITLE FILE (using free Google Speech Recognition API) and TRANSLATED SUBTITLE FILE (using unofficial online Google Translate API) for any video or audio file
Modern GUI application that transcribes and translate audio files using OpenAI Whisper.
The BEST music separation model with help of A.I. ... to my ears ! ๐๐
A unified tokenizer that is capable of both extracting semantic information and enabling high-fidelity audio reconstruction.
Cross-platform audio recorder designed for real-time speech audio processing
Generate karaoke videos, by downloading audio and lyrics, separating instrumentals, synchronising lyrics using transcription models, rendering CDG and uploading videos to YouTube / Dropbox / Google Drive
๐ฌ Fast, cross-platform CLI and GUI for batch transcription, translation, speaker annotation and subtitle generation using OpenAIโs Whisper on CPU, Nvidia GPU and Apple MLX.
Wayland Speech-to-Text Tool - A minimal signal-driven speech-to-text tool for Wayland environments with PipeWire audio
Elucidated Text-To-Audio (ETTA) is a SOTA text-to-audio model with a holistic understanding of the design space and trained with synthetic captions.
End-to-end workflow to automatically generate show notes from audio/video transcripts
Vapi Blocks is a library of components & api snips to copy and paste into React applications built with TailwindCSS for integrating Voice AI into your application using Vapi.ai. Vapi let's you develop voice AI fast, Vapi Blocks helps you implement faster.
AI-powered tool for automatic podcast script and audio generation.
Free in-browser audio & video censorship tool. AI-powered transcription with Whisper, 100% private client-side processing. Bleep profanity, custom words, or any phrase.
SVAR - Simple Voice Activated Recorder
a comfyui cuatom node for audio subtitling based on whisperX and translators
A comprehensive framework to test audio comprehension of Large Audio Language Models.
Transcribe audio and video files with speaker diarization and logically grouped timestamps using Gemini Flash
AI-Powered Podcast Generator: A Python-based tool that converts text scripts into realistic audio podcasts using Google's Generative AI API. This project leverages advanced text-to-speech technology to create dynamic, multi-speaker conversations with customizable voices.
Automatically generate subtitles from an input audio or video file using OpenAI Whisper
An MCP Server for audio transcription using OpenAI
A user-friendly Raspberry Pi baby monitor with cry detection and audio/video streaming.
This project is a video processing application that extracts audio from videos, performs automatic speech recognition (ASR), and generates subtitles. It allows users to enhance audio quality, correct transcription errors, and convert subtitles into various dialects, all through a user-friendly command-line and web interface.
Installation script for an AI applications using ROCm on Linux.
MCP server for Fal.ai - Generate images, videos, music and audio with Claude
Snapcast Multiroom audio docker image
A set of bash scripts to convert audio files into M4B audiobooks with chapter markers, customizable bitrate, book metadata and embedded cover art.
Android application for data transfer, using sound waves
Removes silence segments from wav audio files
Learn how multimodal AI merges text, image, and audio for smarter models
Generate audio datasets for training Text-To-Speech models, through smart audio splitting with silence detection, and transcription using Whisper.
The Multi-Language Automatic Translation, Subtitling, and Voice Rendering System uses third party software to automatically convert audio to text, translate text, render text to video, and render text to audio.
Dockerized Whisper C++ speech-to-text API for easy deployment and rapid integration. Offering the latest stable and nightly builds for efficient audio transcription.
A cross-platform desktop application that records audio and transcribes it to text using OpenAI's Whisper API or compatible services. Perfect for dictation, note-taking, and accessibility.
Audio Cleaner using DeepFilterNet, hosted through Streamlit
A curated list of tools for building AI with rich context from screen recordings, audio, and personal data
Text-to-speech plugin for Claude Code โ multi-provider support (ElevenLabs, OpenAI, Google, Amazon Polly, Azure, Kitten, local system TTS) on macOS, Linux, and Windows
Prompt Management System for Interaction with the ChatGPT API
A Python tool that uses Google Gemini API to transcribe video or audio files into SRT subtitle files.
International Public Radio Directory, bringing diversity into audio. Public listing of internet radios from all around the world.
A high-performance speech recognition MCP server based on Faster Whisper, providing efficient audio transcription capabilities.
Transcribe audio/video to text, locally on macOS, Linux and Windows. A simple whisper.cpp wrapper/UI built with Go/Fyne.
An audio/video transcriber with diarization and transcription editing.
A MCP server that provides audio transcription capabilities using OpenAI's Whisper API
Transcribe Offline by openresearchtools.com is an open source desktop application that allows you to transcribe audio and video fully offline, with optional speaker diarisation and word-level alignment. It can also generate subtitles and integrate with local large language models (LLMs) for summarisation and editing
Convert audio files (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm) to SRT subtitles with OpenAI Whisper. Easy script for fast, accurate transcription.
ืืชืจ ืืืคืืฉ ืืืืจืืช ืฉืืจืื
One-key voice-to-transcription tool: record speech, transcribe locally with Whisper, then paste. Never lose your audio files anymore!
OmniEvalKit is an evaluation framework designed for omni-modal large language models, with a focus on audio and audio-visual understanding. Based on OmniEvalKit, you can quickly reproduce benchmarks, implement your own models or datasets, and conduct fair comparisons with other open-source models. MiniCPM-o is evaluated using this framework.
[ACL 2025] Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
Fine-tuned whisper that transcribe Hebrew audio into IPA
A voice transcription tool using faster-whisper that records audio and converts speech to text on Linux systems.
Rudimentary program for speech transcription, manipulation, and redaction.
AudioWrite: Effortless voice dictation powered by Google's Gemini API. Record, transcribe, and transform rambling audio into polished, multi-language notes. PWA ready.
LTX-2.3 video generation skill โ setup, inference, prompting, ComfyUI integration for Lightricks 22B DiT audio-video model
๐๏ธ Lightning-fast voice dictation Desktop Web App powered by Groq's Whisper Turbo - Open-source, privacy-first, with real-time audio visualization and intuitive click controls
WhisperVoice: Covert voice notes. Encrypts text and hides it via LLM-generated acrostic sentences. Murf.ai creates natural audio. Browser extension decrypts with passcode, revealing hidden message or playing decoy for unauthorized listeners. Uses LLM, Murf.ai, STT APIs
Modern NVR with object/motion/audio detection, push notifications, multi-location, and encrypted local and cloud-based storage support built in.
AI-powered music production in REAPER via the Model Context Protocol โ 163 tools for composition, MIDI, FX, mixing, and mastering.
Real-time desktop audio transcription using OpenAI Whisper for Arch Linux with CUDA acceleration
A powerful audio transcription server that seamlessly transcribes meeting recordings, generates notes, and intelligently splits audio files for efficient management. Open-source and built with FastMCP and Groq/OpenAI Whisper
MCP server for real-time audio transcription using OpenAI Whisper
Synchronous multiroom audio player
A powerful MCP (Model Context Protocol) server that transcribes audio and video files into text using Groq's Whisper model.
A deep learning application that classifies the reason for a baby's cry (hunger, pain, etc.) from live or uploaded audio. Built with a TensorFlow/Keras CNN, Librosa for audio processing, and a responsive Flask web UI with real-time recording and visualization. Helps caregivers understand an infant's needs instantly.
Python API for controlling Snapcast, a multi-room synchronous audio solution.
XTTS fine-tuning via CLI
Blazingly fast audio transcription MCP server using Whisper with Flash Attention 2
App for transcribing audio/video to editable SRT subtitles using Whisper. Supports mp3/mp4/wav inputs, audio extraction, and local download.