TTS (Text-to-Speech)
109 repos
Web UI for training and running open models like Gemma 4, Qwen3.5, DeepSeek, gpt-oss locally.
Clone a voice in 5 seconds to generate arbitrary speech in real-time
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
A generative speech model for daily dialogue.
Instant voice cloning by MIT and MyShell. Audio foundation model.
SOTA Open Source TTS
SoTA open-source TTS
The open-source AI voice studio. Clone, dictate, create.
From the team behind Gatsby, Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
A TTS model capable of generating ultra-realistic dialogue in one pass.
🧠 Leon is your open-source personal assistant.
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
A multi-voice TTS system trained with an emphasis on quality
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.
Speech-to-text, text-to-speech, speaker diarization, speech enhancement, source separation, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, HarmonyOS, Raspberry Pi, RISC-V, RK NPU, Axera NPU, Ascend NPU, x86_64 servers, websocket server/client, support 12 programming languages
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice cloning.
A fast, local neural text to speech system
Use Microsoft Edge's online text-to-speech service from Python WITHOUT needing Microsoft Edge or Windows or an API key
AI Agent Engineering Platform built on an Open Source TypeScript AI Agent Framework
Zero-Shot Speech Editing and Text-to-Speech in the Wild
EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine
Very low latency speech to text, intent recognition, and text to speech, for building voice agents and interfaces
Open Vision Agents by Stream. Build voice and vision agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.
Gradio WebUI for creators and developers, featuring key TTS (Edge-TTS, kokoro) and zero-shot Voice Cloning (E2 & F5-TTS, CosyVoice), with Whisper audio processing, YouTube download, Demucs vocal isolation, and multilingual translation.
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
Towards Human-Sounding Speech
Silero Models: pre-trained text-to-speech models made embarrassingly simple
This repository contains a hand-curated resources for Prompt Engineering with a focus on Generative Pre-trained Transformer (GPT), ChatGPT, PaLM etc
250+ Fine-tuning & RL Notebooks for text, vision, audio, embedding, TTS models.
Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model w/CPU ONNX and NVIDIA GPU PyTorch support, handling, and auto-stitching
An Open Source text-to-speech system built by inverting Whisper.
High-Quality Voice Cloning TTS for 600+ Languages
A nearly-live implementation of OpenAI's Whisper.
World's first open-source, agentic video production system. 12 pipelines, 52 tools, 500+ agent skills. Turn your AI coding assistant into a full video production studio.
A single Gradio + React WebUI with extensions for ACE-Step, OmniVoice, Kimi Audio, Piper TTS, GPT-SoVITS, CosyVoice, XTTSv2, DIA, Kokoro, OpenVoice, ParlerTTS, Stable Audio, MMS, StyleTTS2, MAGNet, AudioGen, MusicGen, Tortoise, RVC, Vocos, Demucs, SeamlessM4T, and Bark!
Lightning-Fast, On-Device, Multilingual TTS — running natively via ONNX.
TTS with kokoro and onnx runtime
AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
🔊 A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).
Amica is an open source interface for interactive communication with 3D characters with voice synthesis and speech recognition.
A comprehensive ComfyUI integration for Microsoft's VibeVoice text-to-speech model, enabling high-quality single and multi-speaker voice synthesis directly within your ComfyUI workflows.
Official MiniMax Model Context Protocol (MCP) server that enables interaction with powerful Text to Speech, image generation and video generation APIs.
Speech Note Linux app. Note taking, reading and translating with offline Speech to Text, Text to Speech and Machine translation.
A Python/Pytorch app for easily synthesising human voices
Run local LLMs like llama, deepseek-distill, kokoro and more inside your browser
Offline inference engine for art, real-time voice conversations, LLM powered chatbots and automated workflows
Self-host the powerful Chatterbox TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), predefined voices, voice cloning, and large audiobook-scale text processing. Runs accelerated on NVIDIA (CUDA), AMD (ROCm), and CPU.
Natural (2-way) voice conversations with Claude Code
A modular voice assistant application for experimenting with state-of-the-art transcription, response generation, and text-to-speech models. Supports OpenAI, Groq, Elevanlabs, CartesiaAI, and Deepgram APIs, plus local models via Ollama. Ideal for research and development in voice technology.
A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech
A ComfyUI custom node integration for local multi-engine multi-language Text-to-Speech and Voice Conversion. Supports: RVC, Echo-TTS, Qwen3-TTS, Cozy Voice 3, Step Audio EditX, IndexTTS-2, Chatterbox (classic and multilingual), F5-TTS, Higgs Audio 2 and VibeVoice with unlimited text length, SRT timing, Character support, and many audio tools
Inworld TTS
High-performance Text-to-Speech server with OpenAI-compatible API, 8 voices, emotion tags, and modern web UI. Optimized for RTX GPUs.
Offline Speech Recognition with OpenAI Whisper and TensorFlow Lite for Android
AI-powered video podcast creation skill for coding agents. Supports Bilibili & YouTube, multi-language (zh-CN/en-US), 6 TTS engines (Edge/Azure/ElevenLabs/OpenAI/Doubao/CosyVoice), 4K Remotion rendering.
Turn PDFs and EPUBs into audiobooks, subtitles or videos into dubbed videos (including translation), and more. For free. Pandrator uses local models, notably XTTS, including voice-cloning (instant, RVC-enhanced, XTTS fine-tuning) and LLM processing. It aspires to be a user-friendly app with a GUI, an installer and all-in-one packages.
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
Run Orpheus 3B Locally With LM Studio
🎙️ Speak with AI - Run locally using Ollama, OpenAI, Anthropic or xAI - Speech uses SparkTTS, OpenAI, ElevenLabs, Kokoro, Typecast or xAI
On-device AI for Android — LLM chat (GGUF/llama.cpp), vision models (VLM), image generation (Stable Diffusion), tool calling, AI personas, RAG knowledge packs, TTS/STT. Fully offline, zero subscriptions, open-source.
VoxNovel: generate audiobooks giving each character a different voice actor.
Fast Streaming TTS with Orpheus + WebRTC (with FastRTC)
EaseVoice Trainer is a simple and user-friendly voice cloning and speech model trainer.
The Naomi Project is an open source, technology agnostic platform for developing always-on, voice-controlled applications!
A local implementation of the Kokoro Text-to-Speech model, featuring dynamic module loading, automatic dependency management, and a web interface.
AI video generation SDK — JSX for videos. One API for Kling, Flux, ElevenLabs, Sora. Built on Vercel AI SDK.
🎬 Auto-subtitle videos with AI transcription, translation, voice cloning, professional rendering, background image and music generator
Blueprint by Mozilla.ai for generating podcasts from documents using local AI
Input text from speech in any Linux window, the lean, fast and accurate way, using whisper.cpp OFFLINE. Speak with local LLMs via llama.cpp.
Like ChatGPT's voice conversations with an AI, but entirely offline/private/trade-secret-friendly, using local AI models such as LLama 2 and Whisper
Automatically generate engaging AI podcasts from nothing but an episode title.
Use Home Assistant Assist on the desktop. Compatible with Windows, MacOS, and Linux
A simple to use python library for creating podcasts with support for many LLM and TTS providers
The official implementation of "A Language Modeling Approach to Diacritic-Free Hebrew TTS"
ComfyUI Chatterbox TTS & Voice Conversion Node
AI-Powered Podcast Generator: A Python-based tool that converts text scripts into realistic audio podcasts using Google's Generative AI API. This project leverages advanced text-to-speech technology to create dynamic, multi-speaker conversations with customizable voices.
OpenAI-compatible TTS API that unifies multiple backends with smart chunking for unlimited-length generation
Installation script for an AI applications using ROCm on Linux.
Mission to create a Hebrew TTS model as powerful and user-friendly as WaveNet
A real-time, offline voice assistant for Linux and Raspberry Pi. Uses local LLMs (via Ollama), speech-to-text (Vosk), and text-to-speech (Piper) for fast, wake-free voice interaction. No cloud. No APIs. Just Python, a mic, and your voice.
Speech-to-text, text-to-speech with ElevenLabs
Langchain Voice Agent with Inworld TTS
Generate audio datasets for training Text-To-Speech models, through smart audio splitting with silence detection, and transcription using Whisper.
AgenticSeek is a fully local, voice-enabled AI assistant designed to autonomously browse the web, write code, and plan tasks while ensuring complete privacy by keeping all data on your device. Tailored for local reasoning models, it runs entirely on your hardware, eliminating any cloud dependency.
A curated list of voice AI agent frameworks, tools, resources, and best practices
Aivis Voice Model File (.aivm/.aivmx) Utility Library
Text-to-speech plugin for Claude Code — multi-provider support (ElevenLabs, OpenAI, Google, Amazon Polly, Azure, Kitten, local system TTS) on macOS, Linux, and Windows
Claude Code Changelog Tracker with AI analysis, TTS, and email notifications
A Deepgram client for Dart and Flutter, supporting all Speech-to-Text and Text-to-Speech features on every platform.
Chrome extension that allows dictating anywhere using OpenAI Whisper
🔊 Intelligent voice notifications for Claude Code using ElevenLabs TTS
Xiaomi Mimo TTS Custom Component for Home Assistant
OpenClaw TTS Provider for Xiaomi MiMo (mimo-v2-tts)
Voice-cloned smart attention TTS notifications for Claude Code. AI summarizes deep work session responses, speaks in your cloned voice. MLX Chatterbox Turbo on Apple Silicon. Zero config, works out of the box.
A command line utility to easily finetune XTTS models in a fully automated way. Developed for Pandrator.
The ultimate PyQt6 application that integrates the power of OpenAI, Google Gemini, Claude, and other open-source AI models
[NVIDIA, MAC, ROCM] Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech application (Minimum Requirements 8GB VRAM / 32 GB RAM, Recommended Requirements 16GB VRAM 24GB RAM)
Serverless implementation of Text-To-Speech
Speech-to-Text/Code using a fast local LLM, for Linux, uses Whisper
Easy one shot installer for configuring Chatterbox's TTS models (Original and Turbo)
🚨 Israeli Home Front Command real-time alerts via OpenClaw - WhatsApp + TTS, no Home Assistant needed
XTTS fine-tuning via CLI
A Model Context Protocol (MCP) server that provides ASR(Automatic Speech Recognition) capabilities using the whisper engine. This server exposes TTS functionality through MCP tools, making it easy to integrate speech synthesis into your applications.
Whisper + TTS + As many MCP servers as I can stuff in