Sesame CSM Gradio UI – Free, Local, High-Quality Text-to-Speech with Voice Cloning! (CUDA, Apple MLX and CPU)

Author

Akash Gupta | Sr. VoIP Engineer | MLOps

What is Sesame CSM?

Conversational Speech Model (CSM) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

I just released Sesame CSM gradio UI, a 100% local, free text-to-speech tool with superior voice cloning! No cloud processing, no API keys – just pure, high-quality AI-generated speech on your own machine. It works on CUDA, APPLE MLX and CPU so anyone can try it.

Listen to a sample conversation generated by CSM.

🔥 Features:

✅ Runs 100% Locally – No internet connection required!
✅ Free & Open Source – No subscriptions, no paywalls.
✅ Superior Voice Cloning – Built directly into the UI.
✅ Gradio UI – Simple, interactive, and user-friendly.
✅ Supports CUDA, Apple MLX, and CPU – Works on NVIDIA GPUs, Apple Silicon, and regular CPUs.

Below is a video showing how to use voice cloning feature.
Note: It has no audio, it shows how to use the UI.

Getting Started

1. Clone the Repository

 git clone https://github.com/akashjss/sesame-csm.git
 cd sesame-csm

2. Install Dependencies, use venv to isolate environment as shown below.

python -m venv venv
source .venv/bin/activate

pip install -r requirements.txt

3. Run Sesame CSM

python run_csm_gradio.py

Once the server is running, open the Gradio UI in your browser to start generating speech!

🎙️ How to Use Voice Cloning

One of the most exciting features of Sesame CSM is its built-in voice cloning. You can record your own voice and use it to generate AI speech.

Steps to Clone Your Voice:

Click the microphone icon in the UI.
Press the record button and read the Speaker Prompt.
Stop recording when finished.
Click ‘Generate Conversation’ to create AI-generated speech using your recorded voice.

Here’s a visual guide to help you out:

💡 Why Use Sesame CSM?

If you’re looking for a fast, free, and high-quality text-to-speech tool with voice cloning, Sesame CSM is the perfect choice. Whether you’re a developer, content creator, or just experimenting with AI-generated speech, this tool gives you full control without any restrictions.

🔗 Try it Now!

👉 GitHub Repository

I’d love to hear your thoughts! Try it out and feel free to share your feedback, report issues, or contribute to the project!

Akash Gupta
Senior VoIP Engineer and AI Enthusiast

AI and VoIP Blog

Thank you for visiting the Blog. Hit the subscribe button to receive the next post right in your inbox. If you find this article helpful don't forget to share your feedback in the comments and hit the like/clap button. This will helps in knowing what topics resonate with you, allowing me to create more that keeps you informed.

Thank you for reading, and stay tuned for more insights and guides!

AI and VoIP Blog