/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

Happy New Year!

The recovered files have been restored.

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


“What counts is not necessarily the size of the dog in the fight – it’s the size of the fight in the dog.” -t. General Dwight Eisenhower


Speech Synthesis/Recognition general Robowaifu Technician 09/13/2019 (Fri) 11:25:07 No.199
We want our robowaifus to speak to us right? en.wikipedia.org/wiki/Speech_synthesis https://archive.is/xxMI4 research.spa.aalto.fi/publications/theses/lemmetty_mst/contents.html https://archive.is/nQ6yt The Taco Tron project: arxiv.org/abs/1703.10135 google.github.io/tacotron/ https://archive.is/PzKZd No code available yet, hopefully they will release it. github.com/google/tacotron/tree/master/demos https://archive.is/gfKpg >=== -edit subject
Edited last time by Chobitsu on 07/02/2023 (Sun) 04:22:22.
>>25876 Had no idea that was a real thing NoidoDev, thanks! Any chance it's opensauce?
>>25893 The original belongs to a corporation, but if you look for "Otamatone DIY" you can find some variants.
>>25909 Cool. Thank you NoidoDev! :^)
>>17474 Can we get this with time stamps? So we can use it for voice training (text to speech).
>ⓍTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. There is no need for an excessive amount of training data that spans countless hours. https://huggingface.co/coqui/XTTS-v2 (only non-commercial licence) Testing Space: https://huggingface.co/spaces/coqui/voice-chat-with-mistral Via https://www.reddit.com/r/LocalLLaMA/comments/17yzr6l/coquiai_ttsv2_is_so_cool/ (seems to be much closer to the ElevenLabs quality)
>>26511 also this one https://github.com/yl4579/StyleTTS2 some people claim its 100x faster than coqui's xtts. still no webui tho :(
>>26512 Thank, I saw this mentioned but forgot to look it up.
>>26512 tested it locally, rtx 3070. works fast as fuck. https://files.catbox.moe/ow0ryz.mp4
>>26535 >>26566 Thanks Anons. :^)
>>27995 REALLY impressive Anon, thanks!
>MetaVoice 1B - The new TTS and Voice cloning open source model Colab: https://drp.li/7RUPU MetaVoice Online Demo - https://ttsdemo.themetavoice.xyz/ https://huggingface.co/metavoiceio https://youtu.be/Y_k3bHPcPTo Not as good as proprietary models.
>>29257 >Not as good as proprietary models. Ehh, they'll get better with time, no doubt. Thanks Anon! Cheers. :^)
>This week we’re talking with Georgi Gerganov about his work on Whisper.cpp and llama.cpp. Georgi first crossed our radar with whisper.cpp, his port of OpenAI’s Whisper model in C and C++. Whisper is a speech recognition model enabling audio transcription and translation. Something we’re paying close attention to here at Changelog, for obvious reasons. Between the invite and the show’s recording, he had a new hit project on his hands: llama.cpp. This is a port of Facebook’s LLaMA model in C and C++. Whisper.cpp made a splash, but llama.cpp is growing in GitHub stars faster than Stable Diffusion did, which was a rocket ship itself. https://changelog.com/podcast/532 Some takeaways: Whiper didn't do speaker identification (Diarization) when they published this in March 22, 2023, and it seems to be hard to find something doing that. But they said people set up their own pipelines for doing this and Whisper might get there as well. I found this on the topic by briefly searching, it still doesn't seem to be covered in some easy way: >How to use OpenAIs Whisper to transcribe and diarize audio files https://github.com/lablab-ai/Whisper-transcription_and_diarization-speaker-identification- Discussion on this: https://huggingface.co/spaces/openai/whisper/discussions/4 Azure AI services seem to be able to do it, but this doesn't help us much. Well, I mean for using it as a tool to extract voice files for training it's one thing, but we also need it as a skill for our waifus: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speaker-recognition?tabs=script&pivots=programming-language-cpp
>>29415 Very nice. Thanks NoidoDev! I have a lot of respect for Gerganov. He very-clearly understands the issues of latency in a systems development context. Exactly the kinds of expertise vital for success to /robowaifu/ and our affiliated cadres in the end. Cheers. :^)
>Data Exchange Podcast 198 - Sep 21, 2023 Overview over everything related to speech. https://www.youtu.be/w4DULuvgO1Y Yishay Carmiel is the CEO of Meaning, a startup at the forefront of building real-time speech applications for enterprises. Episode Notes: https://thedataexchange.media/state-of-ai-for-speech-and-audio >Sections Generative AI for Audio (text-to-speech; text-to-music; speech synthesis) - 00:00:44 Speech Translation - 00:09:44 Automatic Speech Recognition and other models that use audio inputs - 00:13:16 Speech Emotion Recognition - 00:19:55 Restoration - 00:21:55 Similarities in recent trends in NLP and Speech - 00:24:23 Diarization (speaker identification), and implementation challenges - 00:29:47 Voice cloning and risk mitigation - 00:35:36
There is some Japanese open source programs for speech synthesis such as VOICEVOX though I should mention if you use these voices they will have funny accents if you make them speak English which can be kinda cute sometimes. https://voicevox.hiroshiba.jp And TALQu but it is only for Windows. https://booth.pm/ja/items/2755336 NNSVS is for singing also open source. https://nnsvs.github.io SociallyIneptWeeb used VOICEVOX for an AI waifu before and detailed what he did https://www.youtube.com/watch?v=bN5UaEkIPGM&t=674s
>>30390 Oh wow, this is really good. Thanks. >https://nnsvs.github.io >NNSVS >Neural network based singing voice synthesis library > GitHub: https://github.com/nnsvs/nnsvs > Paper: https://arxiv.org/abs/2210.15987 > Demo: https://r9y9.github.io/projects/nnsvs/ >Features > Open-source: NNSVS is fully open-source. You can create your own voicebanks with your dataset. > Multiple languages: NNSVS has been used for creating singing voice synthesis (SVS) systems for multiple languages by VocalSynth comminities (8+ as far as I know). > Research friendly: NNSVS comes with reproducible Kaldi/ESPnet-style recipes. You can use NNSVS to create baseline systems for your research.
>>30398 Here is a site I found that writes some about it and has links to written tutorials. https://nnsvs.carrd.co/
>VoiceCraft >>30614 Thanks, but it's about voice cloning again. I think what I really want are artificial voices which don't belong to anyone. Cloning has it's use cases as well, but I don't need or want it for a robot wife. Also I don't need to be to close to a human. To me the quality problem is a solved problem at this point, at least for robowaifus. I was very impressed certainly by the singing capabilities I saw and heard recently, see above >>30390
>>30625 If you aren't worried about human closeness there is a pretty simple TTS that sounds like old retro synthesized voices. Unfortunately I cant find a video that has the female voice. https://github.com/adafruit/Talkie
>>30657 Thanks, but I didn't mean to go so extreme into the other direction. I just meant for our use case here, and in my opinion, the current state of the technology should be sufficient in terms of quality or it's at least close to it. Making it faster and run better on smaller devices would be good, though. For content creation it's another story, if we don't want to only have stories about robots.
I figure this might be the best place for this. I found a paper on lip syncing synthesized voices.
>>31027 For some reason the file didnt attach https://doi.org/10.1109/ROMAN.2010.5598656
>>31027 I've been thinking about designing something similar, now I'm totally gonna s̶t̶e̶a̶l̶ be inspired by this.
>>31027 That is great. I mentioned doing something, sorta, the same with facial expressions. I believe this is the same sort of "framework" or idea. Here's the paper, Real-time lip synchronization between text-to-speech (TTS) system and robot mouth Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Here's an address for the paper. https://sci-hub.ru/10.1109/roman.2010.5598656
>>31049 >Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Lol, welcome to my world! :D TBH I think Robbit must've disabled file posting by Torfags. I hope he changes that soon.
Just wanted to mention, that Suno - the AI music creation model, is based on Bark - the speech generation model. They needed around two years from there to where we are now. I have a source, a video where this is mentioned, but this is also about a lot of other things. Just has been mentioned there.
>>32169 Neat! That's an interesting heritage. Impressive results in a fairly short time, too. Thanks, NoidoDev. Cheers. :^)
I think to make speech generation fast enough, we'll have to go from some AI voice to replicate the generated voice in mbrolla. I asked Claude about it. [Quote] To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA, you'll need to follow several steps. Here's a concise overview of the process: 1. Prepare your data: - Ensure your audio files are high-quality recordings - Verify that phonetic labels are accurate and aligned with the audio 2. Choose your synthesizer: - eSpeak: Better for rapid development, less natural-sounding - MBROLA: More natural-sounding, but requires more work 3. For eSpeak: - Create a voice file with phoneme definitions - Define pitch and timing rules - Add your voice file to eSpeak's voices directory 4. For MBROLA: - Create diphone database from your audio files - Define pitch and duration rules - Create a language definition file 5. Test and refine: - Use sample texts to test your voice - Adjust parameters for better quality This process can be complex and time-consuming. Would you like me to elaborate on any specific part of this process? Yes, there are tools available to help create voices for MBROLA. Here's a brief overview of some key tools: 1. MBR_PH: This is the official MBROLA database creation tool. It helps in creating diphone databases from labeled speech samples. 2. Praat: While not specifically for MBROLA, this phonetics software can be used for labeling and segmenting audio files, which is a crucial step in the process. 3. HTK (Hidden Markov Model Toolkit): Can be used for automatic segmentation of speech data. 4. Festvox: A voice building framework that can be used to create voices for various synthesizers, including MBROLA. 5. WinSnoori: A speech analysis tool that can be helpful in the voice creation process. These tools can significantly streamline the process, but creating a high-quality voice still requires considerable effort and expertise. Would you like more details on any of these tools or the specific steps they're used for? [/Quote]
>>32406 The linguistic knowledge necessary for defining pitch and duration rules in speech synthesis encompasses several areas of linguistics. Here's an overview of the key areas: 1. Phonetics and Phonology: - Understanding of speech sounds and their features - Knowledge of phonemic inventories and allophonic variations - Familiarity with syllable structure and phonotactics 2. Prosody: - Understanding of stress patterns (lexical and sentential) - Knowledge of intonation patterns and their meanings - Familiarity with rhythm and timing in speech 3. Morphology: - Understanding how word structure affects pronunciation - Knowledge of morphophonemic alternations 4. Syntax: - Understanding how sentence structure affects prosody - Knowledge of phrase boundaries and their prosodic effects 5. Pragmatics: - Understanding how context and intent affect prosody - Knowledge of focus and emphasis in speech 6. Sociolinguistics: - Familiarity with dialect variations in prosody - Understanding of speaking styles and registers 7. Acoustic Phonetics: - Knowledge of acoustic correlates of prosodic features - Understanding of formant structures and transitions 8. Language-specific knowledge: - Deep understanding of the prosodic system of the target language - Familiarity with any unique features (e.g., tone in tonal languages) This knowledge helps in creating rules that accurately capture the nuances of natural speech in the target language. Would you like more information on any of these linguistic areas or how they specifically apply to speech synthesis?
I think to make speech generation fast enough, we'll have to go from some AI voice to replicate the generated voice in mbrolla. I asked Claude about it. To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA, you'll need to follow several steps. Here's a concise overview of the process: 1. Prepare your data: - Ensure your audio files are high-quality recordings - Verify that phonetic labels are accurate and aligned with the audio 2. Choose your synthesizer: - eSpeak: Better for rapid development, less natural-sounding - MBROLA: More natural-sounding, but requires more work 3. For eSpeak: - Create a voice file with phoneme definitions - Define pitch and timing rules - Add your voice file to eSpeak's voices directory 4. For MBROLA: - Create diphone database from your audio files - Define pitch and duration rules - Create a language definition file 5. Test and refine: - Use sample texts to test your voice - Adjust parameters for better quality This process can be complex and time-consuming. Would you like me to elaborate on any specific part of this process? Yes, there are tools available to help create voices for MBROLA. Here's a brief overview of some key tools: 1. MBR_PH: This is the official MBROLA database creation tool. It helps in creating diphone databases from labeled speech samples. 2. Praat: While not specifically for MBROLA, this phonetics software can be used for labeling and segmenting audio files, which is a crucial step in the process. 3. HTK (Hidden Markov Model Toolkit): Can be used for automatic segmentation of speech data. 4. Festvox: A voice building framework that can be used to create voices for various synthesizers, including MBROLA. 5. WinSnoori: A speech analysis tool that can be helpful in the voice creation process. These tools can significantly streamline the process, but creating a high-quality voice still requires considerable effort and expertise. Would you like more details on any of these tools or the specific steps they're used for?
>>32409 The linguistic knowledge necessary for defining pitch and duration rules in speech synthesis encompasses several areas of linguistics. Here's an overview of the key areas: 1. Phonetics and Phonology: - Understanding of speech sounds and their features - Knowledge of phonemic inventories and allophonic variations - Familiarity with syllable structure and phonotactics 2. Prosody: - Understanding of stress patterns (lexical and sentential) - Knowledge of intonation patterns and their meanings - Familiarity with rhythm and timing in speech 3. Morphology: - Understanding how word structure affects pronunciation - Knowledge of morphophonemic alternations 4. Syntax: - Understanding how sentence structure affects prosody - Knowledge of phrase boundaries and their prosodic effects 5. Pragmatics: - Understanding how context and intent affect prosody - Knowledge of focus and emphasis in speech 6. Sociolinguistics: - Familiarity with dialect variations in prosody - Understanding of speaking styles and registers 7. Acoustic Phonetics: - Knowledge of acoustic correlates of prosodic features - Understanding of formant structures and transitions 8. Language-specific knowledge: - Deep understanding of the prosodic system of the target language - Familiarity with any unique features (e.g., tone in tonal languages) This knowledge helps in creating rules that accurately capture the nuances of natural speech in the target language. Would you like more information on any of these linguistic areas or how they specifically apply to speech synthesis?
>>32409 > To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA >This checkpoint leverages the pretrained checkpoint wav2vec2-large-lv60 and is fine-tuned on CommonVoice to recognize phonetic labels in multiple languages. https://huggingface.co/facebook/wav2vec2-lv-60-espeak-cv-ft >Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained mode https://arxiv.org/abs/2109.11680
>>32406 >>32407 Can be deleted. I was on another computer while using Claude.
>>32409 Ouuuf, I had to ask specifically: Yes, Festival can indeed be easier to work with for voice creation compared to directly using MBROLA or eSpeak. Here's a brief overview of why Festival might be a more accessible option: 1. Comprehensive toolkit: - Festival comes with a suite of tools designed for voice creation - Includes utilities for data preparation, analysis, and synthesis 2. Festvox: - A voice building framework that works with Festival - Provides a structured approach to creating new voices 3. Higher-level abstractions: - Allows you to work with higher-level linguistic concepts - Can automatically handle some lower-level details 4. Built-in machine learning: - Can use statistical methods to learn prosody models from data - Reduces the need for manual rule creation 5. Extensive documentation: - Well-documented processes for voice building - Includes tutorials and example voices 6. Language support: - Comes with tools for building voices in various languages - Can handle different writing systems and phoneme sets 7. Integration: - Can still use MBROLA voices if desired - Allows for mixing different synthesis methods While Festival still requires significant effort and linguistic knowledge, it provides more structure and automation in the voice creation process. Would you like more details on how to approach voice creation with Festival?
>>32413 We should keep an eye on developments in case anyone tries to make that easier with AI. Or maybe we could try advocating for it. I might also give it a try myself, if I really have to, but I'm bad with patience and frustration. http://www.festvox.org/ https://github.com/festvox/festvox >For full details about voice building see the document itself http://festvox.org/bsv/
>>32414 Great research effort NoidoDev, thanks! :^)
>Kokoro TTS I just watched this video https://youtu.be/tl1wvZXlj0I about a new TTS model. It's named Kokoro https://huggingface.co/hexgrad/Kokoro-82M >Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out). >On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. ... >Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio ... >Kokoro v0.19 is limited in some specific ways, due to its training set and/or architecture: > [Data] Lacks voice cloning capability, likely due to small <100h training set > [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes > [Data] Training dataset is mostly long-form reading and narration, not conversation > [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash > [Data] Multilingual capability is architecturally feasible, but training data is mostly English What I found really interesting is, that it can blend voices. So you can create voices which won't sound like the standard ones.
>>35678 Nice find, NoidoDev. Thanks! :^)
I used eSpeak and copy-and-paste to do a quick-and-dirty speech capability.
>>36362 Can you please link some resources for everyone here how you did it? Like, the original sauces for things.
>>36377 From Offline AI Roleplay - A Guide to Simple Offline AI "Open eSpeak while you use the AI program (it may be labeled as TTSApp). Copy the outputs into eSpeak, and have it say the text. Both Jan and LM Studio have a dedicated copy button. The shortcut for pasting is Ctrl + V." eSpeak can be found here; https://espeak.sourceforge.net/
>>36377 espeak is in every Linux repository. It's very well known but also basic. I use spd-say sometimes with s I sometimes use Speech Dispatcher (spd-say) with Festival >>4333 or Mbrola >>32411 >>32406, though Festival seems to be easier for voice creation: >>32413. Festival sounds better than espeak, imo. Especially with the right configuration (which I don't have anymore right now). This still sounds very robotic and makes errors in pronunciation, but it's okayish and better than espeak. Speech Dispatcher (spd-say promt) is a kind of layer on top so you can switch the system underneath: https://wiki.archlinux.org/title/Speech_dispatcher
>>36417 Festival definitely looks better for developers who are making a unique system >espeak is in every Linux repository. Yeah, but me and many others are on Windows
>>36419 isnt narrator in the accessibility settings just a builtin text-to-speech program
>>36383 >>36417 Thanks for your help, Anons! :^)
>>36423 I tried that at first, but the problem is that as far as I know, it reads EVERYTHING on the screen.
>>36419 >but me and many others are on Windows Okay, I assume this is for development, while the real system will more likely be Linux. Anyways, I don't know how this exactly works but I think you can use the embedded Linux in Windows or whatever this is, and I assume there's also a repository. WSL: https://learn.microsoft.com/en-us/windows/wsl/about
>>36447 Yeah, WSL currently defaults to an Ubuntu 24 variant system + terminal. It's not a perfect match, but it's close enough in most respects. For example, I've been able to build & successfully run juCi++ [1][2][3] on it (this is a moderately complex & dependency-laden GTKMM -based GUI application, built from source) without any hiccups. This subsystem is very simple to set up, and I'll be happy to help anyone here who may be struggling to do so. Hopefully it can support Anon's other development needs, and if not then moving over to a full Linux system will be all the easier for them afterwards. Cheers. :^) --- 1. https://gitlab.com/cppit/jucipp/-/blob/master/docs/install.md#debianlinux-mintubuntu 2. here's a one-liner to copypasta onto your new Ubuntu terminal for installing all it's dependencies: sudo apt-get install libclang-dev liblldb-dev || sudo apt-get install libclang-6.0-dev liblldb-6.0-dev || sudo apt-get install libclang-4.0-dev liblldb-4.0-dev || sudo apt-get install libclang-3.8-dev liblldb-3.8-dev; sudo apt-get install universal-ctags || sudo apt-get install exuberant-ctags; sudo apt-get install git cmake make g++ clang-format pkg-config libboost-filesystem-dev libboost-serialization-dev libgtksourceviewmm-3.0-dev aspell-en libaspell-dev libgit2-dev Then just follow the rest of the install instructions from the link above (ie, git clone --recursive https://gitlab.com/cppit/jucipp , etc.) 3. WSL is also a great platform for Windows users to build & run BUMP for archiving this board, btw ( >>14866 ). >=== -prose edit -add footnote/hotlink -add dependencies/BUMP footnotes
Edited last time by Chobitsu on 02/03/2025 (Mon) 15:36:01.

Report/Delete/Moderation Forms
Delete
Report