/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

I Fucked Up

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


“If you are going through hell, keep going.” -t. Winston Churchill


Speech Synthesis/Recognition general Robowaifu Technician 09/13/2019 (Fri) 11:25:07 No.199
We want our robowaifus to speak to us right? en.wikipedia.org/wiki/Speech_synthesis https://archive.is/xxMI4 research.spa.aalto.fi/publications/theses/lemmetty_mst/contents.html https://archive.is/nQ6yt The Taco Tron project: arxiv.org/abs/1703.10135 google.github.io/tacotron/ https://archive.is/PzKZd No code available yet, hopefully they will release it. github.com/google/tacotron/tree/master/demos https://archive.is/gfKpg >=== -edit subject
Edited last time by Chobitsu on 07/02/2023 (Sun) 04:22:22.
>>23634 >I'm going to start using a name tag so I can get some proper recognition for what I've done so far. Good thinking Anon. That's not really why we use names here. Watch the movie 50 first dates to understand the actual reason.
>>23643 I deleted my original post here, but forgot to copy it. Just wanted to post the new link to the related post. Well... Related: >>23682 This thread is about speech synthesis and maybe recognition, even not about 3D models. You can crosslink posts like above.
>our research team kept seeing new voice conversion methods getting more complex and becoming harder to reproduce. So, we tried to see if we could make a top-tier voice conversion model that was extremely simple. So, we made kNN-VC, where our entire conversion model is just k-nearest neighbors regression on WavLM features. And, it turns out, this does as well if not better than very complex any-to-any voice conversion methods. What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. https://bshall.github.io/knn-vc https://arxiv.org/abs/2305.18975
>>23736 >What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. Lol. That seems a little bizarre to think through. Thanks Anon. >ps. I edited the subject ITT, thanks for pointing that out NoidoDev.
We should think about optimizations of speech recognition (synthesis needs it's own approach): - there are FPGA SBCs which you can train to react to certain words, then put out a text or trigger something - instead of recording a 30s sentence, record much shorter but go on directly after the first one, check the parts, but also glue them together and send the whole sentence to the speech recognition model - maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something - finding ways to detect made up words - construct words out of syllables instead of just jumping to what could have been meant, using that for parts of a sentence where the speech recognition model is uncertain - using the certainty values of speech recognition to look for errors (misunderstandings), maybe using the syllable construction, wordlists and list of names for that
>>24951 >- maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something I would anticipate this should at the least provide greater odds of a coherent parse (particularly in a noisy environment) than just STT alone. Good thinking Anon.
Open file (50.97 KB 768x384 vallex_framework.jpg)
Related: >>25073 >VALL-E X is an amazing multilingual text-to-speech (TTS) model proposed by Microsoft. While Microsoft initially publish in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are glad to share our trained VALL-E X model with the community, allowing everyone to experience the power next-generation TTS https://github.com/Plachtaa/VALL-E-X https://huggingface.co/spaces/Plachta/VALL-E-X
>>25075 also worth noting that : its broken if you launch it thru "python -X utf8 launch-ui.py" command and let install "vallex-checkpoint.pt" and whisper "medium.pt" models on its own, very weird as its already solved here : https://github.com/Plachtaa/VALL-E-X#install-with-pip-recommended-with-python-310-cuda-117--120-pytorch-20 download them manually, thats it.
>>25075 >>25096 Thanks. This will be very useful.
Open file (107.39 KB 608x783 Screenshot_136.png)
There's some excitement around a Discord server being removed, which was working on AI voice models. We might even not have known about it (I didn't), but here's the website: https://voice-models.com https://docs.google.com/spreadsheets/d/1tAUaQrEHYgRsm1Lvrnj14HFHDwJWl0Bd9x0QePewNco/edit#gid=1227575351 and weights.gg (not voice models) >AI Hub discord just got removed from my server list But it seems to be only a fraction of the models. Some mention a IIRC backup: https://www.reddit.com/r/generativeAI/comments/16zzuh4/ai_hub_discord_just_got_removed_from_my_server/
>>25805 >I WARNED YOU ABOUT THE DOXXCORD STAIRS BRO Save.everything. Doxxcord is even more deeply-controlled than G*ogle is. DMCAs don't result in a forum getting disappear'd.
>Otamatone https://youtu.be/Y_ILdh1K0Fk Found here, related: >>25273
>>25876 Had no idea that was a real thing NoidoDev, thanks! Any chance it's opensauce?
>>25893 The original belongs to a corporation, but if you look for "Otamatone DIY" you can find some variants.
>>25909 Cool. Thank you NoidoDev! :^)
>>17474 Can we get this with time stamps? So we can use it for voice training (text to speech).
>ⓍTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. There is no need for an excessive amount of training data that spans countless hours. https://huggingface.co/coqui/XTTS-v2 (only non-commercial licence) Testing Space: https://huggingface.co/spaces/coqui/voice-chat-with-mistral Via https://www.reddit.com/r/LocalLLaMA/comments/17yzr6l/coquiai_ttsv2_is_so_cool/ (seems to be much closer to the ElevenLabs quality)
>>26511 also this one https://github.com/yl4579/StyleTTS2 some people claim its 100x faster than coqui's xtts. still no webui tho :(
>>26512 Thank, I saw this mentioned but forgot to look it up.
>>26512 tested it locally, rtx 3070. works fast as fuck. https://files.catbox.moe/ow0ryz.mp4
>>26535 >>26566 Thanks Anons. :^)
>>27995 REALLY impressive Anon, thanks!
>MetaVoice 1B - The new TTS and Voice cloning open source model Colab: https://drp.li/7RUPU MetaVoice Online Demo - https://ttsdemo.themetavoice.xyz/ https://huggingface.co/metavoiceio https://youtu.be/Y_k3bHPcPTo Not as good as proprietary models.
>>29257 >Not as good as proprietary models. Ehh, they'll get better with time, no doubt. Thanks Anon! Cheers. :^)
>This week we’re talking with Georgi Gerganov about his work on Whisper.cpp and llama.cpp. Georgi first crossed our radar with whisper.cpp, his port of OpenAI’s Whisper model in C and C++. Whisper is a speech recognition model enabling audio transcription and translation. Something we’re paying close attention to here at Changelog, for obvious reasons. Between the invite and the show’s recording, he had a new hit project on his hands: llama.cpp. This is a port of Facebook’s LLaMA model in C and C++. Whisper.cpp made a splash, but llama.cpp is growing in GitHub stars faster than Stable Diffusion did, which was a rocket ship itself. https://changelog.com/podcast/532 Some takeaways: Whiper didn't do speaker identification (Diarization) when they published this in March 22, 2023, and it seems to be hard to find something doing that. But they said people set up their own pipelines for doing this and Whisper might get there as well. I found this on the topic by briefly searching, it still doesn't seem to be covered in some easy way: >How to use OpenAIs Whisper to transcribe and diarize audio files https://github.com/lablab-ai/Whisper-transcription_and_diarization-speaker-identification- Discussion on this: https://huggingface.co/spaces/openai/whisper/discussions/4 Azure AI services seem to be able to do it, but this doesn't help us much. Well, I mean for using it as a tool to extract voice files for training it's one thing, but we also need it as a skill for our waifus: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speaker-recognition?tabs=script&pivots=programming-language-cpp
>>29415 Very nice. Thanks NoidoDev! I have a lot of respect for Gerganov. He very-clearly understands the issues of latency in a systems development context. Exactly the kinds of expertise vital for success to /robowaifu/ and our affiliated cadres in the end. Cheers. :^)
>Data Exchange Podcast 198 - Sep 21, 2023 Overview over everything related to speech. https://www.youtu.be/w4DULuvgO1Y Yishay Carmiel is the CEO of Meaning, a startup at the forefront of building real-time speech applications for enterprises. Episode Notes: https://thedataexchange.media/state-of-ai-for-speech-and-audio >Sections Generative AI for Audio (text-to-speech; text-to-music; speech synthesis) - 00:00:44 Speech Translation - 00:09:44 Automatic Speech Recognition and other models that use audio inputs - 00:13:16 Speech Emotion Recognition - 00:19:55 Restoration - 00:21:55 Similarities in recent trends in NLP and Speech - 00:24:23 Diarization (speaker identification), and implementation challenges - 00:29:47 Voice cloning and risk mitigation - 00:35:36
There is some Japanese open source programs for speech synthesis such as VOICEVOX though I should mention if you use these voices they will have funny accents if you make them speak English which can be kinda cute sometimes. https://voicevox.hiroshiba.jp And TALQu but it is only for Windows. https://booth.pm/ja/items/2755336 NNSVS is for singing also open source. https://nnsvs.github.io SociallyIneptWeeb used VOICEVOX for an AI waifu before and detailed what he did https://www.youtube.com/watch?v=bN5UaEkIPGM&t=674s
>>30390 Oh wow, this is really good. Thanks. >https://nnsvs.github.io >NNSVS >Neural network based singing voice synthesis library > GitHub: https://github.com/nnsvs/nnsvs > Paper: https://arxiv.org/abs/2210.15987 > Demo: https://r9y9.github.io/projects/nnsvs/ >Features > Open-source: NNSVS is fully open-source. You can create your own voicebanks with your dataset. > Multiple languages: NNSVS has been used for creating singing voice synthesis (SVS) systems for multiple languages by VocalSynth comminities (8+ as far as I know). > Research friendly: NNSVS comes with reproducible Kaldi/ESPnet-style recipes. You can use NNSVS to create baseline systems for your research.
>>30398 Here is a site I found that writes some about it and has links to written tutorials. https://nnsvs.carrd.co/
>VoiceCraft >>30614 Thanks, but it's about voice cloning again. I think what I really want are artificial voices which don't belong to anyone. Cloning has it's use cases as well, but I don't need or want it for a robot wife. Also I don't need to be to close to a human. To me the quality problem is a solved problem at this point, at least for robowaifus. I was very impressed certainly by the singing capabilities I saw and heard recently, see above >>30390
>>30625 If you aren't worried about human closeness there is a pretty simple TTS that sounds like old retro synthesized voices. Unfortunately I cant find a video that has the female voice. https://github.com/adafruit/Talkie
>>30657 Thanks, but I didn't mean to go so extreme into the other direction. I just meant for our use case here, and in my opinion, the current state of the technology should be sufficient in terms of quality or it's at least close to it. Making it faster and run better on smaller devices would be good, though. For content creation it's another story, if we don't want to only have stories about robots.
I figure this might be the best place for this. I found a paper on lip syncing synthesized voices.
>>31027 For some reason the file didnt attach https://doi.org/10.1109/ROMAN.2010.5598656
>>31027 I've been thinking about designing something similar, now I'm totally gonna s̶t̶e̶a̶l̶ be inspired by this.
>>31027 That is great. I mentioned doing something, sorta, the same with facial expressions. I believe this is the same sort of "framework" or idea. Here's the paper, Real-time lip synchronization between text-to-speech (TTS) system and robot mouth Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Here's an address for the paper. https://sci-hub.ru/10.1109/roman.2010.5598656
>>31049 >Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Lol, welcome to my world! :D TBH I think Robbit must've disabled file posting by Torfags. I hope he changes that soon.
Just wanted to mention, that Suno - the AI music creation model, is based on Bark - the speech generation model. They needed around two years from there to where we are now. I have a source, a video where this is mentioned, but this is also about a lot of other things. Just has been mentioned there.
>>32169 Neat! That's an interesting heritage. Impressive results in a fairly short time, too. Thanks, NoidoDev. Cheers. :^)
I think to make speech generation fast enough, we'll have to go from some AI voice to replicate the generated voice in mbrolla. I asked Claude about it. [Quote] To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA, you'll need to follow several steps. Here's a concise overview of the process: 1. Prepare your data: - Ensure your audio files are high-quality recordings - Verify that phonetic labels are accurate and aligned with the audio 2. Choose your synthesizer: - eSpeak: Better for rapid development, less natural-sounding - MBROLA: More natural-sounding, but requires more work 3. For eSpeak: - Create a voice file with phoneme definitions - Define pitch and timing rules - Add your voice file to eSpeak's voices directory 4. For MBROLA: - Create diphone database from your audio files - Define pitch and duration rules - Create a language definition file 5. Test and refine: - Use sample texts to test your voice - Adjust parameters for better quality This process can be complex and time-consuming. Would you like me to elaborate on any specific part of this process? Yes, there are tools available to help create voices for MBROLA. Here's a brief overview of some key tools: 1. MBR_PH: This is the official MBROLA database creation tool. It helps in creating diphone databases from labeled speech samples. 2. Praat: While not specifically for MBROLA, this phonetics software can be used for labeling and segmenting audio files, which is a crucial step in the process. 3. HTK (Hidden Markov Model Toolkit): Can be used for automatic segmentation of speech data. 4. Festvox: A voice building framework that can be used to create voices for various synthesizers, including MBROLA. 5. WinSnoori: A speech analysis tool that can be helpful in the voice creation process. These tools can significantly streamline the process, but creating a high-quality voice still requires considerable effort and expertise. Would you like more details on any of these tools or the specific steps they're used for? [/Quote]
>>32406 The linguistic knowledge necessary for defining pitch and duration rules in speech synthesis encompasses several areas of linguistics. Here's an overview of the key areas: 1. Phonetics and Phonology: - Understanding of speech sounds and their features - Knowledge of phonemic inventories and allophonic variations - Familiarity with syllable structure and phonotactics 2. Prosody: - Understanding of stress patterns (lexical and sentential) - Knowledge of intonation patterns and their meanings - Familiarity with rhythm and timing in speech 3. Morphology: - Understanding how word structure affects pronunciation - Knowledge of morphophonemic alternations 4. Syntax: - Understanding how sentence structure affects prosody - Knowledge of phrase boundaries and their prosodic effects 5. Pragmatics: - Understanding how context and intent affect prosody - Knowledge of focus and emphasis in speech 6. Sociolinguistics: - Familiarity with dialect variations in prosody - Understanding of speaking styles and registers 7. Acoustic Phonetics: - Knowledge of acoustic correlates of prosodic features - Understanding of formant structures and transitions 8. Language-specific knowledge: - Deep understanding of the prosodic system of the target language - Familiarity with any unique features (e.g., tone in tonal languages) This knowledge helps in creating rules that accurately capture the nuances of natural speech in the target language. Would you like more information on any of these linguistic areas or how they specifically apply to speech synthesis?
I think to make speech generation fast enough, we'll have to go from some AI voice to replicate the generated voice in mbrolla. I asked Claude about it. To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA, you'll need to follow several steps. Here's a concise overview of the process: 1. Prepare your data: - Ensure your audio files are high-quality recordings - Verify that phonetic labels are accurate and aligned with the audio 2. Choose your synthesizer: - eSpeak: Better for rapid development, less natural-sounding - MBROLA: More natural-sounding, but requires more work 3. For eSpeak: - Create a voice file with phoneme definitions - Define pitch and timing rules - Add your voice file to eSpeak's voices directory 4. For MBROLA: - Create diphone database from your audio files - Define pitch and duration rules - Create a language definition file 5. Test and refine: - Use sample texts to test your voice - Adjust parameters for better quality This process can be complex and time-consuming. Would you like me to elaborate on any specific part of this process? Yes, there are tools available to help create voices for MBROLA. Here's a brief overview of some key tools: 1. MBR_PH: This is the official MBROLA database creation tool. It helps in creating diphone databases from labeled speech samples. 2. Praat: While not specifically for MBROLA, this phonetics software can be used for labeling and segmenting audio files, which is a crucial step in the process. 3. HTK (Hidden Markov Model Toolkit): Can be used for automatic segmentation of speech data. 4. Festvox: A voice building framework that can be used to create voices for various synthesizers, including MBROLA. 5. WinSnoori: A speech analysis tool that can be helpful in the voice creation process. These tools can significantly streamline the process, but creating a high-quality voice still requires considerable effort and expertise. Would you like more details on any of these tools or the specific steps they're used for?
>>32409 The linguistic knowledge necessary for defining pitch and duration rules in speech synthesis encompasses several areas of linguistics. Here's an overview of the key areas: 1. Phonetics and Phonology: - Understanding of speech sounds and their features - Knowledge of phonemic inventories and allophonic variations - Familiarity with syllable structure and phonotactics 2. Prosody: - Understanding of stress patterns (lexical and sentential) - Knowledge of intonation patterns and their meanings - Familiarity with rhythm and timing in speech 3. Morphology: - Understanding how word structure affects pronunciation - Knowledge of morphophonemic alternations 4. Syntax: - Understanding how sentence structure affects prosody - Knowledge of phrase boundaries and their prosodic effects 5. Pragmatics: - Understanding how context and intent affect prosody - Knowledge of focus and emphasis in speech 6. Sociolinguistics: - Familiarity with dialect variations in prosody - Understanding of speaking styles and registers 7. Acoustic Phonetics: - Knowledge of acoustic correlates of prosodic features - Understanding of formant structures and transitions 8. Language-specific knowledge: - Deep understanding of the prosodic system of the target language - Familiarity with any unique features (e.g., tone in tonal languages) This knowledge helps in creating rules that accurately capture the nuances of natural speech in the target language. Would you like more information on any of these linguistic areas or how they specifically apply to speech synthesis?
>>32409 > To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA >This checkpoint leverages the pretrained checkpoint wav2vec2-large-lv60 and is fine-tuned on CommonVoice to recognize phonetic labels in multiple languages. https://huggingface.co/facebook/wav2vec2-lv-60-espeak-cv-ft >Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained mode https://arxiv.org/abs/2109.11680
>>32406 >>32407 Can be deleted. I was on another computer while using Claude.
>>32409 Ouuuf, I had to ask specifically: Yes, Festival can indeed be easier to work with for voice creation compared to directly using MBROLA or eSpeak. Here's a brief overview of why Festival might be a more accessible option: 1. Comprehensive toolkit: - Festival comes with a suite of tools designed for voice creation - Includes utilities for data preparation, analysis, and synthesis 2. Festvox: - A voice building framework that works with Festival - Provides a structured approach to creating new voices 3. Higher-level abstractions: - Allows you to work with higher-level linguistic concepts - Can automatically handle some lower-level details 4. Built-in machine learning: - Can use statistical methods to learn prosody models from data - Reduces the need for manual rule creation 5. Extensive documentation: - Well-documented processes for voice building - Includes tutorials and example voices 6. Language support: - Comes with tools for building voices in various languages - Can handle different writing systems and phoneme sets 7. Integration: - Can still use MBROLA voices if desired - Allows for mixing different synthesis methods While Festival still requires significant effort and linguistic knowledge, it provides more structure and automation in the voice creation process. Would you like more details on how to approach voice creation with Festival?
>>32413 We should keep an eye on developments in case anyone tries to make that easier with AI. Or maybe we could try advocating for it. I might also give it a try myself, if I really have to, but I'm bad with patience and frustration. http://www.festvox.org/ https://github.com/festvox/festvox >For full details about voice building see the document itself http://festvox.org/bsv/
>>32414 Great research effort NoidoDev, thanks! :^)

Report/Delete/Moderation Forms
Delete
Report