/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

Build Back Better

More updates on the way. -r

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


Have a nice day, Anon!


Speech Synthesis/Recognition general Robowaifu Technician 09/13/2019 (Fri) 11:25:07 No.199
We want our robowaifus to speak to us right? en.wikipedia.org/wiki/Speech_synthesis https://archive.is/xxMI4 research.spa.aalto.fi/publications/theses/lemmetty_mst/contents.html https://archive.is/nQ6yt The Taco Tron project: arxiv.org/abs/1703.10135 google.github.io/tacotron/ https://archive.is/PzKZd No code available yet, hopefully they will release it. github.com/google/tacotron/tree/master/demos https://archive.is/gfKpg >=== -edit subject
Edited last time by Chobitsu on 07/02/2023 (Sun) 04:22:22.
>>22538 Lol. Just to let you know Anon, we're primarily a SFW board. You might try /robo/. Cheers. :^)
>>22538 What it this? From the ...engine where the dev doesn't want to be mentioned here?
I just finished my demonstration for talking to the waifu ai https://youtu.be/jjvbENaiDXc
>Whisper-based Real-time Speech Recognition https://www.unrealengine.com/marketplace/en-US/product/d293a6a427c94831888ca0f47bc5939b Just want to show this here after finding it. Something like this would be useful if one wanted to use UnrealEngine for a virtual waifu or some kind of a virtual training environment.
>>23538 I'm sure there's some kind of netcode in unreal you can use for a transcribing API of your choice and save yourself the $99 >virtual waifu real life robotic waifu
>>23558 >Whisper C++ >Beta: v1.4.2 / Stable: v1.2.1 / Roadmap | F.A.Q. >High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model: >Plain C/C++ implementation without dependencies >Apple silicon first-class citizen - optimized via ARM >NEON, Accelerate framework and Core ML >AVX intrinsics support for x86 architectures >VSX intrinsics support for POWER architectures >Mixed F16 / F32 precision >4-bit and 5-bit integer quantization support >Low memory usage (Flash Attention) >Zero memory allocations at runtime >Runs on the CPU >Partial GPU support for NVIDIA via cuBLAS >Partial OpenCL GPU support via CLBlast >BLAS CPU support via OpenBLAS >C-style API Thanks, that might come in handy. There seems to be enough GPU support, despite running on a CPU. I'm still thinking of building a dedicated server in some time, using the Arc380 (70W). >large 2.9 GB ~3.3 GB The original one needs 10GB or more for the large one. Which would rather indicate to get a 3060 (170W). Many thing will work fine with smaller models anyways.
>>23558 Thanks for the reminder Anon. That anon's work is really quite excellent tbh.
>>23558 >>23561 This (bit hard to understand) guy here https://www.youtube.com/watch?v=75H12lYz0Lo tests it on a Raspberry Pi and it works actually surprisingly fast! He tries to get smaller and smaller with his optimizations. I'll keep an eye on that.
>>23579 aws transcribe cost 3 cents per minute and you want to rent a server to run that thing which probably requires multiple gpus. Doesn't make any sense.
>>23591 >Whisper vs AWS transcribe This is about running it at home. The tiny model works on a Raspberry Pi and the large one maybe on a 4GB GPU, certainly on a 6GB GPU (like the Arc380 which uses 70W). Do as you wish, but the general notion here is that we want our waifus be independent from the internet. Some might even say, not connected to it. Using online services for something so fundamental as speech recognition (transcription), especially beyond development, is a special case and will not be recommended.
>>23535 That took quiet a while and was more productive than whatever the heck kiwi is doing. I'm going to start using a name tag so I can get some proper recognition for what I've done so far. Which is trying to make a hasel actuator, this, buying supplies, reading up on electronics and testing the arduino and soon making a 3d anime girl doll from scratch. I'm really about to leave this place cause this is bullshit.
>>23634 peteblank is an anagram for "pleb taken"
>>23590 Wow. That's most excellent.
>>23634 It's good that you did something, during the last few month, but don't exaggerate. You had some advice from other anons here when trying to make the hasel actuator. You also bring this kind of vitriol with you, bashing someone or this board in way too many comments. >3d anime girl doll from scratch I'm looking forward to see that. >I'm really about to leave this place You don't need to hang out here every day. Work on your project and report back later.
>>23640 I am right to be upset at kiwi since he's attacking my character for no reason. I told him I was planning to do this for profit if possible, i emailed the guy who made the 3d model asking for permission and then he turns around and claims i want to steal other people's stuff.
>>23634 >I'm going to start using a name tag so I can get some proper recognition for what I've done so far. Good thinking Anon. That's not really why we use names here. Watch the movie 50 first dates to understand the actual reason.
>>23643 I deleted my original post here, but forgot to copy it. Just wanted to post the new link to the related post. Well... Related: >>23682 This thread is about speech synthesis and maybe recognition, even not about 3D models. You can crosslink posts like above.
>our research team kept seeing new voice conversion methods getting more complex and becoming harder to reproduce. So, we tried to see if we could make a top-tier voice conversion model that was extremely simple. So, we made kNN-VC, where our entire conversion model is just k-nearest neighbors regression on WavLM features. And, it turns out, this does as well if not better than very complex any-to-any voice conversion methods. What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. https://bshall.github.io/knn-vc https://arxiv.org/abs/2305.18975
>>23736 >What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. Lol. That seems a little bizarre to think through. Thanks Anon. >ps. I edited the subject ITT, thanks for pointing that out NoidoDev.
We should think about optimizations of speech recognition (synthesis needs it's own approach): - there are FPGA SBCs which you can train to react to certain words, then put out a text or trigger something - instead of recording a 30s sentence, record much shorter but go on directly after the first one, check the parts, but also glue them together and send the whole sentence to the speech recognition model - maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something - finding ways to detect made up words - construct words out of syllables instead of just jumping to what could have been meant, using that for parts of a sentence where the speech recognition model is uncertain - using the certainty values of speech recognition to look for errors (misunderstandings), maybe using the syllable construction, wordlists and list of names for that
>>24951 >- maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something I would anticipate this should at the least provide greater odds of a coherent parse (particularly in a noisy environment) than just STT alone. Good thinking Anon.
Open file (50.97 KB 768x384 vallex_framework.jpg)
Related: >>25073 >VALL-E X is an amazing multilingual text-to-speech (TTS) model proposed by Microsoft. While Microsoft initially publish in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are glad to share our trained VALL-E X model with the community, allowing everyone to experience the power next-generation TTS https://github.com/Plachtaa/VALL-E-X https://huggingface.co/spaces/Plachta/VALL-E-X
>>25075 also worth noting that : its broken if you launch it thru "python -X utf8 launch-ui.py" command and let install "vallex-checkpoint.pt" and whisper "medium.pt" models on its own, very weird as its already solved here : https://github.com/Plachtaa/VALL-E-X#install-with-pip-recommended-with-python-310-cuda-117--120-pytorch-20 download them manually, thats it.
>>25075 >>25096 Thanks. This will be very useful.
Open file (107.39 KB 608x783 Screenshot_136.png)
There's some excitement around a Discord server being removed, which was working on AI voice models. We might even not have known about it (I didn't), but here's the website: https://voice-models.com https://docs.google.com/spreadsheets/d/1tAUaQrEHYgRsm1Lvrnj14HFHDwJWl0Bd9x0QePewNco/edit#gid=1227575351 and weights.gg (not voice models) >AI Hub discord just got removed from my server list But it seems to be only a fraction of the models. Some mention a IIRC backup: https://www.reddit.com/r/generativeAI/comments/16zzuh4/ai_hub_discord_just_got_removed_from_my_server/
>>25805 >I WARNED YOU ABOUT THE DOXXCORD STAIRS BRO Save.everything. Doxxcord is even more deeply-controlled than G*ogle is. DMCAs don't result in a forum getting disappear'd.
>Otamatone https://youtu.be/Y_ILdh1K0Fk Found here, related: >>25273
>>25876 Had no idea that was a real thing NoidoDev, thanks! Any chance it's opensauce?
>>25893 The original belongs to a corporation, but if you look for "Otamatone DIY" you can find some variants.
>>25909 Cool. Thank you NoidoDev! :^)
>>17474 Can we get this with time stamps? So we can use it for voice training (text to speech).
>ⓍTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. There is no need for an excessive amount of training data that spans countless hours. https://huggingface.co/coqui/XTTS-v2 (only non-commercial licence) Testing Space: https://huggingface.co/spaces/coqui/voice-chat-with-mistral Via https://www.reddit.com/r/LocalLLaMA/comments/17yzr6l/coquiai_ttsv2_is_so_cool/ (seems to be much closer to the ElevenLabs quality)
>>26511 also this one https://github.com/yl4579/StyleTTS2 some people claim its 100x faster than coqui's xtts. still no webui tho :(
>>26512 Thank, I saw this mentioned but forgot to look it up.
>>26512 tested it locally, rtx 3070. works fast as fuck. https://files.catbox.moe/ow0ryz.mp4
>>26535 >>26566 Thanks Anons. :^)
>>27995 REALLY impressive Anon, thanks!
>MetaVoice 1B - The new TTS and Voice cloning open source model Colab: https://drp.li/7RUPU MetaVoice Online Demo - https://ttsdemo.themetavoice.xyz/ https://huggingface.co/metavoiceio https://youtu.be/Y_k3bHPcPTo Not as good as proprietary models.
>>29257 >Not as good as proprietary models. Ehh, they'll get better with time, no doubt. Thanks Anon! Cheers. :^)
>This week we’re talking with Georgi Gerganov about his work on Whisper.cpp and llama.cpp. Georgi first crossed our radar with whisper.cpp, his port of OpenAI’s Whisper model in C and C++. Whisper is a speech recognition model enabling audio transcription and translation. Something we’re paying close attention to here at Changelog, for obvious reasons. Between the invite and the show’s recording, he had a new hit project on his hands: llama.cpp. This is a port of Facebook’s LLaMA model in C and C++. Whisper.cpp made a splash, but llama.cpp is growing in GitHub stars faster than Stable Diffusion did, which was a rocket ship itself. https://changelog.com/podcast/532 Some takeaways: Whiper didn't do speaker identification (Diarization) when they published this in March 22, 2023, and it seems to be hard to find something doing that. But they said people set up their own pipelines for doing this and Whisper might get there as well. I found this on the topic by briefly searching, it still doesn't seem to be covered in some easy way: >How to use OpenAIs Whisper to transcribe and diarize audio files https://github.com/lablab-ai/Whisper-transcription_and_diarization-speaker-identification- Discussion on this: https://huggingface.co/spaces/openai/whisper/discussions/4 Azure AI services seem to be able to do it, but this doesn't help us much. Well, I mean for using it as a tool to extract voice files for training it's one thing, but we also need it as a skill for our waifus: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speaker-recognition?tabs=script&pivots=programming-language-cpp
>>29415 Very nice. Thanks NoidoDev! I have a lot of respect for Gerganov. He very-clearly understands the issues of latency in a systems development context. Exactly the kinds of expertise vital for success to /robowaifu/ and our affiliated cadres in the end. Cheers. :^)
>Data Exchange Podcast 198 - Sep 21, 2023 Overview over everything related to speech. https://www.youtu.be/w4DULuvgO1Y Yishay Carmiel is the CEO of Meaning, a startup at the forefront of building real-time speech applications for enterprises. Episode Notes: https://thedataexchange.media/state-of-ai-for-speech-and-audio >Sections Generative AI for Audio (text-to-speech; text-to-music; speech synthesis) - 00:00:44 Speech Translation - 00:09:44 Automatic Speech Recognition and other models that use audio inputs - 00:13:16 Speech Emotion Recognition - 00:19:55 Restoration - 00:21:55 Similarities in recent trends in NLP and Speech - 00:24:23 Diarization (speaker identification), and implementation challenges - 00:29:47 Voice cloning and risk mitigation - 00:35:36
There is some Japanese open source programs for speech synthesis such as VOICEVOX though I should mention if you use these voices they will have funny accents if you make them speak English which can be kinda cute sometimes. https://voicevox.hiroshiba.jp And TALQu but it is only for Windows. https://booth.pm/ja/items/2755336 NNSVS is for singing also open source. https://nnsvs.github.io SociallyIneptWeeb used VOICEVOX for an AI waifu before and detailed what he did https://www.youtube.com/watch?v=bN5UaEkIPGM&t=674s
>>30390 Oh wow, this is really good. Thanks. >https://nnsvs.github.io >NNSVS >Neural network based singing voice synthesis library > GitHub: https://github.com/nnsvs/nnsvs > Paper: https://arxiv.org/abs/2210.15987 > Demo: https://r9y9.github.io/projects/nnsvs/ >Features > Open-source: NNSVS is fully open-source. You can create your own voicebanks with your dataset. > Multiple languages: NNSVS has been used for creating singing voice synthesis (SVS) systems for multiple languages by VocalSynth comminities (8+ as far as I know). > Research friendly: NNSVS comes with reproducible Kaldi/ESPnet-style recipes. You can use NNSVS to create baseline systems for your research.
>>30398 Here is a site I found that writes some about it and has links to written tutorials. https://nnsvs.carrd.co/
>VoiceCraft >>30614 Thanks, but it's about voice cloning again. I think what I really want are artificial voices which don't belong to anyone. Cloning has it's use cases as well, but I don't need or want it for a robot wife. Also I don't need to be to close to a human. To me the quality problem is a solved problem at this point, at least for robowaifus. I was very impressed certainly by the singing capabilities I saw and heard recently, see above >>30390
>>30625 If you aren't worried about human closeness there is a pretty simple TTS that sounds like old retro synthesized voices. Unfortunately I cant find a video that has the female voice. https://github.com/adafruit/Talkie
>>30657 Thanks, but I didn't mean to go so extreme into the other direction. I just meant for our use case here, and in my opinion, the current state of the technology should be sufficient in terms of quality or it's at least close to it. Making it faster and run better on smaller devices would be good, though. For content creation it's another story, if we don't want to only have stories about robots.

Report/Delete/Moderation Forms
Delete
Report