/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

Downtime was caused by the hosting service's network going down. Should be OK now.

An issue with the Webring addon was causing Lynxchan to intermittently crash. The issue has been fixed.

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


Vera stayed by Anon's side, continuing to support him in building new programs, but their primary goal was no longer work or money or fame.


Speech Synthesis/Recognition general Robowaifu Technician 09/13/2019 (Fri) 11:25:07 No.199
We want our robowaifus to speak to us right? en.wikipedia.org/wiki/Speech_synthesis https://archive.is/xxMI4 research.spa.aalto.fi/publications/theses/lemmetty_mst/contents.html https://archive.is/nQ6yt The Taco Tron project: arxiv.org/abs/1703.10135 google.github.io/tacotron/ https://archive.is/PzKZd No code available yet, hopefully they will release it. github.com/google/tacotron/tree/master/demos https://archive.is/gfKpg >=== -edit subject
Edited last time by Chobitsu on 07/02/2023 (Sun) 04:22:22.
>>16669 The techniques for the audio in there are studied now under phonetics. The techniques for the video in there are studied under articulatory synthesis. Articulatory synthesis is difficult and computationally expensive. I don't know of a good, flexible framework for doing that, so I wouldn't know how to get started on waifu speech with that. Under phonetics, the main techniques before deep neural networks were formant synthesis and concatenative synthesis. Formant synthesis will result in recognizable sounds, but not human voices. It's what you're hearing in the video. Concatenative synthesis requires huge diphone sound banks, which represent sound pieces that can be combined. (Phone = single stable sound. Diphone = adjacent pair of phones. A diphone sound bank cuts off each diphone at the midpoints of the phones since it's much easier to concatenate phones cleanly at the midpoints rather than the endpoints. This is what Hatsune Miku uses.) Concatenative synthesis is more efficient than deep neural networks, but deep neural networks are far, far more natural, controllable, and flexible. Seriously, I highly recommend following in the PPP's footsteps here. Deep neural networks are the best way forward. They can produce higher quality results with better controls and with less data than any other approach. Programmatically, they're also flexible enough to incorporate any advances you might might see from phonetics and articulatory synthesis. The current deep neural networks for speech generation already borrow a lot of ideas from phonetics.
>>16684 Thanks for the advice Anon!
Open file (127.74 KB 1078x586 WER.png)
Open file (83.58 KB 903x456 models.png)
Open file (82.91 KB 911x620 languages.png)
https://github.com/openai/whisper Oh shit, audio transcription has surpassed average human level and now competitive with professional transcription. OpenAI has gone off its investor rails and completely open-sourced the model and weights. On top of that it's multilingual and can do Japanese fairly well. This could be used for transcribing audio from vtubers, audio books, and anime with missing subtitles. Unfortunately it doesn't do speaker detection as far as I know but it might be possible to train another model to use the encoded audio features to detect them. Install: python -m pip install git+https://github.com/openai/whisper.git --user Quick start: import whisper model = whisper.load_model("base", device="cuda") # set device to cpu if no CUDA result = model.transcribe("chobits_sample.mp3", language="en") # multilingual models will automatically detect language, but not English only models print(result["text"]) Output (base): > Yuzuki. I brought you some tea. Huh? Huh? Why are you serving the tea? The maid, Persecom, is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. Wee. I can handle this on my own. I want you to try to relax. Oh. Oh? Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all we really need is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motu-suwa, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! Huh? There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki. Interestingly the VA actually said "persecom" instead of persocom and Motusua instead of Motosuwa, which transcribed as "Motu-suwa". The poor pronunciation of "all he really needs is a good night's sleep" sounded a lot like "all we really need is a good night's sleep" and was transcribed as such. The only other errors were transcribing a Chii processing sound effect as "wee", mistaking Minoru saying "ah!" as "huh?", the clatter of teacups being transcribed as "oh", and Minoru saying "ugh" as "oh?" Output (small): > Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki? "Ah! Huh?" from Minoru and Hideki were omitted. "Ugh" was also omitted when Minoru passes out. It understood persocom wasn't a name but still misspelled it "persicom". Chii's sound effect wasn't transcribed as "wee" this time. Motosuwa got transcribed as "Motosua". This model understood "all he really needs" but made a mistake at the end thinking Hideki was asking a question saying Yuzuki. Output (medium): > Yuzuki! I brought you some tea. Ah! Huh? Why are you serving the tea? The maid, Persicom, is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki... This one got the ellipsis right at the end and recognized Minoru saying "ah!" but mistook persocom as a name, Persicom. "Ugh" was omitted. Output (large): >Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki... "Ah! huh?" were omitted and it understood persocom wasn't a name but still spelled it as "persicom".
>>17474 (continued) Output (tiny): > Useuki. I brought you some tea. Ugh. Huh? Why are you serving the tea? The maid, Percicom, is currently being used by the system. What are you talking about? Useuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord, Minoro. Thank you very much. I can handle this on my own. I want you to try to relax. Oh. Minoro! Minoro! Minoro! Well, I'm glad to know that all we really need is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motu, so it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! Huh? There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoro under such extreme stress. If only I was just more useful. Let's use a key. Tons of errors, not particularly usable. Output (tiny.en): > Yuzuki! I brought you some tea. Oh! Why are you serving the tea? The maid purse-a-com is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord, me no no. Thank you very much. I can handle this on my own. I want you to try to relax. Oh. Do you know who? What do you know her? What do you know her? Well, I'm glad to know that all he really needs is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motusua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minaro under such extreme stress. If only I was just more useful. Oh, Miss Yuzuki. >Lord, me no no. Japanese names and words confuse it. "Minoru" became "Do you know who?" and "Lord Minoru" became "What do you know her?" but it does decent on English and got "all he really needs" right but flubbed "but to be so exhausted" as "but it'd be so exhausted". Interestingly it got "Motusua" right the way she said it. Output (base.en): > Yuzuki! I brought you some tea. Ugh! What? Why are you serving the tea? The maid-pursa-com is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minaro, thank you very much. I can handle this on my own. I want you to try to relax. Oh. Minaro! Lord Minaro! Lord Minaro! Well, I'm glad to know that Allie really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motusua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minaro under such extreme stress. If only I was just more useful. Miss Yuzuki. This one really messed up "all he really needs" as "Allie really needs" and understood "Minoru" as a name "Minaro". It also got "but to be so exhausted" right. Mistook "ugh" as "oh". Output (small.en): > Yuzuki! I brought you some tea. Ah! Huh? Why are you serving the tea? The maid persicum is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki. >persicum This one got Minoru spelled right and "all he really needs" and "but to be so exhausted". Omitted "ugh". Output (medium.en): > Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he'd just collapse like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki? Omitted "ah! huh?" and "ugh" but otherwise good. Overall from just this sample I think using base is the best for English and tiny.en on CPU. The improvements in quality by small and medium aren't really worth the slowdown in speed and the base.en model doesn't seem particularly robust. If going for a larger model small.en seems better than small.
Open file (5.94 MB 646x360 sample.webm)
Open file (75.53 KB 607x522 whisper.png)
>>17474 Holy shit, I just found out the other day Whisper has a translate feature and gave it a go tonight. It works amazingly well with the medium size model. >[00:00.000 --> 00:02.920] The strongest warrior of Vesper, Mahoro. >[00:02.920 --> 00:06.720] Thank you for fighting with me until today. >[00:06.720 --> 00:11.600] I'm sure you already know, but you have only a few lives left. >[00:11.600 --> 00:17.600] If you continue to fight as a warrior, you will only have 37 days to move. >[00:17.600 --> 00:25.600] However, if you release your armament, you will still be able to move for 398 days, according to the report. >[00:25.600 --> 00:30.800] Mahoro, you've done enough for us Vesper. >[00:30.800 --> 00:37.200] If you have a wish that you want to grant, why don't you live the rest of your time freely? >[00:37.200 --> 00:41.000] Huh? Um... >[00:41.000 --> 00:46.000] Now, choose whichever path you like. >[00:48.000 --> 00:49.800] My wish... >[00:49.800 --> 00:54.400] My last wish is... I imagine finetuning the model on English and Japanese voices and learning to predict not only the text but also the emotion, tone and speaker by attaching Tacotron to the decoder. Then the translate feature could be used to auto-dub anime in the same voice and emotion but in English. The decoder of Whisper could also be used to predict style embeddings (the emotion and tone) from text to feed into Tacotron to synthesize much more natural sounding speech, and the more context you give it, the more accurate it would be.
>>18253 Amazing. Please do this for us Anon! If you can meta-context encodings then we can us this nearly directly for our (robo)waifus. Also, Mahoro Based/10 choice. :^)
>>18253 Thanks, I plan to use Whisper soon. I've read it has problems with mixing languages, so if you encounter problems it might come from that.
Open file (128.51 KB 1078x638 ClipboardImage.png)
Microsoft one shot voice training. https://valle-demo.github.io/ Give it chii's voice and it will probably sound like Chii.
(related crosspost) >>18628 >>18605
>>18628 Reading the comments section is predictable tbh. https ://www.foxnews.com/tech/new-ai-simulate-voice-3-seconds-audio >=== -disable hotlink
Edited last time by Chobitsu on 01/12/2023 (Thu) 08:08:59.
Our neighbors at /cyber/ mentioned this one. >Prime Voice AI >"The most realistic and versatile AI speech software, ever. Eleven brings the most compelling, rich and lifelike voices to creators and publishers seeking the ultimate tools for storytelling." https://beta.elevenlabs.io/
> The scope of OpenUtau includes: > - Modern user experience. > - Selected compatibility with UTAU technologies. > - OpenUtau aims to solve problems in less laborious ways, so don't expect it to replicate exact UTAU features. > - Extensible realtime phonetics (VCV, CVVC, Arpasing) intellegence. > - English, Japanese, Chinese, Korean, Russian and more. > - Internationalization, including UI translation and file system encoding support. > - No you don't need to change system locale to use OpenUtau. > - Smooth preview/rendering experience. > - A easy to use plugin system. > - An efficient resampling engine interface. > - Compatible with most UTAU resamplers. > - A Windows and a macOS version. >The scope of OpenUtau does not include: > - Full feature digital music workstation. > - OpenUtau does not strike for Vocaloid compatibility, other than limited features. https://github.com/stakira/OpenUtau
>This repo/rentry aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows/Linux, as well as a stepping stone for anons that genuinely want to play around with TorToiSe. Rhttps://git.ecker.tech/mrq/ai-voice-cloning
>>22538 Lol. Just to let you know Anon, we're primarily a SFW board. You might try /robo/. Cheers. :^)
>>22538 What it this? From the ...engine where the dev doesn't want to be mentioned here?
I just finished my demonstration for talking to the waifu ai https://youtu.be/jjvbENaiDXc
>Whisper-based Real-time Speech Recognition https://www.unrealengine.com/marketplace/en-US/product/d293a6a427c94831888ca0f47bc5939b Just want to show this here after finding it. Something like this would be useful if one wanted to use UnrealEngine for a virtual waifu or some kind of a virtual training environment.
>>23538 I'm sure there's some kind of netcode in unreal you can use for a transcribing API of your choice and save yourself the $99 >virtual waifu real life robotic waifu
>>23558 >Whisper C++ >Beta: v1.4.2 / Stable: v1.2.1 / Roadmap | F.A.Q. >High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model: >Plain C/C++ implementation without dependencies >Apple silicon first-class citizen - optimized via ARM >NEON, Accelerate framework and Core ML >AVX intrinsics support for x86 architectures >VSX intrinsics support for POWER architectures >Mixed F16 / F32 precision >4-bit and 5-bit integer quantization support >Low memory usage (Flash Attention) >Zero memory allocations at runtime >Runs on the CPU >Partial GPU support for NVIDIA via cuBLAS >Partial OpenCL GPU support via CLBlast >BLAS CPU support via OpenBLAS >C-style API Thanks, that might come in handy. There seems to be enough GPU support, despite running on a CPU. I'm still thinking of building a dedicated server in some time, using the Arc380 (70W). >large 2.9 GB ~3.3 GB The original one needs 10GB or more for the large one. Which would rather indicate to get a 3060 (170W). Many thing will work fine with smaller models anyways.
>>23558 Thanks for the reminder Anon. That anon's work is really quite excellent tbh.
>>23558 >>23561 This (bit hard to understand) guy here https://www.youtube.com/watch?v=75H12lYz0Lo tests it on a Raspberry Pi and it works actually surprisingly fast! He tries to get smaller and smaller with his optimizations. I'll keep an eye on that.
>>23579 aws transcribe cost 3 cents per minute and you want to rent a server to run that thing which probably requires multiple gpus. Doesn't make any sense.
>>23591 >Whisper vs AWS transcribe This is about running it at home. The tiny model works on a Raspberry Pi and the large one maybe on a 4GB GPU, certainly on a 6GB GPU (like the Arc380 which uses 70W). Do as you wish, but the general notion here is that we want our waifus be independent from the internet. Some might even say, not connected to it. Using online services for something so fundamental as speech recognition (transcription), especially beyond development, is a special case and will not be recommended.
>>23535 That took quiet a while and was more productive than whatever the heck kiwi is doing. I'm going to start using a name tag so I can get some proper recognition for what I've done so far. Which is trying to make a hasel actuator, this, buying supplies, reading up on electronics and testing the arduino and soon making a 3d anime girl doll from scratch. I'm really about to leave this place cause this is bullshit.
>>23634 peteblank is an anagram for "pleb taken"
>>23590 Wow. That's most excellent.
>>23634 It's good that you did something, during the last few month, but don't exaggerate. You had some advice from other anons here when trying to make the hasel actuator. You also bring this kind of vitriol with you, bashing someone or this board in way too many comments. >3d anime girl doll from scratch I'm looking forward to see that. >I'm really about to leave this place You don't need to hang out here every day. Work on your project and report back later.
>>23640 I am right to be upset at kiwi since he's attacking my character for no reason. I told him I was planning to do this for profit if possible, i emailed the guy who made the 3d model asking for permission and then he turns around and claims i want to steal other people's stuff.
>>23634 >I'm going to start using a name tag so I can get some proper recognition for what I've done so far. Good thinking Anon. That's not really why we use names here. Watch the movie 50 first dates to understand the actual reason.
>>23643 I deleted my original post here, but forgot to copy it. Just wanted to post the new link to the related post. Well... Related: >>23682 This thread is about speech synthesis and maybe recognition, even not about 3D models. You can crosslink posts like above.
>our research team kept seeing new voice conversion methods getting more complex and becoming harder to reproduce. So, we tried to see if we could make a top-tier voice conversion model that was extremely simple. So, we made kNN-VC, where our entire conversion model is just k-nearest neighbors regression on WavLM features. And, it turns out, this does as well if not better than very complex any-to-any voice conversion methods. What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. https://bshall.github.io/knn-vc https://arxiv.org/abs/2305.18975
>>23736 >What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. Lol. That seems a little bizarre to think through. Thanks Anon. >ps. I edited the subject ITT, thanks for pointing that out NoidoDev.
We should think about optimizations of speech recognition (synthesis needs it's own approach): - there are FPGA SBCs which you can train to react to certain words, then put out a text or trigger something - instead of recording a 30s sentence, record much shorter but go on directly after the first one, check the parts, but also glue them together and send the whole sentence to the speech recognition model - maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something - finding ways to detect made up words - construct words out of syllables instead of just jumping to what could have been meant, using that for parts of a sentence where the speech recognition model is uncertain - using the certainty values of speech recognition to look for errors (misunderstandings), maybe using the syllable construction, wordlists and list of names for that
>>24951 >- maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something I would anticipate this should at the least provide greater odds of a coherent parse (particularly in a noisy environment) than just STT alone. Good thinking Anon.
Open file (50.97 KB 768x384 vallex_framework.jpg)
Related: >>25073 >VALL-E X is an amazing multilingual text-to-speech (TTS) model proposed by Microsoft. While Microsoft initially publish in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are glad to share our trained VALL-E X model with the community, allowing everyone to experience the power next-generation TTS https://github.com/Plachtaa/VALL-E-X https://huggingface.co/spaces/Plachta/VALL-E-X
>>25075 also worth noting that : its broken if you launch it thru "python -X utf8 launch-ui.py" command and let install "vallex-checkpoint.pt" and whisper "medium.pt" models on its own, very weird as its already solved here : https://github.com/Plachtaa/VALL-E-X#install-with-pip-recommended-with-python-310-cuda-117--120-pytorch-20 download them manually, thats it.
>>25075 >>25096 Thanks. This will be very useful.
Open file (107.39 KB 608x783 Screenshot_136.png)
There's some excitement around a Discord server being removed, which was working on AI voice models. We might even not have known about it (I didn't), but here's the website: https://voice-models.com https://docs.google.com/spreadsheets/d/1tAUaQrEHYgRsm1Lvrnj14HFHDwJWl0Bd9x0QePewNco/edit#gid=1227575351 and weights.gg (not voice models) >AI Hub discord just got removed from my server list But it seems to be only a fraction of the models. Some mention a IIRC backup: https://www.reddit.com/r/generativeAI/comments/16zzuh4/ai_hub_discord_just_got_removed_from_my_server/
>>25805 >I WARNED YOU ABOUT THE DOXXCORD STAIRS BRO Save.everything. Doxxcord is even more deeply-controlled than G*ogle is. DMCAs don't result in a forum getting disappear'd.
>Otamatone https://youtu.be/Y_ILdh1K0Fk Found here, related: >>25273
>>25876 Had no idea that was a real thing NoidoDev, thanks! Any chance it's opensauce?
>>25893 The original belongs to a corporation, but if you look for "Otamatone DIY" you can find some variants.
>>25909 Cool. Thank you NoidoDev! :^)
>>17474 Can we get this with time stamps? So we can use it for voice training (text to speech).
>ⓍTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. There is no need for an excessive amount of training data that spans countless hours. https://huggingface.co/coqui/XTTS-v2 (only non-commercial licence) Testing Space: https://huggingface.co/spaces/coqui/voice-chat-with-mistral Via https://www.reddit.com/r/LocalLLaMA/comments/17yzr6l/coquiai_ttsv2_is_so_cool/ (seems to be much closer to the ElevenLabs quality)
>>26511 also this one https://github.com/yl4579/StyleTTS2 some people claim its 100x faster than coqui's xtts. still no webui tho :(
>>26512 Thank, I saw this mentioned but forgot to look it up.
>>26512 tested it locally, rtx 3070. works fast as fuck. https://files.catbox.moe/ow0ryz.mp4
>>26535 >>26566 Thanks Anons. :^)

Report/Delete/Moderation Forms
Delete
Report