/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

The Mongolian Tugrik has recovered its original value thanks to clever trade agreements facilitated by Ukhnaagiin Khürelsükh throat singing at Xi Jinping.

The website will stay a LynxChan instance. Thanks for flying AlogSpace! --robi

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB


(used to delete files and postings)

Knowing more than 100% of what we knew the moment before! Go beyond! Plus! Ultra!

Speech Synthesis general Robowaifu Technician 09/13/2019 (Fri) 11:25:07 No.199
We want our robowaifus to speak to us right?



The Taco Tron project:


No code available yet, hopefully they will release it.

Facebook made a great speech generator, circa a year ago: https://ai.facebook.com/blog/a-highly-efficient-real-time-text-to-speech-system-deployed-on-cpus/ - It's not free software, but they described how it is build. Yannic Kilcher goes through the system and explains it here: https://www.youtube.com/watch?v=XvDzZwoQFcU One interesting feature is, that it runs on a CPU with 4-cores (not the training of course). On such a CPU it is faster than real-time, which means faster than running the audio output would take. Something like this might be something very useful to us, if we could run our own knock-off on a small SBC inside the robowaifu.
>>10393 >Something like this might be something very useful to us, if we could run our own knock-off on a small SBC inside the robowaifu. It certainly would, if we can somehow obtain access to it or reproduce it, Anon. Thanks for the heads-up, and for the video link. It really helps to get the points across well for us. youtube-dl --write-description --write-auto-sub --sub-lang="en" https://www.youtube.com/watch?v=XvDzZwoQFcU
Open file (111.71 KB 286x286 speechbrain.png)
not sure if this has been posted before, but I came across this and immediately thought of some of the todo list for clipchan. https://speechbrain.github.io/index.html seems like there was some discussion about emotion and speaker ID classifiers.
>>10458 Very cool Anon, thanks. It looks like it's a solid and open source system too, AFAICT.
The model link is dead, while I can train a new model I am looking to avoid that step right now because of other deadlines, though I would love to include 2B in WaifuEngine, would anyone be willing to mirror or provide an updated link? Thanks
>>10499 ATTENTION ROBOWAIFUDEV I'm pretty sure the model in question is your pre-trained one for 2B's WaifuSynth voice, ie, https://anonfiles.com/Hbe661i3p0/2b_v1_pt >via https://gitlab.com/robowaifudev/waifusynth cf. (>>10498, >>10502)
>>10504 Links are both dead
>>10504 To clearify the pretrained model links are both dead repo still up
Open file (14.93 KB 480x360 hqdefault.jpg)
Great! Now my waifu can sing a lullaby for me to sleep well. The only problem is that I don't have the Vocaloid editor. Video demonstration: https://youtu.be/mxqcCDOzUpk Github: https://github.com/vanstorm9/AI-Vocaloid-Kit-V2
Open file (548.83 KB 720x540 lime_face_joy.png)
>>5521 >Cute robowaifu Check >Inspiring message to all weebineers everywhere Check >Epic music Check Best propaganda campaign. 10/10, would build robowaifu. >>5529 >>5530 Damn it lads! You're bringing me closer to starting sampling Lime's VA heheheh (Although I was hoping to use my voice to generate a somewhat convincing robowaifu, so as to minimise reliance on females).
>>11229 Forgot to add. >>5532 >I don't know if that'll be enough. Chii didn't really talk much. You're overcomplicating it. I think he meant create a tts that outputs "Chii" regardless of what you put in ;) (Although you could add different tonality and accents, might be a more fun challenge).
>>10504 Sorry, been busy and haven't been active here lately. Updated the repo link: https://www.mediafire.com/file/vjz09k062m02qpi/2b_v1.pt/file This model could be improved by training it without the pain sound effects. There's so many of them it biased the model which causes strange results sometimes when sentences start with A or H.
>>11474 Thanks! Wonderful to see you, I hope all your endeavors are going well Anon.
>>11474 come join my doxcord server if you have time and pm me! thanks for the model, you will likely see it used on the 2B "cosplay" waifu, we may have in the game
>>11480 The link is expired. What would you like to talk about? I don't have a lot to add. You can do some pretty interesting stuff with voice synthesis by adding other embeddings to the input embedding, such as for the character in a multi-character model, emphasis, emotion, pitch, speed, and ambiance (to utilize training samples with background noise.) This is what Replica Studios has been doing: https://replicastudios.com/
>>11522 If you are interested, I am looking for someone to take over the speech synthesis part of WaifuEngine, I got it working however, to work on it as a specialty takes me away from the rest of the application, like I want to train a new model using glowtts but my time is limited. I also have to work on the various other aspects of the project, to get it off the ground. Right now our inference time using tacotron2 isn't great unless you have a GPU. As for compensation on the project, so far I have been giving away coffee money as we have little resources haha, if the project gets bigger and more funding, I'd be willing to help the project contributors out. https:// discord.gg/ gBKGNJrev4
>>11536 In August I'll have some time to work on TTS stuff and do some R&D. I recommend using FastPitch. It's just as good as Tacotron2 but 15x faster on the GPU and 2x faster on the CPU than Tacotron2 is on the GPU. It takes about a week to train on a toaster card and also already has stuff for detecting and changing the pitch and speed, which is essential to control for producing more expressive voices with extra input embeddings. https://fastpitch.github.io/
>>11550 I'd message you on discord about this this could be useful info for the board. But essentially I did use fast pitch originally, the issue is the teacher student training methodology, you have to use tacotron to bootstrap and predict durations to align, When you don't do that and just train on LJS Model of Fastpitch via fine tuning, it fails to predict the durations. We can definitely try this method I am open to it, I guess in my time crunch I didn't bother. I am optimizing for delivery so that we have a product people can use and enjoy, it should be very simple to update the models in the future, it would be one python script change based off my architecture
>>11559 The 2B model I made was finetuned on the pretrained Tacotron2 model and only took about an hour. Automating preprocessing the training data won't be a big deal. And if a multi-speaker model is built for many different characters it would get faster and faster to finetune. I've been looking into Glow-TTS more and the automated duration and pitch prediction is a nice feature but the output quality seems even less expressive than Tacotron2. A key part of creating a cute female voice is having a large range in pitch variation. Also I've found a pretrained Tacotron2 model that uses IPA. It would be possible to train it on Japanese voices and make them talk in English, although it would take some extra time to adapt FastPitch to use IPA. Demo: https://stefantaubert.github.io/tacotron2/ GitHub: https://github.com/stefantaubert/tacotron2
Some other ideas I'd like to R&D for voice synthesis in the future: - anti-aliasing ReLUs or replacing them with swish - adding gated linear units - replacing the convolution layers with deeper residual layers - trying a 2-layer LSTM in Tacotron2 - adding ReZero to the FastPitch transformers so they can be deeper and train faster - training with different hyperparameters to improve the quality - using RL and human feedback to improve the quality - using GANs to refine output like HiFiSinger - outputting at a higher resolution and downsampling
>>11569 Thanks, but what's the point of this IPA. To let it talk correctly in other languages? >Der Nordwind und die Sonne - German with American English accent I can assure you: I doesn't work. Americans talking German often (always) sounds bad, but this is a level of it's own. Absolutely bizarre.
>>11571 Yeah, I live around Chinese with thick accents and this takes it to the next level, kek. That's not really the motivation for using IPA though. This pilot study used transfer learning to intentionally create different accents, rather than copy the voice without the accent. How IPA is useful to generating waifu voices is it helps improve pronunciation, reduce needed training data, and solves the problem with heteronyms, words spelled the same but pronounced differently: https://jakubmarian.com/english-words-spelled-the-same-but-pronounced-differently/ When models without IPA have never seen a rare word in training, such as a technical word like synthesis, they will usually guess incorrectly how to pronounce it, but with IPA the pronunciation is always the same and it can speak the word fluently without ever having seen it before. Also in a multi-speaker model you can blend between speaker embeddings to create a new voice and it's possible to find interpretable directions in latent space. Finding one for accents should be possible, which could be left in control to the user's preferences to make a character voice sound more American, British or Japanese and so on.
>>11577 Ah, okay, this sounds pretty useful. One more problem comes to mind in regards to this. In English foreign names are often changed in pronunciation, because the name would sound "strange" otherwise. The philosopher Kant would sound like the c-word for female private parts. Therefore they pronounce it Kaant. I wonder if the method helps with that as well.
>>11582 In that case it depends what language you transliterate with. If necessary names could be transliterated as they're suppose to be pronounced in their original language, or it could all be in the same language. Exceptions could also be defined. For example, the way Americans pronounce manga is quite different from the Japanese. If someone wants their waifu to sound more like a weeb and pronounce it the Japanese way, they could enter the Japanese IPA definition for it to override the default transliteration.
Open file (18.23 KB 575x368 preview.png)
Open file (62.21 KB 912x423 aegisub.png)
Finished creating a tool for automatically downloading subtitles and audio clips from Youtube videos, which can be reworked in Aegisub or another subtitle editor, then converted into a training set with Clipchan. https://gitlab.com/robowaifudev/alisub
>>11623 This sounds exciting Anon, thanks! >or another subtitle editor Can you recommend a good alternative Anon? I've never been able to successfully get Aegisub to run.
>>11624 Someone recommended SubtitleEdit but it's Windows only: https://nikse.dk/SubtitleEdit Subtitle Editor can display waveforms but it's far more difficult to use and I don't recommend it.
>>11623 Okay, thanks. This could be useful for more, I guess. Maybe later to train the system on lip reading using YouTube, for example. Or maybe for training voice recognition in the first place? How much data do we need to emulate a particular voice?
>>11625 OK, thanks for the advice. I'll try and see if I can set it up on a virtual box instead or something, Aegisub did look pretty easy to use (first time I've seen it in action, so thanks again). The problem is always a wxWidgets dependency hell issue. I can even get it to build, right up to link time.
>>11631 Finetuning a pretrained model you need about 20 minutes. Training a model from scratch takes about 12 hours. Multispeaker models trained on hundreds of voices can clone a voice with a few sentences but still need a lot of samples to capture all the nuances.
Been doing some work to get WaifuEngine's speech synthesis to run fast on the CPU and found that FastPitch has a real-time factor of 40x and WaveGlow 0.4x. This lead me to testing several different vocoder alternatives to Waveglow and arriving at multi-band MelGAN with an RTF of 20x. So FastPitch+MelGAN has an RTF of 12x, which means it can synthesize 12 seconds of speech every second or 80ms to generate a second of speech. "Advancing robotics to a point where anime catgirl meidos in tiny miniskirts are a reality" took MelGAN 250ms on CPU to generate from 2B's Tacotron2 Mel spectrogram. Now I just gotta set up this shit so it's easy to train end-to-end and the whole internet and their waifus are getting real-time waifus. Multi-band MelGAN repo: https://github.com/rishikksh20/melgan Multi-band MelGAN paper: https://arxiv.org/abs/2005.05106 Original MelGAN paper: https://arxiv.org/abs/1910.06711
>>11636 Interesting, thanks, but I meant how much samples we need to fine-tune a voice. I also wonder if voicesmare being 'blended' that way. Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. >>11647 Thanks for your work. I thought voice generation would take much more time to do. Good to know. Responses to someone talking should be fast.
Open file (153.25 KB 710x710 gawr kilcher.jpg)
>>11648 I meant 20 minutes and 12 hours of samples. Finetuning with 20 minutes of samples takes about 1-2 hours on my budget GPU. >Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. This definitely deserves more thought. If every person on the internet will be able to do speech synthesis and there is a tsunami of voice cloning characters, it's important people are able to have creative freedom with it while the buzz is on. People's curiosity will further advance speech synthesis and diffuse into other areas of AI, including waifu tech. On the other hand if people only straight up copy voices then it would cause a media shitstorm and possibly turn people away, but that could also have its benefits. Whatever happens though the accelerator is stuck to the floor. In the meantime while the hype builds, iteration can continue on until the synthesis of Gawr Kilcher is realized. When people look closely though they'll notice it's neither Yannic or Gura but actually Rimuru and Stunk all along.
>>11647 Thanks for the information, Anon.
>>11650 kek. i just noticed that logo. i wonder what based-boomer AJ would think of robowaifus. white race genocide, or crushing blow to feminazis and freedom to all men from oppression?
>>11677 He doesn't like them or AI in general. Said something once like people are going to stop having kids and masturbate with a piece of plastic all day and how the government is going to know everything about people through them and be able to manipulate them perfectly. He's not really wrong. Look how many people already give up all their data using Windows and Chrome.
>>8151 >>12193 A routine check on the Insights->Traffic page led me here. While the program itself is written with Qt, what actually makes the voices work (Voice.h and beyond) does not contain a single trace of Qt (well, almost, but what little there is is just error boxes). This is a deliberate design decision to allow the actual inference engine to be copied and ported anywhere with minimal trouble. For inference on embedded devices you probably want to use TFLite, which is on my list because I plan on Windows SAPI integration.
>>12257 Hello Anon, welcome. We're glad you're here. Thanks for any technical explanations, we have a number of engineers here. Please have a look around the board while you're here. If you have any questions, feel free to make a post on our current /meta thread (>>8492). If you decide you'd like to introduce yourself more fully, then we have an embassy thread for just that (>>2823). Regardless, thanks for stopping by!
In need of some help... I want to create a speech synthesizer, I want to take samples of my waifu's voices (which I have a lot of) and use it to digitally create her voice. First of all, is it possible? The voice samples I have are not the kind that this video shows https://youtu.be/_d7xRj121bs?t=55 , they're just in-game dialog. It is also worth noting that the voice is in Japanese. If it is possible, I still have no idea where to begin with this, I'm guessing I'll need some sound tech knowledge (which I have none of) and that's about all I can think of. In terms of programming languages, I know Python fairly well and am currently getting into C++. Anons, how do I get started with this?
>>13811 >I still have no idea where to begin with this Welcome. Then look through the thread and into the programs mentioned. You will probably need to train some neural network on a GPU. Also, you would need to extract the voices from the game and also have these words in text then. If you can't get them as files, then you might need to record them with a microphone. Then would need to transcribe the text. Lurk around, maybe someone else knows more, and just ignore the disgusting troll insulting everyone.
>>13811 As dumb as this might sound, you might want to check out /MLP/ on 4chan, there's a 100+ threads about doing this with My Little Pony characters called the "Pony Preservation Project" and they've actually made some decent progress.
>>13818 >look through the thread and into the programs mentioned Will do. >probably need to train some neural network on a GPU Have yet to get into neural networks but looks like the time has come. >extract the voices I've done that, from the game files too so they're of decently high quality. >transcribe the text That I need to do. >>13821 >as dumb as this might sound Nothing dumb about it if it works. Will give them a visit. Thank you, Anons!
>>13823 Keep us updated if there's progress anon, speech synthesis is a fascinating field. I'd love to try it out myself later in the year once I have more time
This may be old news, since it's from 2018, but Google's Duplex seems to have a great grasp on conversational speech. I think it says a lot when I had an easier time understanding the robot verus the lady at the restaurant (2nd audio example in the blog). https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html
>>14270 Hi, I knew that this has been mentioned before somewhere. Didn't find it here in this thread nor with Waifusearch. Anyways, it's in the wrong thread here, since this is about speech synthesis but the article is about speech recognition. The former conversation probably happened in the chatbot thread. >One of the key research insights was to constrain Duplex to closed domains, which are narrow enough to explore extensively. Duplex can only carry out natural conversations after being deeply trained in such domains. It cannot carry out general conversations. This is exactly the interesting topic of the article. Good reminder. A few month or a year ago I pointed out that recognizing all kinds of words, sentences and meanings will be one of our biggest challenges. Especially if it should work with all kinds of voices. Some specialists (Sphinx CMU) claimed it would currently require a server farm with terrabytes of RAM to do that, if it was even possible. We'll probably need a way to work around that. Maybe using many constrained models on fast SSDs which take over, dependent on the topic of conversation. Let's also hope for some progress, but also accept that the first robowaifus might only understand certain commands.
>>11623 You should replace youtube-dl with yt-dlp. youtube-dl is no longer maintaned and has issues with some youtube videos.
>>15192 Thanks for the tip Anon. Having used youtube-dl for years now, I too noticed the sudden drop-off in updates that occurred following the coordinated attack by RIAA/Microsoft against it's developer & user community. We'll look into it.
Open file (73.10 KB 862x622 IPA_synthesis.png)
I think I've finally figured out a way to train more expressive voices in conversation without having to label a ton of data. First, the English text needs to be transcribed into IPA so that a speech synthesis model can easily predict how words are spoken without requiring a huge dataset covering all the exceptions and weirdness of English. The English transcription or IPA is projected into an embedding that's split into two parts. One part constrained to representing the content as IPA via projecting those features back into IPA symbols and minimizing the cross entropy loss. The other half modelling the style, such as the emotion and other subtleties, to match the audio examples more faithfully, which are trained through the Mel spectrogram loss. This way the model can learn all aspects of speech through just the text labels and audio examples alone. At inference time this style embedding could be modified to change the emotion, pitch, cadence, tone and other qualities of the model for voice acting or creating examples for finetuning the model towards a desired personality. A ByT5 model could be used to transcribe English and other languages into the IPA embedding + style embedding. It could also take into account the previous context of the conversation to generate a more appropriate style embedding for the speech synthesis model to work from. Training from context though will require new datasets from podcasts that have such context. I've collected some with existing transcripts and timestamps for this already. The transcripts just need to be accurately aligned to the audio clips for clipping, so it's not an unfeasible project for one person to do. Other possibilities for this could be adding tags into the text training data that get filtered out from the content via the IPA cross entropy loss, ensuring the tags only affect the style embedding. You could indicate tempo, pitches, velocity and note values for singing which would be learned in the style embeddings. It could also be used for annotating different moods or speaking styles such as whispering or yelling. There's a ton of possibilities here for more versatile speech synthesis and natural conversation.

Report/Delete/Moderation Forms