/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

Merry Christmas!

Update on the file situation (it's good)

The warrant canary has been updated.

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


Merry Christmas, /robowaifu/ ! Please join the /christmas/ party this year!


Speech Synthesis/Recognition general Robowaifu Technician 09/13/2019 (Fri) 11:25:07 No.199
We want our robowaifus to speak to us right? en.wikipedia.org/wiki/Speech_synthesis https://archive.is/xxMI4 research.spa.aalto.fi/publications/theses/lemmetty_mst/contents.html https://archive.is/nQ6yt The Taco Tron project: arxiv.org/abs/1703.10135 google.github.io/tacotron/ https://archive.is/PzKZd No code available yet, hopefully they will release it. github.com/google/tacotron/tree/master/demos https://archive.is/gfKpg >=== -edit subject
Edited last time by Chobitsu on 07/02/2023 (Sun) 04:22:22.
>>199
This has chatter on HN right now.
news.ycombinator.com/item?id=13992454
Open file (15.15 KB 480x360 0.jpg)
>>199
Inmoov project is worth checking out, I believe they already have code available for voice recognition, text to speech and a bunch of other stuff. The only thing the bot is missing is legs because they're still trying to figure out an affordable way to make them.

inmoov.fr/

https://www.invidio.us/watch?v=2sZOyCBbows
Edited last time by Chobitsu on 09/19/2019 (Thu) 12:29:40.
>>489
Thanks for the tip anon. If I find some good code or other cyber assets I find valuable I'll link them back here at some point.
>>489
That's a pretty cool robot anon.
>>497
>Do you think that is healthy?
I lost it.
Open file (377.66 KB 1280x1280 1572796434848-0.png)
>>199
There seems to be a project to make good My Little Pony synthesized voices.
https://clyp.it/r0yypquc?token=e11965be1b6dce146eb61702006c285e
https://mlpol.net/mlpol/res/249436.html#249436
https://boards.4channel.org/mlp/thread/34514258#p34514258
Their technology seems sound and the voice is good. They have assembled the files and resources for us to use and to train the talking bot so we can probably use the same technology to synthesize more AI voices. If any of you guys would like to put Twilight Sparkle in your wAIfu then this is a fantastic development. Even if you don't like MLP these resources can turn a library of voice lines such as in an anime or tv show into a synthesized voice which is pretty cool. Put your waifu into a wAIfu.
>>1563
Thanks for the tip anon, I'll try to check it out.
>>1563
>Cool. Good to know that soon we can make Sweetie Belle recite the 14 words.
I'm OK w/ this tbh.
>>1563
>know nothing about the project
>spend 30+ minutes going through the cuckchan thread b/c AI work
welp, i have to say, love mlp fags or hate them, the level of autism on display in their 'pony preservation project' is impressive.
>>1563
Ponies and cuckchan aside, I am impressed. I'll have to read about how intensive the training is. I'm very interested in trying it out for myself. I wonder what would happen if you tried using SHODAN's voice lines.

There are only about 26 minutes worth of audio from SS2. Does anyone know if that's sufficient for training, or is more needed?

https://www.youtube.com/watch?v=QGEjKoGHzO0
>>1570
I'll amend this post by saying that 26 minutes of audio is probably not sufficient. It sounds like there should be at least several hours for the best results. I think a better approach would be to train a neural network using voice clips from someone who sounds similar to SHODAN's original voice actress. The next step would be to create a program that takes voice audio and adds the distinctive audio "glitches" of SHODAN's voice. Then the voice clips generated by the NN could be fed through this program to "SHODANify" it. There might already be ways to do this quite easily with audio editing programs, I'm only thinking of creating an automated way to do it.
>>1571
>I'm only thinking of creating an automated way to do it.
Sounds like an interesting project idea. I'd imagine the original audio engineers for the game layered effects and filters in a traditional way. Figuring out both how to 'reverse engineer' the effect as well as how to automate it for the general case seems like an intricate process. Any plans to pursue this beyond conception stage?
>>1572
Right now, no. I do have an idea of exactly what kind of things I or someone else who's interested would need to do. Unfortunately, my daily schedule and life in general makes it difficult for me to make time for this kind of project. (College student with a job and family, friends, other hobbies, etc.) normalfag-tier I know
However, I'll say this:
the more I think about this and type out my ideas, the more practical I think it is. It's just a matter of investing some time and effort.

I know that Audacity supports python scripting for automating stuff, but I would have to learn about using it to apply the actual effects. If I can't manually create a SHODAN-like audio clip using Audacity, I won't understand enough to automate the process. I already have a general idea of what kind of effects are needed (shifting pitch, timbre, layering with delay, stuttering, etc.) and listening to the audio clips from the game will help me refine the process. Also, depending on what kind of result I can get with Audacity, I may want to consider other audio editing programs. Audacity is just the one that comes to mind, being FOSS.

Once I understand what kind of effects actually go into the audio, and how to apply them to get the best result, then I can start to play around with a Python script. Fortunately I have experience with Python, and I think this would be fairly straightforward. I'd have to read Audacity's documentation on scripting, which may or may not be sparse. Another tricky part to this is applying the audio effects in a somewhat random way so that there's some variation to the resulting clip and so that multiple clips don't all sound alike. I think there should be some underlying logic to how the effects are applied, but it might take me some time to puzzle out the best strategy (and of course I could probably always find some way to improve upon it).

Getting audio clips to pass through the script would be fairly trivial, I think. For starters, I would probably just use a pre-trained NN, or train one on my own using an available dataset. In a perfect world, we could use a NN trained with Terry Brosius' voice. However, I don't think there's very much audio available. She's done voice acting for a variety of games, but I believe many hours worth of audio are needed in order to effectively train a NN. Unless she happens to record a book on tape someday, I doubt that this will be possible/practical.

Question/request to any anon who's familiar with audio editing and manipulation, whether with Audacity or a different program:
Can you "SHODAN-ify" an audio clip by hand? And if so, can you teach me how?
If not, maybe you can point me towards some good resources to help me.
Also, any advice on creating scripts for Audacity (or a different program) would be welcome.

Question for anons who are more familiar with NN's for speech synthesis:
Would you reckon that we could train a NN with Terry Brosius' lines from other videogames, not just SS1/2?
If there's enough audio and interest in this idea, it would be a great benefit to have multiple anons working to transcribe and clip audio from different games. However, I wouldn't worry about this until after either me or some other anon can get a working SHODAN-ify script created.
>>1571
You only need 5 seconds of arbitrary audio to get pretty good results with the SV2TTS framework. It uses an encoder trained to perform speaker verification to create a speaker embedding vector from a short audio sample. Then a seq2seq network creates a mel spectrogram from some text and the speaker embedding, and a wavenet turns the mel spectrogram into an audio waveform. The encoder in the pretrained model was trained on noisy speech, but I don't know how well it would work with a reference as heavily distorted as SHODAN's voice.

Audio samples: https://google.github.io/tacotron/publications/speaker_adaptation/
GitHub: https://github.com/CorentinJ/Real-Time-Voice-Cloning
Paper: attached pdf
>>1582
>5 seconds
>quality results
well shit negro we're in business
I have a feeling that training directly with the distorted SHODAN audio will not work (although it might be worth trying), so instead I floated the idea of using Terry Brosius' regular voice. There's probably more than enough audio from other characters that she has voiced, based on this information.
I will have to study these links and learn more about this, thank you, anon.
anon started a thread about speechsynth on fatchan.
https://fatpeople.lol/tech/thread/288.html

pretty sure everything there has been discussed here already except this:
https://paintschainer.preferred.tech/index_en.html
>>1701
example 'AI' painting it did for me
Open file (64.78 KB 426x567 tacotron opinion.png)
>>1582 Was just gonna repost Tacotron. It's a really amazing voice synthesizer. It'd be interesting to see what it's capable of adapted with transformers instead of outdated LSTMs. https://www.youtube.com/watch?v=0sR1rU3gLzQ Audio samples: https://google.github.io/tacotron/publications/tacotron2/ Paper: https://arxiv.org/pdf/1712.05884.pdf GitHub: https://github.com/NVIDIA/tacotron2
>>2355 Also a newer paper from this year can convert any voice to many and improved on SOTA in any-to-any conversion. I can imagine this being used to train a synthesized voice on the more subtle and emotional nuances of speech. Demo: https://dunbar12138.github.io/projectpage/Audiovisual/ Paper: https://arxiv.org/pdf/2001.04463.pdf GitHub: https://github.com/dunbar12138/Audiovisual-Synthesis
thanks for the papers anon, i'll try to get through them sometime soon. :^)
Open file (88.97 KB 500x377 the rules.jpg)
does anyone have a good idea how many minutes of audio it should take to train a good text-to-speech model from scratch with current machine learning techniques? I found no dataset containing child or childlike speech. so far there seems to be no academic interest in compiling one, and i really don't think another soul on this planet is pathetic or degenerate enough to make one. so here I am with about 500 ~15 minute long videos ripped from some family's youtube channel. youtube's machine generated subtitles are surprisingly accurate so far, but this is still a really daunting task to label speakers, proofread, and format, and i'm not sure how much of this is needed to get the job done right. also this feels incredibly skeevy, but unless one of you has seen a dataset that has what i'm looking for, it's something i have to do, even if the ethics of duplicating a real living child's voice for my own purpose is dubious at best.
>>2499 You might try asking the Anons working on the Pony Preservation Project. They are likely to be a better source of information on this atm. >>1563
>>2499 People have achieved near human-quality voices with the voice cloning toolkit corpus. It consists of 110 speakers, mostly in their 20's, reading 400 sentences each. https://datashare.is.ed.ac.uk/handle/10283/3443 You're not gonna get very far only using one voice though. You can probably find some children's speech datasets on Kaggle.
>>2508 when i looked on kaggle i was unable to find anything of that nature. as far as using a single voice, i'm not intentionally using only a single voice, but when i was looking into this, waveglow (https://github.com/NVIDIA/waveglow) appeared to be achieving decent quality results using a single person's voice. if i can find good samples of multiple voices i'd be interested in all of my different options but as of right now i'm stuck using data i can put together on my own.
>>2517 >if i can find good samples of multiple voices he linked you a very good one already. and i directed you to a group working with an entire 200-episode-show cast's worth of professional voice actors, including girls. can't find anything 'good' in those anon?
>>2518 i didn't mean to disregard anon's advice to seek out the mlp group. i did appreciate the referral. i don't know yet if that fits the bill or if i need to keep looking, but i will be looking into his suggestion.
>>2519 Haha no worries Anon! I just wanted to point out there is already a lot of content between those. The one is over 10 GB of highly-tagged audio sources, and the other is a growing mountain of autistically-driven creation, much of which is remarkably good. Good luck Anon.
I like you.
Open file (168.75 KB 1920x1080 I_like_you_Anon.jpeg)
>>2640 Thanks Anon. We like you, too. Just the way you are tbh.
Open file (140.36 KB 640x480 4-jHssB.png)
>>1582 >GitHub: https://github.com/CorentinJ/Real-Time-Voice-Cloning What a fucking garbage software, I tried to use Totala narrator voice to try reading a few paragraphs and the program shits the bed. Using smaller text samples doesn't even clone the narrator voice at all, what the fuck man.
>>4144 Yeah, it's a little dated and wasn't really a pro-grade project anyway. >"...13/11/19: I'm now working full time and I will not maintain this repo anymore. To anyone who reads this: >If you just want to clone your voice, do check our demo on Resemble.AI - it will give much better results than this repo and will not require a complex setup. The free plan already allows you to do a lot. We have a plugin for Unity as well for game developers. >If, for some reason, you want to spend hours setting up a python environment to run crappy old tensorflow code that will give you very average results, be my guest. One year after my thesis, there are many other better open-source implementations of neural TTS out there, and new ones keep coming every day." Might try the recommendation Anon? Please let us know how it works for you if you do, thanks. Good luck. https://www.resemble.ai/
>>4145 >resemble.ai >you need to add your E-Mail account so that their pajeet tech scammer can spam it Yeah lets not get in there, its not looking too pretty.
>>4242 Heh, they are obviously for-profit and want to monetize the customers. As I implied elsewhere, if you want to have your cake and eat it too, then we'll need to roll our own here at /robowaifu/. Better crack those books, Anon! :^)
I just wanted to mention that there's another thread about voices: https://julay.world/robowaifu/last/156.html Maybe these could be merged? I'd like voices to resemble actresses with good voices or a mix of different ones to avoid trouble. There's enough audio and scripts from subtitles available. Is training a NN on using Audacity the right way? It would first need a network which could tell us how similar two voices are, then we could try to get closer and closer. We also have Festival available as free software, so voices from there could be the starting point. Maybe first think of a way how to tell it how close voices are, some generated others not, then it can learn.
>>4333 >Maybe these could be merged? Good idea, and maybe I'll work out a reasonably convenient way to do that at some point. However, Lynxchan software doesn't have a 'merge these threads' button, and the manual effort to do so is rather tedious (as you might imagine). >Is training a NN on using Audacity the right way? Near as I can tell, those are 'apples and oranges' in comparison. Audacity is a tool for human editing of audio files primarily. NNs are an entirely different type of thing. And you have some good ideas. Honestly, I would suggest you look into the Pony's efforts in this area. They have achieved some impressive results already based on extracting VA from the show. >>1563 But I hope we can manage something like your ideas, that would be great.
There's a program MorphVox Pro which can alter a male voice to female. I have no intention to use my own voice for that, but maybe it could be usefull to change generated voices from eg Festival. More importantly it shows what's possible. The vid is even from 2015. https://youtu.be/CpVwl-FEzl4 Via https://dollforum.com/forum/viewtopic.php?f=6&t=130302&sid=44113180fc656eb7aa41381a0ce12d02
>>4345 There is a merge thread feature on 2.4 tho.
>>4622 That's good news. As long as Robi utilizes that version for the reboot, I'll explore that feature and that idea. Now please allow me to mark individual posts and provide a 'move this post to thread X' (in batches of many posts ofc).
>>199 I just found this speech synthesizer programming tutorial: https://www.youtube.com/watch?v=Jcymn3RGkF4
>>4659 Interesting, but the endresult is useless unless you wanted a bad voice with finish accent. Don't fall for his trick at the beginning when he talks like his voice is the synthesizer. Also, I don't get it. Why would everyone create their own? I just need one program where I can put in the data. Did you try his software: https://github.com/bisqwit/speech_synth_series/tree/master/ep4-speechsyn Is it even reproducible, or just messy unreadable code? Where is it better than eSpeak or Festival?
>>4659 Thanks Anon, appreciated. I'll have a look at it sometime over the next few days.
>>4659 Has anyone ideas how to get phonemes from voices, without manually cutting them out of soundfiles? There seem to be some methods, but it's difficult and complex. Not even sure if this helps: https://youtu.be/x1IAPgvKUmM There are voices available for sale and free ones anyways, might be easier to change those. But what's the best way to do that? That would be something getting us forward. Here some introduction to work with sound and signal processing in Python. Not sure if I should learn that at some point, but I like his approach to teaching and learning by doing projects: https://youtu.be/0ALKGR0I5MA The available software gets better every year, but not for free and often it needs the cloud. However, even if we don't get anything done here, at least something will be available, bc others want this stuff as well. Then again, cloud based stuff is quite useless. EmVoice One: https://youtu.be/Da2DAjKzeaQ and UTAU, Vocaloid, SynthV, Eleanor Forte are mentioned in the comments. Newscaster, wow: https://youtu.be/wHP3J01aEns
The best TTS is AI-based. Check out the demos. Google has some. Amazon has some. You don't hear them, typically, though. There are issues with performance & cost. My opinion, wait for generally-available AI TTS. Someone mentioned espeak. espeak is like the near exact opposite, however, you can speed up espeak way faster than other systems. But what I want is good singing. If you've ever pirated Vocaloid, you know it sucks so bad, not just in terms of English results, but in terms of the interface.
>>4694 The idea about eSpeak was, to use it as a base with a female voice, then have some software to change the voice output to make it better. You can combine software, one program doesn't need to do it all.
Someone mentioned Waveglow here >>5461 and it sounds good. Though the alternatives on their site https://nv-adlr.github.io/WaveGlow sound good as well. I wonder why all of these sound better than Hanson Robotics Sofia in her recent videos. Maybe because Sofias speech is generated live at the time she's talking. "In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable." https://nv-adlr.github.io/WaveGlow Can't upload the paper. There might to be a block for uploads, because some spammer.
>>5467 No, Anon, you are the robots. https://streamable.com/u1ulrp >=== What hath God wrought? Use you're robowaifu-powers only for good Anon. :^)
Edited last time by Chobitsu on 10/06/2020 (Tue) 21:17:15.
>>5474 Nice work Anon. Mind telling us how you did it?
>>5474 I hope this will be up to your standards, Anon. >
>>5474 I haven't looked much into the documentation and stuff, but how easy or hard is it to get it to use another voice pack and it sounding just as natural? For example, I've found some local vtubers who know how to talk like anime waifus in English. I figured I could bribe some of them to donate some voiced lines which can then be used as training data so that we can have some cutesy voices. It's cheaper than trying to figure out how to contact JP voice actresses who actually sound terrible in English. So my question is if it's possible to compile a list of the minimum lines a voice actress will need to get enough data to have an AI adopt her voice? I used to be active in game development so I think I am experienced enough in giving contracts and royalties etc. (even though my actual games were financial failures).
Open file (4.12 MB 1280x720 singing-mouth.mp4)
>>5480 Sounds amazing and absolutely sufficient. Some little indicators that she isn't human are even very welcome. Well, of course I'd like to use another voice, something cuter, younger, girlier, or more like Cameron. Also, how usefull it is depends on how fast at least single words or short phrases can be created, and on what kind of hardware. I could imagine having a lot of phrases and sentences stored on an SSD and only fill in the blanks. Then maybe add another modulation system to the output (vid related).
>>5485 Haha, thanks but I'm just a humble webm encoder tending my wares. The credit is due to the actual author. >>5474 But yes, you and this Anon >>5484 have some good ideas for enhancements. I hope we can manage something good with synthesized voices, and something entirely open too.
Open file (212.21 KB 1920x1080 shrug.jpg)
>>5475 It's just WaveGlow out of the box. I wrote a quick Python script to sample it. You can get the code here: https://gitlab.com/kokubunji/say-something-godot-chan It requires an Nvidia GPU with CUDA cuDNN though and the dependencies are ridiculous, not to mention the 700 MB in models that must be downloaded. >>5484 The pretrained model's dataset is 24 hours of 13100 audio clips. I haven't tried training it yet but I can't imagine that amount of data is necessary to finetune the model. I've finetuned GPT2 on my CPU with a few books and it completely change the output. >>5485 It's very fast. Even on my toaster GPU it generates about 16 words per second or 1000 words a minute. You could generate a 90,000 word audiobook in about an hour and a half. >>5486 My frustration is how inaccessible and convoluted these models are. They can't be deployed to consumers, but I got some ideas for a lightweight and expressive speech synthesis that can be run on a CPU. If my voice box simulation is successful I should be able to use the data generated to create a synthesized voice with complete vocal expression. It's really unbelievable the direction research is heading, they just find more ways to throw money at a problem rather than better data. In a few years we might have barely any new AI at all, except what is made available to us through an API to select individuals.
>>5493 >???? >PROFIT!! Kek.
>>5493 Thanks Kokubunji, nice repo instructions. Much appreciated.
>>5493 Chii a cute in the classroom. I like this idea. >16 words per second or 1000 words a minute Seems like that would be fast enough. I hope you figure out how to make your voice box run on a CPU, that would be great!
Full speech is tough even with modern software. But there is a trick I figured out a while ago. If you take robot speech and try to make a fluent dialog something is going to go wrong and ruin the illusion. However this only happens when you're expectation is solid speech. Flip the script of expectations and build it to speak a foreign language you don't know and sprinkle in "pigeon" English. Your perspective goes from "this thing barley works " to " this thing is trying so hard ". What was once an annoying glitch becomes cute effort as it tries it hardest. All it takes prerecorded bursts of a foreign language mixed with a few awkward text to speech words.
Open file (6.21 MB 1280x720 test.mp4)
>>5503 The actual simulation will be too slow to run in real-time but the data generated from it of the tongue position, jaw position, vocal posture and such can be used to train a neural network to synthesize a voice with the same parameters. By simulating a large variety of voices I hypothesize it could reverse engineer people's voices to some degree. However, the software I'm using cannot simulate sound vibrations from turbulent airflow (breathiness) or surfaces contacting (such as Ugandan Knuckles clicking), only resonance. I might be able to simulate breathiness though by modulating the sound with some randomness. Either way, converting text to the parameters that produce that sound should be far more efficient and embeddable in a game. It'll be better than nothing. The parameters should also make it possible to generate unique voices for random characters and customize waifu voices to one's liking.
>>5493 >RuntimeError: CUDA out of memory. Kek, so much for using Tacotron2. A 6 GB card isn't enough to train it.
>>5521 >geeks and robots lel'd >vocal posture Interesting. I don't think I've been familiar with that concept? >>5523 Are you running it 'dry' Anon? That is, no other resources especially vidya! contending for the GPU's memory resources?
>>5523 >A 6 GB card isn't enough to train it. AI Anon said 'Even on my toaster GPU' . Maybe there's some kinds of settings you need to tweak. Surely a toaster GPU has less than 6GB of RAM?
>>5525 Yeah, this card is fully dedicated to machine learning, not even attached to the monitor. >>5526 I found out the batch size parameter was hidden in hparams.py and it was set too high. It seems to be working with a batch size of 16. I'm surprised how fast it is. It'll only take about 40 minutes to train on 24 hours of audio clips. Now we just need a dataset of cute voice samples.
Open file (24.26 KB 640x480 chb_chii114.jpg)
>>5520 >Your perspective goes from "this thing barley works " to " this thing is trying so hard ". Yes, I agree with that idea. Robowaifu naivete can actually be quite adorable, and it's an effective trope. > pic related >>5521 >Either way, converting text to the parameters that produce that sound should be far more efficient and embeddable in a game. I see (well kind of, I think). If I understand correctly, the workload of the simulation is primarily used for generating these parameters? So if you pre-generate them ahead of time and store them somehow, then the second part where the parameters are read in then used to generate the actual waveforms should be computationally inexpensive. Is that approximately the idea anon? chii has no knuckles... <tfw ;~; >helping anon find de whey https://www.youtube.com/watch?v=IulR5PXiESk >do it for princess chii anon! >>5529 >Now we just need a dataset of cute voice samples. I nominate Chii first. Surely we could manage to create a clip library of all the Chii VA segments from the Chobits animu?
Open file (25.40 KB 640x480 Figure_1.png)
>>5530 >Surely we could manage to create a clip library of all the Chii VA segments from the Chobits animu? I don't know if that'll be enough. Chii didn't really talk much. Each audio clip also needs text and the background noise has to be removed with something like Spleeter or DTLN. https://github.com/deezer/spleeter https://github.com/breizhn/DTLN It's worth a shot though. The average sentence in the dataset is about 20 words. Output seems to perform best around 10-15 word phrases. Keeping clips a sentence long would be best. I'm gonna try Rikka first since I have limited bandwidth and already have the entire show downloaded. I don't know how well it will handle Japanese though. We can probably automate audio clipping by using subtitles off https://kitsunekko.net/ and piping them through a denoising network. That way it's easy to train a whole bunch of characters.
>>5530 Well, the idea of machine learning is to disentangle useful latent variables from high-dimensional data but without immense amounts of data it's exponentially difficult for backpropagation to separate them. For example, if you wanted to learn the x,y position of something on an image and control it but had a lack of data samples to train on, it may notice that the x and y values are correlated and become biased, so when you try to change the x dimension it causes the object to move diagonally instead because it failed to disentangle the latent variables. If the training data covers the latent space evenly and the model has access to all the underlying variables, it has a much easier time disentangling the data and can interpolate between the gaps without much confusion because it's getting input where that data belongs in the latent space. A smaller and simpler model can be used rather than a bulky slow one because it doesn't have to do all the guesswork of pulling all the dimensions apart trying to sort the data. >>5532 Done: https://gitlab.com/kokubunji/clipchan
>>5537 >clipchan error ModuleNotFoundError: No module named 'ass' > Apparently I need a Python dependency? The .ass file was extracted w/ ffmpeg.
>>5539 just in case it matters >
>>5537 >>5539 >>5540 Nvm, figured it out. > #1 Now I'm getting a 'file name too long' error. (probably some of the interstitial stuff, I can post the .ass text if you'd like. > #2 Also, it's extracting the first audio track (Japanese), but I want the second track (English). Any way to control this? Regardless, very cool work Kokubunji.
>>5541 sample clipchan results, btw. had to convert to .mp3 so I could post them here, but the originals are all .m4a >
>>5541 Found the Dialogue Event that broke things (the middle one) Dialogue: 0,0:01:31.87,0:01:33.95,Chobits Dialogue,Comment,0,0,0,,I'm gonna go to Tokyo! Dialogue: 0,0:01:48.87,0:01:55.88,Chobits OP JP,,0,0,0,,{\fad(400,900)\be1}{\k15\1c&HDF6B7B&}Fu{\k21}ta{\k22}ri {\k23\1c&H4E4FDE&}ga {\k44\1c&HDE8162&}ki{\k45}tto {\k22\1c&HA1CA5D&}de{\k23}a{\k23}e{\k24}ru {\k43\1c&H226FCD&}you{\k48}na {\k20\1c&H56CED9&}ma{\k45}hou {\k26\1c&H7D79D7&}wo {\k47\1c&HDA90CB&}ka{\k48}ke{\k152}te. Dialogue: 0,0:01:48.87,0:01:55.88,Chobits OP EN,,0,0,0,,{\fad(400,900)\be1}Casting a spell that will make sure they meet.
>>5537 Thanks for that detailed explanation. That helps, actually. Not sure how to word a cogent response, but the topic seems to make more sense to me now.
>>5532 Fair enough, good luck with Rikka Anon! :^) I'll try to sort out longer audio/subtitle clips of Chii's speech from all the episodes. I'd expect we should be able to find at least five minutes of this, just in case it might work.
>>5537 >>5541 >>5544 BTW, this is the name of the source file itself, just to be on the same page: 01. Chobits [BD 720p Hi10P AAC][dual-audio][kuchikirukia] [3DD90125].mkv The 'kuchikirukia' version seems to be the best quality one I've found over the years of the Chobits series, so I've just standardized on it.
>>5493 We just need the audio clip equivalent of "The quick brown fox jumps over the lazy dog." where there are enough use cases to build a speech pattern. >>5520 The fake foreign language option also sounds good. How about Klingon, or Hymnos (Reyvateil language in Ar Tonelico series)? Godspeed anon.
>>5539 Whoops, forgot to add requirements.txt. Anyone setting it up now can just do: pip install -r requirements.txt >>5541 I could output subtitles too long to a csv file or something. Long audio clips need to be manually split up anyway. It seems the offending line is the opening with all the formatting code. You should be able to filter the events it clips to only character dialogue with --style "Chobits Dialogue" But there may be still rare cases where English subtitles go over the 255 character limit. I'll start with scrubbing the formatting tags so openings can be clipped too. Also it might be useful to keep the raw audio clips and parameters used to generate them. Later when we clean them up we could use that data to train an AI to automatically crop and prepare clips. >>5547 If there are too few, we can try augmenting the data by splitting up audio clips into shorter phrases to add to the dataset, as well as applying a random delay to the beginning, slightly changing the volume or increasing the tempo in Audacity.
Put up some quick instructions on cleaning audio clips with Spleeter and DTLN: https://gitlab.com/kokubunji/clipchan#remove-music-from-audio-clips It's pretty simple to use. DTLN is better at removing noise but the quality isn't as pleasant as Spleeter.
>>5550 >--style "Chobits Dialogue" Great!, that did the trick. Extracted 408 clips in about 2 minutes. > That's fine to get the Japanese clips (which, frankly I like Chii's VA's voice better tbh), but I'd like to get the English channel's clips too. Have any suggestions?
>>5555 >portentous digits tho Here are Chii's first 5 utterances to Hideki... > Ofc during the first episode her only words were cute variations of 'Chii'. :^)
>>5556 BTW, it's humorous to simply mpv * from inside the clip extract directory. Kind of like 'watching' the show on fast forward.
>>5554 Thank you. Yes, Spleeter seems to preserve the subtlties of the voice better. I wonder what's the difference? Regardless, I'll be post-processing the clip extracts from Chii's dialogue where needed. I don't have a specific time frame, but I plan to work my way through an episode or two here and there until I have the complete set. I'll probably post the completed set as a zip on Anonfiles when it's finished.
Open file (22.29 KB 240.mp3)
Open file (19.02 KB 248.mp3)
Open file (22.29 KB 234.mp3)
Open file (21.88 KB 231.mp3)
Open file (18.21 KB 228.mp3)
>>5555 I just pushed a bug fix and feature update. It should be able to process all subtitles now. Too long filenames are truncated and all needed subtitle text is written into filelist.txt in the output path. You can now inspect the subtitles with --inspect or -i before running and it will count how often the styles are used. The most used one is likely the character dialogue. >>5556 My heart can't handle this much cuteness at once. There's a lot of noise in them but some of them are still usable. >>5558 DTLN has a lower sampling rate than Spleeter and was designed for removing heavy background noise like air conditioners running. Good luck with it. If anyone doesn't have CUDA but wants to train a voice I don't mind training a character voice for them if they have the clips. We could probably train the voices on Kaggle or Google Collab too. If a lot of people become interested in the project one day that would be one way for them to get started.
>>5559 Great, thanks for the inspect flag and fixes. So, again, any way to specify the English language audio track for clip extraction (vs. the defaulted Japanese)?
Open file (63.03 KB 776x877 literal_beetus.png)
>>5559 Haha, want an ASMR? create a playlist of all 28 Chii clips from ep01 and then mpv --playlist=01_chii.pls --loop-playlist=inf
>>5560 Sorry I missed that. Just pushed another update to select the audio track. Use -track 2 or -t 2 to extract the second audio track. Also added --quiet / -q to silence ffmpeg output unless there's an error and --guess / -g to automatically pick the most common style for extraction. Also major update: the subtitles file option is now optional and specified with --subtitles / -S. Subtitles can be extracted directly from video now and subtitle track selected with -b if necessary >>5568 Haha, that's a lot of Chii. It seems there's a bug though? The formatting tags shouldn't be showing in the filenames unless those are clips extracted from an early version.
>>5570 >Chii-levels > 9'000 IKR? >unless those are clips extracted from an early version. Yes, it's the older stuff I haven't redone it yet. I'll use the newer stuff for the final processing & edits, etc. BTW, there are still a few formatting things in the newer version. IIRC, '(/N)' (or something similar). Also, portable filenames (for instance that work on W*ndows) need some chars removed to work correctly. I dealt with this issue in BUMP. So, I'd say the !, ? and any other punctuation are good candidates for removal from the filenames for example.
>>5570 >Just pushed another update to select the audio track >Also major update: the subtitles file option is now optional Great! I'll try this over the weekend. Thanks for all the hard work Anon.
Open file (170.45 KB 480x270 CONDUCTOR.gif)
>>5571 There, added portable filenames. I noticed the recent version wasn't removing {} stuff again so I fixed that too. Now I just need to automate Spleeter and it should be good to go. >>5572 I don't even think of it as work I'm so excited for this. There's so much that can be potentially done with it from voicing waifus to redubbing anime. The memetic potential is infinite.
>>5570 Seem to be having trouble auto-pulling the subtitles. Here's a portion of the output showing which channel it's in: > #1 Here's the response I'm getting, trying the simplest approach: > #2 I'm sure I'm just flubbing it somehow. Maybe providing a specific example of grabbing auto-subtitles would help a bit ? I can generate them myself w/ ffmpeg, but I would much prefer using your method instead.
>>5575 >a portion of the ffmpeg output*
>>5575 >trying the simplest approach: Actually, I guess this is the simplest approach, but it breaks on me worse: >
>>5577 My bad, I forgot to push my code before going to bed, but I think inspect should have still worked with -b 3. I've updated debug mode -d to provide some more useful output, such as what it's running FFmpeg with. FFmpeg is getting an invalid argument list somehow. It may be due to your version of FFmpeg. Can you pull the latest version of Clipchan and try running these commands to see what they output now? python ../clipchan.py -i 01.mkv -b 3 -d python ../clipchan.py 01.mkv -d ffmpeg -version
>>5581 Haha, no worries and thanks very much Anon. While you slept, I finished grabbing and sorting the basic clips for all Chii utterances in ep01-ep03, in both English and Japanese. This tool of yours is a remarkable time saver. Ofc all the clips will need explicit fine-tuning inside Audacity later, but your Clipchan is kind of revolutionary tbh. What a difference a day brings! :^)
>>5581 >Can you pull the latest version of Clipchan and try running these commands to see what they output now? Sure thing, here we go: python ../clipchan.py -i 01.mkv -b 3 -d > #1 python ../clipchan.py 01.mkv -d > #2 ffmpeg -version > #3
>>5583 Once it automates Spleeter, cropping and normalization it will be truly revolutionary. Every show will be game to doing machine learning effortlessly. >>5584 Your ffmpeg wasn't built with --enable-libzvbi. I pushed another update though that uses a different method to extract the subtitles from a video. Let me know if it works for you.
>>5581 >>5586 >Let me know if it works for you. Great, looks like your patch finds the subtitles stream now. > I simply installed the ffmpeg in the repo iirc. I can probably manage to build from their repo if you think it would be worth the trouble?
>>5587 >I simply installed the ffmpeg in the distro package repo iirc*
>>5587 It's fine, if it works now the dependency was unnecessary.
>>5589 Got you. Alright I'm off for a few hours at least. Cheers.
Spleeter is now automated in v0.2 but not fully tested yet. To try it put the Spleeter pretrained_models directory in the Clipchan directory and use Clipchan as usual plus --spleeter. Due to a bug in Spleeter, the terminal will reset after it completes to prevent the terminal from freezing. Next, to automate cropping and normalization I will make it look for the median point of silence in the padding area and crop it to 0.02 seconds before the next sound detected. This should be good enough. There are some alignment issues with my subtitles so I'm realigning them in Aegisub and reducing the default padding to 0.2s since it's not uncommon for subtitles to be 0.4s apart.
>>5593 That sounds clever. Look forward to trying it out. Sounds like you're almost there Anon.
>>5574 > I noticed the recent version wasn't removing {} stuff again I found an explicit example of the newline char still being left in the filenames/dialogue text Dialogue: 0,0:05:36.45,0:05:39.03,Chobits Dialogue,Comment,0,0,0,,{\i1}What did I say in front\Nof such a beautiful lady? The '\N'
>>5597 >newline char that's still being left*
>>5595 Automated clipping and normalization is almost done. I think after this I'll try making a neural net that can detect which character is speaking. That way an entire show can be fed in and sorted automatically using a few examples of the characters speaking. >>5597 Newlines are being removed from my subtitles. The only place they should appear is in the log file in debug mode. Try pulling the latest update and running the same command with -d and inspecting clipchan.log. It will show the reformatted text <> unedited subtitle text, something like this: [249] 00:20:2.960-00:20:5.980 (0.0) The magma of our souls burns with a mighty flame <> The magma of our souls\Nburns with a mighty flame!
>>5601 >That way an entire show can be fed in and sorted automatically using a few examples of the characters speaking. That sounds awesome. I think I'm going to put my tedious effort on hold and wait for your better approach Anon. :^) You know it occurs to me that you could probably do a 'two-pass' approach as well (at the cost of longer processing) that could sort of do all the cleanup, crops, speaker identification, etc., then feed that information back into a second pass sequence that should then be able to improve the accuracy. Of the crops and noise removal of tricky bits that have a lot going on audio-wise in a short time span, for example.
>>5603 Seems to me, this could also be used to improve a series subtitles as well. Sort of an auto-gen for subtitles, that are actually timed very well, and also more accurate with the actual text. For example, the engrish-translation of some English subtitles often aren't right on-cue with the English VA scripts (even if often much more humorous/possibly more accurate to the original Japanese meanings/idioms). Seems like that might save having to go in and manually edit the filelist.txt entries by hand before passing it into the machine learning so audio/text actually matches first.
>>5603 The cropping is perfect so long as the subtitles are aligned correctly. Sometimes sound effects slip through Spleeter but that can't be avoided. Speaker identification is going to require building a dataset first to train on. >>5605 Auto-aligning subtitles will be tricky. I could probably fix small misalignments with the same method I'm using to crop the audio clips by snapping them to the nearest sensible spot. I'd have to run Spleeter over the whole episode first which shouldn't be too big of a hit since it has to convert the clips anyway. I'll add this feature idea to the to-do list. Maybe two projects down the line someone will create some speech recognition for it that can generate subtitles.
Trying to find the instruction on your repo for removing music from clips. > #1 Discovered a minor naming issue w/ instructions. > #2 Then realized (afaict) a showstopper (for me at least) dependency. > #3 I suppose you can't do this w/o a Nvidia GPU then?
>>5613 >Trying to follow*
>>5613 Spleeter runs off CPU by default. You need to downgrade to Python 3.7 to install Tensorflow.
Trained on 2B's voice overnight without data augmentation and possibly a too high learning rate. It's not perfect and there's only 18 minutes of training data, but the results are pretty satisfying. Training data I used available here: >>5620 Filelist.txt: https://pastebin.com/y3GyyBtR Once I fine tune it better I'll create a Google Collab so anyone can use it even without a GPU.
>>5615 >You need to downgrade to Python 3.7 to install Tensorflow. I have no idea how to do that tbh and I've fought trying everything I know how to do to get spleeter working but have repeatedly failed. I'll just focus on extracting and sorting out the clips for now since clipchan does that part well.
>>5626 Haha, wow that's pretty nice already. Great stuff Anon.
>>5626 >Google Collab Any chance of creating a mechanism to save pre-canned responses out locally. I mean audio files and some way to associate them with the input texts locally? It's one thing to use Google Collab intermittently as a generator for locally-stored content, it's another thing entirely to become wholly-dependent on G*ogle for our waifu's daily operations.
>>5627 If you're on a Debian-based distro you can check which versions of Python are available with apt-cache policy python3 and to downgrade aptitude install python3=3.7.3-1 or whatever 3.7 version is available in your distro. Just be careful it doesn't remove any packages and finds a resolution that downgrades packages as necessary. If that fails, Tensorflow 2 is compatible with Python 3.8 and I can try porting these old projects to Tensorflow 2. >>5629 Yeah, you can use the Levenshtein edit distance to find the closest match and play that pre-generated response. You could generate a whole library of words, phrases and sentences then stitch them together. If someone is really ambitious they could probably write some code for evaluating PyTorch models in OpenCL or simply port the code to the CPU. At the end of the day though if someone doesn't wanna be dependent on Google or Kaggle for compute they need to get a GPU.
>>5630 >Just be careful it doesn't remove any packages and finds a resolution that downgrades packages as necessary. I have no idea how to do that. I did find a tool called downgrade and so I'll try to use that. Thanks.
>>5630 Great samples. Kek. >>5631 Be careful not to mess up your system by downgrading your regular Python or install all kinds of stuff: https://www.pythonforbeginners.com/basics/how-to-use-python-virtualenv
Past couple days have been hell tracking down strange bugs and trying to get this to work flawlessly but it's working good now. --auto-clean normalizes, removes silence and resamples clips to prepare them for machine learning and has a success rate of about 98%. It gives warnings for which clips need further attention. Most of the time these lines aren't usable anyway since they contain overlapping audio or other strong background noise. Also added another tool for quickly captioning audio clips called filelist.py. It goes through all the wave files in a folder and plays them, prompting you what the line should be. Hopefully it's pretty straightforward to use. You will need to install playsound to use it:pip install playsound With that, Clipchan is pretty much done and ready for waifu datamining. Enjoy! https://gitlab.com/kokubunji/clipchan
>>5633 I see, thanks for the advice. >>5648 Thanks for all the hard work. Does --auto-clean rely on dependencies, or no?
>>5615 Ironically enough, I can install Tensorflow 2 just fine on my system. But even when I install TF 1.4, Spleeter refuses to recognize it, and I basically broke my system trying to downgrade to Python3.7 from 3.8 (thankfully I seemed to have recovered from that now). Even when I successfully installed TF1.4 on a RaspberryPi (Debian Buster-based, and already Python3.7), Spleeter still refused to recognize it and failed to install.
Here's as far as I've gotten to : spleeter 1.4.0 requires museval==0.3.0, but you'll have museval 0.3.1 which is incompatible. spleeter 1.4.0 requires pandas==0.25.1, but you'll have pandas 1.1.3 which is incompatible. spleeter 1.4.0 requires tensorflow==1.14.0, but you'll have tensorflow 2.3.1 which is incompatible.
>>5651 >TF 1.14*
>>5649 It depends on Numpy and Scipy at the moment, but I'm gonna remove the Scipy dependency and use the standard library's wave instead. >>5651 Figures. Downgrading is always a nightmare. I'll see if I can port it to Tensorflow 2. Fortunately there's some code to automate translating projects.
>>5677 It seems just a few days ago Spleeter 2.0 was released on PyPI that's compatible with Python 3.8:pip install spleeter I created a separate branch for Spleeter 2.0 and Tensorflow 2.3.0:cd clipchan git checkout python3.8 pip install -r requirements.txt I've tested that it's compatible with the Spleeter 1.4 pretrained models. It seems people have already ported Tacotron2 and WaveGlow to Tensorflow 2 so I'll work on creating a Python 3.8 branch for WaifuSynth too.
>>5682 Great news! I'll give it a shot tonight.
>>5682 BTW (OT) what bearing is the advice to use the '-m' flag with pip? As in pip -m install foobar I've seen that advice (and examples) often, but I don't think I understand what difference it makes yet.
>>5682 Had to first remove the leftover 1.4 spleeter egg file from /usr/lib/python3.8/site-packages/ > #1 Things seemed to be going well, then it errored out with this > #2
>>5687 Apparently, you can specify a version number (but guys recommend against this approach for some reason?) > #1 I'm not sure if this means everything went ok now with pip install spleeter > #2 but I'll push ahead with checking out the 3.8 branch of clipchan...
Don't forget to git fetch first before checkout. > #1 Minor typo in requirements.txt > #2 Seems OK so far now > #3 I'll give it a test in a while and see how --auto-clean goes. Any specific example command you'd care to give me so I don't flub the increasingly complicated heh :^) clipchan flags?
Here's the command I used: python ../clipchan.py 01.mkv --output-path clips/en/01 --style "Chobits Dialogue" -b 3 --spleeter --auto-clean --track 2 Which produced ~400 .wav files for me > #1 but errored out on me right at the beginning of the spleeter post-processing phase. > #2 And the log file seems empty? > #3
>>5690 Also, I'm curious why filelist.txt is being written in the working directory instead of the target directory. Intentional? Seems like it used to be written into the same directory as the output .m4a files IIRC.
>>5686 python -m pip -m tells python to run a module, in this case pip, from that specific python installation. >>5687 Specifying certain versions leads to unnecessary dependency hell. Unfortunately, dependency hell is the reality because all these different machine learning libraries require specific versions or they break. >>5689 Thanks, fixed the typo. Most of the flags are there to give some flexibility and don't need to be changed. It should work great with the default settings. >>5690 If the log is empty I assume Spleeter failed to start. What happens when you run Spleeter by itself?python -m spleeter separate -i clips/en/01/*.wav -o /tmp -n filename >>5691 You can change where filelist.txt is written to with --filelist-output-path / -l or name it to something else like ep1_filelist.txt with --filelist / -f
Hmm, a surprise. I deleted all the output .wav files from the previous effort and decided to try again w/o the --spleeter flag python ../clipchan.py 01.mkv --output-path clips/en/01 --style "Chobits Dialogue" -b 3 --auto-clean --track 2 but had the same outcome? >
>>5692 > from that specific python installation. Hmm, not sure I really understand that (probably why I didn't get it before heh). But this is probably more apropos for our Python General thread? >>159 >Thanks, fixed the typo y/w. >You can change where filelist.txt is written to with Great, more flags to deal with haha. <pull request: default filelist.txt into the same output path :^)
>>5692 >What happens when you run Spleeter by itself? What should I pass as the filename? (sorry if this seems blatantly obvious to everyone else) >
>>5692 >>5695 When I dropped the unrecognized flag, spleeter unceremoniously crashes with no further output >
One other thing. I noticed a spleeter.sh file in the directory (no doubt left over from the 1.4 branch of clipchan). This bash script is explicitly for spleeter v1.4.0 it seems. Does there need to be a similar script for the newer spleeter v2. Just spitballing here tbh.
>>5694 You can have different versions of Python installed, each with their own pip module for managing packages. And sure, I'll make the output folder the default path. >>5693 Double check your command. It's not possible to reach that part of the program unless --spleeter or --spleeter-model is given. >>5695 That is a bug in Clipchan. It seems Spleeter changed its arguments and I didn't realize I was using 1.4 when I was testing because Python ignored the virtual environment. I will have to investigate how Spleeter 2.0 outputs its files. >>5696 However, this is caused by Tensorflow being built with optimizations not supported by your CPU. Likely the only option here to workaround this is to build Tensorflow 2.3.0 from source: https://www.tensorflow.org/install/source >>5697 Pip simplifies the installation of Spleeter. I'm going to change the other one to use pip as well.
Alright, I'm trying to build Tensorflow from source r/n. I'll start over completely from scratch with Clipchan if that goes well. Hopefully, the new 3.8 copy of Clipchan will go well then.
Well, unsurprisingly that was a miserable failure. Building Tensorflow appears well above my paygrade. After a week of fighting with this with no success I'm about done with it. Question to Kokubunji If I simply sort the Chii vocals out of the raw .wav clips and then package them up for you in a zip somewhere, can you take if from there and do all the rest? It's obvious at this point I can't get Clipchan's marvelous, no doubt features to work correctly.
>>5701 Yeah, Clipchan will automate the rest. I'll see if I can find a Tensorflow 2.3.0 pip wheel without AVX or build one for people to use. Most of the machine learning libraries have dropped support for legacy CPUs in a similar way researchers have dropped support for CPUs. Now that I think about it, a lot of the papers require massive GPU clusters or 100's of TPUs and they don't release their models or code. We're already being left in the dust and if /robowaifu/ can't get AI to work, then what hope is there for everyone else?
>>5702 Thanks, I'll get on that and hopefully have it for you within the week. (I already got about 12 episodes done from before but have to redo with .wavs heh). My main issue is my lack of experience with python. I'm more interested in the underlying engines it's scripting that do the actual heavy-lifting. We'll need both skill areas, and we're doing OK in that regard IMO. Thanks for all the great innovation and also the hard work helping us get things working on lower-end hardware. You are much appreciated Anon. >We're already being left in the dust and if /robowaifu/ can't get AI to work, then what hope is there for everyone else? Ehh, we'll get things working, you obviously already have many things now. We're just on a limited budget as typical Anons vs. Big Tech has unlimited deep pockets. It was always going to be asymmetric for us. We just have to be clever about things. I'm not going to stop, I just have to recognize my current limits. We'll all learn more about things as we go along.
>>5702 > Dropped support for legacy CPUs Whhahaaaah, they do that?!? I've spend the last week or so thinking about building a server based on old Xeon CPUs and maybe also buying a Xeon Phi, as the external brain for my waifu... F... https://youtu.be/ZKkzEBtIoH8
>>5708 Yes, this is a common idea I think. It's a strong argument for us to succeed at figuring out a way to use old hardware effectively. There's faar more of that kind available the world over than the other. It's probably fundamentally less botnetted as well, also an important point.
Same anon than >>5708 Looked into it a bit, Pytorch for example seems to support CPUs via Intel Math Kernel Library, which seems to have no limitations to which CPU is working with it, except optimization might not work on non Intel CPUs: https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html
>>5704 Yeah, getting this stuff to work needs to become frictionless as possible for everyone to get involved. People don't have the time or energy for hitting their head on a wall for a week. There's about 1-2 years left before AI really takes off so it's definitely doable to get everything ready for people to use. I can imagine something like Transcoder translating PyTorch and Tensorflow code and models to mlpack in C++ which can build for any system, including embedded systems. >>5708 >>5712 Xeon and Xeon Phi should be fine. The public builds of PyTorch and Tensorflow require AVX and AVX2 since sometime around 2018. The devs have said multiple times it isn't suppose to but the instructions keep slipping into the builds and they don't do anything about it. Sometimes they do but then a minor version later it's fucked again. They've effectively cut off millions of people from using or learning how to use neural networks on common hardware. And just a few years ago PyTorch worked on 32-bit systems but they dropped support for 32-bit builds too. In a few months I'll definitely see if I can port Spleeter and WaifuSynth to mlpack. That would completely disentangle ourselves from Facebook and Google and be a huge step forward to keeping AI open.
>>5718 >Transcoder Is it this project Anon? > Unsupervised Translation of Programming Languages >A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is time-consuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.
>>5718 A week ago I wouldn't have recognized the term AVX, but yeah even ood Xeons have that, just not the newer versions of it, which would be much better. Still a bit concerning if old CPUs are getting ignored, especially for young students doing this on their own small budget. Still, it's amazing what hardware you can get for even 1k, and I figure the devs have to draw the line somewhere. Should have bought Bitcoin when you where in primary school or so, I guess.
>>5721 *old
>>5719 Yeah, that's it. The source code is available here: https://github.com/facebookresearch/TransCoder/ But this is getting off-topic from speech synthesis.
>>5718 >In a few months I'll definitely see if I can port Spleeter and WaifuSynth to mlpack. <Over 9'000!!! Haha not only are they cutting-edge AI library devs, they are shitposting memes within their own paper. Godspeed Anon.
>>5718 We're behind you, Anon! >>5730 >>5731 >>5732
Alright, I've tried my best to build a Tensorflow pip wheel without AVX and mostly succeeded but some AVX instructions are still slipping through inside pywrap, tf2xla and libtfkernel. On the Tensorflow community wheel page no one has succeeded yet in making an AVX-free build for 2.3.0 and I can't even begin to figure out where they forgot to include the optimization parameters in Google's 9 GB clusterfuck. So I've taken a look at the Spleeter model and it uses some algorithms not implemented in mlpack yet and the same is true for Tacotron2 and WaveGlow. I should be able to translate the functions but it's gonna be a lot of work, at least 6 months full-time. I'm gonna play around with mlpack more to see if it's worth the effort. The most off putting thing is the lack of good GPU support but they're working on that. There might even be other people willing to help since it would be a significant contribution to mlpack to have all these signal processing algorithms.
>>5736 Well, no one could possibly criticize your creativity and efforts thus far Anon. You've already done amazing work, and I'll support you whatever your decision. If you choose to simply wait until the mlpack codebase improves, then perhaps anons can simply do what I'm doing; namely just produce waifu vocal clips and then put them out there for those with more hardware to process. This will be a hit-or-miss approach IMO, but it preserves the status quo. It would be the choice of a sensible man, no doubt. If you choose to press forward with the hard choice, then I'd be happy to do any simple tasks I can to help take some of the load off you. Profiling the specific performance characteristics of particular commands/datasets on either my old Intel or ARM hardware, for example. Certainly the template generics approach the mlpack team has taken is both a tried-and-true one, and the run-time performance thus far seems to be smoking hot on this tiny arm7hf hardware currently at my disposal. The high-resolution clock from the standard C++ library seems to work just fine on this old hardware, afaict. If you can give me a list of tests to perform, I'll be happy to participate in that small way in your big effort Anon. Again, regardless, you've already made remarkable strides, and you deserve applause for that alone. :^)
>>5736 >The most off putting thing is the lack of good GPU support but they're working on that. Interesting. Can you give us details on the efforts so far? >There might even be other people willing to help since it would be a significant contribution to mlpack to have all these signal processing algorithms. SYCL is a higher-level form of OpenCL, and entirely in standard C++. This implies some probability of a long-term, broad availability on a widely diverse set of hardware. If you've ever had experience with Microsoft's C++ AMP, you can think of this as a more sophisticated progression of that approach. https://sycl.tech/ CodePlay is taking a lead at heading up practical solutions to support this standard. The ISO C++ Standards Committee also has a study group SG14 (the GameDev & low latency ISO C++ working group), headed up by Michael Wong, the chairman of OpenMP. It seems to me this is probably the most likely path the committee will progress down towards direct, standard support for heterogeneous computing accelerators such as GPUs. Probably worth thinking about in the long-term though quite early to put all your eggs in that one basket just yet IMO. >Porting our robowaifus to mlpack This is likely to be a big, complex topic and probably deserves it's own thread.
>>5698 Sorry, I don't know how I missed that filename Anon. That's exciting! So we're really going to have robowaifus then?
Open file (1.26 MB 327x480 message from 2B.webm)
Devlog update. Almost ready to train the new WaifuSynth. Just need more voice data to train on and double check that the code is doing what it's suppose to do. For the first pretrained model I will be training it on 2B, Rikka, Megumin, Rem, Chii, Mahoro, Armitage and the LJSpeech dataset. I modified Tacotron2 to take both a text and style input that are combined together so the audio clips can be annotated with emotions. For now I'm just using letters of the alphabet to mark where there are changes in a speaker's tone. For example, if a character starts doing excited yelling I mark JY underneath the word, J for joy and Y for yelling. I'm not sure how well this will work, but I guess I'll find out once I try training tomorrow. Even a little bit of control over the generation process would be great. Tacotron2's encoder LSTM starts off from a blank slate so I created another LSTM to take in a character name as input and use its output for the initial hidden state of the encoder LSTM before it encodes the text and style input. This way different characters can be easily sampled and new characters can be added to the training set without having to recode and retrain the entire model. I made some other tweaks to the model as well by duplicating parts of the network into three and using my multiplication and addition trick (a*b+c) which I've found also speeds up language model training by orders of magnitude. Hopefully it works in Tacotron2 as well. I'm concerned it will push the model over my memory limit though so I will be limiting training sentences to 120 characters. If that's not enough I can cut the batch size down 6x or just turn it off completely in the model hyper parameters. I also found a function in PyTorch to help free up GPU memory by emptying the cache: torch.cuda.empty_cache() And to wrap up today I had some fun with the old WaifuSynth after seeing some anons lamenting about how they don't believe robowaifus will exist in their lifetimes. I'm still kind of shocked no one on YouTube has made a serious effort yet to clone 2B's voice or any anime voices for that matter. Hopefully they'll see it and get a taste of the future to come. Also making this gave me an idea to pass the hidden state to the next sentence generation so the generation flows smoothly instead of resetting and sounding so robotic, but that will add another level of complexity separating the training set into sentence pairs. I'll leave that idea to the next version.
>>5861 >Chii I'm roughly halfway now through sorting out the new .wav versions of Chii's vocals. Should I continue or do you already have them. I estimate it will take me another couple of weeks total to finish up, then go through them all and trim them in Audacity, then make all the many edits needed in each filelist.txt to make the English VA's actual words. BTW, the version of .wav file are full 16-bit stereo files, but IIRC you mentioned something on the repo about 22.5K mono files instead? Should these be downsampled first after I sort them?
>>5863 >to match the English VA's actual words.*
>>5863 I haven't done Chii yet. It will take several days at least to train a new model from scratch, starting with the LJSpeech dataset. You shouldn't need to trim them in Audacity though. It only takes a minute to align the subs in Aegisub. A little bit of noise within 200ms at the end and start of clips will be found clipped out automatically. The clips shouldn't be resampled before going into Spleeter. They will get resampled into mono 22050 Hz automatically after running Spleeter from the Clipchan master branch (but not the 3.8 since it's behind). When I wake up I'll update filelist.py to automatically create the filelist from a subtitle file. That'll make things a lot easier for English VAs instead of tapping > and fixing the punctuation. I originally made it for subbing audio clips without subtitles when I was going through 2B's, but in practice I've found it's a lot easier to create subs in Aegisub for things like audio from a YouTube video and then run Clipchan on them.
>>5866 > It only takes a minute to align the subs in Aegisub. Unfortunately I didn't do that ahead of time (like an idiot), and as I said I've already ripped the entire thing and I'm halfway through sorting them out (requires listening to every clip in order ofc--basically the time req'd to watch the entire episode). I can start the entire process over again for the forth time haha if you think it would be more expedient. It would be really helpful if we had some sort of full tutorial video from you about the correct way to use Clipchan, start to finish, posted on bitchute or somewhere Anon. Regardless, I'll do what needs doing to get Chii's voice on her virtual waifu's avatar. Also, I wonder if the other characters a VA does for other animus might also be helpful in contributing to a full & capable trained model for a character?
>>5866 BTW, I'm the anon who's on the 3.8 branch...
>>5863 By the way why do you need to edit filelist.txt to make the English VA's actual words? It's already automatically generated by Clipchan. There are English subtitles for Chobits here: https://kitsunekko.net/dirlist.php?dir=subtitles%2FChobits%2F
>>5907 Simply b/c many (~ >50%) of the English subs in my source widely diverge from the actual English voice track for Chii. The longer the sentence(s), usually the worse it becomes.
>>5913 > (~ >50%) OK, that's probably an exaggeration, but it's certainly several hundreds of dialogue line examples in the whole series of 25 episodes.
>>5867 >Also, I wonder if the other characters a VA does for other animus might also be helpful in contributing to a full & capable trained model for a character? Perhaps, some VAs change their voice acting quite a bit between characters. It should give the model more data to work with. A big issue with 2B's voice is there isn't enough voice clips to cover every possible word, but I'm hoping this multi-speaker version will learn to fill in the blanks. >>5913 Oh, that sucks and makes sense. Once I finish the next version of WaifuSynth I'll see if I can extend it to speech recognition because that's gonna be a big hassle for people trying to train their own waifu voices.
>>5915 Yeah, the VA for Chii is rather prolific WARNING: don't look her up, it will ruin everything haha! :^) and I thought that since Chii isn't really too diverse in her vocabulary (part of the storyline arc ofc), that perhaps the statistical modeling effect of AI might benefit if I can find another character she did that wasn't too far from Chii's 'normal' voice. >multi-speaker fills in the blanks That's good news. Here's hoping. >auto voice recog That would truly make this into an amazing all-in-one toolset Anon.
Open file (686.68 KB 1535x906 clipchan.png)
>>5916 Anyway, for now don't worry about resampling the clips. They should be the highest quality available before going into Spleeter. In Aegisub you can load up the video or audio, align the subtitles, type it in the proper line, and hit enter to proceed to the next one. When Clipchan is complete I'll make a video explaining the whole process.
>>5917 OK, thanks for the explanation. Sounds like I need to start over with this. Not sure what my timeline will be, probably somewhere around the Trump win.
Open file (168.92 KB 1024x1024 2B.jpg)
Open file (65.44 KB example.mp3)
Open file (33.03 KB example2.mp3)
Open file (17.94 KB example3.mp3)
For some reason I thought I uploaded the 2B voice model for WaifuSynth already but I didn't. You can get it now here: https://anonfiles.com/Hbe661i3p0/2b_v1_pt
>>5932 >2B CATS remake wehn? /robowaifu/ for great justice. This needs to happen.
>>5917 Welp, I just wasted an entire day trying to get Aegisub up and running with no success. Just as an offhand guess, I'm supposing you're not running it on Linux (but rather on W*ndows)?
>>5945 I quit using Windows over a decade ago. What problem are you having with it?
>>5945 It could be either an issue with FFMS: >After upgrading my Linux distro, i ran Aegisub and got this error >aegisub-3.2: symbol lookup error: aegisub-3.2: undefined symbol: FFMS_DoIndexinga >So i had to downgrade from ffms2 2.40 package to ffms2 2.23.1 https://github.com/Aegisub/Aegisub/issues/198 Or Wayland, Aegisub requires x11: >GDK_BACKEND=x11 aegisub does not crash. https://github.com/Aegisub/Aegisub/issues/180
>>5949 >>aegisub-3.2: symbol lookup error: aegisub-3.2: undefined symbol: FFMS_DoIndexing That was exactly the error from the distro package manager install that started my down this long bunny trail. I never found that issue link in my searches. :/ Anyways, I went and downloaded the repo and tried to build from source, but then discovered I had to have wxWidgets as well, so I had to back out then builld that from source (dev version took hours to finish, but at least it succeeded in the end). Afterwards, the Aegisub build failed with 'references not found' type errors. Too many to remember and I tableflipped.execlosed the terminal after all those hours in disgust so I can't recall exactly. Anyway thanks for the links. I'll try it again tomorrow.
>>5932 One thing I'm not perfectly clear on Anon, can WaifuSynth be used for other languages? For example, since animu is basically a Japanese art-form, can your system be used to create Japanese-speaking robowaifus? If so, would you mind explaining how any of us would go about setting something like that up please?
>>5861 OMFG anon this is awesome! Crafting 2B's perfect ass out of silicone will be challenging but this is all the motivation I need!
>>5648 Anon what happened to your gitlab profile? It is deleted, can you post your new one?
>>7560 Anyone downloaded this, at least for archiving reasons? This is also gone: https://anonfiles.com/Hbe661i3p0/2b_v1_pt from here >>5932
For singing, https://dreamtonics.com/en/synthesizerv/ has a free Eleanor, which is pretty fantastic. As with all vocaloid -type software, you have to git gud at phonemes.
>>7576 It is possible that he deleted all the contents and left robowaifu after the latest drama. He might be the anon who involved in the latest one. If that is the case it's pretty unfortunate. I hope he recovers soon and comes back.
>>7594 Possible, but I hope this isn't it. Kind of radical. I tried to explain it to him as reasonable as possible what the problem was. Whatever, I don't wanna get into that again. The more important point is: I think he gave us enough hints how to do this stuff. I'm not claiming that I could reproduce this clipchan program, but I had the same idea before I read it here. It's kind of obvious to take subtitles to harvest voices. Which means, there will be other implementations on the net doing that and explaining how to. We don't need someone come to us or to be into anime nor robowaifus, just take some other implementation from another place or have someone reproducing it based on the knowledge available.
>>7594 What drama happened besides the migration I've been to deep in my projects to browse like I used to.
Open file (40.97 KB example.mp3)
Open file (17.84 KB 863x454 example_eq.png)
Open file (18.51 KB 562x411 example_parameters.png)
I'll post this here for now, since it's definitely relevant. I was experimenting a little bit more with Deltavox RS and Audacity. It seems that there is no "one size fits all" solution when using Deltavox. In order to get a decent result, you have to experiment with different spellings, phonemes, energy, F0, bidirectional padding, and so on. In Audacity, I used a simple filter curve. I was able to get noticeably less tinny audio, which sounds less computer generated. I'm going to explore more options for editing the audio after it's been synthesized to improve its quality. I'll post again if I find anything interesting. I'll repost the links here since they're still relevant: Deltavox User Guide https://docs.google.com/document/d/1z9V4cDvatcA0gYcDacL5Bg-9nwdyV1vD5nsByL_a1wk/edit Download: https://mega.nz/file/CMBkzTpb#LDjrwHbK0YiKTz0YllofVuWg-De9wrmzXVwIn0EBiII
>>8150 Thanks. BTW, do you know if this is open source? Since QT dlls are included I presume this is C++ software. If both are true, then it's very likely I can rewrite this to be portable across platforms -- not just (((Wangblows))) and we can be running it on our RaspberryPis & other potatos. Thanks for all the great information Anon.
>>8244 And one based on Vocaloid: https://youtu.be/OPBba9ScdjU
Is the voice synthesize going to be for English voices or Japanese voices? Or does one also work for the another? It would be pretty awesome if one could take voice samples of their favorite VA, vtuber, etc, put it through a voice synth A.I., and give their robowaifu that her voice.
>>9110 >It would be pretty awesome if one could take voice samples of their favorite VA, vtuber, etc, put it through a voice synth A.I., and give their robowaifu that her voice. More or less, that has already been achieved Anon. So hopefully more options along that line will soon become readily available for us all.
>>9111 Oh wow it already has? Where can I read about it/try it if possible?
>>9112 Our guy here called his WaifuSynth. ITT there are examples from the ponys, who have taken a highly-methodical approach for all the main characters in MLP:FiM cartoon show.
I see. Though all the synths seems to be for English voices. I'm guessing the 2B, Rikka, Megumin, Rem, etc mentioned in >>5861 are referring to their English VA rather than the Japanese ones. Unless I'm missing out on something? (If so, then maybe it'd be best for me to make some time and read this whole thread.)
>>9118 AFAICT, the training approach is just a statistical system matching sounds to words based on examples. It should work for any human language I think -- though you would need to be fluent in the target language to QA the results ofc.
>>9119 Ohh, I see. One last thing: I wouldn't be wrong to assume that, since the dropping of "kokubunji", there is no one working on the voice for robowaifu?
>>9112 WaifuSynth: https://gitlab.com/robowaifudev/waifusynth Clipchan: https://gitlab.com/robowaifudev/clipchan There are better methods now like FastPitch and HiFiSinger. FastPitch is about 40x faster than Tacotron2/Waveglow (what WaifuSynth uses) and is less susceptible to generation errors but is still far from perfect. HiFiSinger uses three different GANs to make realistic speech and singing, and its 48kHz model outperforms the 24kHz ground truth but it still has room for improvement in 48kHz, although I suspect it could be near perfect by training a 96kHz model. FastPitch: https://fastpitch.github.io/ HiFiSinger: https://speechresearch.github.io/hifisinger/ There's still a lot of research to be done before this stuff will be ready for production, namely imitating voices without training a new model, emotion/speech style control, and ironing out undesired realistic noises in generation. Probably in the next 2-3 years it will be easy to make any character sing given a song or read out any given text, and you won't have to go through the whole hassle of collecting audio clips and training models yourself. >>9118 Making Japanese VAs speak English and English VAs speak Japanese should be possible but you will have to train a model that converts the input to phonemes, otherwise it will just garble and misread text. Training Tacotron2 takes a really long time so I'd recommend modifying FastPitch to use phonemes instead of characters. All you have to do is instead of inputting characters like 's a m u r a i', you input the IPA 's a m ɯ ɾ a i'. You can probably save a lot of time on training by initializing the embeddings of the IPA symbols to the character embeddings of a pretrained model, then train it on LJSpeech or another dataset until it sounds good, then fine-tune it on the desired character. This paper reports that it only takes 20 minutes of audio to speak a new language using an IPA Tacotron2 model but they don't provide their trained model or code: https://arxiv.org/abs/2011.06392v1
>>9121 Also, you can use https://ichi.moe/ to convert Japanese subtitles from https://kitsunekko.net/dirlist.php?dir=subtitles%2Fjapanese%2F into romaji and then convert the romaji to IPA. Japanese IPA is straightforward since the syllables sound exactly the same as they are written, unlike English: https://en.wikipedia.org/wiki/Help:IPA/Japanese
>>9121 >>9122 !! Thanks so much Anon!
>>9121 Nice, thanks! I have an 'new' machine (well, old but still much better than my old notebook) pieced together that has an i3 and an Nvidia GT430 (or possibly an 750ti). Not too impressive I know, but I could use it to take a shot at setting up clipchan again. Mind giving me specific set up advice Anon? Like the OS to use, Python version to use, etc., etc. The more specific, the better. TIA.
>>9125 2 GB might not be enough to train FastPitch but you might squeeze by with gradient checkpointing and gradient accumulation to reduce memory usage. A 1 GB card will certainly be too little since the model parameters are 512MB and you need at least twice that to also store the gradient. If it doesn't work you could shrink the parameters down and train a lower quality model from scratch. However, it seems the GT430 supports CUDA 2.1 and the 750Ti supports 5.0. 2.x capability was removed in CUDA 9 and 6.x removed in CUDA 10. If you're lucky you might be able to get them to still work by compiling PyTorch and Tensorflow with the CUDA version you need, but I wouldn't bet on it. I'd recommend using at least Python 3.7 and Tensorflow==2.3.0 since Spleeter requires that specific version. If someone has a newer GPU with at least 6 GB they'll have to download a Tensorflow 2.3.0 wheel with CUDA 11.1 because Tensorflow only supported 10.1 until version 2.4. A Tensorflow 2.3.0 + CUDA 11.1 wheel for Python 3.8 is available here: https://github.com/davidenunes/tensorflow-wheels Again this is only necessary if you have a newer GPU with at least 6 GB. Spleeter will run fine on the CPU. I use Python 3.7 and don't feel like compiling a Tensorflow 2.3.0 wheel for it so I just modified Spleeter's requirements.txt to support tensorflow==2.4.0 and numpy>1.16.0,numpy<=1.19.5 and installed it from source. Spleeter will still work and output the clean voice clips but crash after finishing. This error can be patched by commenting out the del function in spleeter/separator.py:135 since Tensorflow 2.4 closes the session automatically. I'm using PyTorch 1.8.0 with CUDA 11.1 since it supports the full capabilities of my GPU. To use either PyTorch or Tensorflow easily you'll need at least a 4th generation i3 which has AVX2 support. Otherwise you'll have to look for community pip wheels compiled without AVX/AVX2 for your specific version of Python. Older versions of PyTorch are compatible with most deep learning models but lack the newer torchaudio which is an amazing library for processing and training on audio that will certainly start seeing some use soon. Lastly, Clipchan needs to be updated to support the latest Spleeter version which completely changed everything again, but that'll be an easy fix. I'll patch it tonight.
>>9135 Great, thanks for the specific details. Hmmm, from what you're saying it sounds like I still won't be able to run it even with the better (by comparison) hardware. I'm sure someday we'll be able to do this stuff on much more modest hardware. I'll just be patient and focus on other things till then. :^) >Lastly, Clipchan needs to be updated to support the latest Spleeter version which completely changed everything again, but that'll be an easy fix. I'll patch it tonight. Don't rush on my account Anon. Maybe it will help others though.
Due to the 'rona my workspace is limited to a shitty laptop. What's the best (or rather, least worst) model one could conceivably run on a CPU?
>>9136 It works fine on better hardware. The problem is backwards compatibility is a foreign concept to Tensorflow so people end up locking their projects to old versions and creating a hellish nightmare of dependency conflicts. Also short clips don't take up too much memory. Only when processing 10 minute songs does it use up to 6 GB. To avoid this Clipchan processes each clip individually. And Clipchan has been updated now to v0.3. I had to fix it anyway to help someone get voice clips. It's essentially finished, besides making it simpler to use and ideally creating a GUI for it. The most important options are -dza which cleans audio with Spleeter, speeds up subtitle processing, and auto-crops the audio clips. For Tacotron2 -r 22050 -c 1 are also needed to resample and mix stereo to mono (they require the -a option to have any effect right now.) If you don't have FFmpeg with libzvbi, then omit the -z option. And some fresh Natsumi Moe voice clips from v0.3 ready for Tacotron2: https://files.catbox.moe/ipt13l.xz Still a work in progress but there's about 10 minutes of usable audio there. >>9149 Not sure, you won't get much of a speed up running FastPitch on CPU compared to Tacotron2. It's possible for fine-tuned models to be pruned and compressed down so they can run on mobile devices, but I'm not aware of anyone who has taken the time to do that. Pruning and compressing doesn't apply to training though, only works with inference.
>>9150 Thanks for the Natsumi Moe clips Anon! A cute. I hope someday we can manage a basic Chii library to generate voices from. Sounds like certain interests are specifically trying to keep their tools from working with older systems -- even their own systems haha. Doesn't sound like I (and thousands more like me) will ever be able to use this tool at that rate. Maybe if someone creates some Docker or other kind of container that were tuned for different hardware setup ups then we might be able to break free of this intentionally-created ratrace they intend us to run.
>>9121 >Cloning into 'fastpitch.github.io'... >fatal: repository 'https://fastpitch.github.io/' not found >Cloning into 'hifisinger'... >fatal: repository 'https://speechresearch.github.io/hifisinger/' not found I'm a starting to get paranoid or are they really onto us?
>>9162 Those are demo pages, not repositories. FastPitch: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch HiFiSinger (unofficial implementation): https://github.com/CODEJIN/HiFiSinger
>>9162 Those aren't git repositories Anon. Browse there and read the pages.
>>9163 >FastPitch: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch Hmm, the only way I can actually get a clone to work is by going up in the tree a bit? git clone --recursive https://github.com/NVIDIA/DeepLearningExamples.git
>>9165 Git 2.25.0 includes a new experimental sparse-checkout command: git clone --filter=blob:none --sparse https://github.com/NVIDIA/DeepLearningExamples.git cd DeepLearningExamples git sparse-checkout init --cone git sparse-checkout add PyTorch/SpeechSynthesis/FastPitch
>>9150 Yeah, I'm not going to do any training, just inference from released checkpoints. I did manage to get a FastSpeech2 model running with some pretty good results, although for some reason it adds garbled echoes after the generated speech.
>>9179 Ahh, didn't know about that one, thanks Anon.
Open file (156.72 KB 555x419 overview.webm)
Open file (50.42 KB 445x554 conformer.png)
A novel voice converter that outperforms FastSpeech2 and generates speech faster. Although it doesn't do speech synthesis from text it introduced a convolution-augmented Transformer that could easily be adapted into FastSpeech2 and FastPitch to improve the quality of synthesized speech. https://kan-bayashi.github.io/NonARSeq2SeqVC/
>>10159 Quality sounds excellent. Thanks Anon.
>>9121 >>10383 >robowaifudev
Facebook made a great speech generator, circa a year ago: https://ai.facebook.com/blog/a-highly-efficient-real-time-text-to-speech-system-deployed-on-cpus/ - It's not free software, but they described how it is build. Yannic Kilcher goes through the system and explains it here: https://www.youtube.com/watch?v=XvDzZwoQFcU One interesting feature is, that it runs on a CPU with 4-cores (not the training of course). On such a CPU it is faster than real-time, which means faster than running the audio output would take. Something like this might be something very useful to us, if we could run our own knock-off on a small SBC inside the robowaifu.
>>10393 >Something like this might be something very useful to us, if we could run our own knock-off on a small SBC inside the robowaifu. It certainly would, if we can somehow obtain access to it or reproduce it, Anon. Thanks for the heads-up, and for the video link. It really helps to get the points across well for us. youtube-dl --write-description --write-auto-sub --sub-lang="en" https://www.youtube.com/watch?v=XvDzZwoQFcU
Open file (111.71 KB 286x286 speechbrain.png)
not sure if this has been posted before, but I came across this and immediately thought of some of the todo list for clipchan. https://speechbrain.github.io/index.html seems like there was some discussion about emotion and speaker ID classifiers.
>>10458 Very cool Anon, thanks. It looks like it's a solid and open source system too, AFAICT.
The model link is dead, while I can train a new model I am looking to avoid that step right now because of other deadlines, though I would love to include 2B in WaifuEngine, would anyone be willing to mirror or provide an updated link? Thanks
>>10499 ATTENTION ROBOWAIFUDEV I'm pretty sure the model in question is your pre-trained one for 2B's WaifuSynth voice, ie, https://anonfiles.com/Hbe661i3p0/2b_v1_pt >via https://gitlab.com/robowaifudev/waifusynth cf. (>>10498, >>10502)
>>10504 Links are both dead
>>10504 To clearify the pretrained model links are both dead repo still up
Open file (14.93 KB 480x360 hqdefault.jpg)
Great! Now my waifu can sing a lullaby for me to sleep well. The only problem is that I don't have the Vocaloid editor. Video demonstration: https://youtu.be/mxqcCDOzUpk Github: https://github.com/vanstorm9/AI-Vocaloid-Kit-V2
Open file (548.83 KB 720x540 lime_face_joy.png)
>>5521 >Cute robowaifu Check >Inspiring message to all weebineers everywhere Check >Epic music Check Best propaganda campaign. 10/10, would build robowaifu. >>5529 >>5530 Damn it lads! You're bringing me closer to starting sampling Lime's VA heheheh (Although I was hoping to use my voice to generate a somewhat convincing robowaifu, so as to minimise reliance on females).
>>11229 Forgot to add. >>5532 >I don't know if that'll be enough. Chii didn't really talk much. You're overcomplicating it. I think he meant create a tts that outputs "Chii" regardless of what you put in ;) (Although you could add different tonality and accents, might be a more fun challenge).
>>10504 Sorry, been busy and haven't been active here lately. Updated the repo link: https://www.mediafire.com/file/vjz09k062m02qpi/2b_v1.pt/file This model could be improved by training it without the pain sound effects. There's so many of them it biased the model which causes strange results sometimes when sentences start with A or H.
>>11474 Thanks! Wonderful to see you, I hope all your endeavors are going well Anon.
>>11474 come join my doxcord server if you have time and pm me! thanks for the model, you will likely see it used on the 2B "cosplay" waifu, we may have in the game
>>11480 The link is expired. What would you like to talk about? I don't have a lot to add. You can do some pretty interesting stuff with voice synthesis by adding other embeddings to the input embedding, such as for the character in a multi-character model, emphasis, emotion, pitch, speed, and ambiance (to utilize training samples with background noise.) This is what Replica Studios has been doing: https://replicastudios.com/
>>11522 If you are interested, I am looking for someone to take over the speech synthesis part of WaifuEngine, I got it working however, to work on it as a specialty takes me away from the rest of the application, like I want to train a new model using glowtts but my time is limited. I also have to work on the various other aspects of the project, to get it off the ground. Right now our inference time using tacotron2 isn't great unless you have a GPU. As for compensation on the project, so far I have been giving away coffee money as we have little resources haha, if the project gets bigger and more funding, I'd be willing to help the project contributors out. https:// discord.gg/ gBKGNJrev4
>>11536 In August I'll have some time to work on TTS stuff and do some R&D. I recommend using FastPitch. It's just as good as Tacotron2 but 15x faster on the GPU and 2x faster on the CPU than Tacotron2 is on the GPU. It takes about a week to train on a toaster card and also already has stuff for detecting and changing the pitch and speed, which is essential to control for producing more expressive voices with extra input embeddings. https://fastpitch.github.io/
>>11550 I'd message you on discord about this this could be useful info for the board. But essentially I did use fast pitch originally, the issue is the teacher student training methodology, you have to use tacotron to bootstrap and predict durations to align, When you don't do that and just train on LJS Model of Fastpitch via fine tuning, it fails to predict the durations. We can definitely try this method I am open to it, I guess in my time crunch I didn't bother. I am optimizing for delivery so that we have a product people can use and enjoy, it should be very simple to update the models in the future, it would be one python script change based off my architecture
>>11559 The 2B model I made was finetuned on the pretrained Tacotron2 model and only took about an hour. Automating preprocessing the training data won't be a big deal. And if a multi-speaker model is built for many different characters it would get faster and faster to finetune. I've been looking into Glow-TTS more and the automated duration and pitch prediction is a nice feature but the output quality seems even less expressive than Tacotron2. A key part of creating a cute female voice is having a large range in pitch variation. Also I've found a pretrained Tacotron2 model that uses IPA. It would be possible to train it on Japanese voices and make them talk in English, although it would take some extra time to adapt FastPitch to use IPA. Demo: https://stefantaubert.github.io/tacotron2/ GitHub: https://github.com/stefantaubert/tacotron2
Some other ideas I'd like to R&D for voice synthesis in the future: - anti-aliasing ReLUs or replacing them with swish - adding gated linear units - replacing the convolution layers with deeper residual layers - trying a 2-layer LSTM in Tacotron2 - adding ReZero to the FastPitch transformers so they can be deeper and train faster - training with different hyperparameters to improve the quality - using RL and human feedback to improve the quality - using GANs to refine output like HiFiSinger - outputting at a higher resolution and downsampling
>>11569 Thanks, but what's the point of this IPA. To let it talk correctly in other languages? >Der Nordwind und die Sonne - German with American English accent I can assure you: I doesn't work. Americans talking German often (always) sounds bad, but this is a level of it's own. Absolutely bizarre.
>>11571 Yeah, I live around Chinese with thick accents and this takes it to the next level, kek. That's not really the motivation for using IPA though. This pilot study used transfer learning to intentionally create different accents, rather than copy the voice without the accent. How IPA is useful to generating waifu voices is it helps improve pronunciation, reduce needed training data, and solves the problem with heteronyms, words spelled the same but pronounced differently: https://jakubmarian.com/english-words-spelled-the-same-but-pronounced-differently/ When models without IPA have never seen a rare word in training, such as a technical word like synthesis, they will usually guess incorrectly how to pronounce it, but with IPA the pronunciation is always the same and it can speak the word fluently without ever having seen it before. Also in a multi-speaker model you can blend between speaker embeddings to create a new voice and it's possible to find interpretable directions in latent space. Finding one for accents should be possible, which could be left in control to the user's preferences to make a character voice sound more American, British or Japanese and so on.
>>11577 Ah, okay, this sounds pretty useful. One more problem comes to mind in regards to this. In English foreign names are often changed in pronunciation, because the name would sound "strange" otherwise. The philosopher Kant would sound like the c-word for female private parts. Therefore they pronounce it Kaant. I wonder if the method helps with that as well.
>>11582 In that case it depends what language you transliterate with. If necessary names could be transliterated as they're suppose to be pronounced in their original language, or it could all be in the same language. Exceptions could also be defined. For example, the way Americans pronounce manga is quite different from the Japanese. If someone wants their waifu to sound more like a weeb and pronounce it the Japanese way, they could enter the Japanese IPA definition for it to override the default transliteration.
Open file (18.23 KB 575x368 preview.png)
Open file (62.21 KB 912x423 aegisub.png)
Finished creating a tool for automatically downloading subtitles and audio clips from Youtube videos, which can be reworked in Aegisub or another subtitle editor, then converted into a training set with Clipchan. https://gitlab.com/robowaifudev/alisub
>>11623 This sounds exciting Anon, thanks! >or another subtitle editor Can you recommend a good alternative Anon? I've never been able to successfully get Aegisub to run.
>>11624 Someone recommended SubtitleEdit but it's Windows only: https://nikse.dk/SubtitleEdit Subtitle Editor can display waveforms but it's far more difficult to use and I don't recommend it.
>>11623 Okay, thanks. This could be useful for more, I guess. Maybe later to train the system on lip reading using YouTube, for example. Or maybe for training voice recognition in the first place? How much data do we need to emulate a particular voice?
>>11625 OK, thanks for the advice. I'll try and see if I can set it up on a virtual box instead or something, Aegisub did look pretty easy to use (first time I've seen it in action, so thanks again). The problem is always a wxWidgets dependency hell issue. I can even get it to build, right up to link time.
>>11631 Finetuning a pretrained model you need about 20 minutes. Training a model from scratch takes about 12 hours. Multispeaker models trained on hundreds of voices can clone a voice with a few sentences but still need a lot of samples to capture all the nuances.
Been doing some work to get WaifuEngine's speech synthesis to run fast on the CPU and found that FastPitch has a real-time factor of 40x and WaveGlow 0.4x. This lead me to testing several different vocoder alternatives to Waveglow and arriving at multi-band MelGAN with an RTF of 20x. So FastPitch+MelGAN has an RTF of 12x, which means it can synthesize 12 seconds of speech every second or 80ms to generate a second of speech. "Advancing robotics to a point where anime catgirl meidos in tiny miniskirts are a reality" took MelGAN 250ms on CPU to generate from 2B's Tacotron2 Mel spectrogram. Now I just gotta set up this shit so it's easy to train end-to-end and the whole internet and their waifus are getting real-time waifus. Multi-band MelGAN repo: https://github.com/rishikksh20/melgan Multi-band MelGAN paper: https://arxiv.org/abs/2005.05106 Original MelGAN paper: https://arxiv.org/abs/1910.06711
>>11636 Interesting, thanks, but I meant how much samples we need to fine-tune a voice. I also wonder if voicesmare being 'blended' that way. Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. >>11647 Thanks for your work. I thought voice generation would take much more time to do. Good to know. Responses to someone talking should be fast.
Open file (153.25 KB 710x710 gawr kilcher.jpg)
>>11648 I meant 20 minutes and 12 hours of samples. Finetuning with 20 minutes of samples takes about 1-2 hours on my budget GPU. >Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. This definitely deserves more thought. If every person on the internet will be able to do speech synthesis and there is a tsunami of voice cloning characters, it's important people are able to have creative freedom with it while the buzz is on. People's curiosity will further advance speech synthesis and diffuse into other areas of AI, including waifu tech. On the other hand if people only straight up copy voices then it would cause a media shitstorm and possibly turn people away, but that could also have its benefits. Whatever happens though the accelerator is stuck to the floor. In the meantime while the hype builds, iteration can continue on until the synthesis of Gawr Kilcher is realized. When people look closely though they'll notice it's neither Yannic or Gura but actually Rimuru and Stunk all along.
>>11647 Thanks for the information, Anon.
>>11650 kek. i just noticed that logo. i wonder what based-boomer AJ would think of robowaifus. white race genocide, or crushing blow to feminazis and freedom to all men from oppression?
>>11677 He doesn't like them or AI in general. Said something once like people are going to stop having kids and masturbate with a piece of plastic all day and how the government is going to know everything about people through them and be able to manipulate them perfectly. He's not really wrong. Look how many people already give up all their data using Windows and Chrome.
>>8151 >>12193 A routine check on the Insights->Traffic page led me here. While the program itself is written with Qt, what actually makes the voices work (Voice.h and beyond) does not contain a single trace of Qt (well, almost, but what little there is is just error boxes). This is a deliberate design decision to allow the actual inference engine to be copied and ported anywhere with minimal trouble. For inference on embedded devices you probably want to use TFLite, which is on my list because I plan on Windows SAPI integration.
>>12257 Hello Anon, welcome. We're glad you're here. Thanks for any technical explanations, we have a number of engineers here. Please have a look around the board while you're here. If you have any questions, feel free to make a post on our current /meta thread (>>8492). If you decide you'd like to introduce yourself more fully, then we have an embassy thread for just that (>>2823). Regardless, thanks for stopping by!
In need of some help... I want to create a speech synthesizer, I want to take samples of my waifu's voices (which I have a lot of) and use it to digitally create her voice. First of all, is it possible? The voice samples I have are not the kind that this video shows https://youtu.be/_d7xRj121bs?t=55 , they're just in-game dialog. It is also worth noting that the voice is in Japanese. If it is possible, I still have no idea where to begin with this, I'm guessing I'll need some sound tech knowledge (which I have none of) and that's about all I can think of. In terms of programming languages, I know Python fairly well and am currently getting into C++. Anons, how do I get started with this?
>>13811 >I still have no idea where to begin with this Welcome. Then look through the thread and into the programs mentioned. You will probably need to train some neural network on a GPU. Also, you would need to extract the voices from the game and also have these words in text then. If you can't get them as files, then you might need to record them with a microphone. Then would need to transcribe the text. Lurk around, maybe someone else knows more, and just ignore the disgusting troll insulting everyone.
>>13811 As dumb as this might sound, you might want to check out /MLP/ on 4chan, there's a 100+ threads about doing this with My Little Pony characters called the "Pony Preservation Project" and they've actually made some decent progress.
>>13818 >look through the thread and into the programs mentioned Will do. >probably need to train some neural network on a GPU Have yet to get into neural networks but looks like the time has come. >extract the voices I've done that, from the game files too so they're of decently high quality. >transcribe the text That I need to do. >>13821 >as dumb as this might sound Nothing dumb about it if it works. Will give them a visit. Thank you, Anons!
>>13823 Keep us updated if there's progress anon, speech synthesis is a fascinating field. I'd love to try it out myself later in the year once I have more time
This may be old news, since it's from 2018, but Google's Duplex seems to have a great grasp on conversational speech. I think it says a lot when I had an easier time understanding the robot verus the lady at the restaurant (2nd audio example in the blog). https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html
>>14270 Hi, I knew that this has been mentioned before somewhere. Didn't find it here in this thread nor with Waifusearch. Anyways, it's in the wrong thread here, since this is about speech synthesis but the article is about speech recognition. The former conversation probably happened in the chatbot thread. >One of the key research insights was to constrain Duplex to closed domains, which are narrow enough to explore extensively. Duplex can only carry out natural conversations after being deeply trained in such domains. It cannot carry out general conversations. This is exactly the interesting topic of the article. Good reminder. A few month or a year ago I pointed out that recognizing all kinds of words, sentences and meanings will be one of our biggest challenges. Especially if it should work with all kinds of voices. Some specialists (Sphinx CMU) claimed it would currently require a server farm with terrabytes of RAM to do that, if it was even possible. We'll probably need a way to work around that. Maybe using many constrained models on fast SSDs which take over, dependent on the topic of conversation. Let's also hope for some progress, but also accept that the first robowaifus might only understand certain commands.
>>11623 You should replace youtube-dl with yt-dlp. youtube-dl is no longer maintaned and has issues with some youtube videos.
>>15192 Thanks for the tip Anon. Having used youtube-dl for years now, I too noticed the sudden drop-off in updates that occurred following the coordinated attack by RIAA/Microsoft against it's developer & user community. We'll look into it.
Open file (73.10 KB 862x622 IPA_synthesis.png)
I think I've finally figured out a way to train more expressive voices in conversation without having to label a ton of data. First, the English text needs to be transcribed into IPA so that a speech synthesis model can easily predict how words are spoken without requiring a huge dataset covering all the exceptions and weirdness of English. The English transcription or IPA is projected into an embedding that's split into two parts. One part constrained to representing the content as IPA via projecting those features back into IPA symbols and minimizing the cross entropy loss. The other half modelling the style, such as the emotion and other subtleties, to match the audio examples more faithfully, which are trained through the Mel spectrogram loss. This way the model can learn all aspects of speech through just the text labels and audio examples alone. At inference time this style embedding could be modified to change the emotion, pitch, cadence, tone and other qualities of the model for voice acting or creating examples for finetuning the model towards a desired personality. A ByT5 model could be used to transcribe English and other languages into the IPA embedding + style embedding. It could also take into account the previous context of the conversation to generate a more appropriate style embedding for the speech synthesis model to work from. Training from context though will require new datasets from podcasts that have such context. I've collected some with existing transcripts and timestamps for this already. The transcripts just need to be accurately aligned to the audio clips for clipping, so it's not an unfeasible project for one person to do. Other possibilities for this could be adding tags into the text training data that get filtered out from the content via the IPA cross entropy loss, ensuring the tags only affect the style embedding. You could indicate tempo, pitches, velocity and note values for singing which would be learned in the style embeddings. It could also be used for annotating different moods or speaking styles such as whispering or yelling. There's a ton of possibilities here for more versatile speech synthesis and natural conversation.
>>15874 Pony Preservation Project anon here. I recommend checking out the presenatations linked from the PPP threads, especially Cookie's segments. derpy.me/pVeU0 derpy.me/Jwj8a In short: - Use Arpabet instead of IPA. It's much easier to get Arpabet data than IPA data, and Arpabet is good enough. - Use BOTH Arpabet and English transcriptions. Each datapoint should contain one or the other for the transcription, and the dataset as a whole should contain both Apabet and English transcriptions. - Use a natural language model to augment your data with emotion embeddings. The pony standard is to use DeepMoji embeddings. Some anon has use TinyBERT for supposedly-better effect. I assume if you're using a language model like TinyBERT, you'd need to create a prompt that gets the network to associate an emotion with the text, then use the embedding for the token associated with that emotion prediction. - Use HiFiGAN for the vocoder. We've also found that text-to-speech isn't always suitable for waifu-speak. Sometimes (often), people want to be able to use a reference audio to get the prosody exactly right. For that, you'll want to use TalkNet2.
>>15192 >>15193 > (>>12247 youtube-dl takedown -related) > (>>16357 yt-dlp installation & scripting -related)
>>16606 You ponies have done some nice work through your primary overarching project Anon. Thanks for the recommendations! :^) Cheers.
Open file (63.51 KB 985x721 audioreq.png)
The idea is that the neural network should eat the same audio stream but with different parameters, for example: if the neural network does not recognize standard speech, then there is a chance to recognize the same audio at "Pitch -1 _ Speed -1" for example, most likely this method has already been implemented and has long been used, if not, it seems to me - it can solve the main difficulties in speech recognition, words understanding, etc.
>>16664 I had similar thoughts, and my thinking was the we might profit from using specialized hardware (probably some ASIC) close to the cameras and ears, automatically creating different versions of the data. For audio it might be a DSP, but I don't know much about that. Filtering out certain frequencies or doing noise cancelation might also be helpful. Basically we need a SBC sized hardware which can do that very efficient and fast, outputing various version from only a few inputs.
Would it be easier to go with an old school approach like this? https://m.youtube.com/watch?v=J_eRsA7ppy0
>>16664 This posting and the response should be moved into a thread about audio recognition or conversational AI / chatbots, since this thread is about speech synthesis not recognition.
>>16669 The techniques for the audio in there are studied now under phonetics. The techniques for the video in there are studied under articulatory synthesis. Articulatory synthesis is difficult and computationally expensive. I don't know of a good, flexible framework for doing that, so I wouldn't know how to get started on waifu speech with that. Under phonetics, the main techniques before deep neural networks were formant synthesis and concatenative synthesis. Formant synthesis will result in recognizable sounds, but not human voices. It's what you're hearing in the video. Concatenative synthesis requires huge diphone sound banks, which represent sound pieces that can be combined. (Phone = single stable sound. Diphone = adjacent pair of phones. A diphone sound bank cuts off each diphone at the midpoints of the phones since it's much easier to concatenate phones cleanly at the midpoints rather than the endpoints. This is what Hatsune Miku uses.) Concatenative synthesis is more efficient than deep neural networks, but deep neural networks are far, far more natural, controllable, and flexible. Seriously, I highly recommend following in the PPP's footsteps here. Deep neural networks are the best way forward. They can produce higher quality results with better controls and with less data than any other approach. Programmatically, they're also flexible enough to incorporate any advances you might might see from phonetics and articulatory synthesis. The current deep neural networks for speech generation already borrow a lot of ideas from phonetics.
>>16684 Thanks for the advice Anon!
Open file (127.74 KB 1078x586 WER.png)
Open file (83.58 KB 903x456 models.png)
Open file (82.91 KB 911x620 languages.png)
https://github.com/openai/whisper Oh shit, audio transcription has surpassed average human level and now competitive with professional transcription. OpenAI has gone off its investor rails and completely open-sourced the model and weights. On top of that it's multilingual and can do Japanese fairly well. This could be used for transcribing audio from vtubers, audio books, and anime with missing subtitles. Unfortunately it doesn't do speaker detection as far as I know but it might be possible to train another model to use the encoded audio features to detect them. Install: python -m pip install git+https://github.com/openai/whisper.git --user Quick start: import whisper model = whisper.load_model("base", device="cuda") # set device to cpu if no CUDA result = model.transcribe("chobits_sample.mp3", language="en") # multilingual models will automatically detect language, but not English only models print(result["text"]) Output (base): > Yuzuki. I brought you some tea. Huh? Huh? Why are you serving the tea? The maid, Persecom, is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. Wee. I can handle this on my own. I want you to try to relax. Oh. Oh? Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all we really need is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motu-suwa, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! Huh? There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki. Interestingly the VA actually said "persecom" instead of persocom and Motusua instead of Motosuwa, which transcribed as "Motu-suwa". The poor pronunciation of "all he really needs is a good night's sleep" sounded a lot like "all we really need is a good night's sleep" and was transcribed as such. The only other errors were transcribing a Chii processing sound effect as "wee", mistaking Minoru saying "ah!" as "huh?", the clatter of teacups being transcribed as "oh", and Minoru saying "ugh" as "oh?" Output (small): > Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki? "Ah! Huh?" from Minoru and Hideki were omitted. "Ugh" was also omitted when Minoru passes out. It understood persocom wasn't a name but still misspelled it "persicom". Chii's sound effect wasn't transcribed as "wee" this time. Motosuwa got transcribed as "Motosua". This model understood "all he really needs" but made a mistake at the end thinking Hideki was asking a question saying Yuzuki. Output (medium): > Yuzuki! I brought you some tea. Ah! Huh? Why are you serving the tea? The maid, Persicom, is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki... This one got the ellipsis right at the end and recognized Minoru saying "ah!" but mistook persocom as a name, Persicom. "Ugh" was omitted. Output (large): >Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki... "Ah! huh?" were omitted and it understood persocom wasn't a name but still spelled it as "persicom".
>>17474 (continued) Output (tiny): > Useuki. I brought you some tea. Ugh. Huh? Why are you serving the tea? The maid, Percicom, is currently being used by the system. What are you talking about? Useuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord, Minoro. Thank you very much. I can handle this on my own. I want you to try to relax. Oh. Minoro! Minoro! Minoro! Well, I'm glad to know that all we really need is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motu, so it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! Huh? There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoro under such extreme stress. If only I was just more useful. Let's use a key. Tons of errors, not particularly usable. Output (tiny.en): > Yuzuki! I brought you some tea. Oh! Why are you serving the tea? The maid purse-a-com is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord, me no no. Thank you very much. I can handle this on my own. I want you to try to relax. Oh. Do you know who? What do you know her? What do you know her? Well, I'm glad to know that all he really needs is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motusua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minaro under such extreme stress. If only I was just more useful. Oh, Miss Yuzuki. >Lord, me no no. Japanese names and words confuse it. "Minoru" became "Do you know who?" and "Lord Minoru" became "What do you know her?" but it does decent on English and got "all he really needs" right but flubbed "but to be so exhausted" as "but it'd be so exhausted". Interestingly it got "Motusua" right the way she said it. Output (base.en): > Yuzuki! I brought you some tea. Ugh! What? Why are you serving the tea? The maid-pursa-com is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minaro, thank you very much. I can handle this on my own. I want you to try to relax. Oh. Minaro! Lord Minaro! Lord Minaro! Well, I'm glad to know that Allie really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motusua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minaro under such extreme stress. If only I was just more useful. Miss Yuzuki. This one really messed up "all he really needs" as "Allie really needs" and understood "Minoru" as a name "Minaro". It also got "but to be so exhausted" right. Mistook "ugh" as "oh". Output (small.en): > Yuzuki! I brought you some tea. Ah! Huh? Why are you serving the tea? The maid persicum is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki. >persicum This one got Minoru spelled right and "all he really needs" and "but to be so exhausted". Omitted "ugh". Output (medium.en): > Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he'd just collapse like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki? Omitted "ah! huh?" and "ugh" but otherwise good. Overall from just this sample I think using base is the best for English and tiny.en on CPU. The improvements in quality by small and medium aren't really worth the slowdown in speed and the base.en model doesn't seem particularly robust. If going for a larger model small.en seems better than small.
Open file (5.94 MB 646x360 sample.webm)
Open file (75.53 KB 607x522 whisper.png)
>>17474 Holy shit, I just found out the other day Whisper has a translate feature and gave it a go tonight. It works amazingly well with the medium size model. >[00:00.000 --> 00:02.920] The strongest warrior of Vesper, Mahoro. >[00:02.920 --> 00:06.720] Thank you for fighting with me until today. >[00:06.720 --> 00:11.600] I'm sure you already know, but you have only a few lives left. >[00:11.600 --> 00:17.600] If you continue to fight as a warrior, you will only have 37 days to move. >[00:17.600 --> 00:25.600] However, if you release your armament, you will still be able to move for 398 days, according to the report. >[00:25.600 --> 00:30.800] Mahoro, you've done enough for us Vesper. >[00:30.800 --> 00:37.200] If you have a wish that you want to grant, why don't you live the rest of your time freely? >[00:37.200 --> 00:41.000] Huh? Um... >[00:41.000 --> 00:46.000] Now, choose whichever path you like. >[00:48.000 --> 00:49.800] My wish... >[00:49.800 --> 00:54.400] My last wish is... I imagine finetuning the model on English and Japanese voices and learning to predict not only the text but also the emotion, tone and speaker by attaching Tacotron to the decoder. Then the translate feature could be used to auto-dub anime in the same voice and emotion but in English. The decoder of Whisper could also be used to predict style embeddings (the emotion and tone) from text to feed into Tacotron to synthesize much more natural sounding speech, and the more context you give it, the more accurate it would be.
>>18253 Amazing. Please do this for us Anon! If you can meta-context encodings then we can us this nearly directly for our (robo)waifus. Also, Mahoro Based/10 choice. :^)
>>18253 Thanks, I plan to use Whisper soon. I've read it has problems with mixing languages, so if you encounter problems it might come from that.
Open file (128.51 KB 1078x638 ClipboardImage.png)
Microsoft one shot voice training. https://valle-demo.github.io/ Give it chii's voice and it will probably sound like Chii.
(related crosspost) >>18628 >>18605
>>18628 Reading the comments section is predictable tbh. https ://www.foxnews.com/tech/new-ai-simulate-voice-3-seconds-audio >=== -disable hotlink
Edited last time by Chobitsu on 01/12/2023 (Thu) 08:08:59.
Our neighbors at /cyber/ mentioned this one. >Prime Voice AI >"The most realistic and versatile AI speech software, ever. Eleven brings the most compelling, rich and lifelike voices to creators and publishers seeking the ultimate tools for storytelling." https://beta.elevenlabs.io/
> The scope of OpenUtau includes: > - Modern user experience. > - Selected compatibility with UTAU technologies. > - OpenUtau aims to solve problems in less laborious ways, so don't expect it to replicate exact UTAU features. > - Extensible realtime phonetics (VCV, CVVC, Arpasing) intellegence. > - English, Japanese, Chinese, Korean, Russian and more. > - Internationalization, including UI translation and file system encoding support. > - No you don't need to change system locale to use OpenUtau. > - Smooth preview/rendering experience. > - A easy to use plugin system. > - An efficient resampling engine interface. > - Compatible with most UTAU resamplers. > - A Windows and a macOS version. >The scope of OpenUtau does not include: > - Full feature digital music workstation. > - OpenUtau does not strike for Vocaloid compatibility, other than limited features. https://github.com/stakira/OpenUtau
>This repo/rentry aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows/Linux, as well as a stepping stone for anons that genuinely want to play around with TorToiSe. Rhttps://git.ecker.tech/mrq/ai-voice-cloning
>>22538 Lol. Just to let you know Anon, we're primarily a SFW board. You might try /robo/. Cheers. :^)
>>22538 What it this? From the ...engine where the dev doesn't want to be mentioned here?
I just finished my demonstration for talking to the waifu ai https://youtu.be/jjvbENaiDXc
>Whisper-based Real-time Speech Recognition https://www.unrealengine.com/marketplace/en-US/product/d293a6a427c94831888ca0f47bc5939b Just want to show this here after finding it. Something like this would be useful if one wanted to use UnrealEngine for a virtual waifu or some kind of a virtual training environment.
>>23538 I'm sure there's some kind of netcode in unreal you can use for a transcribing API of your choice and save yourself the $99 >virtual waifu real life robotic waifu
>>23558 >Whisper C++ >Beta: v1.4.2 / Stable: v1.2.1 / Roadmap | F.A.Q. >High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model: >Plain C/C++ implementation without dependencies >Apple silicon first-class citizen - optimized via ARM >NEON, Accelerate framework and Core ML >AVX intrinsics support for x86 architectures >VSX intrinsics support for POWER architectures >Mixed F16 / F32 precision >4-bit and 5-bit integer quantization support >Low memory usage (Flash Attention) >Zero memory allocations at runtime >Runs on the CPU >Partial GPU support for NVIDIA via cuBLAS >Partial OpenCL GPU support via CLBlast >BLAS CPU support via OpenBLAS >C-style API Thanks, that might come in handy. There seems to be enough GPU support, despite running on a CPU. I'm still thinking of building a dedicated server in some time, using the Arc380 (70W). >large 2.9 GB ~3.3 GB The original one needs 10GB or more for the large one. Which would rather indicate to get a 3060 (170W). Many thing will work fine with smaller models anyways.
>>23558 Thanks for the reminder Anon. That anon's work is really quite excellent tbh.
>>23558 >>23561 This (bit hard to understand) guy here https://www.youtube.com/watch?v=75H12lYz0Lo tests it on a Raspberry Pi and it works actually surprisingly fast! He tries to get smaller and smaller with his optimizations. I'll keep an eye on that.
>>23579 aws transcribe cost 3 cents per minute and you want to rent a server to run that thing which probably requires multiple gpus. Doesn't make any sense.
>>23591 >Whisper vs AWS transcribe This is about running it at home. The tiny model works on a Raspberry Pi and the large one maybe on a 4GB GPU, certainly on a 6GB GPU (like the Arc380 which uses 70W). Do as you wish, but the general notion here is that we want our waifus be independent from the internet. Some might even say, not connected to it. Using online services for something so fundamental as speech recognition (transcription), especially beyond development, is a special case and will not be recommended.
>>23535 That took quiet a while and was more productive than whatever the heck kiwi is doing. I'm going to start using a name tag so I can get some proper recognition for what I've done so far. Which is trying to make a hasel actuator, this, buying supplies, reading up on electronics and testing the arduino and soon making a 3d anime girl doll from scratch. I'm really about to leave this place cause this is bullshit.
>>23634 peteblank is an anagram for "pleb taken"
>>23590 Wow. That's most excellent.
>>23634 It's good that you did something, during the last few month, but don't exaggerate. You had some advice from other anons here when trying to make the hasel actuator. You also bring this kind of vitriol with you, bashing someone or this board in way too many comments. >3d anime girl doll from scratch I'm looking forward to see that. >I'm really about to leave this place You don't need to hang out here every day. Work on your project and report back later.
>>23640 I am right to be upset at kiwi since he's attacking my character for no reason. I told him I was planning to do this for profit if possible, i emailed the guy who made the 3d model asking for permission and then he turns around and claims i want to steal other people's stuff.
>>23634 >I'm going to start using a name tag so I can get some proper recognition for what I've done so far. Good thinking Anon. That's not really why we use names here. Watch the movie 50 first dates to understand the actual reason.
>>23643 I deleted my original post here, but forgot to copy it. Just wanted to post the new link to the related post. Well... Related: >>23682 This thread is about speech synthesis and maybe recognition, even not about 3D models. You can crosslink posts like above.
>our research team kept seeing new voice conversion methods getting more complex and becoming harder to reproduce. So, we tried to see if we could make a top-tier voice conversion model that was extremely simple. So, we made kNN-VC, where our entire conversion model is just k-nearest neighbors regression on WavLM features. And, it turns out, this does as well if not better than very complex any-to-any voice conversion methods. What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. https://bshall.github.io/knn-vc https://arxiv.org/abs/2305.18975
>>23736 >What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. Lol. That seems a little bizarre to think through. Thanks Anon. >ps. I edited the subject ITT, thanks for pointing that out NoidoDev.
We should think about optimizations of speech recognition (synthesis needs it's own approach): - there are FPGA SBCs which you can train to react to certain words, then put out a text or trigger something - instead of recording a 30s sentence, record much shorter but go on directly after the first one, check the parts, but also glue them together and send the whole sentence to the speech recognition model - maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something - finding ways to detect made up words - construct words out of syllables instead of just jumping to what could have been meant, using that for parts of a sentence where the speech recognition model is uncertain - using the certainty values of speech recognition to look for errors (misunderstandings), maybe using the syllable construction, wordlists and list of names for that
>>24951 >- maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something I would anticipate this should at the least provide greater odds of a coherent parse (particularly in a noisy environment) than just STT alone. Good thinking Anon.
Open file (50.97 KB 768x384 vallex_framework.jpg)
Related: >>25073 >VALL-E X is an amazing multilingual text-to-speech (TTS) model proposed by Microsoft. While Microsoft initially publish in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are glad to share our trained VALL-E X model with the community, allowing everyone to experience the power next-generation TTS https://github.com/Plachtaa/VALL-E-X https://huggingface.co/spaces/Plachta/VALL-E-X
>>25075 also worth noting that : its broken if you launch it thru "python -X utf8 launch-ui.py" command and let install "vallex-checkpoint.pt" and whisper "medium.pt" models on its own, very weird as its already solved here : https://github.com/Plachtaa/VALL-E-X#install-with-pip-recommended-with-python-310-cuda-117--120-pytorch-20 download them manually, thats it.
>>25075 >>25096 Thanks. This will be very useful.
Open file (107.39 KB 608x783 Screenshot_136.png)
There's some excitement around a Discord server being removed, which was working on AI voice models. We might even not have known about it (I didn't), but here's the website: https://voice-models.com https://docs.google.com/spreadsheets/d/1tAUaQrEHYgRsm1Lvrnj14HFHDwJWl0Bd9x0QePewNco/edit#gid=1227575351 and weights.gg (not voice models) >AI Hub discord just got removed from my server list But it seems to be only a fraction of the models. Some mention a IIRC backup: https://www.reddit.com/r/generativeAI/comments/16zzuh4/ai_hub_discord_just_got_removed_from_my_server/
>>25805 >I WARNED YOU ABOUT THE DOXXCORD STAIRS BRO Save.everything. Doxxcord is even more deeply-controlled than G*ogle is. DMCAs don't result in a forum getting disappear'd.
>Otamatone https://youtu.be/Y_ILdh1K0Fk Found here, related: >>25273
>>25876 Had no idea that was a real thing NoidoDev, thanks! Any chance it's opensauce?
>>25893 The original belongs to a corporation, but if you look for "Otamatone DIY" you can find some variants.
>>25909 Cool. Thank you NoidoDev! :^)
>>17474 Can we get this with time stamps? So we can use it for voice training (text to speech).
>ⓍTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. There is no need for an excessive amount of training data that spans countless hours. https://huggingface.co/coqui/XTTS-v2 (only non-commercial licence) Testing Space: https://huggingface.co/spaces/coqui/voice-chat-with-mistral Via https://www.reddit.com/r/LocalLLaMA/comments/17yzr6l/coquiai_ttsv2_is_so_cool/ (seems to be much closer to the ElevenLabs quality)
>>26511 also this one https://github.com/yl4579/StyleTTS2 some people claim its 100x faster than coqui's xtts. still no webui tho :(
>>26512 Thank, I saw this mentioned but forgot to look it up.
>>26512 tested it locally, rtx 3070. works fast as fuck. https://files.catbox.moe/ow0ryz.mp4
>>26535 >>26566 Thanks Anons. :^)
>>27995 REALLY impressive Anon, thanks!
>MetaVoice 1B - The new TTS and Voice cloning open source model Colab: https://drp.li/7RUPU MetaVoice Online Demo - https://ttsdemo.themetavoice.xyz/ https://huggingface.co/metavoiceio https://youtu.be/Y_k3bHPcPTo Not as good as proprietary models.
>>29257 >Not as good as proprietary models. Ehh, they'll get better with time, no doubt. Thanks Anon! Cheers. :^)
>This week we’re talking with Georgi Gerganov about his work on Whisper.cpp and llama.cpp. Georgi first crossed our radar with whisper.cpp, his port of OpenAI’s Whisper model in C and C++. Whisper is a speech recognition model enabling audio transcription and translation. Something we’re paying close attention to here at Changelog, for obvious reasons. Between the invite and the show’s recording, he had a new hit project on his hands: llama.cpp. This is a port of Facebook’s LLaMA model in C and C++. Whisper.cpp made a splash, but llama.cpp is growing in GitHub stars faster than Stable Diffusion did, which was a rocket ship itself. https://changelog.com/podcast/532 Some takeaways: Whiper didn't do speaker identification (Diarization) when they published this in March 22, 2023, and it seems to be hard to find something doing that. But they said people set up their own pipelines for doing this and Whisper might get there as well. I found this on the topic by briefly searching, it still doesn't seem to be covered in some easy way: >How to use OpenAIs Whisper to transcribe and diarize audio files https://github.com/lablab-ai/Whisper-transcription_and_diarization-speaker-identification- Discussion on this: https://huggingface.co/spaces/openai/whisper/discussions/4 Azure AI services seem to be able to do it, but this doesn't help us much. Well, I mean for using it as a tool to extract voice files for training it's one thing, but we also need it as a skill for our waifus: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speaker-recognition?tabs=script&pivots=programming-language-cpp
>>29415 Very nice. Thanks NoidoDev! I have a lot of respect for Gerganov. He very-clearly understands the issues of latency in a systems development context. Exactly the kinds of expertise vital for success to /robowaifu/ and our affiliated cadres in the end. Cheers. :^)
>Data Exchange Podcast 198 - Sep 21, 2023 Overview over everything related to speech. https://www.youtu.be/w4DULuvgO1Y Yishay Carmiel is the CEO of Meaning, a startup at the forefront of building real-time speech applications for enterprises. Episode Notes: https://thedataexchange.media/state-of-ai-for-speech-and-audio >Sections Generative AI for Audio (text-to-speech; text-to-music; speech synthesis) - 00:00:44 Speech Translation - 00:09:44 Automatic Speech Recognition and other models that use audio inputs - 00:13:16 Speech Emotion Recognition - 00:19:55 Restoration - 00:21:55 Similarities in recent trends in NLP and Speech - 00:24:23 Diarization (speaker identification), and implementation challenges - 00:29:47 Voice cloning and risk mitigation - 00:35:36
There is some Japanese open source programs for speech synthesis such as VOICEVOX though I should mention if you use these voices they will have funny accents if you make them speak English which can be kinda cute sometimes. https://voicevox.hiroshiba.jp And TALQu but it is only for Windows. https://booth.pm/ja/items/2755336 NNSVS is for singing also open source. https://nnsvs.github.io SociallyIneptWeeb used VOICEVOX for an AI waifu before and detailed what he did https://www.youtube.com/watch?v=bN5UaEkIPGM&t=674s
>>30390 Oh wow, this is really good. Thanks. >https://nnsvs.github.io >NNSVS >Neural network based singing voice synthesis library > GitHub: https://github.com/nnsvs/nnsvs > Paper: https://arxiv.org/abs/2210.15987 > Demo: https://r9y9.github.io/projects/nnsvs/ >Features > Open-source: NNSVS is fully open-source. You can create your own voicebanks with your dataset. > Multiple languages: NNSVS has been used for creating singing voice synthesis (SVS) systems for multiple languages by VocalSynth comminities (8+ as far as I know). > Research friendly: NNSVS comes with reproducible Kaldi/ESPnet-style recipes. You can use NNSVS to create baseline systems for your research.
>>30398 Here is a site I found that writes some about it and has links to written tutorials. https://nnsvs.carrd.co/
>VoiceCraft >>30614 Thanks, but it's about voice cloning again. I think what I really want are artificial voices which don't belong to anyone. Cloning has it's use cases as well, but I don't need or want it for a robot wife. Also I don't need to be to close to a human. To me the quality problem is a solved problem at this point, at least for robowaifus. I was very impressed certainly by the singing capabilities I saw and heard recently, see above >>30390
>>30625 If you aren't worried about human closeness there is a pretty simple TTS that sounds like old retro synthesized voices. Unfortunately I cant find a video that has the female voice. https://github.com/adafruit/Talkie
>>30657 Thanks, but I didn't mean to go so extreme into the other direction. I just meant for our use case here, and in my opinion, the current state of the technology should be sufficient in terms of quality or it's at least close to it. Making it faster and run better on smaller devices would be good, though. For content creation it's another story, if we don't want to only have stories about robots.
I figure this might be the best place for this. I found a paper on lip syncing synthesized voices.
>>31027 For some reason the file didnt attach https://doi.org/10.1109/ROMAN.2010.5598656
>>31027 I've been thinking about designing something similar, now I'm totally gonna s̶t̶e̶a̶l̶ be inspired by this.
>>31027 That is great. I mentioned doing something, sorta, the same with facial expressions. I believe this is the same sort of "framework" or idea. Here's the paper, Real-time lip synchronization between text-to-speech (TTS) system and robot mouth Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Here's an address for the paper. https://sci-hub.ru/10.1109/roman.2010.5598656
>>31049 >Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Lol, welcome to my world! :D TBH I think Robbit must've disabled file posting by Torfags. I hope he changes that soon.
Just wanted to mention, that Suno - the AI music creation model, is based on Bark - the speech generation model. They needed around two years from there to where we are now. I have a source, a video where this is mentioned, but this is also about a lot of other things. Just has been mentioned there.
>>32169 Neat! That's an interesting heritage. Impressive results in a fairly short time, too. Thanks, NoidoDev. Cheers. :^)
I think to make speech generation fast enough, we'll have to go from some AI voice to replicate the generated voice in mbrolla. I asked Claude about it. [Quote] To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA, you'll need to follow several steps. Here's a concise overview of the process: 1. Prepare your data: - Ensure your audio files are high-quality recordings - Verify that phonetic labels are accurate and aligned with the audio 2. Choose your synthesizer: - eSpeak: Better for rapid development, less natural-sounding - MBROLA: More natural-sounding, but requires more work 3. For eSpeak: - Create a voice file with phoneme definitions - Define pitch and timing rules - Add your voice file to eSpeak's voices directory 4. For MBROLA: - Create diphone database from your audio files - Define pitch and duration rules - Create a language definition file 5. Test and refine: - Use sample texts to test your voice - Adjust parameters for better quality This process can be complex and time-consuming. Would you like me to elaborate on any specific part of this process? Yes, there are tools available to help create voices for MBROLA. Here's a brief overview of some key tools: 1. MBR_PH: This is the official MBROLA database creation tool. It helps in creating diphone databases from labeled speech samples. 2. Praat: While not specifically for MBROLA, this phonetics software can be used for labeling and segmenting audio files, which is a crucial step in the process. 3. HTK (Hidden Markov Model Toolkit): Can be used for automatic segmentation of speech data. 4. Festvox: A voice building framework that can be used to create voices for various synthesizers, including MBROLA. 5. WinSnoori: A speech analysis tool that can be helpful in the voice creation process. These tools can significantly streamline the process, but creating a high-quality voice still requires considerable effort and expertise. Would you like more details on any of these tools or the specific steps they're used for? [/Quote]
>>32406 The linguistic knowledge necessary for defining pitch and duration rules in speech synthesis encompasses several areas of linguistics. Here's an overview of the key areas: 1. Phonetics and Phonology: - Understanding of speech sounds and their features - Knowledge of phonemic inventories and allophonic variations - Familiarity with syllable structure and phonotactics 2. Prosody: - Understanding of stress patterns (lexical and sentential) - Knowledge of intonation patterns and their meanings - Familiarity with rhythm and timing in speech 3. Morphology: - Understanding how word structure affects pronunciation - Knowledge of morphophonemic alternations 4. Syntax: - Understanding how sentence structure affects prosody - Knowledge of phrase boundaries and their prosodic effects 5. Pragmatics: - Understanding how context and intent affect prosody - Knowledge of focus and emphasis in speech 6. Sociolinguistics: - Familiarity with dialect variations in prosody - Understanding of speaking styles and registers 7. Acoustic Phonetics: - Knowledge of acoustic correlates of prosodic features - Understanding of formant structures and transitions 8. Language-specific knowledge: - Deep understanding of the prosodic system of the target language - Familiarity with any unique features (e.g., tone in tonal languages) This knowledge helps in creating rules that accurately capture the nuances of natural speech in the target language. Would you like more information on any of these linguistic areas or how they specifically apply to speech synthesis?
I think to make speech generation fast enough, we'll have to go from some AI voice to replicate the generated voice in mbrolla. I asked Claude about it. To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA, you'll need to follow several steps. Here's a concise overview of the process: 1. Prepare your data: - Ensure your audio files are high-quality recordings - Verify that phonetic labels are accurate and aligned with the audio 2. Choose your synthesizer: - eSpeak: Better for rapid development, less natural-sounding - MBROLA: More natural-sounding, but requires more work 3. For eSpeak: - Create a voice file with phoneme definitions - Define pitch and timing rules - Add your voice file to eSpeak's voices directory 4. For MBROLA: - Create diphone database from your audio files - Define pitch and duration rules - Create a language definition file 5. Test and refine: - Use sample texts to test your voice - Adjust parameters for better quality This process can be complex and time-consuming. Would you like me to elaborate on any specific part of this process? Yes, there are tools available to help create voices for MBROLA. Here's a brief overview of some key tools: 1. MBR_PH: This is the official MBROLA database creation tool. It helps in creating diphone databases from labeled speech samples. 2. Praat: While not specifically for MBROLA, this phonetics software can be used for labeling and segmenting audio files, which is a crucial step in the process. 3. HTK (Hidden Markov Model Toolkit): Can be used for automatic segmentation of speech data. 4. Festvox: A voice building framework that can be used to create voices for various synthesizers, including MBROLA. 5. WinSnoori: A speech analysis tool that can be helpful in the voice creation process. These tools can significantly streamline the process, but creating a high-quality voice still requires considerable effort and expertise. Would you like more details on any of these tools or the specific steps they're used for?
>>32409 The linguistic knowledge necessary for defining pitch and duration rules in speech synthesis encompasses several areas of linguistics. Here's an overview of the key areas: 1. Phonetics and Phonology: - Understanding of speech sounds and their features - Knowledge of phonemic inventories and allophonic variations - Familiarity with syllable structure and phonotactics 2. Prosody: - Understanding of stress patterns (lexical and sentential) - Knowledge of intonation patterns and their meanings - Familiarity with rhythm and timing in speech 3. Morphology: - Understanding how word structure affects pronunciation - Knowledge of morphophonemic alternations 4. Syntax: - Understanding how sentence structure affects prosody - Knowledge of phrase boundaries and their prosodic effects 5. Pragmatics: - Understanding how context and intent affect prosody - Knowledge of focus and emphasis in speech 6. Sociolinguistics: - Familiarity with dialect variations in prosody - Understanding of speaking styles and registers 7. Acoustic Phonetics: - Knowledge of acoustic correlates of prosodic features - Understanding of formant structures and transitions 8. Language-specific knowledge: - Deep understanding of the prosodic system of the target language - Familiarity with any unique features (e.g., tone in tonal languages) This knowledge helps in creating rules that accurately capture the nuances of natural speech in the target language. Would you like more information on any of these linguistic areas or how they specifically apply to speech synthesis?
>>32409 > To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA >This checkpoint leverages the pretrained checkpoint wav2vec2-large-lv60 and is fine-tuned on CommonVoice to recognize phonetic labels in multiple languages. https://huggingface.co/facebook/wav2vec2-lv-60-espeak-cv-ft >Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained mode https://arxiv.org/abs/2109.11680
>>32406 >>32407 Can be deleted. I was on another computer while using Claude.
>>32409 Ouuuf, I had to ask specifically: Yes, Festival can indeed be easier to work with for voice creation compared to directly using MBROLA or eSpeak. Here's a brief overview of why Festival might be a more accessible option: 1. Comprehensive toolkit: - Festival comes with a suite of tools designed for voice creation - Includes utilities for data preparation, analysis, and synthesis 2. Festvox: - A voice building framework that works with Festival - Provides a structured approach to creating new voices 3. Higher-level abstractions: - Allows you to work with higher-level linguistic concepts - Can automatically handle some lower-level details 4. Built-in machine learning: - Can use statistical methods to learn prosody models from data - Reduces the need for manual rule creation 5. Extensive documentation: - Well-documented processes for voice building - Includes tutorials and example voices 6. Language support: - Comes with tools for building voices in various languages - Can handle different writing systems and phoneme sets 7. Integration: - Can still use MBROLA voices if desired - Allows for mixing different synthesis methods While Festival still requires significant effort and linguistic knowledge, it provides more structure and automation in the voice creation process. Would you like more details on how to approach voice creation with Festival?
>>32413 We should keep an eye on developments in case anyone tries to make that easier with AI. Or maybe we could try advocating for it. I might also give it a try myself, if I really have to, but I'm bad with patience and frustration. http://www.festvox.org/ https://github.com/festvox/festvox >For full details about voice building see the document itself http://festvox.org/bsv/
>>32414 Great research effort NoidoDev, thanks! :^)

Report/Delete/Moderation Forms
Delete
Report