/robowaifu/ - Speech Synthesis/Recognition general

Name
Subject
E-mail
Message	Max message length: 6144
Files	Drag files to upload or click here to select them Maximum 5 files / Maximum size: 20.00 MB

Spoiler images
Password	(used to delete files and postings)
Use bypass

Robot Voices Robowaifu Technician 09/12/2019 (Thu) 03:09:38 No.156

What are the best sounding female robotic voices available? I want something that doesn't kill the boner, I think Siri has an okay voice, but I'm not sure if that would be available for my project

https://www.invidio.us/watch?v=l_4aAbAUoxk

Robowaifu Technician 09/19/2019 (Thu) 12:17:54 No.488

>>199
This has chatter on HN right now.
news.ycombinator.com/item?id=13992454

Robowaifu Technician 09/19/2019 (Thu) 12:24:21 No.489

>>199
Inmoov project is worth checking out, I believe they already have code available for voice recognition, text to speech and a bunch of other stuff. The only thing the bot is missing is legs because they're still trying to figure out an affordable way to make them.

inmoov.fr/

https://www.invidio.us/watch?v=2sZOyCBbows

Edited last time by Chobitsu on 09/19/2019 (Thu) 12:29:40.

Robowaifu Technician 09/19/2019 (Thu) 12:55:25 No.495

>>489
Thanks for the tip anon. If I find some good code or other cyber assets I find valuable I'll link them back here at some point.

Robowaifu Technician 09/19/2019 (Thu) 12:56:12 No.496

>>489
That's a pretty cool robot anon.

Robowaifu Technician 09/19/2019 (Thu) 12:57:31 No.497

>not using espeak

Robowaifu Technician 09/19/2019 (Thu) 12:58:37 No.499

>>497
>Do you think that is healthy?
I lost it.

Robowaifu Technician 10/06/2019 (Sun) 02:28:48 No.1239

The single greatest voice actress there is ofc--Tara Strong.

If she could do the same voice she did when she played the nurse in The Animatrix or the voice during the tutorial of the minority report video game(i know odd choice) it would be literal sex for your ears nonstop.

Robowaifu Technician 10/06/2019 (Sun) 02:32:29 No.1240

>>1239
>Tara Strong
Actually, a Twilicorn robowaifu would be pretty hot in fact of course.

Robowaifu Technician 10/06/2019 (Sun) 02:33:35 No.1241

github.com/alexa/alexa-avs-sample-app/wiki/Raspberry-Pi

idk where to post this but it might help. amazon echo is botnet af but as a software placeholder for someone who doesn't care about freedom, it might be helpful.

Robowaifu Technician 10/06/2019 (Sun) 02:35:38 No.1242

>>156
I imagine you'd probably want to start with a good VA to build up a general vocabulary, and then have software tweak the sounds to change emotional context, etc. I've never really heard a good fully synthesized artificial voice that I know of. Certainly nothing you'd want to hear in the middle of sex w/ your robot wife. Talented VAs like >>1239 are the way to go, at least for now.

Robowaifu Technician 10/06/2019 (Sun) 02:37:00 No.1243

>>156
www.vocalid.co/voicebank

This could be helpful, I'm just now realizing how important a robowaifu voice is, good luck developing!

Robowaifu Technician 10/06/2019 (Sun) 02:38:41 No.1244

>>1242
Vidya voice actors should be a good source of training data, since there's usually a way to export the raw audio for each line out of the game files without any additional noise. Sometimes even with corresponding subtitles.

Robowaifu Technician 10/06/2019 (Sun) 02:38:57 No.1245

>>1244
Yeah, that's an interesting idea anon. Have any specifics for us?

Robowaifu Technician 10/06/2019 (Sun) 02:39:54 No.1246

>>1245
Not really, other than:
>pick your favorite game voice actor
>go to IMDb and grab a list of all the games she ever voiced and install all of them
>export audio files for each line for each game either with corresponding text or listen to each line and write the text manually
>maybe attach emotions to each line, sort of a tag system. "anger", "fear", etc.
>feed that as training data to whatever you're using as the actual voice engine
>???
>profit!

Robowaifu Technician 10/06/2019 (Sun) 02:42:29 No.1247

https://www.invidio.us/watch?v=41U78QP8nBk

Robowaifu Technician 10/06/2019 (Sun) 02:42:48 No.1248

>>1247
Nice historical find Anon.

Robowaifu Technician 10/06/2019 (Sun) 02:43:54 No.1249

github.com/MycroftAI/mimic

I may try to build and play around with this in the next couple of months. If I make progress and can figure out how to make webms, I'll probably post results here.

Robowaifu Technician 10/06/2019 (Sun) 02:44:20 No.1250

Just waiting for
lyrebird.ai
to release their API - we could technically copy a person's voice

Robowaifu Technician 10/06/2019 (Sun) 02:45:40 No.1251

>>1250
Very interesting tech, but
>'Ethics' link on front page
>"We are the stewards of you're voice…"
What could possibly go wrong?

Sounds like it's going to turn into another libshitshow tbh anon. But still, it might be a good tech to watch for follow-on copycats. Maybe they will release their code, I doubt this bunch will. With copytext like the above they are simply greedy little shits by the sound of it.

Robowaifu Technician 10/06/2019 (Sun) 02:46:21 No.1252

voicebot.ai/

Robowaifu Technician 10/06/2019 (Sun) 02:47:32 No.1253

Does voice control systems belong here or another thread?
jasperproject.github.io/documentation/

Seems kind of like an Alexa afaik but runs on a RPi

>ed. maybe the Mycroft thread would be a better choice

Robowaifu Technician 10/06/2019 (Sun) 02:49:44 No.1254

>>156
pornstar plus babytalk equals hotness.

Robowaifu Technician 10/06/2019 (Sun) 02:51:15 No.1255

>>1243
Vocaloid is a bitch to use for serious Text To Speech. You have to compose every sentence syllable by syllable. The English voices aren't all that great either.

Robowaifu Technician 10/06/2019 (Sun) 02:51:33 No.1256

When I was in Japan I found the bus and train announcements to be incredibly sexy. It seems that Japanese lends itself to text-to-speech better than English, since it's a very enunciated language (as opposed to lots of slurring in English).

If given a very limited vocabulary, I'm planning to make my robowaifu speak Japanese. Now I just need to find the same source voices they use in Japan without spending too much on translation or on a popular voice actor.

Robowaifu Technician 10/06/2019 (Sun) 02:52:09 No.1257

>>1256
>Now I just need to find the same source voices they use in Japan without spending too much on translation or on a popular voice actor.
Good luck anon. Please keep us informed ifwhen you find good ones!

Robowaifu Technician 10/06/2019 (Sun) 02:53:22 No.1258

Festival vox for GNU+Linux has some decent utterances.

It runs a TCP server you can issue text to speech commands in Lisp syntax:
(SayText "Hello there.")

This system doesn't have inflections, but it's decent text to speech, and you could use the framework to create your own utterances (or pay a voice actor to).

The CMU Arctic pack is more natural than the dry robotic default voices:
festvox.org/cmu_arctic/

However, I find that in each utterance many words can be improved by adding an h or two, spelling phonetically, or inserting a small delay with one to three periods between words.

I pass sentences through a simple word substitution filter per voice, and update it when I hear a word that sounds a bit odd.
E.g.:

[arctic_slt_female]
good : ghood.
to : to.
obey : obey..
your : your.
owner : ohner

I rather like the sound of these and kind of don't mind the lack of inflections, but then again I'm right leaning on the /clang/ scale.

Perfect Immortal Machine Robowaifu Technician 10/06/2019 (Sun) 02:54:28 No.1259

Shodan

Robowaifu Technician 10/06/2019 (Sun) 02:54:50 No.1260

>>1258
Festival sounds worse than a Speak & Spell. All the voices have this ear piercing scratching sound. It does not even sound like a robotic effect, it just sounds like someone speaking through a broken microphone.

Robowaifu Technician 10/06/2019 (Sun) 02:55:25 No.1261

(((Jewgle Jewplex)))
www.tomsguide.com/us/google-duplex-faq,news-27185.html

Robowaifu Technician 11/23/2019 (Sat) 12:30:40 No.1563

>>199
There seems to be a project to make good My Little Pony synthesized voices.
https://clyp.it/r0yypquc?token=e11965be1b6dce146eb61702006c285e
https://mlpol.net/mlpol/res/249436.html#249436
https://boards.4channel.org/mlp/thread/34514258#p34514258
Their technology seems sound and the voice is good. They have assembled the files and resources for us to use and to train the talking bot so we can probably use the same technology to synthesize more AI voices. If any of you guys would like to put Twilight Sparkle in your wAIfu then this is a fantastic development. Even if you don't like MLP these resources can turn a library of voice lines such as in an anime or tv show into a synthesized voice which is pretty cool. Put your waifu into a wAIfu.

Robowaifu Technician 11/23/2019 (Sat) 12:39:09 No.1564

>>199
https://clyp.it/ccyaz3rr?token=c14204d84d7069f0f091f38eada3d740
Twilight Sparkle going Sicko mode

Robowaifu Technician 11/23/2019 (Sat) 12:39:31 No.1565

>>1563
Thanks for the tip anon, I'll try to check it out.

Robowaifu Technician 11/23/2019 (Sat) 12:43:45 No.1566

>>1563
>Cool. Good to know that soon we can make Sweetie Belle recite the 14 words.
I'm OK w/ this tbh.

Robowaifu Technician 11/23/2019 (Sat) 13:52:07 No.1569

>>1563
>know nothing about the project
>spend 30+ minutes going through the cuckchan thread b/c AI work
welp, i have to say, love mlp fags or hate them, the level of autism on display in their 'pony preservation project' is impressive.

Robowaifu Technician 11/23/2019 (Sat) 22:29:26 No.1570

>>1563
Ponies and cuckchan aside, I am impressed. I'll have to read about how intensive the training is. I'm very interested in trying it out for myself. I wonder what would happen if you tried using SHODAN's voice lines.

There are only about 26 minutes worth of audio from SS2. Does anyone know if that's sufficient for training, or is more needed?

https://www.youtube.com/watch?v=QGEjKoGHzO0

Robowaifu Technician 11/23/2019 (Sat) 22:44:06 No.1571

>>1570
I'll amend this post by saying that 26 minutes of audio is probably not sufficient. It sounds like there should be at least several hours for the best results. I think a better approach would be to train a neural network using voice clips from someone who sounds similar to SHODAN's original voice actress. The next step would be to create a program that takes voice audio and adds the distinctive audio "glitches" of SHODAN's voice. Then the voice clips generated by the NN could be fed through this program to "SHODANify" it. There might already be ways to do this quite easily with audio editing programs, I'm only thinking of creating an automated way to do it.

Robowaifu Technician 11/24/2019 (Sun) 00:53:18 No.1572

>>1571
>I'm only thinking of creating an automated way to do it.
Sounds like an interesting project idea. I'd imagine the original audio engineers for the game layered effects and filters in a traditional way. Figuring out both how to 'reverse engineer' the effect as well as how to automate it for the general case seems like an intricate process. Any plans to pursue this beyond conception stage?

Robowaifu Technician 11/25/2019 (Mon) 04:51:22 No.1581

>>1572
Right now, no. I do have an idea of exactly what kind of things I or someone else who's interested would need to do. Unfortunately, my daily schedule and life in general makes it difficult for me to make time for this kind of project. (College student with a job and family, friends, other hobbies, etc.) normalfag-tier I know
However, I'll say this:
the more I think about this and type out my ideas, the more practical I think it is. It's just a matter of investing some time and effort.

I know that Audacity supports python scripting for automating stuff, but I would have to learn about using it to apply the actual effects. If I can't manually create a SHODAN-like audio clip using Audacity, I won't understand enough to automate the process. I already have a general idea of what kind of effects are needed (shifting pitch, timbre, layering with delay, stuttering, etc.) and listening to the audio clips from the game will help me refine the process. Also, depending on what kind of result I can get with Audacity, I may want to consider other audio editing programs. Audacity is just the one that comes to mind, being FOSS.

Once I understand what kind of effects actually go into the audio, and how to apply them to get the best result, then I can start to play around with a Python script. Fortunately I have experience with Python, and I think this would be fairly straightforward. I'd have to read Audacity's documentation on scripting, which may or may not be sparse. Another tricky part to this is applying the audio effects in a somewhat random way so that there's some variation to the resulting clip and so that multiple clips don't all sound alike. I think there should be some underlying logic to how the effects are applied, but it might take me some time to puzzle out the best strategy (and of course I could probably always find some way to improve upon it).

Getting audio clips to pass through the script would be fairly trivial, I think. For starters, I would probably just use a pre-trained NN, or train one on my own using an available dataset. In a perfect world, we could use a NN trained with Terry Brosius' voice. However, I don't think there's very much audio available. She's done voice acting for a variety of games, but I believe many hours worth of audio are needed in order to effectively train a NN. Unless she happens to record a book on tape someday, I doubt that this will be possible/practical.

Question/request to any anon who's familiar with audio editing and manipulation, whether with Audacity or a different program:
Can you "SHODAN-ify" an audio clip by hand? And if so, can you teach me how?
If not, maybe you can point me towards some good resources to help me.
Also, any advice on creating scripts for Audacity (or a different program) would be welcome.

Question for anons who are more familiar with NN's for speech synthesis:
Would you reckon that we could train a NN with Terry Brosius' lines from other videogames, not just SS1/2?
If there's enough audio and interest in this idea, it would be a great benefit to have multiple anons working to transcribe and clip audio from different games. However, I wouldn't worry about this until after either me or some other anon can get a working SHODAN-ify script created.

Robowaifu Technician 11/25/2019 (Mon) 05:08:19 No.1582

>>1571
You only need 5 seconds of arbitrary audio to get pretty good results with the SV2TTS framework. It uses an encoder trained to perform speaker verification to create a speaker embedding vector from a short audio sample. Then a seq2seq network creates a mel spectrogram from some text and the speaker embedding, and a wavenet turns the mel spectrogram into an audio waveform. The encoder in the pretrained model was trained on noisy speech, but I don't know how well it would work with a reference as heavily distorted as SHODAN's voice.

Audio samples: https://google.github.io/tacotron/publications/speaker_adaptation/
GitHub: https://github.com/CorentinJ/Real-Time-Voice-Cloning
Paper: attached pdf

Robowaifu Technician 11/25/2019 (Mon) 05:19:37 No.1583

>>1582
>5 seconds
>quality results
well shit negro we're in business
I have a feeling that training directly with the distorted SHODAN audio will not work (although it might be worth trying), so instead I floated the idea of using Terry Brosius' regular voice. There's probably more than enough audio from other characters that she has voiced, based on this information.
I will have to study these links and learn more about this, thank you, anon.

Robowaifu Technician 12/03/2019 (Tue) 09:02:49 No.1701

anon started a thread about speechsynth on fatchan.
https://fatpeople.lol/tech/thread/288.html

pretty sure everything there has been discussed here already except this:
https://paintschainer.preferred.tech/index_en.html

Robowaifu Technician 12/03/2019 (Tue) 10:24:54 No.1703

>>1701
example 'AI' painting it did for me

Robowaifu Technician 12/12/2019 (Thu) 02:38:42 No.1850

Do you want a flange or not?

Robowaifu Technician 12/12/2019 (Thu) 07:50:53 No.1851

>>1850
Can you clarify what you mean Anon?

Robowaifu Technician 12/12/2019 (Thu) 13:36:56 No.1852

>>1851
Well, do you want something that sounds like an ordinary woman, no frills, no bells and whistles? Or do you want one that sounds like a science fiction robot, with a reverb or "metallic" effect?

Robowaifu Technician 12/13/2019 (Fri) 05:03:43 No.1853

>>156
I mentioned in another thread about voice generation that I'm interested in trying to devise a way to automatically process audio clips and add sound effects like SHODAN. There was also brief discussion of using Terri Brosius' own voice to train a neural network, but it's not clear how practical that is. I don't know what your personal preferences are, but I do know that there are a number of people (including myself) who would love to hear SHODAN say dirty things. I'd like to get started working on this, but unfortunately other obligations are keeping me from investing too much time in it currently (work, school, family, etc.). Hopefully I can do more in-depth research on this idea and potentially have some sort of prototype working in the next few months, but I don't want to promise anything. I'd hate to get someone's hopes up and not deliver anything.

Robowaifu Technician 01/10/2020 (Fri) 12:04:56 No.1888

Robowaifu Technician 04/12/2020 (Sun) 09:19:10 No.2355

>>1582 Was just gonna repost Tacotron. It's a really amazing voice synthesizer. It'd be interesting to see what it's capable of adapted with transformers instead of outdated LSTMs. https://www.youtube.com/watch?v=0sR1rU3gLzQ Audio samples: https://google.github.io/tacotron/publications/tacotron2/ Paper: https://arxiv.org/pdf/1712.05884.pdf GitHub: https://github.com/NVIDIA/tacotron2

Unsupervised Any-to-Many Audiovisual Synthesis via Exemplar Autoencoders Robowaifu Technician 04/12/2020 (Sun) 09:37:42 No.2356

>>2355 Also a newer paper from this year can convert any voice to many and improved on SOTA in any-to-any conversion. I can imagine this being used to train a synthesized voice on the more subtle and emotional nuances of speech. Demo: https://dunbar12138.github.io/projectpage/Audiovisual/ Paper: https://arxiv.org/pdf/2001.04463.pdf GitHub: https://github.com/dunbar12138/Audiovisual-Synthesis

Robowaifu Technician 04/12/2020 (Sun) 09:40:56 No.2358

thanks for the papers anon, i'll try to get through them sometime soon. :^)

Robowaifu Technician 04/22/2020 (Wed) 22:25:31 No.2499

does anyone have a good idea how many minutes of audio it should take to train a good text-to-speech model from scratch with current machine learning techniques? I found no dataset containing child or childlike speech. so far there seems to be no academic interest in compiling one, and i really don't think another soul on this planet is pathetic or degenerate enough to make one. so here I am with about 500 ~15 minute long videos ripped from some family's youtube channel. youtube's machine generated subtitles are surprisingly accurate so far, but this is still a really daunting task to label speakers, proofread, and format, and i'm not sure how much of this is needed to get the job done right. also this feels incredibly skeevy, but unless one of you has seen a dataset that has what i'm looking for, it's something i have to do, even if the ethics of duplicating a real living child's voice for my own purpose is dubious at best.

Robowaifu Technician 04/23/2020 (Thu) 00:06:46 No.2501

>>2499 You might try asking the Anons working on the Pony Preservation Project. They are likely to be a better source of information on this atm. >>1563

Robowaifu Technician 04/23/2020 (Thu) 05:09:15 No.2508

>>2499 People have achieved near human-quality voices with the voice cloning toolkit corpus. It consists of 110 speakers, mostly in their 20's, reading 400 sentences each. https://datashare.is.ed.ac.uk/handle/10283/3443 You're not gonna get very far only using one voice though. You can probably find some children's speech datasets on Kaggle.

Robowaifu Technician 04/23/2020 (Thu) 20:47:35 No.2517

>>2508 when i looked on kaggle i was unable to find anything of that nature. as far as using a single voice, i'm not intentionally using only a single voice, but when i was looking into this, waveglow (https://github.com/NVIDIA/waveglow) appeared to be achieving decent quality results using a single person's voice. if i can find good samples of multiple voices i'd be interested in all of my different options but as of right now i'm stuck using data i can put together on my own.

Robowaifu Technician 04/23/2020 (Thu) 20:58:55 No.2518

>>2517 >if i can find good samples of multiple voices he linked you a very good one already. and i directed you to a group working with an entire 200-episode-show cast's worth of professional voice actors, including girls. can't find anything 'good' in those anon?

Robowaifu Technician 04/23/2020 (Thu) 21:09:17 No.2519

>>2518 i didn't mean to disregard anon's advice to seek out the mlp group. i did appreciate the referral. i don't know yet if that fits the bill or if i need to keep looking, but i will be looking into his suggestion.

Robowaifu Technician 04/24/2020 (Fri) 01:53:37 No.2524

>>2519 Haha no worries Anon! I just wanted to point out there is already a lot of content between those. The one is over 10 GB of highly-tagged audio sources, and the other is a growing mountain of autistically-driven creation, much of which is remarkably good. Good luck Anon.

Robowaifu Technician 05/03/2020 (Sun) 17:03:25 No.2640

I like you.

Robowaifu Technician 05/03/2020 (Sun) 19:01:28 No.2641

>>2640 Thanks Anon. We like you, too. Just the way you are tbh.

Robowaifu Technician 07/10/2020 (Fri) 19:03:27 No.4144

>>1582 >GitHub: https://github.com/CorentinJ/Real-Time-Voice-Cloning What a fucking garbage software, I tried to use Totala narrator voice to try reading a few paragraphs and the program shits the bed. Using smaller text samples doesn't even clone the narrator voice at all, what the fuck man.

Robowaifu Technician 07/10/2020 (Fri) 23:46:21 No.4145

>>4144 Yeah, it's a little dated and wasn't really a pro-grade project anyway. >"...13/11/19: I'm now working full time and I will not maintain this repo anymore. To anyone who reads this: >If you just want to clone your voice, do check our demo on Resemble.AI - it will give much better results than this repo and will not require a complex setup. The free plan already allows you to do a lot. We have a plugin for Unity as well for game developers. >If, for some reason, you want to spend hours setting up a python environment to run crappy old tensorflow code that will give you very average results, be my guest. One year after my thesis, there are many other better open-source implementations of neural TTS out there, and new ones keep coming every day." Might try the recommendation Anon? Please let us know how it works for you if you do, thanks. Good luck. https://www.resemble.ai/

Robowaifu Technician 07/20/2020 (Mon) 04:18:46 No.4242

>>4145 >resemble.ai >you need to add your E-Mail account so that their pajeet tech scammer can spam it Yeah lets not get in there, its not looking too pretty.

Robowaifu Technician 07/20/2020 (Mon) 07:32:22 No.4247

>>4242 Heh, they are obviously for-profit and want to monetize the customers. As I implied elsewhere, if you want to have your cake and eat it too, then we'll need to roll our own here at /robowaifu/. Better crack those books, Anon! :^)

Robowaifu Technician 07/22/2020 (Wed) 03:47:56 No.4333

I just wanted to mention that there's another thread about voices: https://julay.world/robowaifu/last/156.html Maybe these could be merged? I'd like voices to resemble actresses with good voices or a mix of different ones to avoid trouble. There's enough audio and scripts from subtitles available. Is training a NN on using Audacity the right way? It would first need a network which could tell us how similar two voices are, then we could try to get closer and closer. We also have Festival available as free software, so voices from there could be the starting point. Maybe first think of a way how to tell it how close voices are, some generated others not, then it can learn.

Robowaifu Technician 07/22/2020 (Wed) 18:43:03 No.4345

>>4333 >Maybe these could be merged? Good idea, and maybe I'll work out a reasonably convenient way to do that at some point. However, Lynxchan software doesn't have a 'merge these threads' button, and the manual effort to do so is rather tedious (as you might imagine). >Is training a NN on using Audacity the right way? Near as I can tell, those are 'apples and oranges' in comparison. Audacity is a tool for human editing of audio files primarily. NNs are an entirely different type of thing. And you have some good ideas. Honestly, I would suggest you look into the Pony's efforts in this area. They have achieved some impressive results already based on extracting VA from the show. >>1563 But I hope we can manage something like your ideas, that would be great.

Robowaifu Technician 07/25/2020 (Sat) 15:54:18 No.4519

There's a program MorphVox Pro which can alter a male voice to female. I have no intention to use my own voice for that, but maybe it could be usefull to change generated voices from eg Festival. More importantly it shows what's possible. The vid is even from 2015. https://youtu.be/CpVwl-FEzl4 Via https://dollforum.com/forum/viewtopic.php?f=6&t=130302&sid=44113180fc656eb7aa41381a0ce12d02

Stephen Lynx 07/30/2020 (Thu) 00:24:05 No.4622

>>4345 There is a merge thread feature on 2.4 tho.

Chobitsu 07/30/2020 (Thu) 02:17:03 No.4624

>>4622 That's good news. As long as Robi utilizes that version for the reboot, I'll explore that feature and that idea. Now please allow me to mark individual posts and provide a 'move this post to thread X' (in batches of many posts ofc).

Robowaifu Technician 08/02/2020 (Sun) 12:06:23 No.4659

>>199 I just found this speech synthesizer programming tutorial: https://www.youtube.com/watch?v=Jcymn3RGkF4

Robowaifu Technician 08/02/2020 (Sun) 16:08:51 No.4666

>>4659 Interesting, but the endresult is useless unless you wanted a bad voice with finish accent. Don't fall for his trick at the beginning when he talks like his voice is the synthesizer. Also, I don't get it. Why would everyone create their own? I just need one program where I can put in the data. Did you try his software: https://github.com/bisqwit/speech_synth_series/tree/master/ep4-speechsyn Is it even reproducible, or just messy unreadable code? Where is it better than eSpeak or Festival?

Chobitsu 08/02/2020 (Sun) 20:57:13 No.4687

>>4659 Thanks Anon, appreciated. I'll have a look at it sometime over the next few days.

Robowaifu Technician 08/03/2020 (Mon) 03:13:03 No.4693

>>4659 Has anyone ideas how to get phonemes from voices, without manually cutting them out of soundfiles? There seem to be some methods, but it's difficult and complex. Not even sure if this helps: https://youtu.be/x1IAPgvKUmM There are voices available for sale and free ones anyways, might be easier to change those. But what's the best way to do that? That would be something getting us forward. Here some introduction to work with sound and signal processing in Python. Not sure if I should learn that at some point, but I like his approach to teaching and learning by doing projects: https://youtu.be/0ALKGR0I5MA The available software gets better every year, but not for free and often it needs the cloud. However, even if we don't get anything done here, at least something will be available, bc others want this stuff as well. Then again, cloud based stuff is quite useless. EmVoice One: https://youtu.be/Da2DAjKzeaQ and UTAU, Vocaloid, SynthV, Eleanor Forte are mentioned in the comments. Newscaster, wow: https://youtu.be/wHP3J01aEns

Robowaifu Technician 08/03/2020 (Mon) 03:53:59 No.4694

The best TTS is AI-based. Check out the demos. Google has some. Amazon has some. You don't hear them, typically, though. There are issues with performance & cost. My opinion, wait for generally-available AI TTS. Someone mentioned espeak. espeak is like the near exact opposite, however, you can speed up espeak way faster than other systems. But what I want is good singing. If you've ever pirated Vocaloid, you know it sucks so bad, not just in terms of English results, but in terms of the interface.

Robowaifu Technician 08/03/2020 (Mon) 04:30:39 No.4695

>>4694 The idea about eSpeak was, to use it as a base with a female voice, then have some software to change the voice output to make it better. You can combine software, one program doesn't need to do it all.

Robowaifu Technician 10/06/2020 (Tue) 06:09:50 No.5467

Someone mentioned Waveglow here >>5461 and it sounds good. Though the alternatives on their site https://nv-adlr.github.io/WaveGlow sound good as well. I wonder why all of these sound better than Hanson Robotics Sofia in her recent videos. Maybe because Sofias speech is generated live at the time she's talking. "In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable." https://nv-adlr.github.io/WaveGlow Can't upload the paper. There might to be a block for uploads, because some spammer.

Robowaifu Technician 10/06/2020 (Tue) 18:24:27 No.5474

>>5467 No, Anon, you are the robots. https://streamable.com/u1ulrp >=== What hath God wrought? Use you're robowaifu-powers only for good Anon. :^)

Edited last time by Chobitsu on 10/06/2020 (Tue) 21:17:15.

Chobitsu 10/06/2020 (Tue) 21:48:23 No.5475

>>5474 Nice work Anon. Mind telling us how you did it?

Robowaifu Technician 10/07/2020 (Wed) 03:39:34 No.5480

>>5474 I hope this will be up to your standards, Anon. >

Robowaifu Technician 10/07/2020 (Wed) 07:06:57 No.5484

>>5474 I haven't looked much into the documentation and stuff, but how easy or hard is it to get it to use another voice pack and it sounding just as natural? For example, I've found some local vtubers who know how to talk like anime waifus in English. I figured I could bribe some of them to donate some voiced lines which can then be used as training data so that we can have some cutesy voices. It's cheaper than trying to figure out how to contact JP voice actresses who actually sound terrible in English. So my question is if it's possible to compile a list of the minimum lines a voice actress will need to get enough data to have an AI adopt her voice? I used to be active in game development so I think I am experienced enough in giving contracts and royalties etc. (even though my actual games were financial failures).

Robowaifu Technician 10/07/2020 (Wed) 07:21:43 No.5485

>>5480 Sounds amazing and absolutely sufficient. Some little indicators that she isn't human are even very welcome. Well, of course I'd like to use another voice, something cuter, younger, girlier, or more like Cameron. Also, how usefull it is depends on how fast at least single words or short phrases can be created, and on what kind of hardware. I could imagine having a lot of phrases and sentences stored on an SSD and only fill in the blanks. Then maybe add another modulation system to the output (vid related).

Robowaifu Technician 10/07/2020 (Wed) 07:38:42 No.5486

>>5485 Haha, thanks but I'm just a humble webm encoder tending my wares. The credit is due to the actual author. >>5474 But yes, you and this Anon >>5484 have some good ideas for enhancements. I hope we can manage something good with synthesized voices, and something entirely open too.

Robowaifu Technician 10/07/2020 (Wed) 15:23:58 No.5493

>>5475 It's just WaveGlow out of the box. I wrote a quick Python script to sample it. You can get the code here: https://gitlab.com/kokubunji/say-something-godot-chan It requires an Nvidia GPU with CUDA cuDNN though and the dependencies are ridiculous, not to mention the 700 MB in models that must be downloaded. >>5484 The pretrained model's dataset is 24 hours of 13100 audio clips. I haven't tried training it yet but I can't imagine that amount of data is necessary to finetune the model. I've finetuned GPT2 on my CPU with a few books and it completely change the output. >>5485 It's very fast. Even on my toaster GPU it generates about 16 words per second or 1000 words a minute. You could generate a 90,000 word audiobook in about an hour and a half. >>5486 My frustration is how inaccessible and convoluted these models are. They can't be deployed to consumers, but I got some ideas for a lightweight and expressive speech synthesis that can be run on a CPU. If my voice box simulation is successful I should be able to use the data generated to create a synthesized voice with complete vocal expression. It's really unbelievable the direction research is heading, they just find more ways to throw money at a problem rather than better data. In a few years we might have barely any new AI at all, except what is made available to us through an API to select individuals.

Robowaifu Technician 10/07/2020 (Wed) 18:06:07 No.5495

>>5493 >???? >PROFIT!! Kek.

Chobitsu 10/07/2020 (Wed) 23:23:22 No.5499

>>5493 Thanks Kokubunji, nice repo instructions. Much appreciated.

Robowaifu Technician 10/08/2020 (Thu) 06:00:15 No.5503

>>5493 Chii a cute in the classroom. I like this idea. >16 words per second or 1000 words a minute Seems like that would be fast enough. I hope you figure out how to make your voice box run on a CPU, that would be great!

Robowaifu Technician 10/08/2020 (Thu) 15:41:40 No.5520

Full speech is tough even with modern software. But there is a trick I figured out a while ago. If you take robot speech and try to make a fluent dialog something is going to go wrong and ruin the illusion. However this only happens when you're expectation is solid speech. Flip the script of expectations and build it to speak a foreign language you don't know and sprinkle in "pigeon" English. Your perspective goes from "this thing barley works " to " this thing is trying so hard ". What was once an annoying glitch becomes cute effort as it tries it hardest. All it takes prerecorded bursts of a foreign language mixed with a few awkward text to speech words.

Robowaifu Technician 10/08/2020 (Thu) 19:37:35 No.5521

>>5503 The actual simulation will be too slow to run in real-time but the data generated from it of the tongue position, jaw position, vocal posture and such can be used to train a neural network to synthesize a voice with the same parameters. By simulating a large variety of voices I hypothesize it could reverse engineer people's voices to some degree. However, the software I'm using cannot simulate sound vibrations from turbulent airflow (breathiness) or surfaces contacting (such as Ugandan Knuckles clicking), only resonance. I might be able to simulate breathiness though by modulating the sound with some randomness. Either way, converting text to the parameters that produce that sound should be far more efficient and embeddable in a game. It'll be better than nothing. The parameters should also make it possible to generate unique voices for random characters and customize waifu voices to one's liking.

Robowaifu Technician 10/08/2020 (Thu) 20:53:28 No.5523

>>5493 >RuntimeError: CUDA out of memory. Kek, so much for using Tacotron2. A 6 GB card isn't enough to train it.

Robowaifu Technician 10/08/2020 (Thu) 21:39:28 No.5525

>>5521 >geeks and robots lel'd >vocal posture Interesting. I don't think I've been familiar with that concept? >>5523 Are you running it 'dry' Anon? That is, no other resources especially vidya! contending for the GPU's memory resources?

Robowaifu Technician 10/08/2020 (Thu) 22:38:43 No.5526

>>5523 >A 6 GB card isn't enough to train it. AI Anon said 'Even on my toaster GPU' . Maybe there's some kinds of settings you need to tweak. Surely a toaster GPU has less than 6GB of RAM?

Robowaifu Technician 10/08/2020 (Thu) 23:54:00 No.5529

>>5525 Yeah, this card is fully dedicated to machine learning, not even attached to the monitor. >>5526 I found out the batch size parameter was hidden in hparams.py and it was set too high. It seems to be working with a batch size of 16. I'm surprised how fast it is. It'll only take about 40 minutes to train on 24 hours of audio clips. Now we just need a dataset of cute voice samples.

Robowaifu Technician 10/09/2020 (Fri) 00:56:11 No.5530

>>5520 >Your perspective goes from "this thing barley works " to " this thing is trying so hard ". Yes, I agree with that idea. Robowaifu naivete can actually be quite adorable, and it's an effective trope. > pic related >>5521 >Either way, converting text to the parameters that produce that sound should be far more efficient and embeddable in a game. I see (well kind of, I think). If I understand correctly, the workload of the simulation is primarily used for generating these parameters? So if you pre-generate them ahead of time and store them somehow, then the second part where the parameters are read in then used to generate the actual waveforms should be computationally inexpensive. Is that approximately the idea anon? chii has no knuckles... <tfw ;~; >helping anon find de whey https://www.youtube.com/watch?v=IulR5PXiESk >do it for princess chii anon! >>5529 >Now we just need a dataset of cute voice samples. I nominate Chii first. Surely we could manage to create a clip library of all the Chii VA segments from the Chobits animu?

Robowaifu Technician 10/09/2020 (Fri) 02:23:00 No.5532

>>5530 >Surely we could manage to create a clip library of all the Chii VA segments from the Chobits animu? I don't know if that'll be enough. Chii didn't really talk much. Each audio clip also needs text and the background noise has to be removed with something like Spleeter or DTLN. https://github.com/deezer/spleeter https://github.com/breizhn/DTLN It's worth a shot though. The average sentence in the dataset is about 20 words. Output seems to perform best around 10-15 word phrases. Keeping clips a sentence long would be best. I'm gonna try Rikka first since I have limited bandwidth and already have the entire show downloaded. I don't know how well it will handle Japanese though. We can probably automate audio clipping by using subtitles off https://kitsunekko.net/ and piping them through a denoising network. That way it's easy to train a whole bunch of characters.

Robowaifu Technician 10/09/2020 (Fri) 06:14:15 No.5537

>>5530 Well, the idea of machine learning is to disentangle useful latent variables from high-dimensional data but without immense amounts of data it's exponentially difficult for backpropagation to separate them. For example, if you wanted to learn the x,y position of something on an image and control it but had a lack of data samples to train on, it may notice that the x and y values are correlated and become biased, so when you try to change the x dimension it causes the object to move diagonally instead because it failed to disentangle the latent variables. If the training data covers the latent space evenly and the model has access to all the underlying variables, it has a much easier time disentangling the data and can interpolate between the gaps without much confusion because it's getting input where that data belongs in the latent space. A smaller and simpler model can be used rather than a bulky slow one because it doesn't have to do all the guesswork of pulling all the dimensions apart trying to sort the data. >>5532 Done: https://gitlab.com/kokubunji/clipchan

Robowaifu Technician 10/09/2020 (Fri) 07:09:59 No.5539

>>5537 >clipchan error ModuleNotFoundError: No module named 'ass' > Apparently I need a Python dependency? The .ass file was extracted w/ ffmpeg.

Robowaifu Technician 10/09/2020 (Fri) 07:16:51 No.5540

>>5539 just in case it matters >

Robowaifu Technician 10/09/2020 (Fri) 07:29:26 No.5541

>>5537 >>5539 >>5540 Nvm, figured it out. > #1 Now I'm getting a 'file name too long' error. (probably some of the interstitial stuff, I can post the .ass text if you'd like. > #2 Also, it's extracting the first audio track (Japanese), but I want the second track (English). Any way to control this? Regardless, very cool work Kokubunji.

Robowaifu Technician 10/09/2020 (Fri) 07:40:33 No.5543

>>5541 sample clipchan results, btw. had to convert to .mp3 so I could post them here, but the originals are all .m4a >

Robowaifu Technician 10/09/2020 (Fri) 07:47:20 No.5544

>>5541 Found the Dialogue Event that broke things (the middle one)

Dialogue: 0,0:01:31.87,0:01:33.95,Chobits Dialogue,Comment,0,0,0,,I'm gonna go to Tokyo!
Dialogue: 0,0:01:48.87,0:01:55.88,Chobits OP JP,,0,0,0,,{\fad(400,900)\be1}{\k15\1c&HDF6B7B&}Fu{\k21}ta{\k22}ri {\k23\1c&H4E4FDE&}ga {\k44\1c&HDE8162&}ki{\k45}tto {\k22\1c&HA1CA5D&}de{\k23}a{\k23}e{\k24}ru {\k43\1c&H226FCD&}you{\k48}na {\k20\1c&H56CED9&}ma{\k45}hou {\k26\1c&H7D79D7&}wo {\k47\1c&HDA90CB&}ka{\k48}ke{\k152}te.
Dialogue: 0,0:01:48.87,0:01:55.88,Chobits OP EN,,0,0,0,,{\fad(400,900)\be1}Casting a spell that will make sure they meet.

Robowaifu Technician 10/09/2020 (Fri) 07:52:27 No.5545

>>5537 Thanks for that detailed explanation. That helps, actually. Not sure how to word a cogent response, but the topic seems to make more sense to me now.

Robowaifu Technician 10/09/2020 (Fri) 08:04:41 No.5547

>>5532 Fair enough, good luck with Rikka Anon! :^) I'll try to sort out longer audio/subtitle clips of Chii's speech from all the episodes. I'd expect we should be able to find at least five minutes of this, just in case it might work.

Robowaifu Technician 10/09/2020 (Fri) 08:13:12 No.5548

>>5537 >>5541 >>5544 BTW, this is the name of the source file itself, just to be on the same page: 01. Chobits [BD 720p Hi10P AAC][dual-audio][kuchikirukia] [3DD90125].mkv The 'kuchikirukia' version seems to be the best quality one I've found over the years of the Chobits series, so I've just standardized on it.

Robowaifu Technician 10/09/2020 (Fri) 10:40:15 No.5549

>>5493 We just need the audio clip equivalent of "The quick brown fox jumps over the lazy dog." where there are enough use cases to build a speech pattern. >>5520 The fake foreign language option also sounds good. How about Klingon, or Hymnos (Reyvateil language in Ar Tonelico series)? Godspeed anon.

Robowaifu Technician 10/09/2020 (Fri) 13:20:04 No.5550

>>5539 Whoops, forgot to add requirements.txt. Anyone setting it up now can just do: pip install -r requirements.txt >>5541 I could output subtitles too long to a csv file or something. Long audio clips need to be manually split up anyway. It seems the offending line is the opening with all the formatting code. You should be able to filter the events it clips to only character dialogue with --style "Chobits Dialogue" But there may be still rare cases where English subtitles go over the 255 character limit. I'll start with scrubbing the formatting tags so openings can be clipped too. Also it might be useful to keep the raw audio clips and parameters used to generate them. Later when we clean them up we could use that data to train an AI to automatically crop and prepare clips. >>5547 If there are too few, we can try augmenting the data by splitting up audio clips into shorter phrases to add to the dataset, as well as applying a random delay to the beginning, slightly changing the volume or increasing the tempo in Audacity.

Robowaifu Technician 10/09/2020 (Fri) 19:50:16 No.5554

Put up some quick instructions on cleaning audio clips with Spleeter and DTLN: https://gitlab.com/kokubunji/clipchan#remove-music-from-audio-clips It's pretty simple to use. DTLN is better at removing noise but the quality isn't as pleasant as Spleeter.

Robowaifu Technician 10/09/2020 (Fri) 20:06:43 No.5555

>>5550 >--style "Chobits Dialogue" Great!, that did the trick. Extracted 408 clips in about 2 minutes. > That's fine to get the Japanese clips (which, frankly I like Chii's VA's voice better tbh), but I'd like to get the English channel's clips too. Have any suggestions?

Robowaifu Technician 10/09/2020 (Fri) 20:29:07 No.5556

>>5555 >portentous digits tho Here are Chii's first 5 utterances to Hideki... > Ofc during the first episode her only words were cute variations of 'Chii'. :^)

Robowaifu Technician 10/09/2020 (Fri) 20:32:19 No.5557

>>5556 BTW, it's humorous to simply mpv * from inside the clip extract directory. Kind of like 'watching' the show on fast forward.

Robowaifu Technician 10/09/2020 (Fri) 20:46:12 No.5558

>>5554 Thank you. Yes, Spleeter seems to preserve the subtlties of the voice better. I wonder what's the difference? Regardless, I'll be post-processing the clip extracts from Chii's dialogue where needed. I don't have a specific time frame, but I plan to work my way through an episode or two here and there until I have the complete set. I'll probably post the completed set as a zip on Anonfiles when it's finished.

Robowaifu Technician 10/09/2020 (Fri) 21:15:35 No.5559

>>5555 I just pushed a bug fix and feature update. It should be able to process all subtitles now. Too long filenames are truncated and all needed subtitle text is written into filelist.txt in the output path. You can now inspect the subtitles with --inspect or -i before running and it will count how often the styles are used. The most used one is likely the character dialogue. >>5556 My heart can't handle this much cuteness at once. There's a lot of noise in them but some of them are still usable. >>5558 DTLN has a lower sampling rate than Spleeter and was designed for removing heavy background noise like air conditioners running. Good luck with it. If anyone doesn't have CUDA but wants to train a voice I don't mind training a character voice for them if they have the clips. We could probably train the voices on Kaggle or Google Collab too. If a lot of people become interested in the project one day that would be one way for them to get started.

Robowaifu Technician 10/09/2020 (Fri) 21:46:07 No.5560

>>5559 Great, thanks for the inspect flag and fixes. So, again, any way to specify the English language audio track for clip extraction (vs. the defaulted Japanese)?

Robowaifu Technician 10/09/2020 (Fri) 23:15:27 No.5568

>>5559 Haha, want an ASMR? create a playlist of all 28 Chii clips from ep01 and then mpv --playlist=01_chii.pls --loop-playlist=inf

Robowaifu Technician 10/09/2020 (Fri) 23:25:31 No.5570

>>5560 Sorry I missed that. Just pushed another update to select the audio track. Use -track 2 or -t 2 to extract the second audio track. Also added --quiet / -q to silence ffmpeg output unless there's an error and --guess / -g to automatically pick the most common style for extraction. Also major update: the subtitles file option is now optional and specified with --subtitles / -S. Subtitles can be extracted directly from video now and subtitle track selected with -b if necessary >>5568 Haha, that's a lot of Chii. It seems there's a bug though? The formatting tags shouldn't be showing in the filenames unless those are clips extracted from an early version.

Robowaifu Technician 10/09/2020 (Fri) 23:31:45 No.5571

>>5570 >Chii-levels > 9'000 IKR? >unless those are clips extracted from an early version. Yes, it's the older stuff I haven't redone it yet. I'll use the newer stuff for the final processing & edits, etc. BTW, there are still a few formatting things in the newer version. IIRC, '(/N)' (or something similar). Also, portable filenames (for instance that work on W*ndows) need some chars removed to work correctly. I dealt with this issue in BUMP. So, I'd say the !, ? and any other punctuation are good candidates for removal from the filenames for example.

Robowaifu Technician 10/09/2020 (Fri) 23:33:48 No.5572

>>5570 >Just pushed another update to select the audio track >Also major update: the subtitles file option is now optional Great! I'll try this over the weekend. Thanks for all the hard work Anon.

Robowaifu Technician 10/10/2020 (Sat) 01:02:58 No.5574

>>5571 There, added portable filenames. I noticed the recent version wasn't removing {} stuff again so I fixed that too. Now I just need to automate Spleeter and it should be good to go. >>5572 I don't even think of it as work I'm so excited for this. There's so much that can be potentially done with it from voicing waifus to redubbing anime. The memetic potential is infinite.

Robowaifu Technician 10/10/2020 (Sat) 02:42:04 No.5575

>>5570 Seem to be having trouble auto-pulling the subtitles. Here's a portion of the output showing which channel it's in: > #1 Here's the response I'm getting, trying the simplest approach: > #2 I'm sure I'm just flubbing it somehow. Maybe providing a specific example of grabbing auto-subtitles would help a bit ? I can generate them myself w/ ffmpeg, but I would much prefer using your method instead.

Robowaifu Technician 10/10/2020 (Sat) 02:42:47 No.5576

>>5575 >a portion of the ffmpeg output*

Robowaifu Technician 10/10/2020 (Sat) 02:54:36 No.5577

>>5575 >trying the simplest approach: Actually, I guess this is the simplest approach, but it breaks on me worse: >

Robowaifu Technician 10/10/2020 (Sat) 07:42:25 No.5581

>>5577 My bad, I forgot to push my code before going to bed, but I think inspect should have still worked with -b 3. I've updated debug mode -d to provide some more useful output, such as what it's running FFmpeg with. FFmpeg is getting an invalid argument list somehow. It may be due to your version of FFmpeg. Can you pull the latest version of Clipchan and try running these commands to see what they output now?

python ../clipchan.py -i 01.mkv -b 3 -d
python ../clipchan.py 01.mkv -d
ffmpeg -version

Robowaifu Technician 10/10/2020 (Sat) 08:08:15 No.5583

>>5581 Haha, no worries and thanks very much Anon. While you slept, I finished grabbing and sorting the basic clips for all Chii utterances in ep01-ep03, in both English and Japanese. This tool of yours is a remarkable time saver. Ofc all the clips will need explicit fine-tuning inside Audacity later, but your Clipchan is kind of revolutionary tbh. What a difference a day brings! :^)

Robowaifu Technician 10/10/2020 (Sat) 08:14:28 No.5584

>>5581 >Can you pull the latest version of Clipchan and try running these commands to see what they output now? Sure thing, here we go: python ../clipchan.py -i 01.mkv -b 3 -d > #1 python ../clipchan.py 01.mkv -d > #2 ffmpeg -version > #3

Robowaifu Technician 10/10/2020 (Sat) 11:53:10 No.5586

>>5583 Once it automates Spleeter, cropping and normalization it will be truly revolutionary. Every show will be game to doing machine learning effortlessly. >>5584 Your ffmpeg wasn't built with --enable-libzvbi. I pushed another update though that uses a different method to extract the subtitles from a video. Let me know if it works for you.

Robowaifu Technician 10/10/2020 (Sat) 12:01:55 No.5587

>>5581 >>5586 >Let me know if it works for you. Great, looks like your patch finds the subtitles stream now. > I simply installed the ffmpeg in the repo iirc. I can probably manage to build from their repo if you think it would be worth the trouble?

Robowaifu Technician 10/10/2020 (Sat) 12:04:10 No.5588

>>5587 >I simply installed the ffmpeg in the distro package repo iirc*

Robowaifu Technician 10/10/2020 (Sat) 12:08:02 No.5589

>>5587 It's fine, if it works now the dependency was unnecessary.

Robowaifu Technician 10/10/2020 (Sat) 12:15:45 No.5590

>>5589 Got you. Alright I'm off for a few hours at least. Cheers.

Robowaifu Technician 10/10/2020 (Sat) 16:36:26 No.5593

Spleeter is now automated in v0.2 but not fully tested yet. To try it put the Spleeter pretrained_models directory in the Clipchan directory and use Clipchan as usual plus --spleeter. Due to a bug in Spleeter, the terminal will reset after it completes to prevent the terminal from freezing. Next, to automate cropping and normalization I will make it look for the median point of silence in the padding area and crop it to 0.02 seconds before the next sound detected. This should be good enough. There are some alignment issues with my subtitles so I'm realigning them in Aegisub and reducing the default padding to 0.2s since it's not uncommon for subtitles to be 0.4s apart.

Robowaifu Technician 10/10/2020 (Sat) 20:28:39 No.5595

>>5593 That sounds clever. Look forward to trying it out. Sounds like you're almost there Anon.

Robowaifu Technician 10/10/2020 (Sat) 20:47:16 No.5597

>>5574 > I noticed the recent version wasn't removing {} stuff again I found an explicit example of the newline char still being left in the filenames/dialogue text

Dialogue: 0,0:05:36.45,0:05:39.03,Chobits Dialogue,Comment,0,0,0,,{\i1}What did I say in front\Nof such a beautiful lady?

The '\N'

Robowaifu Technician 10/10/2020 (Sat) 20:48:40 No.5598

>>5597 >newline char that's still being left*

Robowaifu Technician 10/10/2020 (Sat) 21:32:20 No.5601

>>5595 Automated clipping and normalization is almost done. I think after this I'll try making a neural net that can detect which character is speaking. That way an entire show can be fed in and sorted automatically using a few examples of the characters speaking. >>5597 Newlines are being removed from my subtitles. The only place they should appear is in the log file in debug mode. Try pulling the latest update and running the same command with -d and inspecting clipchan.log. It will show the reformatted text <> unedited subtitle text, something like this:

[249] 00:20:2.960-00:20:5.980 (0.0) The magma of our souls burns with a mighty flame <> The magma of our souls\Nburns with a mighty flame!

Robowaifu Technician 10/10/2020 (Sat) 21:49:02 No.5603

>>5601 >That way an entire show can be fed in and sorted automatically using a few examples of the characters speaking. That sounds awesome. I think I'm going to put my tedious effort on hold and wait for your better approach Anon. :^) You know it occurs to me that you could probably do a 'two-pass' approach as well (at the cost of longer processing) that could sort of do all the cleanup, crops, speaker identification, etc., then feed that information back into a second pass sequence that should then be able to improve the accuracy. Of the crops and noise removal of tricky bits that have a lot going on audio-wise in a short time span, for example.

Robowaifu Technician 10/10/2020 (Sat) 21:55:54 No.5605

>>5603 Seems to me, this could also be used to improve a series subtitles as well. Sort of an auto-gen for subtitles, that are actually timed very well, and also more accurate with the actual text. For example, the engrish-translation of some English subtitles often aren't right on-cue with the English VA scripts (even if often much more humorous/possibly more accurate to the original Japanese meanings/idioms). Seems like that might save having to go in and manually edit the filelist.txt entries by hand before passing it into the machine learning so audio/text actually matches first.

Robowaifu Technician 10/10/2020 (Sat) 23:24:19 No.5606

>>5603 The cropping is perfect so long as the subtitles are aligned correctly. Sometimes sound effects slip through Spleeter but that can't be avoided. Speaker identification is going to require building a dataset first to train on. >>5605 Auto-aligning subtitles will be tricky. I could probably fix small misalignments with the same method I'm using to crop the audio clips by snapping them to the nearest sensible spot. I'd have to run Spleeter over the whole episode first which shouldn't be too big of a hit since it has to convert the clips anyway. I'll add this feature idea to the to-do list. Maybe two projects down the line someone will create some speech recognition for it that can generate subtitles.

Robowaifu Technician 10/11/2020 (Sun) 03:37:48 No.5613

Trying to find the instruction on your repo for removing music from clips. > #1 Discovered a minor naming issue w/ instructions. > #2 Then realized (afaict) a showstopper (for me at least) dependency. > #3 I suppose you can't do this w/o a Nvidia GPU then?

Robowaifu Technician 10/11/2020 (Sun) 03:39:36 No.5614

>>5613 >Trying to follow*

Robowaifu Technician 10/11/2020 (Sun) 05:21:38 No.5615

>>5613 Spleeter runs off CPU by default. You need to downgrade to Python 3.7 to install Tensorflow.

Robowaifu Technician 10/12/2020 (Mon) 09:21:56 No.5626

Trained on 2B's voice overnight without data augmentation and possibly a too high learning rate. It's not perfect and there's only 18 minutes of training data, but the results are pretty satisfying. Training data I used available here: >>5620 Filelist.txt: https://pastebin.com/y3GyyBtR Once I fine tune it better I'll create a Google Collab so anyone can use it even without a GPU.

Robowaifu Technician 10/12/2020 (Mon) 09:27:48 No.5627

>>5615 >You need to downgrade to Python 3.7 to install Tensorflow. I have no idea how to do that tbh and I've fought trying everything I know how to do to get spleeter working but have repeatedly failed. I'll just focus on extracting and sorting out the clips for now since clipchan does that part well.

Robowaifu Technician 10/12/2020 (Mon) 09:29:21 No.5628

>>5626 Haha, wow that's pretty nice already. Great stuff Anon.

Robowaifu Technician 10/12/2020 (Mon) 09:49:46 No.5629

>>5626 >Google Collab Any chance of creating a mechanism to save pre-canned responses out locally. I mean audio files and some way to associate them with the input texts locally? It's one thing to use Google Collab intermittently as a generator for locally-stored content, it's another thing entirely to become wholly-dependent on G*ogle for our waifu's daily operations.

Robowaifu Technician 10/12/2020 (Mon) 10:45:26 No.5630

>>5627 If you're on a Debian-based distro you can check which versions of Python are available with apt-cache policy python3 and to downgrade aptitude install python3=3.7.3-1 or whatever 3.7 version is available in your distro. Just be careful it doesn't remove any packages and finds a resolution that downgrades packages as necessary. If that fails, Tensorflow 2 is compatible with Python 3.8 and I can try porting these old projects to Tensorflow 2. >>5629 Yeah, you can use the Levenshtein edit distance to find the closest match and play that pre-generated response. You could generate a whole library of words, phrases and sentences then stitch them together. If someone is really ambitious they could probably write some code for evaluating PyTorch models in OpenCL or simply port the code to the CPU. At the end of the day though if someone doesn't wanna be dependent on Google or Kaggle for compute they need to get a GPU.

Robowaifu Technician 10/12/2020 (Mon) 12:49:41 No.5631

>>5630 >Just be careful it doesn't remove any packages and finds a resolution that downgrades packages as necessary. I have no idea how to do that. I did find a tool called downgrade and so I'll try to use that. Thanks.

Robowaifu Technician 10/12/2020 (Mon) 13:38:58 No.5633

>>5630 Great samples. Kek. >>5631 Be careful not to mess up your system by downgrading your regular Python or install all kinds of stuff: https://www.pythonforbeginners.com/basics/how-to-use-python-virtualenv

Robowaifu Technician 10/13/2020 (Tue) 23:46:00 No.5648

Past couple days have been hell tracking down strange bugs and trying to get this to work flawlessly but it's working good now. --auto-clean normalizes, removes silence and resamples clips to prepare them for machine learning and has a success rate of about 98%. It gives warnings for which clips need further attention. Most of the time these lines aren't usable anyway since they contain overlapping audio or other strong background noise. Also added another tool for quickly captioning audio clips called filelist.py. It goes through all the wave files in a folder and plays them, prompting you what the line should be. Hopefully it's pretty straightforward to use. You will need to install playsound to use it:pip install playsound With that, Clipchan is pretty much done and ready for waifu datamining. Enjoy! https://gitlab.com/kokubunji/clipchan

Robowaifu Technician 10/14/2020 (Wed) 00:09:56 No.5649

>>5633 I see, thanks for the advice. >>5648 Thanks for all the hard work. Does --auto-clean rely on dependencies, or no?

Robowaifu Technician 10/14/2020 (Wed) 00:19:42 No.5651

>>5615 Ironically enough, I can install Tensorflow 2 just fine on my system. But even when I install TF 1.4, Spleeter refuses to recognize it, and I basically broke my system trying to downgrade to Python3.7 from 3.8 (thankfully I seemed to have recovered from that now). Even when I successfully installed TF1.4 on a RaspberryPi (Debian Buster-based, and already Python3.7), Spleeter still refused to recognize it and failed to install.

Robowaifu Technician 10/14/2020 (Wed) 01:52:07 No.5652

Here's as far as I've gotten to :

spleeter 1.4.0 requires museval==0.3.0, but you'll have museval 0.3.1 which is incompatible.
spleeter 1.4.0 requires pandas==0.25.1, but you'll have pandas 1.1.3 which is incompatible.
spleeter 1.4.0 requires tensorflow==1.14.0, but you'll have tensorflow 2.3.1 which is incompatible.

Robowaifu Technician 10/14/2020 (Wed) 10:37:07 No.5666

>>5651 >TF 1.14*

Robowaifu Technician 10/14/2020 (Wed) 14:29:28 No.5677

>>5649 It depends on Numpy and Scipy at the moment, but I'm gonna remove the Scipy dependency and use the standard library's wave instead. >>5651 Figures. Downgrading is always a nightmare. I'll see if I can port it to Tensorflow 2. Fortunately there's some code to automate translating projects.

Robowaifu Technician 10/14/2020 (Wed) 16:51:21 No.5682

>>5677 It seems just a few days ago Spleeter 2.0 was released on PyPI that's compatible with Python 3.8:pip install spleeter I created a separate branch for Spleeter 2.0 and Tensorflow 2.3.0:

cd clipchan
git checkout python3.8
pip install -r requirements.txt

I've tested that it's compatible with the Spleeter 1.4 pretrained models. It seems people have already ported Tacotron2 and WaveGlow to Tensorflow 2 so I'll work on creating a Python 3.8 branch for WaifuSynth too.

Robowaifu Technician 10/14/2020 (Wed) 23:18:29 No.5685

>>5682 Great news! I'll give it a shot tonight.

Robowaifu Technician 10/14/2020 (Wed) 23:21:45 No.5686

>>5682 BTW (OT) what bearing is the advice to use the '-m' flag with pip? As in pip -m install foobar I've seen that advice (and examples) often, but I don't think I understand what difference it makes yet.

Robowaifu Technician 10/14/2020 (Wed) 23:54:35 No.5687

>>5682 Had to first remove the leftover 1.4 spleeter egg file from /usr/lib/python3.8/site-packages/ > #1 Things seemed to be going well, then it errored out with this > #2

Robowaifu Technician 10/15/2020 (Thu) 00:04:02 No.5688

>>5687 Apparently, you can specify a version number (but guys recommend against this approach for some reason?) > #1 I'm not sure if this means everything went ok now with pip install spleeter > #2 but I'll push ahead with checking out the 3.8 branch of clipchan...

Robowaifu Technician 10/15/2020 (Thu) 00:14:06 No.5689

Don't forget to git fetch first before checkout. > #1 Minor typo in requirements.txt > #2 Seems OK so far now > #3 I'll give it a test in a while and see how --auto-clean goes. Any specific example command you'd care to give me so I don't flub the increasingly complicated heh :^) clipchan flags?

Robowaifu Technician 10/15/2020 (Thu) 00:49:05 No.5690

Here's the command I used: python ../clipchan.py 01.mkv --output-path clips/en/01 --style "Chobits Dialogue" -b 3 --spleeter --auto-clean --track 2 Which produced ~400 .wav files for me > #1 but errored out on me right at the beginning of the spleeter post-processing phase. > #2 And the log file seems empty? > #3

Robowaifu Technician 10/15/2020 (Thu) 00:52:24 No.5691

>>5690 Also, I'm curious why filelist.txt is being written in the working directory instead of the target directory. Intentional? Seems like it used to be written into the same directory as the output .m4a files IIRC.

Robowaifu Technician 10/15/2020 (Thu) 01:13:02 No.5692

>>5686 python -m pip -m tells python to run a module, in this case pip, from that specific python installation. >>5687 Specifying certain versions leads to unnecessary dependency hell. Unfortunately, dependency hell is the reality because all these different machine learning libraries require specific versions or they break. >>5689 Thanks, fixed the typo. Most of the flags are there to give some flexibility and don't need to be changed. It should work great with the default settings. >>5690 If the log is empty I assume Spleeter failed to start. What happens when you run Spleeter by itself?python -m spleeter separate -i clips/en/01/*.wav -o /tmp -n filename >>5691 You can change where filelist.txt is written to with --filelist-output-path / -l or name it to something else like ep1_filelist.txt with --filelist / -f

Robowaifu Technician 10/15/2020 (Thu) 01:22:21 No.5693

Hmm, a surprise. I deleted all the output .wav files from the previous effort and decided to try again w/o the --spleeter flag python ../clipchan.py 01.mkv --output-path clips/en/01 --style "Chobits Dialogue" -b 3 --auto-clean --track 2 but had the same outcome? >

Robowaifu Technician 10/15/2020 (Thu) 01:26:58 No.5694

>>5692 > from that specific python installation. Hmm, not sure I really understand that (probably why I didn't get it before heh). But this is probably more apropos for our Python General thread? >>159 >Thanks, fixed the typo y/w. >You can change where filelist.txt is written to with Great, more flags to deal with haha. <pull request: default filelist.txt into the same output path :^)

Robowaifu Technician 10/15/2020 (Thu) 01:28:56 No.5695

>>5692 >What happens when you run Spleeter by itself? What should I pass as the filename? (sorry if this seems blatantly obvious to everyone else) >

Robowaifu Technician 10/15/2020 (Thu) 01:44:53 No.5696

>>5692 >>5695 When I dropped the unrecognized flag, spleeter unceremoniously crashes with no further output >

Robowaifu Technician 10/15/2020 (Thu) 02:29:08 No.5697

One other thing. I noticed a spleeter.sh file in the directory (no doubt left over from the 1.4 branch of clipchan). This bash script is explicitly for spleeter v1.4.0 it seems. Does there need to be a similar script for the newer spleeter v2. Just spitballing here tbh.

Robowaifu Technician 10/15/2020 (Thu) 03:53:25 No.5698

>>5694 You can have different versions of Python installed, each with their own pip module for managing packages. And sure, I'll make the output folder the default path. >>5693 Double check your command. It's not possible to reach that part of the program unless --spleeter or --spleeter-model is given. >>5695 That is a bug in Clipchan. It seems Spleeter changed its arguments and I didn't realize I was using 1.4 when I was testing because Python ignored the virtual environment. I will have to investigate how Spleeter 2.0 outputs its files. >>5696 However, this is caused by Tensorflow being built with optimizations not supported by your CPU. Likely the only option here to workaround this is to build Tensorflow 2.3.0 from source: https://www.tensorflow.org/install/source >>5697 Pip simplifies the installation of Spleeter. I'm going to change the other one to use pip as well.

Robowaifu Technician 10/15/2020 (Thu) 04:42:27 No.5700

Alright, I'm trying to build Tensorflow from source r/n. I'll start over completely from scratch with Clipchan if that goes well. Hopefully, the new 3.8 copy of Clipchan will go well then.

Robowaifu Technician 10/15/2020 (Thu) 05:22:32 No.5701

Well, unsurprisingly that was a miserable failure. Building Tensorflow appears well above my paygrade. After a week of fighting with this with no success I'm about done with it. Question to Kokubunji If I simply sort the Chii vocals out of the raw .wav clips and then package them up for you in a zip somewhere, can you take if from there and do all the rest? It's obvious at this point I can't get Clipchan's marvelous, no doubt features to work correctly.

Robowaifu Technician 10/15/2020 (Thu) 08:12:42 No.5702

>>5701 Yeah, Clipchan will automate the rest. I'll see if I can find a Tensorflow 2.3.0 pip wheel without AVX or build one for people to use. Most of the machine learning libraries have dropped support for legacy CPUs in a similar way researchers have dropped support for CPUs. Now that I think about it, a lot of the papers require massive GPU clusters or 100's of TPUs and they don't release their models or code. We're already being left in the dust and if /robowaifu/ can't get AI to work, then what hope is there for everyone else?

Robowaifu Technician 10/15/2020 (Thu) 08:51:27 No.5704

>>5702 Thanks, I'll get on that and hopefully have it for you within the week. (I already got about 12 episodes done from before but have to redo with .wavs heh). My main issue is my lack of experience with python. I'm more interested in the underlying engines it's scripting that do the actual heavy-lifting. We'll need both skill areas, and we're doing OK in that regard IMO. Thanks for all the great innovation and also the hard work helping us get things working on lower-end hardware. You are much appreciated Anon. >We're already being left in the dust and if /robowaifu/ can't get AI to work, then what hope is there for everyone else? Ehh, we'll get things working, you obviously already have many things now. We're just on a limited budget as typical Anons vs. Big Tech has unlimited deep pockets. It was always going to be asymmetric for us. We just have to be clever about things. I'm not going to stop, I just have to recognize my current limits. We'll all learn more about things as we go along.

Robowaifu Technician 10/15/2020 (Thu) 10:48:40 No.5708

>>5702 > Dropped support for legacy CPUs Whhahaaaah, they do that?!? I've spend the last week or so thinking about building a server based on old Xeon CPUs and maybe also buying a Xeon Phi, as the external brain for my waifu... F... https://youtu.be/ZKkzEBtIoH8

Robowaifu Technician 10/15/2020 (Thu) 11:47:59 No.5711

>>5708 Yes, this is a common idea I think. It's a strong argument for us to succeed at figuring out a way to use old hardware effectively. There's faar more of that kind available the world over than the other. It's probably fundamentally less botnetted as well, also an important point.

Robowaifu Technician 10/15/2020 (Thu) 11:50:17 No.5712

Same anon than >>5708 Looked into it a bit, Pytorch for example seems to support CPUs via Intel Math Kernel Library, which seems to have no limitations to which CPU is working with it, except optimization might not work on non Intel CPUs: https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html

Robowaifu Technician 10/15/2020 (Thu) 21:45:52 No.5718

>>5704 Yeah, getting this stuff to work needs to become frictionless as possible for everyone to get involved. People don't have the time or energy for hitting their head on a wall for a week. There's about 1-2 years left before AI really takes off so it's definitely doable to get everything ready for people to use. I can imagine something like Transcoder translating PyTorch and Tensorflow code and models to mlpack in C++ which can build for any system, including embedded systems. >>5708 >>5712 Xeon and Xeon Phi should be fine. The public builds of PyTorch and Tensorflow require AVX and AVX2 since sometime around 2018. The devs have said multiple times it isn't suppose to but the instructions keep slipping into the builds and they don't do anything about it. Sometimes they do but then a minor version later it's fucked again. They've effectively cut off millions of people from using or learning how to use neural networks on common hardware. And just a few years ago PyTorch worked on 32-bit systems but they dropped support for 32-bit builds too. In a few months I'll definitely see if I can port Spleeter and WaifuSynth to mlpack. That would completely disentangle ourselves from Facebook and Google and be a huge step forward to keeping AI open.

Robowaifu Technician 10/15/2020 (Thu) 22:07:57 No.5719

>>5718 >Transcoder Is it this project Anon? > Unsupervised Translation of Programming Languages >A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is time-consuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.

Robowaifu Technician 10/15/2020 (Thu) 22:36:16 No.5721

>>5718 A week ago I wouldn't have recognized the term AVX, but yeah even ood Xeons have that, just not the newer versions of it, which would be much better. Still a bit concerning if old CPUs are getting ignored, especially for young students doing this on their own small budget. Still, it's amazing what hardware you can get for even 1k, and I figure the devs have to draw the line somewhere. Should have bought Bitcoin when you where in primary school or so, I guess.

Robowaifu Technician 10/15/2020 (Thu) 22:37:40 No.5722

>>5721 *old

Robowaifu Technician 10/15/2020 (Thu) 22:41:39 No.5723

>>5719 Yeah, that's it. The source code is available here: https://github.com/facebookresearch/TransCoder/ But this is getting off-topic from speech synthesis.

Robowaifu Technician 10/16/2020 (Fri) 03:12:16 No.5728

>>5718 >In a few months I'll definitely see if I can port Spleeter and WaifuSynth to mlpack. <Over 9'000!!! Haha not only are they cutting-edge AI library devs, they are shitposting memes within their own paper. Godspeed Anon.

Robowaifu Technician 10/16/2020 (Fri) 05:25:56 No.5733

>>5718 We're behind you, Anon! >>5730 >>5731 >>5732

Robowaifu Technician 10/16/2020 (Fri) 19:19:44 No.5736

Alright, I've tried my best to build a Tensorflow pip wheel without AVX and mostly succeeded but some AVX instructions are still slipping through inside pywrap, tf2xla and libtfkernel. On the Tensorflow community wheel page no one has succeeded yet in making an AVX-free build for 2.3.0 and I can't even begin to figure out where they forgot to include the optimization parameters in Google's 9 GB clusterfuck. So I've taken a look at the Spleeter model and it uses some algorithms not implemented in mlpack yet and the same is true for Tacotron2 and WaveGlow. I should be able to translate the functions but it's gonna be a lot of work, at least 6 months full-time. I'm gonna play around with mlpack more to see if it's worth the effort. The most off putting thing is the lack of good GPU support but they're working on that. There might even be other people willing to help since it would be a significant contribution to mlpack to have all these signal processing algorithms.

Robowaifu Technician 10/16/2020 (Fri) 21:16:53 No.5738

>>5736 Well, no one could possibly criticize your creativity and efforts thus far Anon. You've already done amazing work, and I'll support you whatever your decision. If you choose to simply wait until the mlpack codebase improves, then perhaps anons can simply do what I'm doing; namely just produce waifu vocal clips and then put them out there for those with more hardware to process. This will be a hit-or-miss approach IMO, but it preserves the status quo. It would be the choice of a sensible man, no doubt. If you choose to press forward with the hard choice, then I'd be happy to do any simple tasks I can to help take some of the load off you. Profiling the specific performance characteristics of particular commands/datasets on either my old Intel or ARM hardware, for example. Certainly the template generics approach the mlpack team has taken is both a tried-and-true one, and the run-time performance thus far seems to be smoking hot on this tiny arm7hf hardware currently at my disposal. The high-resolution clock from the standard C++ library seems to work just fine on this old hardware, afaict. If you can give me a list of tests to perform, I'll be happy to participate in that small way in your big effort Anon. Again, regardless, you've already made remarkable strides, and you deserve applause for that alone. :^)

Robowaifu Technician 10/17/2020 (Sat) 15:58:31 No.5740

>>5736 >The most off putting thing is the lack of good GPU support but they're working on that. Interesting. Can you give us details on the efforts so far? >There might even be other people willing to help since it would be a significant contribution to mlpack to have all these signal processing algorithms. SYCL is a higher-level form of OpenCL, and entirely in standard C++. This implies some probability of a long-term, broad availability on a widely diverse set of hardware. If you've ever had experience with Microsoft's C++ AMP, you can think of this as a more sophisticated progression of that approach. https://sycl.tech/ CodePlay is taking a lead at heading up practical solutions to support this standard. The ISO C++ Standards Committee also has a study group SG14 (the GameDev & low latency ISO C++ working group), headed up by Michael Wong, the chairman of OpenMP. It seems to me this is probably the most likely path the committee will progress down towards direct, standard support for heterogeneous computing accelerators such as GPUs. Probably worth thinking about in the long-term though quite early to put all your eggs in that one basket just yet IMO. >Porting our robowaifus to mlpack This is likely to be a big, complex topic and probably deserves it's own thread.

Robowaifu Technician 10/21/2020 (Wed) 07:41:30 No.5856

>>5698 Sorry, I don't know how I missed that filename Anon. That's exciting! So we're really going to have robowaifus then?

Robowaifu Technician 10/21/2020 (Wed) 08:47:53 No.5861

Devlog update. Almost ready to train the new WaifuSynth. Just need more voice data to train on and double check that the code is doing what it's suppose to do. For the first pretrained model I will be training it on 2B, Rikka, Megumin, Rem, Chii, Mahoro, Armitage and the LJSpeech dataset. I modified Tacotron2 to take both a text and style input that are combined together so the audio clips can be annotated with emotions. For now I'm just using letters of the alphabet to mark where there are changes in a speaker's tone. For example, if a character starts doing excited yelling I mark JY underneath the word, J for joy and Y for yelling. I'm not sure how well this will work, but I guess I'll find out once I try training tomorrow. Even a little bit of control over the generation process would be great. Tacotron2's encoder LSTM starts off from a blank slate so I created another LSTM to take in a character name as input and use its output for the initial hidden state of the encoder LSTM before it encodes the text and style input. This way different characters can be easily sampled and new characters can be added to the training set without having to recode and retrain the entire model. I made some other tweaks to the model as well by duplicating parts of the network into three and using my multiplication and addition trick (a*b+c) which I've found also speeds up language model training by orders of magnitude. Hopefully it works in Tacotron2 as well. I'm concerned it will push the model over my memory limit though so I will be limiting training sentences to 120 characters. If that's not enough I can cut the batch size down 6x or just turn it off completely in the model hyper parameters. I also found a function in PyTorch to help free up GPU memory by emptying the cache: torch.cuda.empty_cache() And to wrap up today I had some fun with the old WaifuSynth after seeing some anons lamenting about how they don't believe robowaifus will exist in their lifetimes. I'm still kind of shocked no one on YouTube has made a serious effort yet to clone 2B's voice or any anime voices for that matter. Hopefully they'll see it and get a taste of the future to come. Also making this gave me an idea to pass the hidden state to the next sentence generation so the generation flows smoothly instead of resetting and sounding so robotic, but that will add another level of complexity separating the training set into sentence pairs. I'll leave that idea to the next version.

Robowaifu Technician 10/21/2020 (Wed) 09:23:48 No.5863

>>5861 >Chii I'm roughly halfway now through sorting out the new .wav versions of Chii's vocals. Should I continue or do you already have them. I estimate it will take me another couple of weeks total to finish up, then go through them all and trim them in Audacity, then make all the many edits needed in each filelist.txt to make the English VA's actual words. BTW, the version of .wav file are full 16-bit stereo files, but IIRC you mentioned something on the repo about 22.5K mono files instead? Should these be downsampled first after I sort them?

Robowaifu Technician 10/21/2020 (Wed) 11:19:47 No.5865

>>5863 >to match the English VA's actual words.*

Robowaifu Technician 10/21/2020 (Wed) 11:27:38 No.5866

>>5863 I haven't done Chii yet. It will take several days at least to train a new model from scratch, starting with the LJSpeech dataset. You shouldn't need to trim them in Audacity though. It only takes a minute to align the subs in Aegisub. A little bit of noise within 200ms at the end and start of clips will be found clipped out automatically. The clips shouldn't be resampled before going into Spleeter. They will get resampled into mono 22050 Hz automatically after running Spleeter from the Clipchan master branch (but not the 3.8 since it's behind). When I wake up I'll update filelist.py to automatically create the filelist from a subtitle file. That'll make things a lot easier for English VAs instead of tapping > and fixing the punctuation. I originally made it for subbing audio clips without subtitles when I was going through 2B's, but in practice I've found it's a lot easier to create subs in Aegisub for things like audio from a YouTube video and then run Clipchan on them.

Robowaifu Technician 10/21/2020 (Wed) 11:52:08 No.5867

>>5866 > It only takes a minute to align the subs in Aegisub. Unfortunately I didn't do that ahead of time (like an idiot), and as I said I've already ripped the entire thing and I'm halfway through sorting them out (requires listening to every clip in order ofc--basically the time req'd to watch the entire episode). I can start the entire process over again for the forth time haha if you think it would be more expedient. It would be really helpful if we had some sort of full tutorial video from you about the correct way to use Clipchan, start to finish, posted on bitchute or somewhere Anon. Regardless, I'll do what needs doing to get Chii's voice on her virtual waifu's avatar. Also, I wonder if the other characters a VA does for other animus might also be helpful in contributing to a full & capable trained model for a character?

Robowaifu Technician 10/21/2020 (Wed) 11:54:42 No.5868

>>5866 BTW, I'm the anon who's on the 3.8 branch...

Robowaifu Technician 10/22/2020 (Thu) 19:24:32 No.5907

>>5863 By the way why do you need to edit filelist.txt to make the English VA's actual words? It's already automatically generated by Clipchan. There are English subtitles for Chobits here: https://kitsunekko.net/dirlist.php?dir=subtitles%2FChobits%2F

Robowaifu Technician 10/22/2020 (Thu) 19:53:27 No.5913

>>5907 Simply b/c many (~ >50%) of the English subs in my source widely diverge from the actual English voice track for Chii. The longer the sentence(s), usually the worse it becomes.

Robowaifu Technician 10/22/2020 (Thu) 19:57:53 No.5914

>>5913 > (~ >50%) OK, that's probably an exaggeration, but it's certainly several hundreds of dialogue line examples in the whole series of 25 episodes.

Robowaifu Technician 10/22/2020 (Thu) 20:10:40 No.5915

>>5867 >Also, I wonder if the other characters a VA does for other animus might also be helpful in contributing to a full & capable trained model for a character? Perhaps, some VAs change their voice acting quite a bit between characters. It should give the model more data to work with. A big issue with 2B's voice is there isn't enough voice clips to cover every possible word, but I'm hoping this multi-speaker version will learn to fill in the blanks. >>5913 Oh, that sucks and makes sense. Once I finish the next version of WaifuSynth I'll see if I can extend it to speech recognition because that's gonna be a big hassle for people trying to train their own waifu voices.

Robowaifu Technician 10/22/2020 (Thu) 20:23:13 No.5916

>>5915 Yeah, the VA for Chii is rather prolific WARNING: don't look her up, it will ruin everything haha! :^) and I thought that since Chii isn't really too diverse in her vocabulary (part of the storyline arc ofc), that perhaps the statistical modeling effect of AI might benefit if I can find another character she did that wasn't too far from Chii's 'normal' voice. >multi-speaker fills in the blanks That's good news. Here's hoping. >auto voice recog That would truly make this into an amazing all-in-one toolset Anon.

Robowaifu Technician 10/22/2020 (Thu) 21:52:32 No.5917

>>5916 Anyway, for now don't worry about resampling the clips. They should be the highest quality available before going into Spleeter. In Aegisub you can load up the video or audio, align the subtitles, type it in the proper line, and hit enter to proceed to the next one. When Clipchan is complete I'll make a video explaining the whole process.

Robowaifu Technician 10/22/2020 (Thu) 22:16:38 No.5918

>>5917 OK, thanks for the explanation. Sounds like I need to start over with this. Not sure what my timeline will be, probably somewhere around the Trump win.

Robowaifu Technician 10/23/2020 (Fri) 09:46:52 No.5932

For some reason I thought I uploaded the 2B voice model for WaifuSynth already but I didn't. You can get it now here: https://anonfiles.com/Hbe661i3p0/2b_v1_pt

Robowaifu Technician 10/23/2020 (Fri) 10:20:54 No.5934

>>5932 >2B CATS remake wehn? /robowaifu/ for great justice. This needs to happen.

Robowaifu Technician 10/23/2020 (Fri) 12:33:41 No.5945

>>5917 Welp, I just wasted an entire day trying to get Aegisub up and running with no success. Just as an offhand guess, I'm supposing you're not running it on Linux (but rather on W*ndows)?

Robowaifu Technician 10/23/2020 (Fri) 12:49:53 No.5947

>>5945 I quit using Windows over a decade ago. What problem are you having with it?

Robowaifu Technician 10/23/2020 (Fri) 13:43:14 No.5949

>>5945 It could be either an issue with FFMS: >After upgrading my Linux distro, i ran Aegisub and got this error >aegisub-3.2: symbol lookup error: aegisub-3.2: undefined symbol: FFMS_DoIndexinga >So i had to downgrade from ffms2 2.40 package to ffms2 2.23.1 https://github.com/Aegisub/Aegisub/issues/198 Or Wayland, Aegisub requires x11: >GDK_BACKEND=x11 aegisub does not crash. https://github.com/Aegisub/Aegisub/issues/180

Robowaifu Technician 10/23/2020 (Fri) 14:19:26 No.5952

>>5949 >>aegisub-3.2: symbol lookup error: aegisub-3.2: undefined symbol: FFMS_DoIndexing That was exactly the error from the distro package manager install that started my down this long bunny trail. I never found that issue link in my searches. :/ Anyways, I went and downloaded the repo and tried to build from source, but then discovered I had to have wxWidgets as well, so I had to back out then builld that from source (dev version took hours to finish, but at least it succeeded in the end). Afterwards, the Aegisub build failed with 'references not found' type errors. Too many to remember and I ~~tableflipped.exe~~closed the terminal after all those hours in disgust so I can't recall exactly. Anyway thanks for the links. I'll try it again tomorrow.

Robowaifu Technician 11/22/2020 (Sun) 14:55:27 No.6997

>>5932 One thing I'm not perfectly clear on Anon, can WaifuSynth be used for other languages? For example, since animu is basically a Japanese art-form, can your system be used to create Japanese-speaking robowaifus? If so, would you mind explaining how any of us would go about setting something like that up please?

Robowaifu Enthusiast 11/22/2020 (Sun) 16:16:44 No.6998

>>5861 OMFG anon this is awesome! Crafting 2B's perfect ass out of silicone will be challenging but this is all the motivation I need!

Robowaifu Technician 12/05/2020 (Sat) 12:51:05 No.7560

>>5648 Anon what happened to your gitlab profile? It is deleted, can you post your new one?

Robowaifu Technician 12/05/2020 (Sat) 21:34:21 No.7576

>>7560 Anyone downloaded this, at least for archiving reasons? This is also gone: https://anonfiles.com/Hbe661i3p0/2b_v1_pt from here >>5932

Robowaifu Technician 12/06/2020 (Sun) 03:38:06 No.7592

For singing, https://dreamtonics.com/en/synthesizerv/ has a free Eleanor, which is pretty fantastic. As with all vocaloid -type software, you have to git gud at phonemes.

Robowaifu Technician 12/06/2020 (Sun) 09:00:14 No.7594

>>7576 It is possible that he deleted all the contents and left robowaifu after the latest drama. He might be the anon who involved in the latest one. If that is the case it's pretty unfortunate. I hope he recovers soon and comes back.

Robowaifu Technician 12/06/2020 (Sun) 10:01:49 No.7595

>>7594 Possible, but I hope this isn't it. Kind of radical. I tried to explain it to him as reasonable as possible what the problem was. Whatever, I don't wanna get into that again. The more important point is: I think he gave us enough hints how to do this stuff. I'm not claiming that I could reproduce this clipchan program, but I had the same idea before I read it here. It's kind of obvious to take subtitles to harvest voices. Which means, there will be other implementations on the net doing that and explaining how to. We don't need someone come to us or to be into anime nor robowaifus, just take some other implementation from another place or have someone reproducing it based on the knowledge available.

Robowaifu Technician 12/07/2020 (Mon) 04:27:13 No.7610

>>7594 What drama happened besides the migration I've been to deep in my projects to browse like I used to.

Robowaifu Technician 01/19/2021 (Tue) 18:50:53 No.8150

I'll post this here for now, since it's definitely relevant. I was experimenting a little bit more with Deltavox RS and Audacity. It seems that there is no "one size fits all" solution when using Deltavox. In order to get a decent result, you have to experiment with different spellings, phonemes, energy, F0, bidirectional padding, and so on. In Audacity, I used a simple filter curve. I was able to get noticeably less tinny audio, which sounds less computer generated. I'm going to explore more options for editing the audio after it's been synthesized to improve its quality. I'll post again if I find anything interesting. I'll repost the links here since they're still relevant: Deltavox User Guide https://docs.google.com/document/d/1z9V4cDvatcA0gYcDacL5Bg-9nwdyV1vD5nsByL_a1wk/edit Download: https://mega.nz/file/CMBkzTpb#LDjrwHbK0YiKTz0YllofVuWg-De9wrmzXVwIn0EBiII

Robowaifu Technician 01/19/2021 (Tue) 19:16:55 No.8151

>>8150 Thanks. BTW, do you know if this is open source? Since QT dlls are included I presume this is C++ software. If both are true, then it's very likely I can rewrite this to be portable across platforms -- not just (((Wangblows))) and we can be running it on our RaspberryPis & other potatos. Thanks for all the great information Anon.

Robowaifu Technician 01/25/2021 (Mon) 00:59:08 No.8244

Some singing with Synth V: https://youtu.be/ex3HRrxEcLE and https://youtu.be/e-6dE6tYyWA

Robowaifu Technician 01/25/2021 (Mon) 01:08:36 No.8245

>>8244 And one based on Vocaloid: https://youtu.be/OPBba9ScdjU

Robowaifu Technician 02/02/2021 (Tue) 20:19:51 No.8423

So 20 years or so ago I got an IMac. Played with it's text to speech and voice control systems that came with it. I figured quick the T2S was not very good. However if the text was misspelled intentionally to emphasize vowels like "mooz end skweer ahl" I could cobble together a bad accent. Then I edited common error code message base to reflect the accent (Airor koad noombar... ect) then it would read the number. Then added a custom alert sound of a Russian lady cursing. So then when the computer had an error it would curse in a foreign language then carefully sound out what the error was in a kind of "ESL". This was a shitbox computer and it was frustrating to work with. I realized the voice control was only associating a sound to a command so instead of a voice queue I used the sound of smacking it on the left side (for stress relief). Now an IMac is built like a bongo drum so there are many ways to hit the thing and get different noises. Flicking the front acted like alt tab rotating through windows, gentle pat to the top put it into sleep mode. Right? Then voice commands like "fuck that!" (to close windows), "fuck all this" (close all windows), "more" for full screen, "shut up" for mute and "earmuffs!) to stop voice recognition and a few more. The cache would bog down constantly and have to be purged. I made this a macro tied into the phrase "go clean yourself up!". Now combine all this together with furious Jolt cola fueled work flow. The look on my roommates face when they watched me smack it upside the head and it curses at me (acknowledgment sound), I flick it a few times cycling through windows, say "fuck that" closing them out. Another smack (more cursing) search "for stuff" (slows to a crawl because of the cache). Get frustrated. Smack it again (more curses) "fuck all this" (all windows close), Smack (more curses) "go clean your self up!" (runs through an auto purge and defrag). Get up and say I'm done for now. night night and pat its head (gentle thumping sound initiating sleep mode). I basically turned an IMac into an abused angry mail order bride by default of tinkering.

Robowaifu Technician 02/03/2021 (Wed) 00:46:43 No.8429

>>8423 Funny. I can just imagine your flatmate's faces.

Robowaifu Technician 02/03/2021 (Wed) 04:41:22 No.8434

>>8423 Wow, this is a great anecdote. You already took some steps towards having your own robowaifu back then.

Robowaifu Technician 03/21/2021 (Sun) 21:37:09 No.9092

>>1239 >>1240 Not a local, but I'm wondering if a current tool like MycroftAI (the virtual assistant) can currently pipe it's output text through the fifteenAI API to make a character voice. I haven't used fifteenai or Mycroft yet, but I suspect you could make a half-decent Twilight home assistant now with a RaspPi and a plushie.

Robowaifu Technician 03/21/2021 (Sun) 22:10:02 No.9093

>>9092 >but I suspect you could make a half-decent Twilight home assistant now with a RaspPi and a plushie. I suspect you can now, yes. And with the further info here on /robowaifu/ she could even move at least a bit. Just search around in the Speech Synthesis general bread >>199 and you could get some ideas.

Robowaifu Technician 03/24/2021 (Wed) 01:57:40 No.9110

Is the voice synthesize going to be for English voices or Japanese voices? Or does one also work for the another? It would be pretty awesome if one could take voice samples of their favorite VA, vtuber, etc, put it through a voice synth A.I., and give their robowaifu that her voice.

Robowaifu Technician 03/24/2021 (Wed) 02:18:02 No.9111

>>9110 >It would be pretty awesome if one could take voice samples of their favorite VA, vtuber, etc, put it through a voice synth A.I., and give their robowaifu that her voice. More or less, that has already been achieved Anon. So hopefully more options along that line will soon become readily available for us all.

Robowaifu Technician 03/24/2021 (Wed) 02:23:21 No.9112

>>9111 Oh wow it already has? Where can I read about it/try it if possible?

Robowaifu Technician 03/24/2021 (Wed) 10:10:27 No.9116

>>9112 Our guy here called his WaifuSynth. ITT there are examples from the ponys, who have taken a highly-methodical approach for all the main characters in MLP:FiM cartoon show.

Robowaifu Technician 03/24/2021 (Wed) 11:36:38 No.9118

I see. Though all the synths seems to be for English voices. I'm guessing the 2B, Rikka, Megumin, Rem, etc mentioned in >>5861 are referring to their English VA rather than the Japanese ones. Unless I'm missing out on something? (If so, then maybe it'd be best for me to make some time and read this whole thread.)

Robowaifu Technician 03/24/2021 (Wed) 12:29:20 No.9119

>>9118 AFAICT, the training approach is just a statistical system matching sounds to words based on examples. It should work for any human language I think -- though you would need to be fluent in the target language to QA the results ofc.

Robowaifu Technician 03/24/2021 (Wed) 12:42:06 No.9120

>>9119 Ohh, I see. One last thing: I wouldn't be wrong to assume that, since the dropping of "kokubunji", there is no one working on the voice for robowaifu?

Robowaifu Technician 03/24/2021 (Wed) 13:36:23 No.9121

>>9112 WaifuSynth: https://gitlab.com/robowaifudev/waifusynth Clipchan: https://gitlab.com/robowaifudev/clipchan There are better methods now like FastPitch and HiFiSinger. FastPitch is about 40x faster than Tacotron2/Waveglow (what WaifuSynth uses) and is less susceptible to generation errors but is still far from perfect. HiFiSinger uses three different GANs to make realistic speech and singing, and its 48kHz model outperforms the 24kHz ground truth but it still has room for improvement in 48kHz, although I suspect it could be near perfect by training a 96kHz model. FastPitch: https://fastpitch.github.io/ HiFiSinger: https://speechresearch.github.io/hifisinger/ There's still a lot of research to be done before this stuff will be ready for production, namely imitating voices without training a new model, emotion/speech style control, and ironing out undesired realistic noises in generation. Probably in the next 2-3 years it will be easy to make any character sing given a song or read out any given text, and you won't have to go through the whole hassle of collecting audio clips and training models yourself. >>9118 Making Japanese VAs speak English and English VAs speak Japanese should be possible but you will have to train a model that converts the input to phonemes, otherwise it will just garble and misread text. Training Tacotron2 takes a really long time so I'd recommend modifying FastPitch to use phonemes instead of characters. All you have to do is instead of inputting characters like 's a m u r a i', you input the IPA 's a m ɯ ɾ a i'. You can probably save a lot of time on training by initializing the embeddings of the IPA symbols to the character embeddings of a pretrained model, then train it on LJSpeech or another dataset until it sounds good, then fine-tune it on the desired character. This paper reports that it only takes 20 minutes of audio to speak a new language using an IPA Tacotron2 model but they don't provide their trained model or code: https://arxiv.org/abs/2011.06392v1

Robowaifu Technician 03/24/2021 (Wed) 13:43:42 No.9122

>>9121 Also, you can use https://ichi.moe/ to convert Japanese subtitles from https://kitsunekko.net/dirlist.php?dir=subtitles%2Fjapanese%2F into romaji and then convert the romaji to IPA. Japanese IPA is straightforward since the syllables sound exactly the same as they are written, unlike English: https://en.wikipedia.org/wiki/Help:IPA/Japanese

Robowaifu Technician 03/24/2021 (Wed) 14:04:45 No.9123

>>9121 >>9122 !! Thanks so much Anon!

Robowaifu Technician 03/24/2021 (Wed) 15:50:44 No.9125

>>9121 Nice, thanks! I have an 'new' machine (well, old but still much better than my old notebook) pieced together that has an i3 and an Nvidia GT430 (or possibly an 750ti). Not too impressive I know, but I could use it to take a shot at setting up clipchan again. Mind giving me specific set up advice Anon? Like the OS to use, Python version to use, etc., etc. The more specific, the better. TIA.

Robowaifu Technician 03/25/2021 (Thu) 00:45:18 No.9135

>>9125 2 GB might not be enough to train FastPitch but you might squeeze by with gradient checkpointing and gradient accumulation to reduce memory usage. A 1 GB card will certainly be too little since the model parameters are 512MB and you need at least twice that to also store the gradient. If it doesn't work you could shrink the parameters down and train a lower quality model from scratch. However, it seems the GT430 supports CUDA 2.1 and the 750Ti supports 5.0. 2.x capability was removed in CUDA 9 and 6.x removed in CUDA 10. If you're lucky you might be able to get them to still work by compiling PyTorch and Tensorflow with the CUDA version you need, but I wouldn't bet on it. I'd recommend using at least Python 3.7 and Tensorflow==2.3.0 since Spleeter requires that specific version. If someone has a newer GPU with at least 6 GB they'll have to download a Tensorflow 2.3.0 wheel with CUDA 11.1 because Tensorflow only supported 10.1 until version 2.4. A Tensorflow 2.3.0 + CUDA 11.1 wheel for Python 3.8 is available here: https://github.com/davidenunes/tensorflow-wheels Again this is only necessary if you have a newer GPU with at least 6 GB. Spleeter will run fine on the CPU. I use Python 3.7 and don't feel like compiling a Tensorflow 2.3.0 wheel for it so I just modified Spleeter's requirements.txt to support tensorflow==2.4.0 and numpy>1.16.0,numpy<=1.19.5 and installed it from source. Spleeter will still work and output the clean voice clips but crash after finishing. This error can be patched by commenting out the del function in spleeter/separator.py:135 since Tensorflow 2.4 closes the session automatically. I'm using PyTorch 1.8.0 with CUDA 11.1 since it supports the full capabilities of my GPU. To use either PyTorch or Tensorflow easily you'll need at least a 4th generation i3 which has AVX2 support. Otherwise you'll have to look for community pip wheels compiled without AVX/AVX2 for your specific version of Python. Older versions of PyTorch are compatible with most deep learning models but lack the newer torchaudio which is an amazing library for processing and training on audio that will certainly start seeing some use soon. Lastly, Clipchan needs to be updated to support the latest Spleeter version which completely changed everything again, but that'll be an easy fix. I'll patch it tonight.

Robowaifu Technician 03/25/2021 (Thu) 01:09:39 No.9136

>>9135 Great, thanks for the specific details. Hmmm, from what you're saying it sounds like I still won't be able to run it even with the better (by comparison) hardware. I'm sure someday we'll be able to do this stuff on much more modest hardware. I'll just be patient and focus on other things till then. :^) >Lastly, Clipchan needs to be updated to support the latest Spleeter version which completely changed everything again, but that'll be an easy fix. I'll patch it tonight. Don't rush on my account Anon. Maybe it will help others though.

Robowaifu Technician 03/25/2021 (Thu) 06:31:45 No.9149

Due to the 'rona my workspace is limited to a shitty laptop. What's the best (or rather, least worst) model one could conceivably run on a CPU?

Robowaifu Technician 03/25/2021 (Thu) 08:17:39 No.9150

>>9136 It works fine on better hardware. The problem is backwards compatibility is a foreign concept to Tensorflow so people end up locking their projects to old versions and creating a hellish nightmare of dependency conflicts. Also short clips don't take up too much memory. Only when processing 10 minute songs does it use up to 6 GB. To avoid this Clipchan processes each clip individually. And Clipchan has been updated now to v0.3. I had to fix it anyway to help someone get voice clips. It's essentially finished, besides making it simpler to use and ideally creating a GUI for it. The most important options are -dza which cleans audio with Spleeter, speeds up subtitle processing, and auto-crops the audio clips. For Tacotron2 -r 22050 -c 1 are also needed to resample and mix stereo to mono (they require the -a option to have any effect right now.) If you don't have FFmpeg with libzvbi, then omit the -z option. And some fresh Natsumi Moe voice clips from v0.3 ready for Tacotron2: https://files.catbox.moe/ipt13l.xz Still a work in progress but there's about 10 minutes of usable audio there. >>9149 Not sure, you won't get much of a speed up running FastPitch on CPU compared to Tacotron2. It's possible for fine-tuned models to be pruned and compressed down so they can run on mobile devices, but I'm not aware of anyone who has taken the time to do that. Pruning and compressing doesn't apply to training though, only works with inference.

Robowaifu Technician 03/25/2021 (Thu) 09:02:19 No.9152

>>9150 Thanks for the Natsumi Moe clips Anon! A cute. I hope someday we can manage a basic Chii library to generate voices from. Sounds like certain interests are specifically trying to keep their tools from working with older systems -- even their own systems haha. Doesn't sound like I (and thousands more like me) will ever be able to use this tool at that rate. Maybe if someone creates some Docker or other kind of container that were tuned for different hardware setup ups then we might be able to break free of this intentionally-created ratrace they intend us to run.

Robowaifu Technician 03/25/2021 (Thu) 16:52:30 No.9162

>>9121 >Cloning into 'fastpitch.github.io'... >fatal: repository 'https://fastpitch.github.io/' not found >Cloning into 'hifisinger'... >fatal: repository 'https://speechresearch.github.io/hifisinger/' not found I'm a starting to get paranoid or are they really onto us?

Robowaifu Technician 03/25/2021 (Thu) 17:30:57 No.9163

>>9162 Those are demo pages, not repositories. FastPitch: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch HiFiSinger (unofficial implementation): https://github.com/CODEJIN/HiFiSinger

Robowaifu Technician 03/25/2021 (Thu) 17:32:50 No.9164

>>9162 Those aren't git repositories Anon. Browse there and read the pages.

Robowaifu Technician 03/25/2021 (Thu) 17:37:44 No.9165

>>9163 >FastPitch: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch Hmm, the only way I can actually get a clone to work is by going up in the tree a bit? git clone --recursive https://github.com/NVIDIA/DeepLearningExamples.git

Robowaifu Technician 03/25/2021 (Thu) 21:21:36 No.9179

>>9165 Git 2.25.0 includes a new experimental sparse-checkout command:

git clone --filter=blob:none --sparse https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples
git sparse-checkout init --cone
git sparse-checkout add PyTorch/SpeechSynthesis/FastPitch

Robowaifu Technician 03/25/2021 (Thu) 21:31:46 No.9180

>>9150 Yeah, I'm not going to do any training, just inference from released checkpoints. I did manage to get a FastSpeech2 model running with some pretty good results, although for some reason it adds garbled echoes after the generated speech.

Robowaifu Technician 03/25/2021 (Thu) 22:37:25 No.9182

>>9179 Ahh, didn't know about that one, thanks Anon.

Robowaifu Technician 04/26/2021 (Mon) 05:28:17 No.10159

A novel voice converter that outperforms FastSpeech2 and generates speech faster. Although it doesn't do speech synthesis from text it introduced a convolution-augmented Transformer that could easily be adapted into FastSpeech2 and FastPitch to improve the quality of synthesized speech. https://kan-bayashi.github.io/NonARSeq2SeqVC/

Robowaifu Technician 04/26/2021 (Mon) 10:09:48 No.10165

>>10159 Quality sounds excellent. Thanks Anon.

Robowaifu Technician 05/12/2021 (Wed) 09:25:27 No.10386

>>9121 >>10383 >robowaifudev

Robowaifu Technician 05/12/2021 (Wed) 14:37:53 No.10393

Facebook made a great speech generator, circa a year ago: https://ai.facebook.com/blog/a-highly-efficient-real-time-text-to-speech-system-deployed-on-cpus/ - It's not free software, but they described how it is build. Yannic Kilcher goes through the system and explains it here: https://www.youtube.com/watch?v=XvDzZwoQFcU One interesting feature is, that it runs on a CPU with 4-cores (not the training of course). On such a CPU it is faster than real-time, which means faster than running the audio output would take. Something like this might be something very useful to us, if we could run our own knock-off on a small SBC inside the robowaifu.

Robowaifu Technician 05/13/2021 (Thu) 01:52:02 No.10406

>>10393 >Something like this might be something very useful to us, if we could run our own knock-off on a small SBC inside the robowaifu. It certainly would, if we can somehow obtain access to it or reproduce it, Anon. Thanks for the heads-up, and for the video link. It really helps to get the points across well for us. youtube-dl --write-description --write-auto-sub --sub-lang="en" https://www.youtube.com/watch?v=XvDzZwoQFcU

Robowaifu Technician 05/19/2021 (Wed) 00:37:54 No.10458

not sure if this has been posted before, but I came across this and immediately thought of some of the todo list for clipchan. https://speechbrain.github.io/index.html seems like there was some discussion about emotion and speaker ID classifiers.

Robowaifu Technician 05/19/2021 (Wed) 03:27:15 No.10460

>>10458 Very cool Anon, thanks. It looks like it's a solid and open source system too, AFAICT.

Em Elle E 05/20/2021 (Thu) 01:39:36 No.10499

The model link is dead, while I can train a new model I am looking to avoid that step right now because of other deadlines, though I would love to include 2B in WaifuEngine, would anyone be willing to mirror or provide an updated link? Thanks

Robowaifu Technician 05/20/2021 (Thu) 03:24:46 No.10504

>>10499 ATTENTION ROBOWAIFUDEV I'm pretty sure the model in question is your pre-trained one for 2B's WaifuSynth voice, ie, https://anonfiles.com/Hbe661i3p0/2b_v1_pt >via https://gitlab.com/robowaifudev/waifusynth cf. (>>10498, >>10502)

Em Elle E 05/20/2021 (Thu) 03:32:00 No.10506

>>10504 Links are both dead

Em Elle E 05/20/2021 (Thu) 03:32:50 No.10507

>>10504 To clearify the pretrained model links are both dead repo still up

Robowaifu Technician 06/09/2021 (Wed) 10:24:18 No.10882

Great! Now my waifu can sing a lullaby for me to sleep well. The only problem is that I don't have the Vocaloid editor. Video demonstration: https://youtu.be/mxqcCDOzUpk Github: https://github.com/vanstorm9/AI-Vocaloid-Kit-V2

Robowaifu Technician 06/22/2021 (Tue) 17:01:45 No.11006

>>1246 BTW Anon, just in case you're not RobowaifuDev, I wanted to let you know that he actually did it (your idea that is, >>9121). Just in case you missed it.

Electronic Chronicler 07/06/2021 (Tue) 21:13:09 No.11229

>>5521 >Cute robowaifu Check >Inspiring message to all weebineers everywhere Check >Epic music Check Best propaganda campaign. 10/10, would build robowaifu. >>5529 >>5530 Damn it lads! You're bringing me closer to starting sampling Lime's VA heheheh (Although I was hoping to use my voice to generate a somewhat convincing robowaifu, so as to minimise reliance on females).

Electronic Chronicler 07/06/2021 (Tue) 21:15:56 No.11230

>>11229 Forgot to add. >>5532 >I don't know if that'll be enough. Chii didn't really talk much. You're overcomplicating it. I think he meant create a tts that outputs "Chii" regardless of what you put in ;) (Although you could add different tonality and accents, might be a more fun challenge).

Robowaifu Technician 07/13/2021 (Tue) 13:23:26 No.11474

>>10504 Sorry, been busy and haven't been active here lately. Updated the repo link: https://www.mediafire.com/file/vjz09k062m02qpi/2b_v1.pt/file This model could be improved by training it without the pain sound effects. There's so many of them it biased the model which causes strange results sometimes when sentences start with A or H.

Robowaifu Technician 07/13/2021 (Tue) 22:50:15 No.11478

>>11474 Thanks! Wonderful to see you, I hope all your endeavors are going well Anon.

Em Elle E 07/13/2021 (Tue) 23:17:16 No.11480

>>11474 come join my doxcord server if you have time and pm me! thanks for the model, you will likely see it used on the 2B "cosplay" waifu, we may have in the game

Robowaifu Technician 07/16/2021 (Fri) 03:54:35 No.11522

>>11480 The link is expired. What would you like to talk about? I don't have a lot to add. You can do some pretty interesting stuff with voice synthesis by adding other embeddings to the input embedding, such as for the character in a multi-character model, emphasis, emotion, pitch, speed, and ambiance (to utilize training samples with background noise.) This is what Replica Studios has been doing: https://replicastudios.com/

Em Elle E 07/17/2021 (Sat) 05:29:16 No.11536

>>11522 If you are interested, I am looking for someone to take over the speech synthesis part of WaifuEngine, I got it working however, to work on it as a specialty takes me away from the rest of the application, like I want to train a new model using glowtts but my time is limited. I also have to work on the various other aspects of the project, to get it off the ground. Right now our inference time using tacotron2 isn't great unless you have a GPU. As for compensation on the project, so far I have been giving away coffee money as we have little resources haha, if the project gets bigger and more funding, I'd be willing to help the project contributors out. https:// discord.gg/ gBKGNJrev4

Robowaifu Technician 07/17/2021 (Sat) 08:31:16 No.11550

>>11536 In August I'll have some time to work on TTS stuff and do some R&D. I recommend using FastPitch. It's just as good as Tacotron2 but 15x faster on the GPU and 2x faster on the CPU than Tacotron2 is on the GPU. It takes about a week to train on a toaster card and also already has stuff for detecting and changing the pitch and speed, which is essential to control for producing more expressive voices with extra input embeddings. https://fastpitch.github.io/

Robowaifu Technician 07/17/2021 (Sat) 08:43:27 No.11551

>>11550 >https://fastpitch.github.io/ >related (>>9165, >>9179)

Em Elle E 07/17/2021 (Sat) 16:49:41 No.11559

>>11550 I'd message you on discord about this this could be useful info for the board. But essentially I did use fast pitch originally, the issue is the teacher student training methodology, you have to use tacotron to bootstrap and predict durations to align, When you don't do that and just train on LJS Model of Fastpitch via fine tuning, it fails to predict the durations. We can definitely try this method I am open to it, I guess in my time crunch I didn't bother. I am optimizing for delivery so that we have a product people can use and enjoy, it should be very simple to update the models in the future, it would be one python script change based off my architecture

Robowaifu Technician 07/18/2021 (Sun) 02:41:28 No.11569

>>11559 The 2B model I made was finetuned on the pretrained Tacotron2 model and only took about an hour. Automating preprocessing the training data won't be a big deal. And if a multi-speaker model is built for many different characters it would get faster and faster to finetune. I've been looking into Glow-TTS more and the automated duration and pitch prediction is a nice feature but the output quality seems even less expressive than Tacotron2. A key part of creating a cute female voice is having a large range in pitch variation. Also I've found a pretrained Tacotron2 model that uses IPA. It would be possible to train it on Japanese voices and make them talk in English, although it would take some extra time to adapt FastPitch to use IPA. Demo: https://stefantaubert.github.io/tacotron2/ GitHub: https://github.com/stefantaubert/tacotron2

Robowaifu Technician 07/18/2021 (Sun) 02:47:48 No.11570

Some other ideas I'd like to R&D for voice synthesis in the future: - anti-aliasing ReLUs or replacing them with swish - adding gated linear units - replacing the convolution layers with deeper residual layers - trying a 2-layer LSTM in Tacotron2 - adding ReZero to the FastPitch transformers so they can be deeper and train faster - training with different hyperparameters to improve the quality - using RL and human feedback to improve the quality - using GANs to refine output like HiFiSinger - outputting at a higher resolution and downsampling

Robowaifu Technician 07/18/2021 (Sun) 02:53:45 No.11571

>>11569 Thanks, but what's the point of this IPA. To let it talk correctly in other languages? >Der Nordwind und die Sonne - German with American English accent I can assure you: I doesn't work. Americans talking German often (always) sounds bad, but this is a level of it's own. Absolutely bizarre.

Robowaifu Technician 07/18/2021 (Sun) 06:16:05 No.11577

>>11571 Yeah, I live around Chinese with thick accents and this takes it to the next level, kek. That's not really the motivation for using IPA though. This pilot study used transfer learning to intentionally create different accents, rather than copy the voice without the accent. How IPA is useful to generating waifu voices is it helps improve pronunciation, reduce needed training data, and solves the problem with heteronyms, words spelled the same but pronounced differently: https://jakubmarian.com/english-words-spelled-the-same-but-pronounced-differently/ When models without IPA have never seen a rare word in training, such as a technical word like synthesis, they will usually guess incorrectly how to pronounce it, but with IPA the pronunciation is always the same and it can speak the word fluently without ever having seen it before. Also in a multi-speaker model you can blend between speaker embeddings to create a new voice and it's possible to find interpretable directions in latent space. Finding one for accents should be possible, which could be left in control to the user's preferences to make a character voice sound more American, British or Japanese and so on.

Robowaifu Technician 07/18/2021 (Sun) 18:41:02 No.11582

>>11577 Ah, okay, this sounds pretty useful. One more problem comes to mind in regards to this. In English foreign names are often changed in pronunciation, because the name would sound "strange" otherwise. The philosopher Kant would sound like the c-word for female private parts. Therefore they pronounce it Kaant. I wonder if the method helps with that as well.

Robowaifu Technician 07/19/2021 (Mon) 04:32:25 No.11601

>>11582 In that case it depends what language you transliterate with. If necessary names could be transliterated as they're suppose to be pronounced in their original language, or it could all be in the same language. Exceptions could also be defined. For example, the way Americans pronounce manga is quite different from the Japanese. If someone wants their waifu to sound more like a weeb and pronounce it the Japanese way, they could enter the Japanese IPA definition for it to override the default transliteration.

Robowaifu Technician 07/20/2021 (Tue) 05:49:11 No.11623

Finished creating a tool for automatically downloading subtitles and audio clips from Youtube videos, which can be reworked in Aegisub or another subtitle editor, then converted into a training set with Clipchan. https://gitlab.com/robowaifudev/alisub

Robowaifu Technician 07/20/2021 (Tue) 06:24:34 No.11624

>>11623 This sounds exciting Anon, thanks! >or another subtitle editor Can you recommend a good alternative Anon? I've never been able to successfully get Aegisub to run.

Robowaifu Technician 07/20/2021 (Tue) 07:04:18 No.11625

>>11624 Someone recommended SubtitleEdit but it's Windows only: https://nikse.dk/SubtitleEdit Subtitle Editor can display waveforms but it's far more difficult to use and I don't recommend it.

Robowaifu Technician 07/20/2021 (Tue) 21:58:24 No.11631

>>11623 Okay, thanks. This could be useful for more, I guess. Maybe later to train the system on lip reading using YouTube, for example. Or maybe for training voice recognition in the first place? How much data do we need to emulate a particular voice?

Robowaifu Technician 07/21/2021 (Wed) 00:34:50 No.11634

>>11625 OK, thanks for the advice. I'll try and see if I can set it up on a virtual box instead or something, Aegisub did look pretty easy to use (first time I've seen it in action, so thanks again). The problem is always a wxWidgets dependency hell issue. I can even get it to build, right up to link time.

Robowaifu Technician 07/21/2021 (Wed) 02:48:24 No.11636

>>11631 Finetuning a pretrained model you need about 20 minutes. Training a model from scratch takes about 12 hours. Multispeaker models trained on hundreds of voices can clone a voice with a few sentences but still need a lot of samples to capture all the nuances.

Robowaifu Technician 07/21/2021 (Wed) 16:14:31 No.11647

Been doing some work to get WaifuEngine's speech synthesis to run fast on the CPU and found that FastPitch has a real-time factor of 40x and WaveGlow 0.4x. This lead me to testing several different vocoder alternatives to Waveglow and arriving at multi-band MelGAN with an RTF of 20x. So FastPitch+MelGAN has an RTF of 12x, which means it can synthesize 12 seconds of speech every second or 80ms to generate a second of speech. "Advancing robotics to a point where anime catgirl meidos in tiny miniskirts are a reality" took MelGAN 250ms on CPU to generate from 2B's Tacotron2 Mel spectrogram. Now I just gotta set up this shit so it's easy to train end-to-end and the whole internet and their waifus are getting real-time waifus. Multi-band MelGAN repo: https://github.com/rishikksh20/melgan Multi-band MelGAN paper: https://arxiv.org/abs/2005.05106 Original MelGAN paper: https://arxiv.org/abs/1910.06711

Robowaifu Technician 07/21/2021 (Wed) 17:11:14 No.11648

>>11636 Interesting, thanks, but I meant how much samples we need to fine-tune a voice. I also wonder if voicesmare being 'blended' that way. Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. >>11647 Thanks for your work. I thought voice generation would take much more time to do. Good to know. Responses to someone talking should be fast.

Robowaifu Technician 07/21/2021 (Wed) 20:08:08 No.11650

>>11648 I meant 20 minutes and 12 hours of samples. Finetuning with 20 minutes of samples takes about 1-2 hours on my budget GPU. >Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. This definitely deserves more thought. If every person on the internet will be able to do speech synthesis and there is a tsunami of voice cloning characters, it's important people are able to have creative freedom with it while the buzz is on. People's curiosity will further advance speech synthesis and diffuse into other areas of AI, including waifu tech. On the other hand if people only straight up copy voices then it would cause a media shitstorm and possibly turn people away, but that could also have its benefits. Whatever happens though the accelerator is stuck to the floor. In the meantime while the hype builds, iteration can continue on until the synthesis of Gawr Kilcher is realized. When people look closely though they'll notice it's neither Yannic or Gura but actually Rimuru and Stunk all along.

Robowaifu Technician 07/22/2021 (Thu) 03:01:01 No.11653

>>11647 Thanks for the information, Anon.

Robowaifu Technician 07/23/2021 (Fri) 17:41:47 No.11677

>>11650 kek. i just noticed that logo. i wonder what based-boomer AJ would think of robowaifus. white race genocide, or crushing blow to feminazis and freedom to all men from oppression?

Robowaifu Technician 07/23/2021 (Fri) 21:50:10 No.11680

>>11677 He doesn't like them or AI in general. Said something once like people are going to stop having kids and masturbate with a piece of plastic all day and how the government is going to know everything about people through them and be able to manipulate them perfectly. He's not really wrong. Look how many people already give up all their data using Windows and Chrome.

Robowaifu Technician 08/05/2021 (Thu) 03:32:09 No.12193

>>8151 https://github.com/ZDisket/TensorVox

Robowaifu Technician 08/07/2021 (Sat) 23:31:13 No.12257

>>8151 >>12193 A routine check on the Insights->Traffic page led me here. While the program itself is written with Qt, what actually makes the voices work (Voice.h and beyond) does not contain a single trace of Qt (well, almost, but what little there is is just error boxes). This is a deliberate design decision to allow the actual inference engine to be copied and ported anywhere with minimal trouble. For inference on embedded devices you probably want to use TFLite, which is on my list because I plan on Windows SAPI integration.

Chobitsu Board owner 08/07/2021 (Sat) 23:54:25 No.12258

>>12257 Hello Anon, welcome. We're glad you're here. Thanks for any technical explanations, we have a number of engineers here. Please have a look around the board while you're here. If you have any questions, feel free to make a post on our current /meta thread (>>8492). If you decide you'd like to introduce yourself more fully, then we have an embassy thread for just that (>>2823). Regardless, thanks for stopping by!

Robowaifu Technician 09/15/2021 (Wed) 10:13:43 No.13164

I was thinking of using something like FastPitch, but with an added effect to make it sound more robotic to keep it from getting to an uncanny valley sound. Either that or making the voice kinda high pitched and childlike, to make it easier to accept when it says something stupid. Has anyone here considered hardware-based speech synthesis so it'll actually sync up with mouth movements? Everything professional I've seen just seem like horrid screaming fleshlights that never really try to resemble actual heads.

Robowaifu Technician 09/15/2021 (Wed) 20:30:52 No.13194

>>13164 >Has anyone here considered hardware-based speech synthesis so it'll actually sync up with mouth movements? Voice modulation, but not complete synthesis. But I don't know how yet. Your picrel is what I knew about, I posted some related video before. However, I was thinking about small internal speakers (mini maze speakers?) but with additional silicone parts that could move and change the voice that way. But nothing specific yet

Robowaifu Technician 10/25/2021 (Mon) 13:09:27 No.13811

In need of some help... I want to create a speech synthesizer, I want to take samples of my waifu's voices (which I have a lot of) and use it to digitally create her voice. First of all, is it possible? The voice samples I have are not the kind that this video shows https://youtu.be/_d7xRj121bs?t=55 , they're just in-game dialog. It is also worth noting that the voice is in Japanese. If it is possible, I still have no idea where to begin with this, I'm guessing I'll need some sound tech knowledge (which I have none of) and that's about all I can think of. In terms of programming languages, I know Python fairly well and am currently getting into C++. Anons, how do I get started with this?

Robowaifu Technician 10/25/2021 (Mon) 21:40:38 No.13818

>>13811 >I still have no idea where to begin with this Welcome. Then look through the thread and into the programs mentioned. You will probably need to train some neural network on a GPU. Also, you would need to extract the voices from the game and also have these words in text then. If you can't get them as files, then you might need to record them with a microphone. Then would need to transcribe the text. Lurk around, maybe someone else knows more, and just ignore the disgusting troll insulting everyone.

Robowaifu Technician 10/25/2021 (Mon) 21:58:33 No.13821

>>13811 As dumb as this might sound, you might want to check out /MLP/ on 4chan, there's a 100+ threads about doing this with My Little Pony characters called the "Pony Preservation Project" and they've actually made some decent progress.

Robowaifu Technician 10/26/2021 (Tue) 03:15:59 No.13823

>>13818 >look through the thread and into the programs mentioned Will do. >probably need to train some neural network on a GPU Have yet to get into neural networks but looks like the time has come. >extract the voices I've done that, from the game files too so they're of decently high quality. >transcribe the text That I need to do. >>13821 >as dumb as this might sound Nothing dumb about it if it works. Will give them a visit. Thank you, Anons!

Robowaifu Technician 11/07/2021 (Sun) 07:46:34 No.14065

>>13823 Keep us updated if there's progress anon, speech synthesis is a fascinating field. I'd love to try it out myself later in the year once I have more time

Robowaifu Technician 11/16/2021 (Tue) 05:32:55 No.14270

This may be old news, since it's from 2018, but Google's Duplex seems to have a great grasp on conversational speech. I think it says a lot when I had an easier time understanding the robot verus the lady at the restaurant (2nd audio example in the blog). https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html

Robowaifu Technician 11/17/2021 (Wed) 01:30:43 No.14276

>>14270 Hi, I knew that this has been mentioned before somewhere. Didn't find it here in this thread nor with Waifusearch. Anyways, it's in the wrong thread here, since this is about speech synthesis but the article is about speech recognition. The former conversation probably happened in the chatbot thread. >One of the key research insights was to constrain Duplex to closed domains, which are narrow enough to explore extensively. Duplex can only carry out natural conversations after being deeply trained in such domains. It cannot carry out general conversations. This is exactly the interesting topic of the article. Good reminder. A few month or a year ago I pointed out that recognizing all kinds of words, sentences and meanings will be one of our biggest challenges. Especially if it should work with all kinds of voices. Some specialists (Sphinx CMU) claimed it would currently require a server farm with terrabytes of RAM to do that, if it was even possible. We'll probably need a way to work around that. Maybe using many constrained models on fast SSDs which take over, dependent on the topic of conversation. Let's also hope for some progress, but also accept that the first robowaifus might only understand certain commands.

Robowaifu Technician 02/12/2022 (Sat) 22:21:51 No.15192

>>11623 You should replace youtube-dl with yt-dlp. youtube-dl is no longer maintaned and has issues with some youtube videos.

Chobitsu Board owner 02/13/2022 (Sun) 03:25:30 No.15193

>>15192 Thanks for the tip Anon. Having used youtube-dl for years now, I too noticed the sudden drop-off in updates that occurred following the coordinated attack by RIAA/Microsoft against it's developer & user community. We'll look into it.

Robowaifu Technician 04/13/2022 (Wed) 10:23:43 No.15874

I think I've finally figured out a way to train more expressive voices in conversation without having to label a ton of data. First, the English text needs to be transcribed into IPA so that a speech synthesis model can easily predict how words are spoken without requiring a huge dataset covering all the exceptions and weirdness of English. The English transcription or IPA is projected into an embedding that's split into two parts. One part constrained to representing the content as IPA via projecting those features back into IPA symbols and minimizing the cross entropy loss. The other half modelling the style, such as the emotion and other subtleties, to match the audio examples more faithfully, which are trained through the Mel spectrogram loss. This way the model can learn all aspects of speech through just the text labels and audio examples alone. At inference time this style embedding could be modified to change the emotion, pitch, cadence, tone and other qualities of the model for voice acting or creating examples for finetuning the model towards a desired personality. A ByT5 model could be used to transcribe English and other languages into the IPA embedding + style embedding. It could also take into account the previous context of the conversation to generate a more appropriate style embedding for the speech synthesis model to work from. Training from context though will require new datasets from podcasts that have such context. I've collected some with existing transcripts and timestamps for this already. The transcripts just need to be accurately aligned to the audio clips for clipping, so it's not an unfeasible project for one person to do. Other possibilities for this could be adding tags into the text training data that get filtered out from the content via the IPA cross entropy loss, ensuring the tags only affect the style embedding. You could indicate tempo, pitches, velocity and note values for singing which would be learned in the style embeddings. It could also be used for annotating different moods or speaking styles such as whispering or yelling. There's a ton of possibilities here for more versatile speech synthesis and natural conversation.

Robowaifu Technician 06/08/2022 (Wed) 00:03:00 No.16606

>>15874 Pony Preservation Project anon here. I recommend checking out the presenatations linked from the PPP threads, especially Cookie's segments. derpy.me/pVeU0 derpy.me/Jwj8a In short: - Use Arpabet instead of IPA. It's much easier to get Arpabet data than IPA data, and Arpabet is good enough. - Use BOTH Arpabet and English transcriptions. Each datapoint should contain one or the other for the transcription, and the dataset as a whole should contain both Apabet and English transcriptions. - Use a natural language model to augment your data with emotion embeddings. The pony standard is to use DeepMoji embeddings. Some anon has use TinyBERT for supposedly-better effect. I assume if you're using a language model like TinyBERT, you'd need to create a prompt that gets the network to associate an emotion with the text, then use the embedding for the token associated with that emotion prediction. - Use HiFiGAN for the vocoder. We've also found that text-to-speech isn't always suitable for waifu-speak. Sometimes (often), people want to be able to use a reference audio to get the prosody exactly right. For that, you'll want to use TalkNet2.

Robowaifu Technician 06/08/2022 (Wed) 02:16:07 No.16609

>>15192 >>15193 > (>>12247 youtube-dl takedown -related) > (>>16357 yt-dlp installation & scripting -related)

Chobitsu 06/08/2022 (Wed) 03:17:26 No.16612

>>16606 You ponies have done some nice work through your primary overarching project Anon. Thanks for the recommendations! :^) Cheers.

some thingy for "advanced audio recognition" Robowaifu Technician 06/13/2022 (Mon) 14:52:24 No.16664

The idea is that the neural network should eat the same audio stream but with different parameters, for example: if the neural network does not recognize standard speech, then there is a chance to recognize the same audio at "Pitch -1 _ Speed -1" for example, most likely this method has already been implemented and has long been used, if not, it seems to me - it can solve the main difficulties in speech recognition, words understanding, etc.

Robowaifu Technician 06/13/2022 (Mon) 15:53:19 No.16666

>>16664 I had similar thoughts, and my thinking was the we might profit from using specialized hardware (probably some ASIC) close to the cameras and ears, automatically creating different versions of the data. For audio it might be a DSP, but I don't know much about that. Filtering out certain frequencies or doing noise cancelation might also be helpful. Basically we need a SBC sized hardware which can do that very efficient and fast, outputing various version from only a few inputs.

Robowaifu Technician 06/13/2022 (Mon) 15:58:59 No.16669

Would it be easier to go with an old school approach like this? https://m.youtube.com/watch?v=J_eRsA7ppy0

Robowaifu Technician 06/13/2022 (Mon) 22:09:47 No.16682

>>16664 This posting and the response should be moved into a thread about audio recognition or conversational AI / chatbots, since this thread is about speech synthesis not recognition.

Robowaifu Technician 06/14/2022 (Tue) 01:52:17 No.16684

>>16669 The techniques for the audio in there are studied now under phonetics. The techniques for the video in there are studied under articulatory synthesis. Articulatory synthesis is difficult and computationally expensive. I don't know of a good, flexible framework for doing that, so I wouldn't know how to get started on waifu speech with that. Under phonetics, the main techniques before deep neural networks were formant synthesis and concatenative synthesis. Formant synthesis will result in recognizable sounds, but not human voices. It's what you're hearing in the video. Concatenative synthesis requires huge diphone sound banks, which represent sound pieces that can be combined. (Phone = single stable sound. Diphone = adjacent pair of phones. A diphone sound bank cuts off each diphone at the midpoints of the phones since it's much easier to concatenate phones cleanly at the midpoints rather than the endpoints. This is what Hatsune Miku uses.) Concatenative synthesis is more efficient than deep neural networks, but deep neural networks are far, far more natural, controllable, and flexible. Seriously, I highly recommend following in the PPP's footsteps here. Deep neural networks are the best way forward. They can produce higher quality results with better controls and with less data than any other approach. Programmatically, they're also flexible enough to incorporate any advances you might might see from phonetics and articulatory synthesis. The current deep neural networks for speech generation already borrow a lot of ideas from phonetics.

Chobitsu 06/17/2022 (Fri) 18:16:22 No.16725

>>16684 Thanks for the advice Anon!

Robowaifu Technician 10/06/2022 (Thu) 20:32:39 No.17474

https://github.com/openai/whisper Oh shit, audio transcription has surpassed average human level and now competitive with professional transcription. OpenAI has gone off its investor rails and completely open-sourced the model and weights. On top of that it's multilingual and can do Japanese fairly well. This could be used for transcribing audio from vtubers, audio books, and anime with missing subtitles. Unfortunately it doesn't do speaker detection as far as I know but it might be possible to train another model to use the encoded audio features to detect them. Install: python -m pip install git+https://github.com/openai/whisper.git --user Quick start:

import whisper
model = whisper.load_model("base", device="cuda") # set device to cpu if no CUDA
result = model.transcribe("chobits_sample.mp3", language="en") # multilingual models will automatically detect language, but not English only models
print(result["text"])

Output (base): > Yuzuki. I brought you some tea. Huh? Huh? Why are you serving the tea? The maid, Persecom, is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. Wee. I can handle this on my own. I want you to try to relax. Oh. Oh? Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all we really need is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motu-suwa, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! Huh? There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki. Interestingly the VA actually said "persecom" instead of persocom and Motusua instead of Motosuwa, which transcribed as "Motu-suwa". The poor pronunciation of "all he really needs is a good night's sleep" sounded a lot like "all we really need is a good night's sleep" and was transcribed as such. The only other errors were transcribing a Chii processing sound effect as "wee", mistaking Minoru saying "ah!" as "huh?", the clatter of teacups being transcribed as "oh", and Minoru saying "ugh" as "oh?" Output (small): > Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki? "Ah! Huh?" from Minoru and Hideki were omitted. "Ugh" was also omitted when Minoru passes out. It understood persocom wasn't a name but still misspelled it "persicom". Chii's sound effect wasn't transcribed as "wee" this time. Motosuwa got transcribed as "Motosua". This model understood "all he really needs" but made a mistake at the end thinking Hideki was asking a question saying Yuzuki. Output (medium): > Yuzuki! I brought you some tea. Ah! Huh? Why are you serving the tea? The maid, Persicom, is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki... This one got the ellipsis right at the end and recognized Minoru saying "ah!" but mistook persocom as a name, Persicom. "Ugh" was omitted. Output (large): >Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki... "Ah! huh?" were omitted and it understood persocom wasn't a name but still spelled it as "persicom".

Robowaifu Technician 10/06/2022 (Thu) 20:33:12 No.17475

>>17474 (continued) Output (tiny): > Useuki. I brought you some tea. Ugh. Huh? Why are you serving the tea? The maid, Percicom, is currently being used by the system. What are you talking about? Useuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord, Minoro. Thank you very much. I can handle this on my own. I want you to try to relax. Oh. Minoro! Minoro! Minoro! Well, I'm glad to know that all we really need is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motu, so it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! Huh? There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoro under such extreme stress. If only I was just more useful. Let's use a key. Tons of errors, not particularly usable. Output (tiny.en): > Yuzuki! I brought you some tea. Oh! Why are you serving the tea? The maid purse-a-com is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord, me no no. Thank you very much. I can handle this on my own. I want you to try to relax. Oh. Do you know who? What do you know her? What do you know her? Well, I'm glad to know that all he really needs is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motusua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minaro under such extreme stress. If only I was just more useful. Oh, Miss Yuzuki. >Lord, me no no. Japanese names and words confuse it. "Minoru" became "Do you know who?" and "Lord Minoru" became "What do you know her?" but it does decent on English and got "all he really needs" right but flubbed "but to be so exhausted" as "but it'd be so exhausted". Interestingly it got "Motusua" right the way she said it. Output (base.en): > Yuzuki! I brought you some tea. Ugh! What? Why are you serving the tea? The maid-pursa-com is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minaro, thank you very much. I can handle this on my own. I want you to try to relax. Oh. Minaro! Lord Minaro! Lord Minaro! Well, I'm glad to know that Allie really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motusua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minaro under such extreme stress. If only I was just more useful. Miss Yuzuki. This one really messed up "all he really needs" as "Allie really needs" and understood "Minoru" as a name "Minaro". It also got "but to be so exhausted" right. Mistook "ugh" as "oh". Output (small.en): > Yuzuki! I brought you some tea. Ah! Huh? Why are you serving the tea? The maid persicum is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki. >persicum This one got Minoru spelled right and "all he really needs" and "but to be so exhausted". Omitted "ugh". Output (medium.en): > Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he'd just collapse like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki? Omitted "ah! huh?" and "ugh" but otherwise good. Overall from just this sample I think using base is the best for English and tiny.en on CPU. The improvements in quality by small and medium aren't really worth the slowdown in speed and the base.en model doesn't seem particularly robust. If going for a larger model small.en seems better than small.

Robowaifu Technician 12/16/2022 (Fri) 03:22:54 No.18253

>>17474 Holy shit, I just found out the other day Whisper has a translate feature and gave it a go tonight. It works amazingly well with the medium size model. >[00:00.000 --> 00:02.920] The strongest warrior of Vesper, Mahoro. >[00:02.920 --> 00:06.720] Thank you for fighting with me until today. >[00:06.720 --> 00:11.600] I'm sure you already know, but you have only a few lives left. >[00:11.600 --> 00:17.600] If you continue to fight as a warrior, you will only have 37 days to move. >[00:17.600 --> 00:25.600] However, if you release your armament, you will still be able to move for 398 days, according to the report. >[00:25.600 --> 00:30.800] Mahoro, you've done enough for us Vesper. >[00:30.800 --> 00:37.200] If you have a wish that you want to grant, why don't you live the rest of your time freely? >[00:37.200 --> 00:41.000] Huh? Um... >[00:41.000 --> 00:46.000] Now, choose whichever path you like. >[00:48.000 --> 00:49.800] My wish... >[00:49.800 --> 00:54.400] My last wish is... I imagine finetuning the model on English and Japanese voices and learning to predict not only the text but also the emotion, tone and speaker by attaching Tacotron to the decoder. Then the translate feature could be used to auto-dub anime in the same voice and emotion but in English. The decoder of Whisper could also be used to predict style embeddings (the emotion and tone) from text to feed into Tacotron to synthesize much more natural sounding speech, and the more context you give it, the more accurate it would be.

Chobitsu 12/16/2022 (Fri) 05:08:16 No.18255

>>18253 Amazing. Please do this for us Anon! If you can meta-context encodings then we can us this nearly directly for our (robo)waifus. Also, Mahoro Based/10 choice. :^)

Robowaifu Technician 12/16/2022 (Fri) 06:21:04 No.18259

>>18253 Thanks, I plan to use Whisper soon. I've read it has problems with mixing languages, so if you encounter problems it might come from that.

Robowaifu Technician 01/10/2023 (Tue) 20:22:43 No.18628

Microsoft one shot voice training. https://valle-demo.github.io/ Give it chii's voice and it will probably sound like Chii.

Robowaifu Technician 01/11/2023 (Wed) 01:20:31 No.18629

(related crosspost) >>18628 >>18605

Chobitsu 01/12/2023 (Thu) 07:43:56 No.18650

>>18628 Reading the comments section is predictable tbh. https ://www.foxnews.com/tech/new-ai-simulate-voice-3-seconds-audio >=== -disable hotlink

Edited last time by Chobitsu on 01/12/2023 (Thu) 08:08:59.

Chobitsu 01/31/2023 (Tue) 15:11:45 No.19351

Our neighbors at /cyber/ mentioned this one. >Prime Voice AI >"The most realistic and versatile AI speech software, ever. Eleven brings the most compelling, rich and lifelike voices to creators and publishers seeking the ultimate tools for storytelling." https://beta.elevenlabs.io/

NoidoDev ##eCt7e4 03/23/2023 (Thu) 15:39:43 No.21542

> The scope of OpenUtau includes: > - Modern user experience. > - Selected compatibility with UTAU technologies. > - OpenUtau aims to solve problems in less laborious ways, so don't expect it to replicate exact UTAU features. > - Extensible realtime phonetics (VCV, CVVC, Arpasing) intellegence. > - English, Japanese, Chinese, Korean, Russian and more. > - Internationalization, including UI translation and file system encoding support. > - No you don't need to change system locale to use OpenUtau. > - Smooth preview/rendering experience. > - A easy to use plugin system. > - An efficient resampling engine interface. > - Compatible with most UTAU resamplers. > - A Windows and a macOS version. >The scope of OpenUtau does not include: > - Full feature digital music workstation. > - OpenUtau does not strike for Vocaloid compatibility, other than limited features. https://github.com/stakira/OpenUtau

Noidodev 05/12/2023 (Fri) 13:17:40 No.22495

>This repo/rentry aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows/Linux, as well as a stepping stone for anons that genuinely want to play around with TorToiSe. Rhttps://git.ecker.tech/mrq/ai-voice-cloning

Chobitsu 05/15/2023 (Mon) 01:47:58 No.22567

>>22538 Lol. Just to let you know Anon, we're primarily a SFW board. You might try /robo/. Cheers. :^)

Noidodev 05/15/2023 (Mon) 04:37:25 No.22577

>>22538 What it this? From the ...engine where the dev doesn't want to be mentioned here?

Robowaifu Technician 06/27/2023 (Tue) 16:29:09 No.23535

I just finished my demonstration for talking to the waifu ai https://youtu.be/jjvbENaiDXc

NoidoDev ##eCt7e4 06/27/2023 (Tue) 19:11:57 No.23538

>Whisper-based Real-time Speech Recognition https://www.unrealengine.com/marketplace/en-US/product/d293a6a427c94831888ca0f47bc5939b Just want to show this here after finding it. Something like this would be useful if one wanted to use UnrealEngine for a virtual waifu or some kind of a virtual training environment.

Robowaifu Technician 06/27/2023 (Tue) 19:39:05 No.23539

>>23538 I'm sure there's some kind of netcode in unreal you can use for a transcribing API of your choice and save yourself the $99 >virtual waifu real life robotic waifu

Robowaifu Technician 06/28/2023 (Wed) 13:28:24 No.23558

>>23538 https://github.com/ggerganov/whisper.cpp

NoidoDev ##eCt7e4 06/28/2023 (Wed) 14:55:14 No.23561

>>23558 >Whisper C++ >Beta: v1.4.2 / Stable: v1.2.1 / Roadmap | F.A.Q. >High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model: >Plain C/C++ implementation without dependencies >Apple silicon first-class citizen - optimized via ARM >NEON, Accelerate framework and Core ML >AVX intrinsics support for x86 architectures >VSX intrinsics support for POWER architectures >Mixed F16 / F32 precision >4-bit and 5-bit integer quantization support >Low memory usage (Flash Attention) >Zero memory allocations at runtime >Runs on the CPU >Partial GPU support for NVIDIA via cuBLAS >Partial OpenCL GPU support via CLBlast >BLAS CPU support via OpenBLAS >C-style API Thanks, that might come in handy. There seems to be enough GPU support, despite running on a CPU. I'm still thinking of building a dedicated server in some time, using the Arc380 (70W). >large 2.9 GB ~3.3 GB The original one needs 10GB or more for the large one. Which would rather indicate to get a 3060 (170W). Many thing will work fine with smaller models anyways.

Chobitsu 06/28/2023 (Wed) 23:49:24 No.23579

>>23558 Thanks for the reminder Anon. That anon's work is really quite excellent tbh.

NoidoDev ##eCt7e4 06/29/2023 (Thu) 02:51:03 No.23590

>>23558 >>23561 This (bit hard to understand) guy here https://www.youtube.com/watch?v=75H12lYz0Lo tests it on a Raspberry Pi and it works actually surprisingly fast! He tries to get smaller and smaller with his optimizations. I'll keep an eye on that.

Robowaifu Technician 06/29/2023 (Thu) 02:56:58 No.23591

>>23579 aws transcribe cost 3 cents per minute and you want to rent a server to run that thing which probably requires multiple gpus. Doesn't make any sense.

NoidoDev ##eCt7e4 06/29/2023 (Thu) 16:15:34 No.23601

>>23591 >Whisper vs AWS transcribe This is about running it at home. The tiny model works on a Raspberry Pi and the large one maybe on a 4GB GPU, certainly on a 6GB GPU (like the Arc380 which uses 70W). Do as you wish, but the general notion here is that we want our waifus be independent from the internet. Some might even say, not connected to it. Using online services for something so fundamental as speech recognition (transcription), especially beyond development, is a special case and will not be recommended.

peteblank 06/30/2023 (Fri) 06:27:24 No.23634

>>23535 That took quiet a while and was more productive than whatever the heck kiwi is doing. I'm going to start using a name tag so I can get some proper recognition for what I've done so far. Which is trying to make a hasel actuator, this, buying supplies, reading up on electronics and testing the arduino and soon making a 3d anime girl doll from scratch. I'm really about to leave this place cause this is bullshit.

Robowaifu Technician 06/30/2023 (Fri) 07:30:21 No.23635

>>23634 peteblank is an anagram for "pleb taken"

Grommet 06/30/2023 (Fri) 10:51:29 No.23639

>>23590 Wow. That's most excellent.

NoidoDev ##eCt7e4 06/30/2023 (Fri) 11:50:11 No.23640

>>23634 It's good that you did something, during the last few month, but don't exaggerate. You had some advice from other anons here when trying to make the hasel actuator. You also bring this kind of vitriol with you, bashing someone or this board in way too many comments. >3d anime girl doll from scratch I'm looking forward to see that. >I'm really about to leave this place You don't need to hang out here every day. Work on your project and report back later.

Robowaifu Technician 06/30/2023 (Fri) 12:36:55 No.23643

>>23640 I am right to be upset at kiwi since he's attacking my character for no reason. I told him I was planning to do this for profit if possible, i emailed the guy who made the 3d model asking for permission and then he turns around and claims i want to steal other people's stuff.

Chobitsu 07/01/2023 (Sat) 05:49:29 No.23674

>>23634 >I'm going to start using a name tag so I can get some proper recognition for what I've done so far. Good thinking Anon. That's not really why we use names here. Watch the movie 50 first dates to understand the actual reason.

NoidoDev ##eCt7e4 07/01/2023 (Sat) 17:15:10 No.23685

>>23643 I deleted my original post here, but forgot to copy it. Just wanted to post the new link to the related post. Well... Related: >>23682 This thread is about speech synthesis and maybe recognition, even not about 3D models. You can crosslink posts like above.

NoidoDev ##eCt7e4 07/02/2023 (Sun) 00:17:29 No.23736

>our research team kept seeing new voice conversion methods getting more complex and becoming harder to reproduce. So, we tried to see if we could make a top-tier voice conversion model that was extremely simple. So, we made kNN-VC, where our entire conversion model is just k-nearest neighbors regression on WavLM features. And, it turns out, this does as well if not better than very complex any-to-any voice conversion methods. What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. https://bshall.github.io/knn-vc https://arxiv.org/abs/2305.18975

Chobitsu 07/02/2023 (Sun) 04:35:24 No.23757

>>23736 >What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. Lol. That seems a little bizarre to think through. Thanks Anon. >ps. I edited the subject ITT, thanks for pointing that out NoidoDev.

NoidoDev ##eCt7e4 08/27/2023 (Sun) 19:24:14 No.24951

We should think about optimizations of speech recognition (synthesis needs it's own approach): - there are FPGA SBCs which you can train to react to certain words, then put out a text or trigger something - instead of recording a 30s sentence, record much shorter but go on directly after the first one, check the parts, but also glue them together and send the whole sentence to the speech recognition model - maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something - finding ways to detect made up words - construct words out of syllables instead of just jumping to what could have been meant, using that for parts of a sentence where the speech recognition model is uncertain - using the certainty values of speech recognition to look for errors (misunderstandings), maybe using the syllable construction, wordlists and list of names for that

Chobitsu 08/30/2023 (Wed) 00:30:01 No.25064

>>24951 >- maybe using an language model for anticipation of what might be said, while using parts of a sentence, especially with some context e.g. pointing at something I would anticipate this should at the least provide greater odds of a coherent parse (particularly in a noisy environment) than just STT alone. Good thinking Anon.

NoidoDev ##eCt7e4 08/30/2023 (Wed) 19:39:15 No.25075

Related: >>25073 >VALL-E X is an amazing multilingual text-to-speech (TTS) model proposed by Microsoft. While Microsoft initially publish in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are glad to share our trained VALL-E X model with the community, allowing everyone to experience the power next-generation TTS https://github.com/Plachtaa/VALL-E-X https://huggingface.co/spaces/Plachta/VALL-E-X

01 09/02/2023 (Sat) 01:36:54 No.25096

>>25075 also worth noting that : its broken if you launch it thru "python -X utf8 launch-ui.py" command and let install "vallex-checkpoint.pt" and whisper "medium.pt" models on its own, very weird as its already solved here : https://github.com/Plachtaa/VALL-E-X#install-with-pip-recommended-with-python-310-cuda-117--120-pytorch-20 download them manually, thats it.

NoidoDev ##pTGTWW 09/02/2023 (Sat) 08:39:14 No.25100

>>25075 >>25096 Thanks. This will be very useful.

NoidoDev ##pTGTWW 10/06/2023 (Fri) 01:45:39 No.25805

There's some excitement around a Discord server being removed, which was working on AI voice models. We might even not have known about it (I didn't), but here's the website: https://voice-models.com https://docs.google.com/spreadsheets/d/1tAUaQrEHYgRsm1Lvrnj14HFHDwJWl0Bd9x0QePewNco/edit#gid=1227575351 and weights.gg (not voice models) >AI Hub discord just got removed from my server list But it seems to be only a fraction of the models. Some mention a IIRC backup: https://www.reddit.com/r/generativeAI/comments/16zzuh4/ai_hub_discord_just_got_removed_from_my_server/

Chobitsu 10/06/2023 (Fri) 05:23:49 No.25809

>>25805 >I WARNED YOU ABOUT THE DOXXCORD STAIRS BRO Save.everything. Doxxcord is even more deeply-controlled than G*ogle is. DMCAs don't result in a forum getting disappear'd.

NoidoDev ##pTGTWW 10/10/2023 (Tue) 00:53:57 No.25876

>Otamatone https://youtu.be/Y_ILdh1K0Fk Found here, related: >>25273

Chobitsu 10/10/2023 (Tue) 19:44:00 No.25893

>>25876 Had no idea that was a real thing NoidoDev, thanks! Any chance it's opensauce?

NoidoDev ##pTGTWW 10/11/2023 (Wed) 00:59:42 No.25909

>>25893 The original belongs to a corporation, but if you look for "Otamatone DIY" you can find some variants.

Chobitsu 10/11/2023 (Wed) 18:04:22 No.25924

>>25909 Cool. Thank you NoidoDev! :^)

NoidoDev ##pTGTWW 10/11/2023 (Wed) 19:32:01 No.25931

>>17474 Can we get this with time stamps? So we can use it for voice training (text to speech).

NoidoDev ##pTGTWW 11/20/2023 (Mon) 06:33:31 No.26511

>ⓍTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. There is no need for an excessive amount of training data that spans countless hours. https://huggingface.co/coqui/XTTS-v2 (only non-commercial licence) Testing Space: https://huggingface.co/spaces/coqui/voice-chat-with-mistral Via https://www.reddit.com/r/LocalLLaMA/comments/17yzr6l/coquiai_ttsv2_is_so_cool/ (seems to be much closer to the ElevenLabs quality)

01 11/20/2023 (Mon) 07:52:43 No.26512

>>26511 also this one https://github.com/yl4579/StyleTTS2 some people claim its 100x faster than coqui's xtts. still no webui tho :(

NoidoDev ##pTGTWW 11/21/2023 (Tue) 10:33:41 No.26535

>>26512 Thank, I saw this mentioned but forgot to look it up.

01 11/24/2023 (Fri) 17:01:15 No.26566

>>26512 tested it locally, rtx 3070. works fast as fuck. https://files.catbox.moe/ow0ryz.mp4

Chobitsu 11/25/2023 (Sat) 10:11:46 No.26572

>>26535 >>26566 Thanks Anons. :^)

OpenVoice Robowaifu Technician 01/04/2024 (Thu) 08:08:39 No.27995

New zero-shot voice cloning model just dropped. Examples: https://research.myshell.ai/open-voice Github: https://github.com/myshell-ai/OpenVoice Notebook: https://github.com/camenduru/OpenVoice-colab#-colab Paper: https://arxiv.org/abs/2312.01479

Chobitsu 01/05/2024 (Fri) 09:37:20 No.28020

>>27995 REALLY impressive Anon, thanks!

MetaVoice 1B NoidoDev ##pTGTWW 02/09/2024 (Fri) 12:25:29 No.29257

>MetaVoice 1B - The new TTS and Voice cloning open source model Colab: https://drp.li/7RUPU MetaVoice Online Demo - https://ttsdemo.themetavoice.xyz/ https://huggingface.co/metavoiceio https://youtu.be/Y_k3bHPcPTo Not as good as proprietary models.

Chobitsu 02/12/2024 (Mon) 06:13:10 No.29369

>>29257 >Not as good as proprietary models. Ehh, they'll get better with time, no doubt. Thanks Anon! Cheers. :^)

Bringing Whisper and LLaMA to the masses NoidoDev ##pTGTWW 02/12/2024 (Mon) 18:30:36 No.29415

>This week we’re talking with Georgi Gerganov about his work on Whisper.cpp and llama.cpp. Georgi first crossed our radar with whisper.cpp, his port of OpenAI’s Whisper model in C and C++. Whisper is a speech recognition model enabling audio transcription and translation. Something we’re paying close attention to here at Changelog, for obvious reasons. Between the invite and the show’s recording, he had a new hit project on his hands: llama.cpp. This is a port of Facebook’s LLaMA model in C and C++. Whisper.cpp made a splash, but llama.cpp is growing in GitHub stars faster than Stable Diffusion did, which was a rocket ship itself. https://changelog.com/podcast/532 Some takeaways: Whiper didn't do speaker identification (Diarization) when they published this in March 22, 2023, and it seems to be hard to find something doing that. But they said people set up their own pipelines for doing this and Whisper might get there as well. I found this on the topic by briefly searching, it still doesn't seem to be covered in some easy way: >How to use OpenAIs Whisper to transcribe and diarize audio files https://github.com/lablab-ai/Whisper-transcription_and_diarization-speaker-identification- Discussion on this: https://huggingface.co/spaces/openai/whisper/discussions/4 Azure AI services seem to be able to do it, but this doesn't help us much. Well, I mean for using it as a tool to extract voice files for training it's one thing, but we also need it as a skill for our waifus: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speaker-recognition?tabs=script&pivots=programming-language-cpp

Chobitsu 02/13/2024 (Tue) 19:08:31 No.29440

>>29415 Very nice. Thanks NoidoDev! I have a lot of respect for Gerganov. He very-clearly understands the issues of latency in a systems development context. Exactly the kinds of expertise vital for success to /robowaifu/ and our affiliated cadres in the end. Cheers. :^)

State of AI for Speech and Audio NoidoDev 02/26/2024 (Mon) 12:54:39 No.29956

>Data Exchange Podcast 198 - Sep 21, 2023 Overview over everything related to speech. https://www.youtu.be/w4DULuvgO1Y Yishay Carmiel is the CEO of Meaning, a startup at the forefront of building real-time speech applications for enterprises. Episode Notes: https://thedataexchange.media/state-of-ai-for-speech-and-audio >Sections Generative AI for Audio (text-to-speech; text-to-music; speech synthesis) - 00:00:44 Speech Translation - 00:09:44 Automatic Speech Recognition and other models that use audio inputs - 00:13:16 Speech Emotion Recognition - 00:19:55 Restoration - 00:21:55 Similarities in recent trends in NLP and Speech - 00:24:23 Diarization (speaker identification), and implementation challenges - 00:29:47 Voice cloning and risk mitigation - 00:35:36

Robowaifu Technician 03/18/2024 (Mon) 16:52:22 No.30390

There is some Japanese open source programs for speech synthesis such as VOICEVOX though I should mention if you use these voices they will have funny accents if you make them speak English which can be kinda cute sometimes. https://voicevox.hiroshiba.jp And TALQu but it is only for Windows. https://booth.pm/ja/items/2755336 NNSVS is for singing also open source. https://nnsvs.github.io SociallyIneptWeeb used VOICEVOX for an AI waifu before and detailed what he did https://www.youtube.com/watch?v=bN5UaEkIPGM&t=674s

NoidoDev ##pTGTWW 03/18/2024 (Mon) 22:07:21 No.30398

>>30390 Oh wow, this is really good. Thanks. >https://nnsvs.github.io >NNSVS >Neural network based singing voice synthesis library > GitHub: https://github.com/nnsvs/nnsvs > Paper: https://arxiv.org/abs/2210.15987 > Demo: https://r9y9.github.io/projects/nnsvs/ >Features > Open-source: NNSVS is fully open-source. You can create your own voicebanks with your dataset. > Multiple languages: NNSVS has been used for creating singing voice synthesis (SVS) systems for multiple languages by VocalSynth comminities (8+ as far as I know). > Research friendly: NNSVS comes with reproducible Kaldi/ESPnet-style recipes. You can use NNSVS to create baseline systems for your research.

Robowaifu Technician 03/18/2024 (Mon) 23:02:26 No.30403

>>30398 Here is a site I found that writes some about it and has links to written tutorials. https://nnsvs.carrd.co/

NoidoDev ##pTGTWW 03/29/2024 (Fri) 22:15:22 No.30625

>VoiceCraft >>30614 Thanks, but it's about voice cloning again. I think what I really want are artificial voices which don't belong to anyone. Cloning has it's use cases as well, but I don't need or want it for a robot wife. Also I don't need to be to close to a human. To me the quality problem is a solved problem at this point, at least for robowaifus. I was very impressed certainly by the singing capabilities I saw and heard recently, see above >>30390

Robowaifu Technician 03/31/2024 (Sun) 02:54:56 No.30657

>>30625 If you aren't worried about human closeness there is a pretty simple TTS that sounds like old retro synthesized voices. Unfortunately I cant find a video that has the female voice. https://github.com/adafruit/Talkie

NoidoDev ##pTGTWW 03/31/2024 (Sun) 05:06:57 No.30664

>>30657 Thanks, but I didn't mean to go so extreme into the other direction. I just meant for our use case here, and in my opinion, the current state of the technology should be sufficient in terms of quality or it's at least close to it. Making it faster and run better on smaller devices would be good, though. For content creation it's another story, if we don't want to only have stories about robots.

Robowaifu Technician 04/26/2024 (Fri) 19:51:52 No.31027

I figure this might be the best place for this. I found a paper on lip syncing synthesized voices.

Robowaifu Technician 04/26/2024 (Fri) 19:59:13 No.31028

>>31027 For some reason the file didnt attach https://doi.org/10.1109/ROMAN.2010.5598656

Mechnomancer 04/27/2024 (Sat) 13:43:54 No.31038

>>31027 I've been thinking about designing something similar, now I'm totally gonna s̶t̶e̶a̶l̶ be inspired by this.

Grommet 04/28/2024 (Sun) 12:16:35 No.31049

>>31027 That is great. I mentioned doing something, sorta, the same with facial expressions. I believe this is the same sort of "framework" or idea. Here's the paper, Real-time lip synchronization between text-to-speech (TTS) system and robot mouth Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Here's an address for the paper. https://sci-hub.ru/10.1109/roman.2010.5598656

Chobitsu 04/29/2024 (Mon) 03:03:04 No.31055

>>31049 >Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Lol, welcome to my world! :D TBH I think Robbit must've disabled file posting by Torfags. I hope he changes that soon.

NoidoDev 07/11/2024 (Thu) 12:13:40 No.32169

Just wanted to mention, that Suno - the AI music creation model, is based on Bark - the speech generation model. They needed around two years from there to where we are now. I have a source, a video where this is mentioned, but this is also about a lot of other things. Just has been mentioned there.

Chobitsu 07/11/2024 (Thu) 13:59:40 No.32171

>>32169 Neat! That's an interesting heritage. Impressive results in a fairly short time, too. Thanks, NoidoDev. Cheers. :^)

NoidoDev 07/27/2024 (Sat) 17:57:30 No.32406

I think to make speech generation fast enough, we'll have to go from some AI voice to replicate the generated voice in mbrolla. I asked Claude about it. [Quote] To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA, you'll need to follow several steps. Here's a concise overview of the process: 1. Prepare your data: - Ensure your audio files are high-quality recordings - Verify that phonetic labels are accurate and aligned with the audio 2. Choose your synthesizer: - eSpeak: Better for rapid development, less natural-sounding - MBROLA: More natural-sounding, but requires more work 3. For eSpeak: - Create a voice file with phoneme definitions - Define pitch and timing rules - Add your voice file to eSpeak's voices directory 4. For MBROLA: - Create diphone database from your audio files - Define pitch and duration rules - Create a language definition file 5. Test and refine: - Use sample texts to test your voice - Adjust parameters for better quality This process can be complex and time-consuming. Would you like me to elaborate on any specific part of this process? Yes, there are tools available to help create voices for MBROLA. Here's a brief overview of some key tools: 1. MBR_PH: This is the official MBROLA database creation tool. It helps in creating diphone databases from labeled speech samples. 2. Praat: While not specifically for MBROLA, this phonetics software can be used for labeling and segmenting audio files, which is a crucial step in the process. 3. HTK (Hidden Markov Model Toolkit): Can be used for automatic segmentation of speech data. 4. Festvox: A voice building framework that can be used to create voices for various synthesizers, including MBROLA. 5. WinSnoori: A speech analysis tool that can be helpful in the voice creation process. These tools can significantly streamline the process, but creating a high-quality voice still requires considerable effort and expertise. Would you like more details on any of these tools or the specific steps they're used for? [/Quote]

NoidoDev 07/27/2024 (Sat) 17:58:08 No.32407

>>32406 The linguistic knowledge necessary for defining pitch and duration rules in speech synthesis encompasses several areas of linguistics. Here's an overview of the key areas: 1. Phonetics and Phonology: - Understanding of speech sounds and their features - Knowledge of phonemic inventories and allophonic variations - Familiarity with syllable structure and phonotactics 2. Prosody: - Understanding of stress patterns (lexical and sentential) - Knowledge of intonation patterns and their meanings - Familiarity with rhythm and timing in speech 3. Morphology: - Understanding how word structure affects pronunciation - Knowledge of morphophonemic alternations 4. Syntax: - Understanding how sentence structure affects prosody - Knowledge of phrase boundaries and their prosodic effects 5. Pragmatics: - Understanding how context and intent affect prosody - Knowledge of focus and emphasis in speech 6. Sociolinguistics: - Familiarity with dialect variations in prosody - Understanding of speaking styles and registers 7. Acoustic Phonetics: - Knowledge of acoustic correlates of prosodic features - Understanding of formant structures and transitions 8. Language-specific knowledge: - Deep understanding of the prosodic system of the target language - Familiarity with any unique features (e.g., tone in tonal languages) This knowledge helps in creating rules that accurately capture the nuances of natural speech in the target language. Would you like more information on any of these linguistic areas or how they specifically apply to speech synthesis?

NoidoDev ##pTGTWW 07/27/2024 (Sat) 18:32:26 No.32409

I think to make speech generation fast enough, we'll have to go from some AI voice to replicate the generated voice in mbrolla. I asked Claude about it. To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA, you'll need to follow several steps. Here's a concise overview of the process: 1. Prepare your data: - Ensure your audio files are high-quality recordings - Verify that phonetic labels are accurate and aligned with the audio 2. Choose your synthesizer: - eSpeak: Better for rapid development, less natural-sounding - MBROLA: More natural-sounding, but requires more work 3. For eSpeak: - Create a voice file with phoneme definitions - Define pitch and timing rules - Add your voice file to eSpeak's voices directory 4. For MBROLA: - Create diphone database from your audio files - Define pitch and duration rules - Create a language definition file 5. Test and refine: - Use sample texts to test your voice - Adjust parameters for better quality This process can be complex and time-consuming. Would you like me to elaborate on any specific part of this process? Yes, there are tools available to help create voices for MBROLA. Here's a brief overview of some key tools: 1. MBR_PH: This is the official MBROLA database creation tool. It helps in creating diphone databases from labeled speech samples. 2. Praat: While not specifically for MBROLA, this phonetics software can be used for labeling and segmenting audio files, which is a crucial step in the process. 3. HTK (Hidden Markov Model Toolkit): Can be used for automatic segmentation of speech data. 4. Festvox: A voice building framework that can be used to create voices for various synthesizers, including MBROLA. 5. WinSnoori: A speech analysis tool that can be helpful in the voice creation process. These tools can significantly streamline the process, but creating a high-quality voice still requires considerable effort and expertise. Would you like more details on any of these tools or the specific steps they're used for?

NoidoDev ##pTGTWW 07/27/2024 (Sat) 18:33:09 No.32410

>>32409 The linguistic knowledge necessary for defining pitch and duration rules in speech synthesis encompasses several areas of linguistics. Here's an overview of the key areas: 1. Phonetics and Phonology: - Understanding of speech sounds and their features - Knowledge of phonemic inventories and allophonic variations - Familiarity with syllable structure and phonotactics 2. Prosody: - Understanding of stress patterns (lexical and sentential) - Knowledge of intonation patterns and their meanings - Familiarity with rhythm and timing in speech 3. Morphology: - Understanding how word structure affects pronunciation - Knowledge of morphophonemic alternations 4. Syntax: - Understanding how sentence structure affects prosody - Knowledge of phrase boundaries and their prosodic effects 5. Pragmatics: - Understanding how context and intent affect prosody - Knowledge of focus and emphasis in speech 6. Sociolinguistics: - Familiarity with dialect variations in prosody - Understanding of speaking styles and registers 7. Acoustic Phonetics: - Knowledge of acoustic correlates of prosodic features - Understanding of formant structures and transitions 8. Language-specific knowledge: - Deep understanding of the prosodic system of the target language - Familiarity with any unique features (e.g., tone in tonal languages) This knowledge helps in creating rules that accurately capture the nuances of natural speech in the target language. Would you like more information on any of these linguistic areas or how they specifically apply to speech synthesis?

NoidoDev ##pTGTWW 07/27/2024 (Sat) 18:42:01 No.32411

>>32409 > To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA >This checkpoint leverages the pretrained checkpoint wav2vec2-large-lv60 and is fine-tuned on CommonVoice to recognize phonetic labels in multiple languages. https://huggingface.co/facebook/wav2vec2-lv-60-espeak-cv-ft >Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained mode https://arxiv.org/abs/2109.11680

NoidoDev ##pTGTWW 07/27/2024 (Sat) 18:43:42 No.32412

>>32406 >>32407 Can be deleted. I was on another computer while using Claude.

Robowaifu Technician 07/27/2024 (Sat) 19:14:02 No.32413

>>32409 Ouuuf, I had to ask specifically: Yes, Festival can indeed be easier to work with for voice creation compared to directly using MBROLA or eSpeak. Here's a brief overview of why Festival might be a more accessible option: 1. Comprehensive toolkit: - Festival comes with a suite of tools designed for voice creation - Includes utilities for data preparation, analysis, and synthesis 2. Festvox: - A voice building framework that works with Festival - Provides a structured approach to creating new voices 3. Higher-level abstractions: - Allows you to work with higher-level linguistic concepts - Can automatically handle some lower-level details 4. Built-in machine learning: - Can use statistical methods to learn prosody models from data - Reduces the need for manual rule creation 5. Extensive documentation: - Well-documented processes for voice building - Includes tutorials and example voices 6. Language support: - Comes with tools for building voices in various languages - Can handle different writing systems and phoneme sets 7. Integration: - Can still use MBROLA voices if desired - Allows for mixing different synthesis methods While Festival still requires significant effort and linguistic knowledge, it provides more structure and automation in the voice creation process. Would you like more details on how to approach voice creation with Festival?

NoidoDev ##pTGTWW 07/27/2024 (Sat) 19:36:34 No.32414

>>32413 We should keep an eye on developments in case anyone tries to make that easier with AI. Or maybe we could try advocating for it. I might also give it a try myself, if I really have to, but I'm bad with patience and frustration. http://www.festvox.org/ https://github.com/festvox/festvox >For full details about voice building see the document itself http://festvox.org/bsv/

Chobitsu 07/27/2024 (Sat) 19:48:32 No.32416

>>32414 Great research effort NoidoDev, thanks! :^)

NoidoDev ##pTGTWW 01/16/2025 (Thu) 04:04:45 No.35678

>Kokoro TTS I just watched this video https://youtu.be/tl1wvZXlj0I about a new TTS model. It's named Kokoro https://huggingface.co/hexgrad/Kokoro-82M >Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out). >On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. ... >Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio ... >Kokoro v0.19 is limited in some specific ways, due to its training set and/or architecture: > [Data] Lacks voice cloning capability, likely due to small <100h training set > [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes > [Data] Training dataset is mostly long-form reading and narration, not conversation > [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash > [Data] Multilingual capability is architecturally feasible, but training data is mostly English What I found really interesting is, that it can blend voices. So you can create voices which won't sound like the standard ones.

Chobitsu 01/16/2025 (Thu) 08:15:02 No.35694

>>35678 Nice find, NoidoDev. Thanks! :^)

GreerTech 02/01/2025 (Sat) 15:02:38 No.36362

I used eSpeak and copy-and-paste to do a quick-and-dirty speech capability.

Chobitsu 02/02/2025 (Sun) 03:16:13 No.36377

>>36362 Can you please link some resources for everyone here how you did it? Like, the original sauces for things.

GreerTech 02/02/2025 (Sun) 03:35:33 No.36383

>>36377 From Offline AI Roleplay - A Guide to Simple Offline AI "Open eSpeak while you use the AI program (it may be labeled as TTSApp). Copy the outputs into eSpeak, and have it say the text. Both Jan and LM Studio have a dedicated copy button. The shortcut for pasting is Ctrl + V." eSpeak can be found here; https://espeak.sourceforge.net/

NoidoDev ##pTGTWW 02/02/2025 (Sun) 21:48:20 No.36417

>>36377 espeak is in every Linux repository. It's very well known but also basic. I use spd-say sometimes with s I sometimes use Speech Dispatcher (spd-say) with Festival >>4333 or Mbrola >>32411 >>32406, though Festival seems to be easier for voice creation: >>32413. Festival sounds better than espeak, imo. Especially with the right configuration (which I don't have anymore right now). This still sounds very robotic and makes errors in pronunciation, but it's okayish and better than espeak. Speech Dispatcher (spd-say promt) is a kind of layer on top so you can switch the system underneath: https://wiki.archlinux.org/title/Speech_dispatcher

GreerTech 02/02/2025 (Sun) 21:52:13 No.36419

>>36417 Festival definitely looks better for developers who are making a unique system >espeak is in every Linux repository. Yeah, but me and many others are on Windows

Robowaifu Technician 02/02/2025 (Sun) 22:10:59 No.36423

>>36419 isnt narrator in the accessibility settings just a builtin text-to-speech program

Chobitsu 02/03/2025 (Mon) 07:25:56 No.36440

>>36383 >>36417 Thanks for your help, Anons! :^)

GreerTech 02/03/2025 (Mon) 07:37:48 No.36443

>>36423 I tried that at first, but the problem is that as far as I know, it reads EVERYTHING on the screen.

NoidoDev ##pTGTWW 02/03/2025 (Mon) 12:29:42 No.36447

>>36419 >but me and many others are on Windows Okay, I assume this is for development, while the real system will more likely be Linux. Anyways, I don't know how this exactly works but I think you can use the embedded Linux in Windows or whatever this is, and I assume there's also a repository. WSL: https://learn.microsoft.com/en-us/windows/wsl/about

Chobitsu 02/03/2025 (Mon) 12:59:52 No.36451

>>36447 Yeah, WSL currently defaults to an Ubuntu 24 variant system + terminal. It's not a perfect match, but it's close enough in most respects. For example, I've been able to build & successfully run juCi++ [1][2][3] on it (this is a moderately complex & dependency-laden GTKMM -based GUI application, built from source) without any hiccups. This subsystem is very simple to set up, and I'll be happy to help anyone here who may be struggling to do so. Hopefully it can support Anon's other development needs, and if not then moving over to a full Linux system will be all the easier for them afterwards. Cheers. :^) --- 1. https://gitlab.com/cppit/jucipp/-/blob/master/docs/install.md#debianlinux-mintubuntu 2. here's a one-liner to copypasta onto your new Ubuntu terminal for installing all it's dependencies:

sudo apt-get install libclang-dev liblldb-dev || sudo apt-get install libclang-6.0-dev liblldb-6.0-dev || sudo apt-get install libclang-4.0-dev liblldb-4.0-dev || sudo apt-get install libclang-3.8-dev liblldb-3.8-dev; sudo apt-get install universal-ctags || sudo apt-get install exuberant-ctags; sudo apt-get install git cmake make g++ clang-format pkg-config libboost-filesystem-dev libboost-serialization-dev libgtksourceviewmm-3.0-dev aspell-en libaspell-dev libgit2-dev

Then just follow the rest of the install instructions from the link above (ie, git clone --recursive https://gitlab.com/cppit/jucipp , etc.) 3. WSL is also a great platform for Windows users to build & run BUMP for archiving this board, btw ( >>14866 ). >=== -prose edit -add footnote/hotlink -add dependencies/BUMP footnotes

Edited last time by Chobitsu on 02/03/2025 (Mon) 15:36:01.

Chobitsu 02/28/2025 (Fri) 17:10:12 No.37216

>>37207 Neat! That's an ecouraging development, Barf. Thanks & cheers. :^) >=== -minor edit

Edited last time by Chobitsu on 02/28/2025 (Fri) 18:04:34.

GreerTech 03/07/2025 (Fri) 17:47:31 No.37361

Good news everybody! I was able to get AndroidOS Voice Control to work with ChatterUI, so now I can chat with AI naturally. As per the principles of >>10000, I tested it on Airplane Mode, and it still works.

Chobitsu 03/08/2025 (Sat) 12:21:14 No.37373

>>37361 Thats pretty neat. Thanks for the update, Anon, Good luck with your research, GreerTech!

NoidoDev ##pTGTWW 03/11/2025 (Tue) 02:41:52 No.37427

>>4693 >>21542 So, UTAU software, which is also used by Adachi Rei (Missile39's voice bank for his orange haired robot) is actually available with an English interface. The program is made for WinXP, but I personally would use it with Wine/Bottles anyways. It runs on old computers (Pentium 4). https://utau-synth.com/ http://utau.wikidot.com/utau >UTAU is musical voice synthesis software created by Ameya/Ayame. Though its purpose is similar to the Vocaloid software created by Yamaha, UTAU functions natively as a stand-alone program on both PC and Mac OSX platforms without requiring a VSTi plugin for third-party software (see: UTAU-Synth) and is licensed as freeware. The PC version also has a shareware option through donation [1]. > UTAU was developed from the concept of Jinriki ("manual") Vocaloid, a practice in which existing voice samples are spliced, re-assembled and pitched to create a singing voice in software such as Melodyne. Ameya released the initial version of UTAU in March 2008. On May 27th, 2011, the first beta version of UTAU-Synth for Mac was released. > The software comes pre-loaded with a robotic-sounding young female voice, Utane Uta (known as Defoko), created from the AquesTalk TTS program [2]. Unlike Vocaloid, however, the UTAU engine can openly accept any manner of .wav files, allowing its users to create their own "voicebanks" and distribute them online.

GreerTech 03/13/2025 (Thu) 20:15:30 No.37479

Several years ago, I remember seeing an Alexa ad* that mentioned that it had "far-field" technology, so you wouldn't have to shout and speak slowly like you had to do with other voice-command technology. I got curious and researched it for myself. Turns out, it was just several microphones on an array, presumably with software that combines and compares the inputs to reinforce what it has heard. Probably not too difficult for a skilled coder. https://developer.amazon.com/en-US/blogs/alexa/post/80facfd2-1176-4c4f-94ac-4c5c781011ca/amazon-alexa-premium-far-field-voice-development-ki https://www.hackster.io/news/an-alexa-development-kit-from-amazon-5dae5522ddc *the one ad with the family that was a meme back in the late 2010s.

NoidoDev ##pTGTWW 03/16/2025 (Sun) 00:52:04 No.37531

>>37479 Thanks, but I saw this quite often during the last few years. Some companies make such arrays of microphones. The question is just how to integrate that into a head. I could imagine having several small holes in the skull, but on the outside with a layer of silicone covering it. It would still be tricky to keep this save from liquid entering. The most interesting to me would be a system where we can setup the mics in a skull on our own and then have a system adapt to it. The mics should be small but waterproof and not directly soldered to some board.

GreerTech 03/16/2025 (Sun) 12:09:39 No.37541

>>37531 Well, if the robot has a solid head, then you can put them on the top dome in a spherical array, with two on the side to mimic nature. We can also put tiny microphones in other places, like the wrist or neck

AI Vocal Expression and Emotion Robowaifu Technician 05/06/2025 (Tue) 20:44:48 No.38268

ALRIGHT LISTEN UP! I'm sick and tired of listening to AI voice that is always monotone, mundane robotic like sound with no human like intonation especially when you talk to them. I know that AI voice has always been very synonym to robotic like speech but c'mon, is this the best that we can do to make our robowaifu speak? There hasn't been any significant improvement in this field through out the years which is a shame. I want my robowaifu to be able realistically roleplay vocally and do ASMR. I think we need to build a new data modal specifically for analysing human speech so that AI can learn how to speak with the right intonation together with realistic vocal expression. I'm really certain that AI emotion and data speech modal are very interrelated. So what are your thought in this?

GreerTech 05/06/2025 (Tue) 21:02:16 No.38269

>>38268 People are already trying to figure it out. It's definitely a case of "easier said than done". There's that one new open source TTS (sorry the name escapes me) that came out recently, you might want to try there.

Robowaifu Technician 05/07/2025 (Wed) 05:10:54 No.38285

>>38269 >new open source TTS This is not good enough. An AI waifu needs to learn what is funny. Somewhere between the line as we speak, your waifu chuckles because it was funny. What's the intonation of a person who speak when they are mad. What's the person sound like when they are explaining things. What's the person's voice sounds like when they are in sigh, hype and ect. All these intricate stage of feeling from what human sounds like when they speak, the AI waifu needs to learn all of these by herself before she can utter a word with a correct kind of tone. In other words, your waifu needs to train her emotion's data bank so that she can converse in more meaningful tones of utterance.

GreerTech 05/07/2025 (Wed) 05:22:03 No.38286

>>38285 https://decrypt.co/316008/ai-model-scream-hysterically-terror They're working on it. Not to say you can't work on it yourself, but rather it's not a deliberate choice to leave out emotion. Also, you can do some tricks just by changing settings. I got Galatea to sing just by slightly lowering her speed. >pic related A monotone voice can actually be cute

Robowaifu Technician 05/07/2025 (Wed) 05:36:46 No.38287

>>38286 >A monotone voice can actually be cute Yes but your waifu needs to be aware in realtime, what the kind of tone you give to her when she is listening to your voice as you speak so that she could reply you with correct vocal intonation.

Chobitsu 05/07/2025 (Wed) 08:37:09 No.38289

>>38268 >>38285 >>38287 Lol. NYPA, Anon. OTOH, if you want to try solving this together with us here, that would be great! <---> I'm glad that you bring up this topic. I think we all instinctively know when a voice is uncanny-valley, but sometimes it can be hard to put into words. You've made a good start at it, Anon. Cheers. :^)

Chobitsu 05/07/2025 (Wed) 08:39:25 No.38290

>>38269 >It's definitely a case of "easier said than done". This. But I must admit, there has been some remarkable progress in this arena. Our own @Robowaifudev did some great work on this a few years ago. My ineptitude with getting Python to work properly filtered me, but he was pulling off some real vocal magic type stuff -- all locally IIRC.

Robowaifu Technician 05/29/2025 (Thu) 15:11:52 No.38778

> (audio LLM -related : >>38775 )