/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

The canary has FINALLY been updated. -robi

Server software upgrades done, should hopefully keep the feds away. -robi

LynxChan 2.8 update this weekend. I will update all the extensions in the relevant repos as well.

The mail server for Alogs was down for the past few months. If you want to reach out, you can now use admin at this domain.

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


Knowing more than 100% of what we knew the moment before! Go beyond! Plus! Ultra!


Speech Synthesis general Robowaifu Technician 09/13/2019 (Fri) 11:25:07 No.199
We want our robowaifus to speak to us right?

en.wikipedia.org/wiki/Speech_synthesis
https://archive.is/xxMI4

research.spa.aalto.fi/publications/theses/lemmetty_mst/contents.html
https://archive.is/nQ6yt

The Taco Tron project:

arxiv.org/abs/1703.10135
google.github.io/tacotron/
https://archive.is/PzKZd

No code available yet, hopefully they will release it.

github.com/google/tacotron/tree/master/demos
https://archive.is/gfKpg
>>10504 Sorry, been busy and haven't been active here lately. Updated the repo link: https://www.mediafire.com/file/vjz09k062m02qpi/2b_v1.pt/file This model could be improved by training it without the pain sound effects. There's so many of them it biased the model which causes strange results sometimes when sentences start with A or H.
>>11474 Thanks! Wonderful to see you, I hope all your endeavors are going well Anon.
>>11474 come join my doxcord server if you have time and pm me! thanks for the model, you will likely see it used on the 2B "cosplay" waifu, we may have in the game
>>11480 The link is expired. What would you like to talk about? I don't have a lot to add. You can do some pretty interesting stuff with voice synthesis by adding other embeddings to the input embedding, such as for the character in a multi-character model, emphasis, emotion, pitch, speed, and ambiance (to utilize training samples with background noise.) This is what Replica Studios has been doing: https://replicastudios.com/
>>11522 If you are interested, I am looking for someone to take over the speech synthesis part of WaifuEngine, I got it working however, to work on it as a specialty takes me away from the rest of the application, like I want to train a new model using glowtts but my time is limited. I also have to work on the various other aspects of the project, to get it off the ground. Right now our inference time using tacotron2 isn't great unless you have a GPU. As for compensation on the project, so far I have been giving away coffee money as we have little resources haha, if the project gets bigger and more funding, I'd be willing to help the project contributors out. https:// discord.gg/ gBKGNJrev4
>>11536 In August I'll have some time to work on TTS stuff and do some R&D. I recommend using FastPitch. It's just as good as Tacotron2 but 15x faster on the GPU and 2x faster on the CPU than Tacotron2 is on the GPU. It takes about a week to train on a toaster card and also already has stuff for detecting and changing the pitch and speed, which is essential to control for producing more expressive voices with extra input embeddings. https://fastpitch.github.io/
>>11550 I'd message you on discord about this this could be useful info for the board. But essentially I did use fast pitch originally, the issue is the teacher student training methodology, you have to use tacotron to bootstrap and predict durations to align, When you don't do that and just train on LJS Model of Fastpitch via fine tuning, it fails to predict the durations. We can definitely try this method I am open to it, I guess in my time crunch I didn't bother. I am optimizing for delivery so that we have a product people can use and enjoy, it should be very simple to update the models in the future, it would be one python script change based off my architecture
>>11559 The 2B model I made was finetuned on the pretrained Tacotron2 model and only took about an hour. Automating preprocessing the training data won't be a big deal. And if a multi-speaker model is built for many different characters it would get faster and faster to finetune. I've been looking into Glow-TTS more and the automated duration and pitch prediction is a nice feature but the output quality seems even less expressive than Tacotron2. A key part of creating a cute female voice is having a large range in pitch variation. Also I've found a pretrained Tacotron2 model that uses IPA. It would be possible to train it on Japanese voices and make them talk in English, although it would take some extra time to adapt FastPitch to use IPA. Demo: https://stefantaubert.github.io/tacotron2/ GitHub: https://github.com/stefantaubert/tacotron2
Some other ideas I'd like to R&D for voice synthesis in the future: - anti-aliasing ReLUs or replacing them with swish - adding gated linear units - replacing the convolution layers with deeper residual layers - trying a 2-layer LSTM in Tacotron2 - adding ReZero to the FastPitch transformers so they can be deeper and train faster - training with different hyperparameters to improve the quality - using RL and human feedback to improve the quality - using GANs to refine output like HiFiSinger - outputting at a higher resolution and downsampling
>>11569 Thanks, but what's the point of this IPA. To let it talk correctly in other languages? >Der Nordwind und die Sonne - German with American English accent I can assure you: I doesn't work. Americans talking German often (always) sounds bad, but this is a level of it's own. Absolutely bizarre.
>>11571 Yeah, I live around Chinese with thick accents and this takes it to the next level, kek. That's not really the motivation for using IPA though. This pilot study used transfer learning to intentionally create different accents, rather than copy the voice without the accent. How IPA is useful to generating waifu voices is it helps improve pronunciation, reduce needed training data, and solves the problem with heteronyms, words spelled the same but pronounced differently: https://jakubmarian.com/english-words-spelled-the-same-but-pronounced-differently/ When models without IPA have never seen a rare word in training, such as a technical word like synthesis, they will usually guess incorrectly how to pronounce it, but with IPA the pronunciation is always the same and it can speak the word fluently without ever having seen it before. Also in a multi-speaker model you can blend between speaker embeddings to create a new voice and it's possible to find interpretable directions in latent space. Finding one for accents should be possible, which could be left in control to the user's preferences to make a character voice sound more American, British or Japanese and so on.
>>11577 Ah, okay, this sounds pretty useful. One more problem comes to mind in regards to this. In English foreign names are often changed in pronunciation, because the name would sound "strange" otherwise. The philosopher Kant would sound like the c-word for female private parts. Therefore they pronounce it Kaant. I wonder if the method helps with that as well.
>>11582 In that case it depends what language you transliterate with. If necessary names could be transliterated as they're suppose to be pronounced in their original language, or it could all be in the same language. Exceptions could also be defined. For example, the way Americans pronounce manga is quite different from the Japanese. If someone wants their waifu to sound more like a weeb and pronounce it the Japanese way, they could enter the Japanese IPA definition for it to override the default transliteration.
Open file (18.23 KB 575x368 preview.png)
Open file (62.21 KB 912x423 aegisub.png)
Finished creating a tool for automatically downloading subtitles and audio clips from Youtube videos, which can be reworked in Aegisub or another subtitle editor, then converted into a training set with Clipchan. https://gitlab.com/robowaifudev/alisub
>>11623 This sounds exciting Anon, thanks! >or another subtitle editor Can you recommend a good alternative Anon? I've never been able to successfully get Aegisub to run.
>>11624 Someone recommended SubtitleEdit but it's Windows only: https://nikse.dk/SubtitleEdit Subtitle Editor can display waveforms but it's far more difficult to use and I don't recommend it.
>>11623 Okay, thanks. This could be useful for more, I guess. Maybe later to train the system on lip reading using YouTube, for example. Or maybe for training voice recognition in the first place? How much data do we need to emulate a particular voice?
>>11625 OK, thanks for the advice. I'll try and see if I can set it up on a virtual box instead or something, Aegisub did look pretty easy to use (first time I've seen it in action, so thanks again). The problem is always a wxWidgets dependency hell issue. I can even get it to build, right up to link time.
>>11631 Finetuning a pretrained model you need about 20 minutes. Training a model from scratch takes about 12 hours. Multispeaker models trained on hundreds of voices can clone a voice with a few sentences but still need a lot of samples to capture all the nuances.
Been doing some work to get WaifuEngine's speech synthesis to run fast on the CPU and found that FastPitch has a real-time factor of 40x and WaveGlow 0.4x. This lead me to testing several different vocoder alternatives to Waveglow and arriving at multi-band MelGAN with an RTF of 20x. So FastPitch+MelGAN has an RTF of 12x, which means it can synthesize 12 seconds of speech every second or 80ms to generate a second of speech. "Advancing robotics to a point where anime catgirl meidos in tiny miniskirts are a reality" took MelGAN 250ms on CPU to generate from 2B's Tacotron2 Mel spectrogram. Now I just gotta set up this shit so it's easy to train end-to-end and the whole internet and their waifus are getting real-time waifus. Multi-band MelGAN repo: https://github.com/rishikksh20/melgan Multi-band MelGAN paper: https://arxiv.org/abs/2005.05106 Original MelGAN paper: https://arxiv.org/abs/1910.06711
>>11636 Interesting, thanks, but I meant how much samples we need to fine-tune a voice. I also wonder if voicesmare being 'blended' that way. Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. >>11647 Thanks for your work. I thought voice generation would take much more time to do. Good to know. Responses to someone talking should be fast.
Open file (153.25 KB 710x710 gawr kilcher.jpg)
>>11648 I meant 20 minutes and 12 hours of samples. Finetuning with 20 minutes of samples takes about 1-2 hours on my budget GPU. >Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. This definitely deserves more thought. If every person on the internet will be able to do speech synthesis and there is a tsunami of voice cloning characters, it's important people are able to have creative freedom with it while the buzz is on. People's curiosity will further advance speech synthesis and diffuse into other areas of AI, including waifu tech. On the other hand if people only straight up copy voices then it would cause a media shitstorm and possibly turn people away, but that could also have its benefits. Whatever happens though the accelerator is stuck to the floor. In the meantime while the hype builds, iteration can continue on until the synthesis of Gawr Kilcher is realized. When people look closely though they'll notice it's neither Yannic or Gura but actually Rimuru and Stunk all along.
>>11647 Thanks for the information, Anon.
>>11650 kek. i just noticed that logo. i wonder what based-boomer AJ would think of robowaifus. white race genocide, or crushing blow to feminazis and freedom to all men from oppression?
>>11677 He doesn't like them or AI in general. Said something once like people are going to stop having kids and masturbate with a piece of plastic all day and how the government is going to know everything about people through them and be able to manipulate them perfectly. He's not really wrong. Look how many people already give up all their data using Windows and Chrome.
>>8151 >>12193 A routine check on the Insights->Traffic page led me here. While the program itself is written with Qt, what actually makes the voices work (Voice.h and beyond) does not contain a single trace of Qt (well, almost, but what little there is is just error boxes). This is a deliberate design decision to allow the actual inference engine to be copied and ported anywhere with minimal trouble. For inference on embedded devices you probably want to use TFLite, which is on my list because I plan on Windows SAPI integration.
>>12257 Hello Anon, welcome. We're glad you're here. Thanks for any technical explanations, we have a number of engineers here. Please have a look around the board while you're here. If you have any questions, feel free to make a post on our current /meta thread (>>8492). If you decide you'd like to introduce yourself more fully, then we have an embassy thread for just that (>>2823). Regardless, thanks for stopping by!
In need of some help... I want to create a speech synthesizer, I want to take samples of my waifu's voices (which I have a lot of) and use it to digitally create her voice. First of all, is it possible? The voice samples I have are not the kind that this video shows https://youtu.be/_d7xRj121bs?t=55 , they're just in-game dialog. It is also worth noting that the voice is in Japanese. If it is possible, I still have no idea where to begin with this, I'm guessing I'll need some sound tech knowledge (which I have none of) and that's about all I can think of. In terms of programming languages, I know Python fairly well and am currently getting into C++. Anons, how do I get started with this?
>>13811 >I still have no idea where to begin with this Welcome. Then look through the thread and into the programs mentioned. You will probably need to train some neural network on a GPU. Also, you would need to extract the voices from the game and also have these words in text then. If you can't get them as files, then you might need to record them with a microphone. Then would need to transcribe the text. Lurk around, maybe someone else knows more, and just ignore the disgusting troll insulting everyone.
>>13811 As dumb as this might sound, you might want to check out /MLP/ on 4chan, there's a 100+ threads about doing this with My Little Pony characters called the "Pony Preservation Project" and they've actually made some decent progress.
>>13818 >look through the thread and into the programs mentioned Will do. >probably need to train some neural network on a GPU Have yet to get into neural networks but looks like the time has come. >extract the voices I've done that, from the game files too so they're of decently high quality. >transcribe the text That I need to do. >>13821 >as dumb as this might sound Nothing dumb about it if it works. Will give them a visit. Thank you, Anons!
>>13823 Keep us updated if there's progress anon, speech synthesis is a fascinating field. I'd love to try it out myself later in the year once I have more time
This may be old news, since it's from 2018, but Google's Duplex seems to have a great grasp on conversational speech. I think it says a lot when I had an easier time understanding the robot verus the lady at the restaurant (2nd audio example in the blog). https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html
>>14270 Hi, I knew that this has been mentioned before somewhere. Didn't find it here in this thread nor with Waifusearch. Anyways, it's in the wrong thread here, since this is about speech synthesis but the article is about speech recognition. The former conversation probably happened in the chatbot thread. >One of the key research insights was to constrain Duplex to closed domains, which are narrow enough to explore extensively. Duplex can only carry out natural conversations after being deeply trained in such domains. It cannot carry out general conversations. This is exactly the interesting topic of the article. Good reminder. A few month or a year ago I pointed out that recognizing all kinds of words, sentences and meanings will be one of our biggest challenges. Especially if it should work with all kinds of voices. Some specialists (Sphinx CMU) claimed it would currently require a server farm with terrabytes of RAM to do that, if it was even possible. We'll probably need a way to work around that. Maybe using many constrained models on fast SSDs which take over, dependent on the topic of conversation. Let's also hope for some progress, but also accept that the first robowaifus might only understand certain commands.
>>11623 You should replace youtube-dl with yt-dlp. youtube-dl is no longer maintaned and has issues with some youtube videos.
>>15192 Thanks for the tip Anon. Having used youtube-dl for years now, I too noticed the sudden drop-off in updates that occurred following the coordinated attack by RIAA/Microsoft against it's developer & user community. We'll look into it.
Open file (73.10 KB 862x622 IPA_synthesis.png)
I think I've finally figured out a way to train more expressive voices in conversation without having to label a ton of data. First, the English text needs to be transcribed into IPA so that a speech synthesis model can easily predict how words are spoken without requiring a huge dataset covering all the exceptions and weirdness of English. The English transcription or IPA is projected into an embedding that's split into two parts. One part constrained to representing the content as IPA via projecting those features back into IPA symbols and minimizing the cross entropy loss. The other half modelling the style, such as the emotion and other subtleties, to match the audio examples more faithfully, which are trained through the Mel spectrogram loss. This way the model can learn all aspects of speech through just the text labels and audio examples alone. At inference time this style embedding could be modified to change the emotion, pitch, cadence, tone and other qualities of the model for voice acting or creating examples for finetuning the model towards a desired personality. A ByT5 model could be used to transcribe English and other languages into the IPA embedding + style embedding. It could also take into account the previous context of the conversation to generate a more appropriate style embedding for the speech synthesis model to work from. Training from context though will require new datasets from podcasts that have such context. I've collected some with existing transcripts and timestamps for this already. The transcripts just need to be accurately aligned to the audio clips for clipping, so it's not an unfeasible project for one person to do. Other possibilities for this could be adding tags into the text training data that get filtered out from the content via the IPA cross entropy loss, ensuring the tags only affect the style embedding. You could indicate tempo, pitches, velocity and note values for singing which would be learned in the style embeddings. It could also be used for annotating different moods or speaking styles such as whispering or yelling. There's a ton of possibilities here for more versatile speech synthesis and natural conversation.
>>15874 Pony Preservation Project anon here. I recommend checking out the presenatations linked from the PPP threads, especially Cookie's segments. derpy.me/pVeU0 derpy.me/Jwj8a In short: - Use Arpabet instead of IPA. It's much easier to get Arpabet data than IPA data, and Arpabet is good enough. - Use BOTH Arpabet and English transcriptions. Each datapoint should contain one or the other for the transcription, and the dataset as a whole should contain both Apabet and English transcriptions. - Use a natural language model to augment your data with emotion embeddings. The pony standard is to use DeepMoji embeddings. Some anon has use TinyBERT for supposedly-better effect. I assume if you're using a language model like TinyBERT, you'd need to create a prompt that gets the network to associate an emotion with the text, then use the embedding for the token associated with that emotion prediction. - Use HiFiGAN for the vocoder. We've also found that text-to-speech isn't always suitable for waifu-speak. Sometimes (often), people want to be able to use a reference audio to get the prosody exactly right. For that, you'll want to use TalkNet2.
>>15192 >>15193 > (>>12247 youtube-dl takedown -related) > (>>16357 yt-dlp installation & scripting -related)
>>16606 You ponies have done some nice work through your primary overarching project Anon. Thanks for the recommendations! :^) Cheers.
Open file (63.51 KB 985x721 audioreq.png)
The idea is that the neural network should eat the same audio stream but with different parameters, for example: if the neural network does not recognize standard speech, then there is a chance to recognize the same audio at "Pitch -1 _ Speed -1" for example, most likely this method has already been implemented and has long been used, if not, it seems to me - it can solve the main difficulties in speech recognition, words understanding, etc.
>>16664 I had similar thoughts, and my thinking was the we might profit from using specialized hardware (probably some ASIC) close to the cameras and ears, automatically creating different versions of the data. For audio it might be a DSP, but I don't know much about that. Filtering out certain frequencies or doing noise cancelation might also be helpful. Basically we need a SBC sized hardware which can do that very efficient and fast, outputing various version from only a few inputs.
Would it be easier to go with an old school approach like this? https://m.youtube.com/watch?v=J_eRsA7ppy0
>>16664 This posting and the response should be moved into a thread about audio recognition or conversational AI / chatbots, since this thread is about speech synthesis not recognition.
>>16669 The techniques for the audio in there are studied now under phonetics. The techniques for the video in there are studied under articulatory synthesis. Articulatory synthesis is difficult and computationally expensive. I don't know of a good, flexible framework for doing that, so I wouldn't know how to get started on waifu speech with that. Under phonetics, the main techniques before deep neural networks were formant synthesis and concatenative synthesis. Formant synthesis will result in recognizable sounds, but not human voices. It's what you're hearing in the video. Concatenative synthesis requires huge diphone sound banks, which represent sound pieces that can be combined. (Phone = single stable sound. Diphone = adjacent pair of phones. A diphone sound bank cuts off each diphone at the midpoints of the phones since it's much easier to concatenate phones cleanly at the midpoints rather than the endpoints. This is what Hatsune Miku uses.) Concatenative synthesis is more efficient than deep neural networks, but deep neural networks are far, far more natural, controllable, and flexible. Seriously, I highly recommend following in the PPP's footsteps here. Deep neural networks are the best way forward. They can produce higher quality results with better controls and with less data than any other approach. Programmatically, they're also flexible enough to incorporate any advances you might might see from phonetics and articulatory synthesis. The current deep neural networks for speech generation already borrow a lot of ideas from phonetics.
>>16684 Thanks for the advice Anon!
Open file (127.74 KB 1078x586 WER.png)
Open file (83.58 KB 903x456 models.png)
Open file (82.91 KB 911x620 languages.png)
https://github.com/openai/whisper Oh shit, audio transcription has surpassed average human level and now competitive with professional transcription. OpenAI has gone off its investor rails and completely open-sourced the model and weights. On top of that it's multilingual and can do Japanese fairly well. This could be used for transcribing audio from vtubers, audio books, and anime with missing subtitles. Unfortunately it doesn't do speaker detection as far as I know but it might be possible to train another model to use the encoded audio features to detect them. Install: python -m pip install git+https://github.com/openai/whisper.git --user Quick start: import whisper model = whisper.load_model("base", device="cuda") # set device to cpu if no CUDA result = model.transcribe("chobits_sample.mp3", language="en") # multilingual models will automatically detect language, but not English only models print(result["text"]) Output (base): > Yuzuki. I brought you some tea. Huh? Huh? Why are you serving the tea? The maid, Persecom, is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. Wee. I can handle this on my own. I want you to try to relax. Oh. Oh? Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all we really need is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motu-suwa, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! Huh? There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki. Interestingly the VA actually said "persecom" instead of persocom and Motusua instead of Motosuwa, which transcribed as "Motu-suwa". The poor pronunciation of "all he really needs is a good night's sleep" sounded a lot like "all we really need is a good night's sleep" and was transcribed as such. The only other errors were transcribing a Chii processing sound effect as "wee", mistaking Minoru saying "ah!" as "huh?", the clatter of teacups being transcribed as "oh", and Minoru saying "ugh" as "oh?" Output (small): > Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki? "Ah! Huh?" from Minoru and Hideki were omitted. "Ugh" was also omitted when Minoru passes out. It understood persocom wasn't a name but still misspelled it "persicom". Chii's sound effect wasn't transcribed as "wee" this time. Motosuwa got transcribed as "Motosua". This model understood "all he really needs" but made a mistake at the end thinking Hideki was asking a question saying Yuzuki. Output (medium): > Yuzuki! I brought you some tea. Ah! Huh? Why are you serving the tea? The maid, Persicom, is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki... This one got the ellipsis right at the end and recognized Minoru saying "ah!" but mistook persocom as a name, Persicom. "Ugh" was omitted. Output (large): >Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki... "Ah! huh?" were omitted and it understood persocom wasn't a name but still spelled it as "persicom".
>>17474 (continued) Output (tiny): > Useuki. I brought you some tea. Ugh. Huh? Why are you serving the tea? The maid, Percicom, is currently being used by the system. What are you talking about? Useuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord, Minoro. Thank you very much. I can handle this on my own. I want you to try to relax. Oh. Minoro! Minoro! Minoro! Well, I'm glad to know that all we really need is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motu, so it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! Huh? There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoro under such extreme stress. If only I was just more useful. Let's use a key. Tons of errors, not particularly usable. Output (tiny.en): > Yuzuki! I brought you some tea. Oh! Why are you serving the tea? The maid purse-a-com is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord, me no no. Thank you very much. I can handle this on my own. I want you to try to relax. Oh. Do you know who? What do you know her? What do you know her? Well, I'm glad to know that all he really needs is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motusua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minaro under such extreme stress. If only I was just more useful. Oh, Miss Yuzuki. >Lord, me no no. Japanese names and words confuse it. "Minoru" became "Do you know who?" and "Lord Minoru" became "What do you know her?" but it does decent on English and got "all he really needs" right but flubbed "but to be so exhausted" as "but it'd be so exhausted". Interestingly it got "Motusua" right the way she said it. Output (base.en): > Yuzuki! I brought you some tea. Ugh! What? Why are you serving the tea? The maid-pursa-com is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minaro, thank you very much. I can handle this on my own. I want you to try to relax. Oh. Minaro! Lord Minaro! Lord Minaro! Well, I'm glad to know that Allie really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motusua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minaro under such extreme stress. If only I was just more useful. Miss Yuzuki. This one really messed up "all he really needs" as "Allie really needs" and understood "Minoru" as a name "Minaro". It also got "but to be so exhausted" right. Mistook "ugh" as "oh". Output (small.en): > Yuzuki! I brought you some tea. Ah! Huh? Why are you serving the tea? The maid persicum is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki. >persicum This one got Minoru spelled right and "all he really needs" and "but to be so exhausted". Omitted "ugh". Output (medium.en): > Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he'd just collapse like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki? Omitted "ah! huh?" and "ugh" but otherwise good. Overall from just this sample I think using base is the best for English and tiny.en on CPU. The improvements in quality by small and medium aren't really worth the slowdown in speed and the base.en model doesn't seem particularly robust. If going for a larger model small.en seems better than small.

Report/Delete/Moderation Forms
Delete
Report