/robowaifu/ - Speech Synthesis/Recognition general

Name
Subject
E-mail
Message	Max message length: 6144
Files	Drag files to upload or click here to select them Maximum 5 files / Maximum size: 20.00 MB

Spoiler images
Password	(used to delete files and postings)
Use bypass

State of AI for Speech and Audio NoidoDev 02/26/2024 (Mon) 12:54:39 No.29956

>Data Exchange Podcast 198 - Sep 21, 2023 Overview over everything related to speech. https://www.youtu.be/w4DULuvgO1Y Yishay Carmiel is the CEO of Meaning, a startup at the forefront of building real-time speech applications for enterprises. Episode Notes: https://thedataexchange.media/state-of-ai-for-speech-and-audio >Sections Generative AI for Audio (text-to-speech; text-to-music; speech synthesis) - 00:00:44 Speech Translation - 00:09:44 Automatic Speech Recognition and other models that use audio inputs - 00:13:16 Speech Emotion Recognition - 00:19:55 Restoration - 00:21:55 Similarities in recent trends in NLP and Speech - 00:24:23 Diarization (speaker identification), and implementation challenges - 00:29:47 Voice cloning and risk mitigation - 00:35:36

Robowaifu Technician 03/18/2024 (Mon) 16:52:22 No.30390

There is some Japanese open source programs for speech synthesis such as VOICEVOX though I should mention if you use these voices they will have funny accents if you make them speak English which can be kinda cute sometimes. https://voicevox.hiroshiba.jp And TALQu but it is only for Windows. https://booth.pm/ja/items/2755336 NNSVS is for singing also open source. https://nnsvs.github.io SociallyIneptWeeb used VOICEVOX for an AI waifu before and detailed what he did https://www.youtube.com/watch?v=bN5UaEkIPGM&t=674s

NoidoDev ##pTGTWW 03/18/2024 (Mon) 22:07:21 No.30398

>>30390 Oh wow, this is really good. Thanks. >https://nnsvs.github.io >NNSVS >Neural network based singing voice synthesis library > GitHub: https://github.com/nnsvs/nnsvs > Paper: https://arxiv.org/abs/2210.15987 > Demo: https://r9y9.github.io/projects/nnsvs/ >Features > Open-source: NNSVS is fully open-source. You can create your own voicebanks with your dataset. > Multiple languages: NNSVS has been used for creating singing voice synthesis (SVS) systems for multiple languages by VocalSynth comminities (8+ as far as I know). > Research friendly: NNSVS comes with reproducible Kaldi/ESPnet-style recipes. You can use NNSVS to create baseline systems for your research.

Robowaifu Technician 03/18/2024 (Mon) 23:02:26 No.30403

>>30398 Here is a site I found that writes some about it and has links to written tutorials. https://nnsvs.carrd.co/

NoidoDev ##pTGTWW 03/29/2024 (Fri) 22:15:22 No.30625

>VoiceCraft >>30614 Thanks, but it's about voice cloning again. I think what I really want are artificial voices which don't belong to anyone. Cloning has it's use cases as well, but I don't need or want it for a robot wife. Also I don't need to be to close to a human. To me the quality problem is a solved problem at this point, at least for robowaifus. I was very impressed certainly by the singing capabilities I saw and heard recently, see above >>30390

Robowaifu Technician 03/31/2024 (Sun) 02:54:56 No.30657

>>30625 If you aren't worried about human closeness there is a pretty simple TTS that sounds like old retro synthesized voices. Unfortunately I cant find a video that has the female voice. https://github.com/adafruit/Talkie

NoidoDev ##pTGTWW 03/31/2024 (Sun) 05:06:57 No.30664

>>30657 Thanks, but I didn't mean to go so extreme into the other direction. I just meant for our use case here, and in my opinion, the current state of the technology should be sufficient in terms of quality or it's at least close to it. Making it faster and run better on smaller devices would be good, though. For content creation it's another story, if we don't want to only have stories about robots.

Robowaifu Technician 04/26/2024 (Fri) 19:51:52 No.31027

I figure this might be the best place for this. I found a paper on lip syncing synthesized voices.

Robowaifu Technician 04/26/2024 (Fri) 19:59:13 No.31028

>>31027 For some reason the file didnt attach https://doi.org/10.1109/ROMAN.2010.5598656

Mechnomancer 04/27/2024 (Sat) 13:43:54 No.31038

>>31027 I've been thinking about designing something similar, now I'm totally gonna s̶t̶e̶a̶l̶ be inspired by this.

Grommet 04/28/2024 (Sun) 12:16:35 No.31049

>>31027 That is great. I mentioned doing something, sorta, the same with facial expressions. I believe this is the same sort of "framework" or idea. Here's the paper, Real-time lip synchronization between text-to-speech (TTS) system and robot mouth Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Here's an address for the paper. https://sci-hub.ru/10.1109/roman.2010.5598656

Chobitsu 04/29/2024 (Mon) 03:03:04 No.31055

>>31049 >Well I can't upload it. I get an error saying,"Tor users can not upload files". What??????? Lol, welcome to my world! :D TBH I think Robbit must've disabled file posting by Torfags. I hope he changes that soon.

NoidoDev 07/11/2024 (Thu) 12:13:40 No.32169

Just wanted to mention, that Suno - the AI music creation model, is based on Bark - the speech generation model. They needed around two years from there to where we are now. I have a source, a video where this is mentioned, but this is also about a lot of other things. Just has been mentioned there.

Chobitsu 07/11/2024 (Thu) 13:59:40 No.32171

>>32169 Neat! That's an interesting heritage. Impressive results in a fairly short time, too. Thanks, NoidoDev. Cheers. :^)

NoidoDev 07/27/2024 (Sat) 17:57:30 No.32406

I think to make speech generation fast enough, we'll have to go from some AI voice to replicate the generated voice in mbrolla. I asked Claude about it. [Quote] To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA, you'll need to follow several steps. Here's a concise overview of the process: 1. Prepare your data: - Ensure your audio files are high-quality recordings - Verify that phonetic labels are accurate and aligned with the audio 2. Choose your synthesizer: - eSpeak: Better for rapid development, less natural-sounding - MBROLA: More natural-sounding, but requires more work 3. For eSpeak: - Create a voice file with phoneme definitions - Define pitch and timing rules - Add your voice file to eSpeak's voices directory 4. For MBROLA: - Create diphone database from your audio files - Define pitch and duration rules - Create a language definition file 5. Test and refine: - Use sample texts to test your voice - Adjust parameters for better quality This process can be complex and time-consuming. Would you like me to elaborate on any specific part of this process? Yes, there are tools available to help create voices for MBROLA. Here's a brief overview of some key tools: 1. MBR_PH: This is the official MBROLA database creation tool. It helps in creating diphone databases from labeled speech samples. 2. Praat: While not specifically for MBROLA, this phonetics software can be used for labeling and segmenting audio files, which is a crucial step in the process. 3. HTK (Hidden Markov Model Toolkit): Can be used for automatic segmentation of speech data. 4. Festvox: A voice building framework that can be used to create voices for various synthesizers, including MBROLA. 5. WinSnoori: A speech analysis tool that can be helpful in the voice creation process. These tools can significantly streamline the process, but creating a high-quality voice still requires considerable effort and expertise. Would you like more details on any of these tools or the specific steps they're used for? [/Quote]

NoidoDev 07/27/2024 (Sat) 17:58:08 No.32407

>>32406 The linguistic knowledge necessary for defining pitch and duration rules in speech synthesis encompasses several areas of linguistics. Here's an overview of the key areas: 1. Phonetics and Phonology: - Understanding of speech sounds and their features - Knowledge of phonemic inventories and allophonic variations - Familiarity with syllable structure and phonotactics 2. Prosody: - Understanding of stress patterns (lexical and sentential) - Knowledge of intonation patterns and their meanings - Familiarity with rhythm and timing in speech 3. Morphology: - Understanding how word structure affects pronunciation - Knowledge of morphophonemic alternations 4. Syntax: - Understanding how sentence structure affects prosody - Knowledge of phrase boundaries and their prosodic effects 5. Pragmatics: - Understanding how context and intent affect prosody - Knowledge of focus and emphasis in speech 6. Sociolinguistics: - Familiarity with dialect variations in prosody - Understanding of speaking styles and registers 7. Acoustic Phonetics: - Knowledge of acoustic correlates of prosodic features - Understanding of formant structures and transitions 8. Language-specific knowledge: - Deep understanding of the prosodic system of the target language - Familiarity with any unique features (e.g., tone in tonal languages) This knowledge helps in creating rules that accurately capture the nuances of natural speech in the target language. Would you like more information on any of these linguistic areas or how they specifically apply to speech synthesis?

NoidoDev ##pTGTWW 07/27/2024 (Sat) 18:32:26 No.32409

I think to make speech generation fast enough, we'll have to go from some AI voice to replicate the generated voice in mbrolla. I asked Claude about it. To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA, you'll need to follow several steps. Here's a concise overview of the process: 1. Prepare your data: - Ensure your audio files are high-quality recordings - Verify that phonetic labels are accurate and aligned with the audio 2. Choose your synthesizer: - eSpeak: Better for rapid development, less natural-sounding - MBROLA: More natural-sounding, but requires more work 3. For eSpeak: - Create a voice file with phoneme definitions - Define pitch and timing rules - Add your voice file to eSpeak's voices directory 4. For MBROLA: - Create diphone database from your audio files - Define pitch and duration rules - Create a language definition file 5. Test and refine: - Use sample texts to test your voice - Adjust parameters for better quality This process can be complex and time-consuming. Would you like me to elaborate on any specific part of this process? Yes, there are tools available to help create voices for MBROLA. Here's a brief overview of some key tools: 1. MBR_PH: This is the official MBROLA database creation tool. It helps in creating diphone databases from labeled speech samples. 2. Praat: While not specifically for MBROLA, this phonetics software can be used for labeling and segmenting audio files, which is a crucial step in the process. 3. HTK (Hidden Markov Model Toolkit): Can be used for automatic segmentation of speech data. 4. Festvox: A voice building framework that can be used to create voices for various synthesizers, including MBROLA. 5. WinSnoori: A speech analysis tool that can be helpful in the voice creation process. These tools can significantly streamline the process, but creating a high-quality voice still requires considerable effort and expertise. Would you like more details on any of these tools or the specific steps they're used for?

NoidoDev ##pTGTWW 07/27/2024 (Sat) 18:33:09 No.32410

>>32409 The linguistic knowledge necessary for defining pitch and duration rules in speech synthesis encompasses several areas of linguistics. Here's an overview of the key areas: 1. Phonetics and Phonology: - Understanding of speech sounds and their features - Knowledge of phonemic inventories and allophonic variations - Familiarity with syllable structure and phonotactics 2. Prosody: - Understanding of stress patterns (lexical and sentential) - Knowledge of intonation patterns and their meanings - Familiarity with rhythm and timing in speech 3. Morphology: - Understanding how word structure affects pronunciation - Knowledge of morphophonemic alternations 4. Syntax: - Understanding how sentence structure affects prosody - Knowledge of phrase boundaries and their prosodic effects 5. Pragmatics: - Understanding how context and intent affect prosody - Knowledge of focus and emphasis in speech 6. Sociolinguistics: - Familiarity with dialect variations in prosody - Understanding of speaking styles and registers 7. Acoustic Phonetics: - Knowledge of acoustic correlates of prosodic features - Understanding of formant structures and transitions 8. Language-specific knowledge: - Deep understanding of the prosodic system of the target language - Familiarity with any unique features (e.g., tone in tonal languages) This knowledge helps in creating rules that accurately capture the nuances of natural speech in the target language. Would you like more information on any of these linguistic areas or how they specifically apply to speech synthesis?

NoidoDev ##pTGTWW 07/27/2024 (Sat) 18:42:01 No.32411

>>32409 > To go from phonetic labels and audio files to creating a voice in eSpeak or MBROLA >This checkpoint leverages the pretrained checkpoint wav2vec2-large-lv60 and is fine-tuned on CommonVoice to recognize phonetic labels in multiple languages. https://huggingface.co/facebook/wav2vec2-lv-60-espeak-cv-ft >Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained mode https://arxiv.org/abs/2109.11680

NoidoDev ##pTGTWW 07/27/2024 (Sat) 18:43:42 No.32412

>>32406 >>32407 Can be deleted. I was on another computer while using Claude.

Robowaifu Technician 07/27/2024 (Sat) 19:14:02 No.32413

>>32409 Ouuuf, I had to ask specifically: Yes, Festival can indeed be easier to work with for voice creation compared to directly using MBROLA or eSpeak. Here's a brief overview of why Festival might be a more accessible option: 1. Comprehensive toolkit: - Festival comes with a suite of tools designed for voice creation - Includes utilities for data preparation, analysis, and synthesis 2. Festvox: - A voice building framework that works with Festival - Provides a structured approach to creating new voices 3. Higher-level abstractions: - Allows you to work with higher-level linguistic concepts - Can automatically handle some lower-level details 4. Built-in machine learning: - Can use statistical methods to learn prosody models from data - Reduces the need for manual rule creation 5. Extensive documentation: - Well-documented processes for voice building - Includes tutorials and example voices 6. Language support: - Comes with tools for building voices in various languages - Can handle different writing systems and phoneme sets 7. Integration: - Can still use MBROLA voices if desired - Allows for mixing different synthesis methods While Festival still requires significant effort and linguistic knowledge, it provides more structure and automation in the voice creation process. Would you like more details on how to approach voice creation with Festival?

NoidoDev ##pTGTWW 07/27/2024 (Sat) 19:36:34 No.32414

>>32413 We should keep an eye on developments in case anyone tries to make that easier with AI. Or maybe we could try advocating for it. I might also give it a try myself, if I really have to, but I'm bad with patience and frustration. http://www.festvox.org/ https://github.com/festvox/festvox >For full details about voice building see the document itself http://festvox.org/bsv/

Chobitsu 07/27/2024 (Sat) 19:48:32 No.32416

>>32414 Great research effort NoidoDev, thanks! :^)

NoidoDev ##pTGTWW 01/16/2025 (Thu) 04:04:45 No.35678

>Kokoro TTS I just watched this video https://youtu.be/tl1wvZXlj0I about a new TTS model. It's named Kokoro https://huggingface.co/hexgrad/Kokoro-82M >Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out). >On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. ... >Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio ... >Kokoro v0.19 is limited in some specific ways, due to its training set and/or architecture: > [Data] Lacks voice cloning capability, likely due to small <100h training set > [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes > [Data] Training dataset is mostly long-form reading and narration, not conversation > [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash > [Data] Multilingual capability is architecturally feasible, but training data is mostly English What I found really interesting is, that it can blend voices. So you can create voices which won't sound like the standard ones.

Chobitsu 01/16/2025 (Thu) 08:15:02 No.35694

>>35678 Nice find, NoidoDev. Thanks! :^)

GreerTech 02/01/2025 (Sat) 15:02:38 No.36362

I used eSpeak and copy-and-paste to do a quick-and-dirty speech capability.

Chobitsu 02/02/2025 (Sun) 03:16:13 No.36377

>>36362 Can you please link some resources for everyone here how you did it? Like, the original sauces for things.

GreerTech 02/02/2025 (Sun) 03:35:33 No.36383

>>36377 From Offline AI Roleplay - A Guide to Simple Offline AI "Open eSpeak while you use the AI program (it may be labeled as TTSApp). Copy the outputs into eSpeak, and have it say the text. Both Jan and LM Studio have a dedicated copy button. The shortcut for pasting is Ctrl + V." eSpeak can be found here; https://espeak.sourceforge.net/

NoidoDev ##pTGTWW 02/02/2025 (Sun) 21:48:20 No.36417

>>36377 espeak is in every Linux repository. It's very well known but also basic. I use spd-say sometimes with s I sometimes use Speech Dispatcher (spd-say) with Festival >>4333 or Mbrola >>32411 >>32406, though Festival seems to be easier for voice creation: >>32413. Festival sounds better than espeak, imo. Especially with the right configuration (which I don't have anymore right now). This still sounds very robotic and makes errors in pronunciation, but it's okayish and better than espeak. Speech Dispatcher (spd-say promt) is a kind of layer on top so you can switch the system underneath: https://wiki.archlinux.org/title/Speech_dispatcher

GreerTech 02/02/2025 (Sun) 21:52:13 No.36419

>>36417 Festival definitely looks better for developers who are making a unique system >espeak is in every Linux repository. Yeah, but me and many others are on Windows

Robowaifu Technician 02/02/2025 (Sun) 22:10:59 No.36423

>>36419 isnt narrator in the accessibility settings just a builtin text-to-speech program

Chobitsu 02/03/2025 (Mon) 07:25:56 No.36440

>>36383 >>36417 Thanks for your help, Anons! :^)

GreerTech 02/03/2025 (Mon) 07:37:48 No.36443

>>36423 I tried that at first, but the problem is that as far as I know, it reads EVERYTHING on the screen.

NoidoDev ##pTGTWW 02/03/2025 (Mon) 12:29:42 No.36447

>>36419 >but me and many others are on Windows Okay, I assume this is for development, while the real system will more likely be Linux. Anyways, I don't know how this exactly works but I think you can use the embedded Linux in Windows or whatever this is, and I assume there's also a repository. WSL: https://learn.microsoft.com/en-us/windows/wsl/about

Chobitsu 02/03/2025 (Mon) 12:59:52 No.36451

>>36447 Yeah, WSL currently defaults to an Ubuntu 24 variant system + terminal. It's not a perfect match, but it's close enough in most respects. For example, I've been able to build & successfully run juCi++ [1][2][3] on it (this is a moderately complex & dependency-laden GTKMM -based GUI application, built from source) without any hiccups. This subsystem is very simple to set up, and I'll be happy to help anyone here who may be struggling to do so. Hopefully it can support Anon's other development needs, and if not then moving over to a full Linux system will be all the easier for them afterwards. Cheers. :^) --- 1. https://gitlab.com/cppit/jucipp/-/blob/master/docs/install.md#debianlinux-mintubuntu 2. here's a one-liner to copypasta onto your new Ubuntu terminal for installing all it's dependencies:

sudo apt-get install libclang-dev liblldb-dev || sudo apt-get install libclang-6.0-dev liblldb-6.0-dev || sudo apt-get install libclang-4.0-dev liblldb-4.0-dev || sudo apt-get install libclang-3.8-dev liblldb-3.8-dev; sudo apt-get install universal-ctags || sudo apt-get install exuberant-ctags; sudo apt-get install git cmake make g++ clang-format pkg-config libboost-filesystem-dev libboost-serialization-dev libgtksourceviewmm-3.0-dev aspell-en libaspell-dev libgit2-dev

Then just follow the rest of the install instructions from the link above (ie, git clone --recursive https://gitlab.com/cppit/jucipp , etc.) 3. WSL is also a great platform for Windows users to build & run BUMP for archiving this board, btw ( >>14866 ). >=== -prose edit -add footnote/hotlink -add dependencies/BUMP footnotes

Edited last time by Chobitsu on 02/03/2025 (Mon) 15:36:01.

Chobitsu 02/28/2025 (Fri) 17:10:12 No.37216

>>37207 Neat! That's an ecouraging development, Barf. Thanks & cheers. :^) >=== -minor edit

Edited last time by Chobitsu on 02/28/2025 (Fri) 18:04:34.

GreerTech 03/07/2025 (Fri) 17:47:31 No.37361

Good news everybody! I was able to get AndroidOS Voice Control to work with ChatterUI, so now I can chat with AI naturally. As per the principles of >>10000, I tested it on Airplane Mode, and it still works.

Chobitsu 03/08/2025 (Sat) 12:21:14 No.37373

>>37361 Thats pretty neat. Thanks for the update, Anon, Good luck with your research, GreerTech!

NoidoDev ##pTGTWW 03/11/2025 (Tue) 02:41:52 No.37427

>>4693 >>21542 So, UTAU software, which is also used by Adachi Rei (Missile39's voice bank for his orange haired robot) is actually available with an English interface. The program is made for WinXP, but I personally would use it with Wine/Bottles anyways. It runs on old computers (Pentium 4). https://utau-synth.com/ http://utau.wikidot.com/utau >UTAU is musical voice synthesis software created by Ameya/Ayame. Though its purpose is similar to the Vocaloid software created by Yamaha, UTAU functions natively as a stand-alone program on both PC and Mac OSX platforms without requiring a VSTi plugin for third-party software (see: UTAU-Synth) and is licensed as freeware. The PC version also has a shareware option through donation [1]. > UTAU was developed from the concept of Jinriki ("manual") Vocaloid, a practice in which existing voice samples are spliced, re-assembled and pitched to create a singing voice in software such as Melodyne. Ameya released the initial version of UTAU in March 2008. On May 27th, 2011, the first beta version of UTAU-Synth for Mac was released. > The software comes pre-loaded with a robotic-sounding young female voice, Utane Uta (known as Defoko), created from the AquesTalk TTS program [2]. Unlike Vocaloid, however, the UTAU engine can openly accept any manner of .wav files, allowing its users to create their own "voicebanks" and distribute them online.

GreerTech 03/13/2025 (Thu) 20:15:30 No.37479

Several years ago, I remember seeing an Alexa ad* that mentioned that it had "far-field" technology, so you wouldn't have to shout and speak slowly like you had to do with other voice-command technology. I got curious and researched it for myself. Turns out, it was just several microphones on an array, presumably with software that combines and compares the inputs to reinforce what it has heard. Probably not too difficult for a skilled coder. https://developer.amazon.com/en-US/blogs/alexa/post/80facfd2-1176-4c4f-94ac-4c5c781011ca/amazon-alexa-premium-far-field-voice-development-ki https://www.hackster.io/news/an-alexa-development-kit-from-amazon-5dae5522ddc *the one ad with the family that was a meme back in the late 2010s.

NoidoDev ##pTGTWW 03/16/2025 (Sun) 00:52:04 No.37531

>>37479 Thanks, but I saw this quite often during the last few years. Some companies make such arrays of microphones. The question is just how to integrate that into a head. I could imagine having several small holes in the skull, but on the outside with a layer of silicone covering it. It would still be tricky to keep this save from liquid entering. The most interesting to me would be a system where we can setup the mics in a skull on our own and then have a system adapt to it. The mics should be small but waterproof and not directly soldered to some board.

GreerTech 03/16/2025 (Sun) 12:09:39 No.37541

>>37531 Well, if the robot has a solid head, then you can put them on the top dome in a spherical array, with two on the side to mimic nature. We can also put tiny microphones in other places, like the wrist or neck

AI Vocal Expression and Emotion Robowaifu Technician 05/06/2025 (Tue) 20:44:48 No.38268

ALRIGHT LISTEN UP! I'm sick and tired of listening to AI voice that is always monotone, mundane robotic like sound with no human like intonation especially when you talk to them. I know that AI voice has always been very synonym to robotic like speech but c'mon, is this the best that we can do to make our robowaifu speak? There hasn't been any significant improvement in this field through out the years which is a shame. I want my robowaifu to be able realistically roleplay vocally and do ASMR. I think we need to build a new data modal specifically for analysing human speech so that AI can learn how to speak with the right intonation together with realistic vocal expression. I'm really certain that AI emotion and data speech modal are very interrelated. So what are your thought in this?

GreerTech 05/06/2025 (Tue) 21:02:16 No.38269

>>38268 People are already trying to figure it out. It's definitely a case of "easier said than done". There's that one new open source TTS (sorry the name escapes me) that came out recently, you might want to try there.

Robowaifu Technician 05/07/2025 (Wed) 05:10:54 No.38285

>>38269 >new open source TTS This is not good enough. An AI waifu needs to learn what is funny. Somewhere between the line as we speak, your waifu chuckles because it was funny. What's the intonation of a person who speak when they are mad. What's the person sound like when they are explaining things. What's the person's voice sounds like when they are in sigh, hype and ect. All these intricate stage of feeling from what human sounds like when they speak, the AI waifu needs to learn all of these by herself before she can utter a word with a correct kind of tone. In other words, your waifu needs to train her emotion's data bank so that she can converse in more meaningful tones of utterance.

GreerTech 05/07/2025 (Wed) 05:22:03 No.38286

>>38285 https://decrypt.co/316008/ai-model-scream-hysterically-terror They're working on it. Not to say you can't work on it yourself, but rather it's not a deliberate choice to leave out emotion. Also, you can do some tricks just by changing settings. I got Galatea to sing just by slightly lowering her speed. >pic related A monotone voice can actually be cute

Robowaifu Technician 05/07/2025 (Wed) 05:36:46 No.38287

>>38286 >A monotone voice can actually be cute Yes but your waifu needs to be aware in realtime, what the kind of tone you give to her when she is listening to your voice as you speak so that she could reply you with correct vocal intonation.

Chobitsu 05/07/2025 (Wed) 08:37:09 No.38289

>>38268 >>38285 >>38287 Lol. NYPA, Anon. OTOH, if you want to try solving this together with us here, that would be great! <---> I'm glad that you bring up this topic. I think we all instinctively know when a voice is uncanny-valley, but sometimes it can be hard to put into words. You've made a good start at it, Anon. Cheers. :^)

Chobitsu 05/07/2025 (Wed) 08:39:25 No.38290

>>38269 >It's definitely a case of "easier said than done". This. But I must admit, there has been some remarkable progress in this arena. Our own @Robowaifudev did some great work on this a few years ago. My ineptitude with getting Python to work properly filtered me, but he was pulling off some real vocal magic type stuff -- all locally IIRC.

Robowaifu Technician 05/29/2025 (Thu) 15:11:52 No.38778

> (audio LLM -related : >>38775 )