I think I've finally figured out a way to train more expressive voices in conversation without having to label a ton of data.
First, the English text needs to be transcribed into IPA so that a speech synthesis model can easily predict how words are spoken without requiring a huge dataset covering all the exceptions and weirdness of English. The English transcription or IPA is projected into an embedding that's split into two parts. One part constrained to representing the content as IPA via projecting those features back into IPA symbols and minimizing the cross entropy loss. The other half modelling the style, such as the emotion and other subtleties, to match the audio examples more faithfully, which are trained through the Mel spectrogram loss. This way the model can learn all aspects of speech through just the text labels and audio examples alone.
At inference time this style embedding could be modified to change the emotion, pitch, cadence, tone and other qualities of the model for voice acting or creating examples for finetuning the model towards a desired personality.
A ByT5 model could be used to transcribe English and other languages into the IPA embedding + style embedding. It could also take into account the previous context of the conversation to generate a more appropriate style embedding for the speech synthesis model to work from. Training from context though will require new datasets from podcasts that have such context. I've collected some with existing transcripts and timestamps for this already. The transcripts just need to be accurately aligned to the audio clips for clipping, so it's not an unfeasible project for one person to do.
Other possibilities for this could be adding tags into the text training data that get filtered out from the content via the IPA cross entropy loss, ensuring the tags only affect the style embedding. You could indicate tempo, pitches, velocity and note values for singing which would be learned in the style embeddings. It could also be used for annotating different moods or speaking styles such as whispering or yelling. There's a ton of possibilities here for more versatile speech synthesis and natural conversation.