https://github.com/openai/whisper
Oh shit, audio transcription has surpassed average human level and now competitive with professional transcription. OpenAI has gone off its investor rails and completely open-sourced the model and weights. On top of that it's multilingual and can do Japanese fairly well. This could be used for transcribing audio from vtubers, audio books, and anime with missing subtitles. Unfortunately it doesn't do speaker detection as far as I know but it might be possible to train another model to use the encoded audio features to detect them.
Install:
python -m pip install git+https://github.com/openai/whisper.git --user
Quick start:
import whisper
model = whisper.load_model("base", device="cuda") # set device to cpu if no CUDA
result = model.transcribe("chobits_sample.mp3", language="en") # multilingual models will automatically detect language, but not English only models
print(result["text"])
Output (base):
> Yuzuki. I brought you some tea. Huh? Huh? Why are you serving the tea? The maid, Persecom, is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. Wee. I can handle this on my own. I want you to try to relax. Oh. Oh? Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all we really need is a good night's sleep. But it'd be so exhausted that he just collapsed like that. Does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motu-suwa, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! Huh? There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki.
Interestingly the VA actually said "persecom" instead of persocom and Motusua instead of Motosuwa, which transcribed as "Motu-suwa". The poor pronunciation of "all he really needs is a good night's sleep" sounded a lot like "all we really need is a good night's sleep" and was transcribed as such. The only other errors were transcribing a Chii processing sound effect as "wee", mistaking Minoru saying "ah!" as "huh?", the clatter of teacups being transcribed as "oh", and Minoru saying "ugh" as "oh?"
Output (small):
> Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki?
"Ah! Huh?" from Minoru and Hideki were omitted. "Ugh" was also omitted when Minoru passes out. It understood persocom wasn't a name but still misspelled it "persicom". Chii's sound effect wasn't transcribed as "wee" this time. Motosuwa got transcribed as "Motosua". This model understood "all he really needs" but made a mistake at the end thinking Hideki was asking a question saying Yuzuki.
Output (medium):
> Yuzuki! I brought you some tea. Ah! Huh? Why are you serving the tea? The maid, Persicom, is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki...
This one got the ellipsis right at the end and recognized Minoru saying "ah!" but mistook persocom as a name, Persicom. "Ugh" was omitted.
Output (large):
>Yuzuki! I brought you some tea. Why are you serving the tea? The maid persicom is currently being used by the system. What are you talking about? Yuzuki, you're the center of that system, aren't you? Why don't you sit down and take a rest? I'm sure you're very tired from all this. Lord Minoru, thank you very much. I can handle this on my own. I want you to try to relax. Minoru! Lord Minoru! Lord Minoru! Well, I'm glad to know that all he really needs is a good night's sleep. But to be so exhausted that he just collapsed like that, does that mean he hasn't been getting any sleep lately? Yes. It must be because of all the research that I asked him to do for me. Please, Mr. Motosua, it isn't your fault. I'm afraid that I'm just not powerful enough. Don't say that! There's no denying that if my processing speed were faster, I wouldn't have put Lord Minoru under such extreme stress. If only I was just more useful. Miss Yuzuki...
"Ah! huh?" were omitted and it understood persocom wasn't a name but still spelled it as "persicom".