Exploring Hugging Face: Automatic Speech Recognition

ASR Models in Hugging Face

Okan Yenigün
AWS Tip

--

Photo by Daniel Sandvik on Unsplash

Automatic Speech Recognition (ASR) is the task of converting spoken language into text. In the context of Hugging Face, it refers to using models and tools available on the Hugging Face platform to perform ASR. Hugging Face provides access to various pre-trained ASR models that you can use to transcribe audio files or streams into text.

Fine-tuned XLSR-53 large model for speech recognition in English

This is the most downloaded ASR model on Hugging Face as of March 2024. It is a pre-trained model based on the Wav2Vec 2.0 architecture, specifically the XLSR-53 variant. It has been fine-tuned for English language speech recognition.

Wav2Vec 2.0 is a self-supervised learning framework for speech processing developed by Facebook AI. It learns powerful representations of raw audio by predicting masked parts of the audio input. The model is first pre-trained on a large amount of unlabeled audio data and then fine-tuned on a smaller labeled dataset for specific tasks like ASR.

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="jonatasgrosman/wav2vec2-large-xlsr-53-english")

result = pipe(["speech.wav"], generate_kwargs={"language": "english"})

result
"""
[{'text': 'and you know what they call it a quarter pounder with cheese in paris'}]
"""

XLSR-53 stands for Cross-Lingual Speech Representations, and the number 53 indicates that the model was pre-trained in 53 different languages. This makes the model more robust and versatile, as it has learned from a diverse set of languages.

Whisper

Whisper is a neural network-based model developed by OpenAI for automatic speech recognition (ASR). It is designed to transcribe speech to text with high accuracy across a wide range of languages and domains.

faster_whisper provides an interface to work with the Whisper model efficiently.

from faster_whisper import WhisperModel

model = WhisperModel("large-v3")

segments, info = model.transcribe("pulp_fiction_dialog.mp3")
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

"""
[0.04s -> 1.82s] So, tell me again about the hash bar.
[1.92s -> 2.70s] Okay, what you want to know?
[3.36s -> 4.34s] Yeah, it's just legal, man, right?
[4.60s -> 6.22s] Yeah, it's legal, but it ain't 100% legal.
[6.32s -> 8.30s] I mean, you just can't walk into a restaurant,
[8.64s -> 10.18s] roll a joint, and start puffing away.
[10.88s -> 12.34s] I mean, they want you to smoke in your home
[12.34s -> 13.52s] or certain designated places.
[14.08s -> 15.16s] And those are hash bars?
[15.42s -> 16.72s] Yeah, it breaks down like this, okay?
[16.92s -> 19.14s] It's legal to buy it. It's legal to own it.
[19.36s -> 21.24s] And if you're the proprietor of a hash bar,
[21.46s -> 22.44s] it's legal to sell it.
[22.80s -> 25.40s] It's legal to carry it, but that doesn't matter
[25.40s -> 26.94s] because get a load of this, all right?
[26.94s -> 29.08s] If you get stopped by a cop in Amsterdam,
[29.08s -> 31.06s] it's illegal for them to search you.
[31.52s -> 33.72s] I mean, that's the right the cops in Amsterdam don't have.
[34.12s -> 35.68s] Oh, man, I'm going.
[35.82s -> 37.38s] That's all it is to it. I'm fucking going.
[37.50s -> 39.46s] I know, baby. You dig it the most.
[41.26s -> 43.36s] But you know what the funniest thing about Europe is?
[43.44s -> 43.74s] What?
[44.04s -> 44.90s] It's the little differences.
[45.86s -> 47.98s] I mean, they got the same shit over there that they got here,
[48.04s -> 49.98s] but it's just there. It's a little different.
[50.26s -> 50.62s] Example.
[51.18s -> 53.34s] All right, well, you can walk into a movie theater in Amsterdam
[53.34s -> 54.18s] and buy a beer.
[54.78s -> 56.68s] And I don't mean just, like, a little paper cup.
[56.74s -> 58.12s] I'm talking about a glass of beer.
[58.12s -> 61.06s] And in Paris, you can buy a beer at McDonald's.
[61.78s -> 65.76s] And you know what they call a quarter pounder with cheese in Paris?
[66.30s -> 67.82s] They don't call it a quarter pounder with cheese?
[68.08s -> 69.34s] No, man, they got the metric system.
[69.46s -> 71.18s] They wouldn't know what the fuck a quarter pounder is.
[71.52s -> 72.38s] What do they call it?
[72.50s -> 74.74s] They call it a royale with cheese.
[75.64s -> 76.54s] Royale with cheese.
[76.72s -> 77.24s] That's right.
[77.62s -> 78.60s] What do they call a Big Mac?
[79.12s -> 81.62s] Big Mac's a Big Mac, but they call it Le Big Mac.
[82.06s -> 82.96s] Le Big Mac.
[85.16s -> 86.36s] What do they call a Whopper?
[86.56s -> 87.22s] I don't know.
[87.30s -> 87.92s] I didn't go on a burger.
[88.12s -> 92.08s] You know what they put on french fries in Holland instead of ketchup?
[92.34s -> 92.66s] Or what?
[92.98s -> 93.60s] Mayonnaise?
[94.18s -> 97.38s] I seen them do it, man.
[97.42s -> 98.78s] They fucking drown them in that shit.
[102.94s -> 105.44s] We should have shotguns with this kind of deal.
[109.26s -> 110.20s] How many up there?
[110.66s -> 111.32s] Three or four.
[112.32s -> 113.22s] That's counting our guy?
[114.26s -> 114.96s] I'm not sure.
[116.00s -> 118.10s] So that means it could be up to five guys up there?
[118.30s -> 118.90s] Mine?
[119.02s -> 120.24s] As possible.
[120.60s -> 121.84s] We should have fuckin' shotguns.
[123.06s -> 124.34s] Thanks for watching.
[124.36s -> 125.54s] For more episodes, check out my Facebook and Instagram.
[125.56s -> 126.76s] Subscribe, to watch more episodes.
[126.80s -> 127.70s] Don't forget to rate this video.
[127.74s -> 128.78s] Click the bell for notifications.
[128.80s -> 130.22s] Oryeee...
[130.24s -> 131.04s] Uma.
[133.20s -> 135.18s] We looked at a job at McDonald's next week.
[135.18s -> 137.76s] I love McDonald's and we concur.
"""

We can use transformers, too. This time the small model:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf
import librosa

# Load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = None

# Load your MP3 file
audio_file = "speech.wav"
audio, sample_rate = sf.read(audio_file)

# Resample the audio to 16,000 Hz if necessary
if sample_rate != 16000:
audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)
sample_rate = 16000

# Process the audio file
input_features = processor(audio, sampling_rate=sample_rate, return_tensors="pt").input_features

# Generate token ids
predicted_ids = model.generate(input_features)

# Decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print("Transcription:", transcription[0])

"""
Transcription: and you know what they call a quarter pounder with cheese in Paris.
"""

Next:

Read More

Sources

https://huggingface.co/openai/whisper-large-v3

https://huggingface.co/openai/whisper-small

https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english

--

--