SPEECH TO TEXT CONVERSION

Speech to text conversion

What is speech to text translation?

Speech-to-text translation is the process of converting spoken words into written word that process is often referred to as speech recognition.

What are uses of speech to text?

Speech to text is used to recognize and translations spoken words or phrases languages into text that making used of computational linguistics. In customer service, it used to extract insights from customer conversations to improved customers experiences and increasing productivity and can also be used for adding subtitles to media content.

That tools i.e. (Amazon Transcribe Medical) which to created record and document clinical conversations into electronic health record systems for analysis in a faster and more efficient manner, which automate data entry and provide immediate access to information.

Speech to text conversion:

In Speech to text conversion the system detects words as well as phrases in audio input by a machine and converts them into a readable text format. It is useful when people of different languages and dialects communicate or interact with each other. In the absence of a Speech to Text conversion system people with different languages may not understand the words spoken by each other.

reference :-ijstr.org//speech to text conversion

Firstly important features are extracted from the input speech and then word and sentence matching is done using acoustic word models and defined syntax and semantic for the sentences and this process is mutually exclusive and can be done at this stage. At End the language modelling is performed by using the selecting modelling method.

Types of speech to text technology

Speaker Dependents Systems

Speech recognition software which dependents on speakers’ particular voice characteristics. System required be trained on a specific user before being able to recognize.

Speaker dependents systems are able to recognized desired text speech from a variety of contexts or phrases

However speaker independent systems are able to recognize the speech from different users by limited the contexts of the speech

Speaker Independent Systems

Software that does not needed speaker training independents. Mostly system used for automated telephones interfaces. These systems are not uses pertaining to recognize each person’s speech characteristics.

Methodologies: -

Mel Frequency Cepstral Coefficient (MFCCs)

Features extraction is the most important part to reduce the data size of the speech signal before pattern classification or recognition. The mains steps of Mel frequency’s Cepstral Coefficients (MFCCs) are – framing, window, Discrete Fourier Transform (DFT), Mel frequency filter, logarithmic function and Discrete Cosine Transform (DCT)

reference :-ijstr.org//speech to text conversion

Framing: First step of the MFCCs is the process of blocking of the speech samples which are obtained from the analogue to digital conversion i.e. (ADC) of the spoken word, into the numbers of frame signal with 20- 40ms frames. Overlapping is required to avoid loss of information’s.
Windowing: In order to reduce the discontinuities at the start and end of the frame the first and last points in the frame, windowing function is use
DFT: Discrete Fourier Transform i.e. (DFT) is used as (FFT) the Fast Fourier Transform algorithm. FFT which converts each frame of the samples from the time domain into the frequency domain.
Mel’s frequency filtering: The voice signal which not follow the linear scale and the also wider frequency range in FFT. It is perceptual scale that helps to simulate the way human ears and corresponds to better resolution at low frequencies.
Logarithmic function: Logarithmic transformations is applied after Mel-scale conversion to obtained the absolute magnitude of the coefficients The absolute magnitude operations removes the phase information, making feature extracted less sensitive to speaker dependent variations.

DCT: Discrete cosine transform (DCT) converts the Mel-filtered spectrum back into the time domain since the Mel Frequency Cepstral Coefficients are used as the time index in recognition stages

Flow of speech to text conversion

reference :-ijstr.org//speech to text conversion

To convert desired text speech to output text, the main four steps are developed by using MATLAB. These steps are i.e. (speech database, preprocessing, feature extraction and recognition). Firstly, fives audio files are recorded with the help of computer. Each audio files contains different pronunciation audio files. The signals of speeches which at low frequencies and have also more energy than at high frequencies. Such as the energies of signal are necessary to be at high frequency.

According to the environment, the unwanted noise which affect the recognition rate worse. In preprocessing at end stage, the speech samples are extracted to features or coefficients by the use of Mel Frequency Cepstral Coefficient (MFCC). These MFCCs coefficients are used as the texts word for Hidden Markov Model i.e. (HMM) recognizer to classify the desired spoken word. Even if audio files are failed to obtained then also the desired text output can be generated by HMM method

Hidden Markov Model Recognizer:

In classification of the speech signal, there are many approaches to recognize the test audio file. The structure of HMM is a left-to-right structure of the phonemes in speech sequences. HMM model represent the word or acoustics phonemes in speech recognition. The number of HMM model is randomly chosen to modelling. The choice of causes to change the feature vectors or observations. It affects accuracy of speech recognition in HMMs. The most flexible and efficient approach to speech recognition in (HMMs).

The challenges of speech-to-text-conversion in real-time:

Real-time speech-to-text-conversion aims at transferring spoken words or phrases or languages into written text. This gives people with a hearing impaired, access to the contents of spoken language that become able to take part in a conversation within the normal time frames of conversational.

Real-time speech-to-text-transfer is a live broadcast of where the spoken comments of the reporter and also rapidly transferred into subtitles that they are correspond to Comments on.

But, most people with a hearing disability do not receive real-time speech-to-text service at counselling interview, conferences live on TV. Most protocols are tape recorded and subsequently transferred into readable text.

Advantages of different methodologies.

· LPC is a Static approached used for features extraction.
Which take the voice sample as linear combination combining previous voice samples.
· The voice signal is fragmented into many frames and then these framed windows are converted into text.
· MFCC is another approach based on extracting features of signal by using filter and technique applies steps like Framing, Windowing and DFT Transform for Speech to Text conversion

Disadvantage
· Uses fixed resolutions analysis along with a subjective frequency scale.
· MFCC that it requires for Normalization as values which not very efficient in existence of surroundings and noises
· The voice signal is seen as short-term time static signal.

Examples where we can use speech to Text

Google Assistance: -

Google Assistant which help you accomplish a variety of tasks. you can use voice command to look up information and tell Google Assistant to do certain things. However app can also convert speech to text. It sends messages, drafts emails, to your calendar. Speech to text app in the sense, it will help to organize your ideas and notes with voice recognition.

Speech Texter

Speech Texter is a useful tool to help you draft texts, tweets, emails, and more with your voice.

Briana

Briana is a personal A.I. that is need to communicate with your computer through your Android or IOS device. The programs that converted voice into text any website or software program, including word processing ones.

Advantages of speech to text translation

1. Increase profits and activities

Speech-to-text technology more efficient while working and the time saved

2. Work efficiently

Speech-to-text software able to your employs to work increasing productivity and efficiencies.

3. Accuracy

The best speech-to-text software can now provide efficient accuracy. Voices type technologies makes it easier to create an accurate transcriptions of calls, meetings, or discussions and others.

4. Improve work experience

Voice typing can encourage employees to get away from their computers from time to time and use voice typing to routine writing tasks easily.

Conclusion

The Speech- to-Text conversion systems is implemented by using the MFCCs for feature extraction and HMM as the recognized. In speech database, audio files are recoded and these are analyzed to getting the features vectors. The choice of the number of words in the HMM also plays an important case in recognition. The performances of the system is more accurate and reliable by using end point detection algorithm in preprocessing stage.

References:-

1]https://www.researchgate.net/publication/283123585_Intralingual_speech-to-text-conversion_in_real-time_Challenges_and_Opportunitie

2]https://www.researchgate.net/publication/304651244_VOICE_RECOGNITION_SYSTEM_SPEECH-TO-TEXT

3] https://www.irjet.net/archives/V7/i5/IRJET-V7I

Search This Blog

Speech To Text Conversion