Unified Cloud Transcription Framework

A comprehensive system to optimize and expedite the audio-to-text conversion process is the Unified Cloud Transcription Framework. It guarantees accurate, effective, and dependable transcription workflows for a range of applications by combining multiple transcription services and tools into a single cloud-based platform.

With the rise of machine learning and cloud computing, it looks like every major cloud provider has their own transcription (speech-to-text) service. Giants like Google, Amazon, Microsoft and IBM support a good number of languages that can be transcribed from a given audio input. It can be hard to decide which one is better for which language before we have tested them all extensively. But if we want, we can consider leaving this choice to the user if we can integrate all of the services easily.

Each of the cloud transcription services have a different API, but the ways to deal with them are mostly similar. This insight allows us to come up with a unified framework for integrating them all. This is specially true for the common use case of transcribing long audio clips. Let’s talk about what we need to handle in general for all services. The following flowchart pretty much sums it up.

Audio format

We have seen that all the major services accept WAV, and 16-bit, single-channel, 22050 Hz WAV files have pretty good quality. So if we convert our input material to this format, we can feed it to any transcription service.

Transcoding means to convert from one format to another. In this case, we should convert the input media which can be a combination of video (H264, MPEG2 etc.) and/or compressed audio (MP3, AAC etc.), into the WAV format as described above. For transcoding, we can use FFmpeg or some other tool/service, e.g., AWS MediaConvert.

Media location

Some services expect an HTTP URL to the input file, which can be done by creating a presigned URL from a bucket path. Some expect it in their own bucket (e.g., Google), in which case, we need to upload there prior to starting the job. Some services support the audio content to be posted with the API (e.g. IBM).

Start the job

Each service has an API to start the transcription job, by calling it via SDK or HTTP. With this API, the service usually takes –

  • the media (presigned URL, path to file in a bucket, or posted with the API)
  • language code (usually in the form <language>-<region>, e.g. en-US)
  • other configurations: sample rate of the audio, channel count (if multichannel is supported), whether to enable speaker identification and confidence, output bucket path if needed, etc.

The API returns a job ID, to be able to later query the status or results.

For example, AWS SDK has TranscribeService.startTranscriptionJob, Google has speech.v1p1beta1.SpeechClient.longRunningRecognize in their @google-cloud/speech package, and IBM has regional endpoints like https://gateway-lon.watsonplatform.net/speech-to-text/api/v1/recognitions

Polling

Most providers do not support specifying a callback URL to let us know whether the job has been completed or failed. So we need to have a polling cycle of our own to periodically check the ongoing jobs. Of course, we need to maintain a map of these running jobs on our end, along with job IDs and other details.

Get job status

There is usually an API that allows you to get the status of a transcription job, whether it is complete, in-progress, or failed. Sometimes this even provides the progress percentage.

Fetch job result

The actual output of a successful job can be retrieved from the service with a separate API call, or some providers can return it as part of the same call that returns the job status (Google and IBM). AWS writes the output to a bucket that we need to read from.

Process/convert the output

The results returned from the services vary in structure and format, though most of them have the same information, like an array of words, with each word having a start time, end time, and confidence. We need to have a step to convert the service format into our own internal format that the rest of our system can rely on.

To illustrate how the formats differ from service to service, we present sample extracts from three providers. Google

[
   {
      "startTime":"5s",
      "endTime":"5.400s",
      "word":"what",
      "confidence":0.9416568
   },
   {
      "startTime":"5.400s",
      "endTime":"5.500s",
      "word":"is",
      "confidence":0.92425334
   },
   {
      "startTime":"5.500s",
      "endTime":"5.800s",
      "word":"really",
      "confidence":0.82757676
   },
]

AWS

[
   {
      "start_time":"39.3",
      "end_time":"39.88",
      "alternatives":[
         {
            "confidence":"1.0",
            "content":"Creatively"
         }
      ],
      "type":"pronunciation"
   },
   {
      "start_time":"39.88",
      "end_time":"40.26",
      "alternatives":[
         {
            "confidence":"1.0",
            "content":"speaking"
         }
      ],
      "type":"pronunciation"
   },
   {
      "alternatives":[
         {
            "confidence":null,
            "content":","
         }
      ],
      "type":"punctuation"
   },

IBM

[
   [
      "and",
      0.19,
      0.32
   ],
   [
      "I",
      0.32,
      0.39
   ],
   [
      "want",
      0.39,
      0.69
   ],
]

Clean up

This step is for any housekeeping necessary, e.g., we may need to clean up the audio file uploaded to the provider’s bucket, in order to not consume space unnecessarily.

Conclusion

Having a unified framework really simplifies adding more transcription providers. At Craftsmen, we were able to successfully integrate these services following this approach —

Key benefits of our unified framework include:

Scalability: We can expand our capabilities in response to rising demand or new technologies because the framework makes it easy to add new transcription providers.

Maintenance: Updating and maintenance were made easier by centralizing integration logic and standardizing APIs. This method increases the development cycle’s agility and lowers the possibility of compatibility problems.

Efficiency: Rather than juggling various integration techniques, developers can concentrate on augmenting features and optimizing user experience. Faster deployment times and quicker response to market demands are directly related to this efficiency.

Reliability: By standardizing our integration framework, we can guarantee reliability and consistent performance from various transcription providers. Consumers gain from trustworthy transcribing services without being impacted by updates or changes to the backend.

At Craftsmen, we use a methodical approach to successfully integrate a range of transcription services. To guarantee flawless interoperability and top performance across platforms, this required careful planning, stringent testing, and careful implementation. Our framework places a high value on adaptability and durability, allowing it to preserve a consistent user experience while supporting the distinctive features of services like Zoom Media Speech-to-Text, Microsoft Azure Speech-to-Text, AWS Transcribe, IBM Watson Speech-to-Text, and Google Cloud Speech-to-Text. This unified approach improves dependability, streamlines development and maintenance, and enables us to provide exceptional transcription services that meet a variety of client needs and quickly adjust to industry standards.

Hire Exceptional Developers Quickly

Share this blog on

Hire Your Software Development Team

Let us help you pull out your hassle recruiting potential software engineers and get the ultimate result exceeding your needs.

Contact Us Directly

Address:

Plot # 272, Lane # 3 (Eastern Road) DOHS Baridhara, Dhaka 1206

Talk to Us
Scroll to Top