Streaming/Realtime

The Streaming API allows for realtime and continuous transcription of content, streamed from live or as-live. This is currently supported by several vendors for uses, including:

Realtime recommendations
Monitoring/alerting/triggering from spoken content
Live subtitling
Simultaneous transcription and translation of meetings, events

This API is different from those provided for file-based content, provided as a websocket interface which is used to both send content and receive the transcription.

Note that different vendors have different requirements on the content that they accept, the supported languages (and punctuation models), as well as any hard-coded or configurable delays in the transcription results.

In general, the longer the delay in receiving the transcription, the better the quality.

API Endpoint

ws://<HOSTNAME>:<PORT>/streamingTranscribe

Where HOSTNAME and PORT is the address of the service running the streaming API.

Parameters

provider (required) - Named vendor to perform the streaming transcription
language (required) The language of the content to be transcribed. Note that mixed-language content is not yet supported, nor automatic language detection.
access_token (required) - The full JWT string acquired from the authentication process
encoding (required) The encoding of the audio content, one of: PCM, MULAW, FLAC, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BITE
sampleRate (required) The same rate of the audio content, expressed as an integer of samples (e.g. 8khz -> 8000)
channels (required) The number of channels in the audio content

If there is an error with the initial request, the connection will be closed with an error code of 1003.

Connecting and Using

Communication happens using a websocket that allows only binary messages to be exchanged.

Messages may be sent from the client and received from the server. The flow of events to establish the flow of messages is as follows:

Client opens a websocket connection to the streaming API
Cient starts to buffer its audio source
The streaming API will start its vendor connection
Client will receive a message from the streaming API (0x01)indicating the number of bytes required
Client will reply with one or more messages (0x01), containing byte chunks of contiguous audio content to be transcribed up to the number of requested bytes
The Client will receive a message from the streaming API (0x02) containing the JSON response
If the Client has further content, go to step 5.

The client may also receive messages (0x03) in the event of an error captured by the streaming API.

Messages

Messages are sent/received as byte arrays with the following structure:

Event type (1 byte)
Payload (n bytes)

Event 0x01

The meaning and structure of this event depends on whether it is being sent from the client or the streaming API, as indicated below.

Sent from client

The client sends an audio segment to the backend.

Event type	Payload (audio byte segment with headers)
9x01	max 8192 bytes

Received from streaming API

The API requests a number of bytes to be sent.

Event type	Number of bytes requested by the backend (Unsigned Integer)
0x01	4 bytes

Event 0x02

The streaming API responds to the client with the transcription within a JSON container

Event type	Payload (application/json encoded in UTF8)
0x02	max 8192 bytes

Event 0x03

In case of errors or unhandled exceptions, the streaming API will send to the client an indication of the issue

Event type	Payload (plain/text encoded in UTF8)
0x03	max 8192 bytes

Transcription Response

The json response is structured in the following way:

export interface TranscriptionItem {
    type: "text" | "punctuation",
    text: string,
    start: number,
    end: number,
    speakerID: string | null,
    partial: boolean
}

With an example of a message shown below:

    [
      {
        "type": "text",
        "text": "Hello there.",
        "start": 0.39,
        "end": 0.6,
        "speakerID": null,
        "partial": true
      }
    ]

TODO: add parameter detail and flow of partial messages