Streaming/Realtime
The Streaming API allows for realtime and continuous transcription of content, streamed from live or as-live. This is currently supported by several vendors for uses, including:
- Realtime recommendations
- Monitoring/alerting/triggering from spoken content
- Live subtitling
- Simultaneous transcription and translation of meetings, events
This API is different from those provided for file-based content, provided as a websocket interface which is used to both send content and receive the transcription.
Note that different vendors have different requirements on the content that they accept, the supported languages (and punctuation models), as well as any hard-coded or configurable delays in the transcription results.
In general, the longer the delay in receiving the transcription, the better the quality.
API Endpoint
ws://<HOSTNAME>:<PORT>/streamingTranscribe
Where HOSTNAME
and PORT
is the address of the service running the streaming API.
Parameters
- provider (required) - Named vendor to perform the streaming transcription
- language (required) The language of the content to be transcribed. Note that mixed-language content is not yet supported, nor automatic language detection.
- access_token (required) - The full JWT string acquired from the authentication process
- encoding (required) The encoding of the audio content, one of: PCM, MULAW, FLAC, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BITE
- sampleRate (required) The same rate of the audio content, expressed as an integer of samples (e.g. 8khz -> 8000)
- channels (required) The number of channels in the audio content
If there is an error with the initial request, the connection will be closed with an error code of 1003
.
Connecting and Using
Communication happens using a websocket that allows only binary messages to be exchanged.
Messages may be sent
from the client and received
from the server. The flow of events to establish the flow of messages is as follows:
- Client opens a websocket connection to the streaming API
- Cient starts to buffer its audio source
- The streaming API will start its vendor connection
- Client will receive a message from the streaming API (
0x01
)indicating the number of bytes required - Client will reply with one or more messages (
0x01
), containing byte chunks of contiguous audio content to be transcribed up to the number of requested bytes - The Client will receive a message from the streaming API (
0x02
) containing the JSON response - If the Client has further content, go to step 5.
The client may also receive messages (0x03
) in the event of an error captured by the streaming API.
Messages
Messages are sent/received as byte arrays with the following structure:
- Event type (1 byte)
- Payload (n bytes)
Event 0x01
The meaning and structure of this event depends on whether it is being sent from the client or the streaming API, as indicated below.
Sent from client
The client sends an audio segment to the backend.
Event type | Payload (audio byte segment with headers) |
---|---|
9x01 | max 8192 bytes |
Received from streaming API
The API requests a number of bytes to be sent.
Event type | Number of bytes requested by the backend (Unsigned Integer) |
---|---|
0x01 | 4 bytes |
Event 0x02
The streaming API responds to the client with the transcription within a JSON container
Event type | Payload (application/json encoded in UTF8) |
---|---|
0x02 | max 8192 bytes |
Event 0x03
In case of errors or unhandled exceptions, the streaming API will send to the client an indication of the issue
Event type | Payload (plain/text encoded in UTF8) |
---|---|
0x03 | max 8192 bytes |
Transcription Response
The json response is structured in the following way:
export interface TranscriptionItem {
type: "text" | "punctuation",
text: string,
start: number,
end: number,
speakerID: string | null,
partial: boolean
}
With an example of a message shown below:
[
{
"type": "text",
"text": "Hello there.",
"start": 0.39,
"end": 0.6,
"speakerID": null,
"partial": true
}
]
TODO: add parameter detail and flow of partial messages