Live speech-to-text and AI insights on local server - Streaming API

In this guide you will be shown how to use Symbl.ai's JavaScript SDK to enable your device's microphone for recording audio and processing. This example was built to run on Mac or Windows PCs. You will learn how to use Symbl's API for speech-to-text transcription and real-time AI insights, such as follow-ups, action items, topics and questions.

Throughout the guide you'll find various references to these variable names, which you will have to replace with your values:

KeyDescription
appIdThe application ID you get from the home page of the platform.
appSecretThe application secret you get from the home page of the platform.
emailAddressThe email address you wish to send the summary email to. The summary email summarizes the conversation and any conversational insights gained from it.

View the full example on GitHub

Getting started

To get this example running, you need to install the node packages @symblai/symbl-js, uuid and mic. You can do that via with npm install @symblai/symbl-js, npm install uuid and npm install mic. We're using mic to simply get audio from the microphone and pass it on to the WebSocket connection.

mic also requires you to install sox. To install sox choose the option which fits your operating system:

Mac: brew install sox

Windows and Linux: Installation of SoX on different Platforms

const {sdk} = require('@symblai/symbl-js');

Simple setup for mic. You can view the full configuration options for mic here

const mic = require('mic');
const sampleRateHertz = 16000;
const micInstance = mic({
  rate: sampleRateHertz,
  channels: '1',
  debug: false,
  exitOnSilence: 6
});

Initialize SDK

You can get the appId and appSecret values from the Symbl Platform.

(async () => {
  try {
    await sdk.init({
      appId: appId,
      appSecret: appSecret,
      basePath: 'https://api.symbl.ai'
    })
  } catch (e) {}
})()

You will also need a unique ID to associate with our Symbl request. You will create
this ID using uuid package

const id = uuid();

Real-time Request Configuration Options

Now you can start the connection using sdk.startRealtimeRequest. You will need to create a configuration object for the connection

const connection = await sdk.startRealtimeRequest(configurationObject);

Here is the breakdown of the configuration types:

Insight Types (insightTypes)

  • insightTypes - This array represents the type of insights that are to be detected. Today the supported types are action_item and question.
{
  insightTypes: ['action_item', 'question']
}

Action Item (action_item)

An action item is a specific outcome recognized in the conversation that requires one or more people in the conversation to act in the future. Action items will be returned via the onInsightResponse callback.

These actions can be definitive and owned with a commitment to working on a presentation, sharing a file, completing a task, etc. Or they can be non-definitive like an idea, suggestion or an opinion that could be worked upon.

All action items are generated with action phrases, assignees and due dates so that you can build workflow automation with your tools.

Action Item JSON Response Example

This is an example of an action_item returned via the onInsightResponse callback function.

[{
  "id": "94020eb9-b688-4d56-945c-a7e5282258cc",
  "confidence": 0.9909798145016999,
  "messageReference": {
    "id": "94020eb9-b688-4d56-945c-a7e5282258cc"
  },
  "hints": [{
    "key": "informationScore",
    "value": "0.9782608695652174"
  }, {
    "key": "confidenceScore",
    "value": "0.9999962500210938"
  }, {
    "key": "comprehensionScore",
    "value": "0.9983848333358765"
  }],
  "type": "action_item",
  "assignee": {
    "id": "e2c5acf8-b9ed-421a-b3b3-02a5ae9796a0",
    "name": "John Doe",
    "userId": "emailAddress"
  },
  "dueBy": {
    "value": "2021-02-05T00:00:00-07:00"
  },
  "tags": [{
    "type": "date",
    "text": "today",
    "beginOffset": 39,
    "value": {
      "value": {
        "datetime": "2021-02-05"
      }
    }
  }, {
    "type": "person",
    "text": "John Doe",
    "beginOffset": 8,
    "value": {
      "value": {
        "name": "John Doe",
        "id": "e2c5acf8-b9ed-421a-b3b3-02a5ae9796a0",
        "assignee": true,
        "userId": "emailAddress"
      }
    }
  }],
  "dismissed": false,
  "payload": {
    "content": "Perhaps John Doe can submit the report today.",
    "contentType": "text/plain"
  },
  "from": {
    "id": "e2c5acf8-b9ed-421a-b3b3-02a5ae9796a0",
    "name": "John Doe",
    "userId": "emailAddress"
  }
}]

Question (question)

The API will find explicit questions or request for information that comes up during the conversation. Questions will be returned via the onInsightResponse callback.

Question JSON Response Example

This is an example of a question returned via the onInsightResponse callback function.

[
  {
    "id": "5a1fc496-bdda-4496-93cc-ef9714a63b1b",
    "confidence": 0.9677371919681392,
    "messageReference": {
      "id": "541b6de9-1d0d-40af-a506-54fdf52b996d"
    },
    "hints": [
      {
        "key": "confidenceScore",
        "value": "0.9998153329948111"
      },
      {
        "key": "comprehensionScore",
        "value": "0.9356590509414673"
      }
    ],
    "type": "question",
    "assignee": {
      "id": "7a717fc4-f292-4f26-88d3-ed63440e1f91",
      "name": "John Doe",
      "userId": "EMAIL_ADDRESS"
    },
    "tags": [],
    "dismissed": false,
    "payload": {
      "content": "How much will all of this cost?",
      "contentType": "text/plain"
    },
    "from": {
      "id": "7a717fc4-f292-4f26-88d3-ed63440e1f91",
      "name": "John Doe",
      "userId": "EMAIL_ADDRESS"
    }
  }
]

Config (config)

config: {
  meetingTitle: 'My Test Meeting',
  confidenceThreshold: 0.7,
  timezoneOffset: 480, // Offset in minutes from UTC
  languageCode: 'en-US',
  sampleRateHertz
},
  • config: This configuration object encapsulates the properties which directly relate to the conversation generated by the audio being passed.

    • meetingTitle: This optional parameter specifies the name of the conversation generated. You can get more info on conversations here

    • confidenceThreshold: This optional parameter specifies the confidence threshold for detecting the insights. Only the insights that have confidenceScore more than this value will be returned.

    • timezoneOffset: This specifies the actual timezoneOffset used for detecting the time/date-related entities.

    • languageCode: It specifies the language to be used for transcribing the audio in BCP-47 format. (Needs to be same as the language in which audio is spoken)

    • sampleRateHertz: It specifies the sampleRate for this audio stream.

Speaker (speaker)

speaker: { // Optional
  userId: 'user-identifier',
  name: 'My name'
},

speaker: Optionally specify the details of the speaker whose data is being passed in the stream.

Handlers (handlers)

handlers: {
  /**
   * This will return live speech-to-text transcription of the call.
   */
  onSpeechDetected: (data) => {
    if (data) {
      const {punctuated} = data
      console.log('Live: ', punctuated && punctuated.transcript)
      console.log('');
    }
    console.log('onSpeechDetected ', JSON.stringify(data, null, 2));
  },
  /**
   * When processed messages are available, this callback will be called.
   */
  onMessageResponse: (data) => {
    console.log('onMessageResponse', JSON.stringify(data, null, 2))
  },
  /**
   * When Symbl detects an insight, this callback will be called.
   */
  onInsightResponse: (data) => {
    console.log('onInsightResponse', JSON.stringify(data, null, 2))
  },
  /**
   * When Symbl detects a topic, this callback will be called.
   */
  onTopicResponse: (data) => {
    console.log('onTopicResponse', JSON.stringify(data, null, 2))
  }
}
  • handlers: This object has the callback functions for different events

    • onSpeechDetected: To retrieve the real-time transcription results as soon as they are detected. You can use this callback to render live transcription which is specific to the speaker of this audio stream.

    onSpeechDetected JSON Response Example

    {
      "type": "recognition_result",
      "isFinal": true,
      "payload": {
        "raw": {
          "alternatives": [{
            "words": [{
              "word": "Hello",
              "startTime": {
                "seconds": "3",
                "nanos": "800000000"
              },
              "endTime": {
                "seconds": "4",
                "nanos": "200000000"
              }
            }, {
              "word": "world.",
              "startTime": {
                "seconds": "4",
                "nanos": "200000000"
              },
              "endTime": {
                "seconds": "4",
                "nanos": "800000000"
              }
            }],
            "transcript": "Hello world.",
            "confidence": 0.9128385782241821
          }]
        }
      },
      "punctuated": {
        "transcript": "Hello world."
      },
      "user": {
        "userId": "emailAddress",
        "name": "John Doe",
        "id": "23681108-355b-4fc3-9d94-ed47dd39fa56"
      }
    }
    
    • onMessageResponse: This callback function contains the "finalized" transcription data for this speaker and if used with multiple streams with other speakers this callback would also provide their messages.
      The "finalized" messages mean that the automatic speech recognition has finalized the state of this part of transcription and has declared it "final". Therefore, this transcription will be more accurate than onSpeechDetected.

    onMessageResponse JSON Response Example

    [{
      "from": {
        "id": "0a7a36b1-047d-4d8c-8958-910317ed9edc",
        "name": "John Doe",
        "userId": "emailAddress"
      },
      "payload": {
        "content": "Hello world.",
        "contentType": "text/plain"
      },
      "id": "59c224c2-54c5-4762-9582-961bf250b478",
      "channel": {
        "id": "realtime-api"
      },
      "metadata": {
        "disablePunctuation": true,
        "timezoneOffset": 480,
        "originalContent": "Hello world.",
        "words": "[{\"word\":\"Hello\",\"startTime\":\"2021-02-04T20:34:59.029Z\",\"endTime\":\"2021-02-04T20:34:59.429Z\"},{\"word\":\"world.\",\"startTime\":\"2021-02-04T20:34:59.429Z\",\"endTime\":\"2021-02-04T20:35:00.029Z\"}]",
        "originalMessageId": "59c224c2-54c5-4762-9582-961bf250b478"
      },
      "dismissed": false,
      "duration": {
        "startTime": "2021-02-04T20:34:59.029Z",
        "endTime": "2021-02-04T20:35:00.029Z"
      }
    }]
    
    • onInsightResponse: This callback provides you with any of the detected insights in real-time as they are detected. As with the onMessageResponse this would also return every speaker's insights in case of multiple streams.

      View the examples for onInsightResponse here.

    • onTrackerResponse: This callback provides you with any of the detected trackers in real-time as they are detected. As with the onMessageResponse this would also return every tracker in case of multiple streams.

    onTrackerResponse JSON Response Example

    [
      {
        "id": "4527907378937856",
        "name": "My Awesome Tracker",
        "matches": [
          {
            "messageRefs": [
              {
                "id": "4670860273123328",
                "text": "Wearing mask is a good safety measure.",
                "offset": -1
              }
            ],
            "type": "vocabulary",
            "value": "wear mask",
            "insightRefs": []
          }
        ]
      }
    ]
    
    • onTopicResponse: This callback provides you with any of the detected topics in real-time as they are detected. As with the onMessageResponse this would also return every topic in case of multiple streams.

    onTopicResponse JSON Response Example

    [{
      "id": "e69a5556-6729-11eb-ab14-2aee2deabb1b",
      "messageReferences": [{
        "id": "0df44422-0248-47e9-8814-e87f63404f2c",
        "relation": "text instance"
      }],
      "phrases": "auto insurance",
      "rootWords": [{
        "text": "auto"
      }],
      "score": 0.9,
      "type": "topic"
    }]
    

Full Configuration Object

const connection = await sdk.startRealtimeRequest({
  id,
  insightTypes: ['action_item', 'question'],
  config: {
    meetingTitle: 'My Test Meeting',
    confidenceThreshold: 0.7,
    timezoneOffset: 480, // Offset in minutes from UTC
    languageCode: 'en-US',
    sampleRateHertz
  },
  speaker: {
    // Optional, if not specified, will simply not send an email in the end.
    userId: 'emailAddress', // Update with valid email
    name: 'My name'
  },
  handlers: {
    /**
     * This will return live speech-to-text transcription of the call.
     */
    onSpeechDetected: (data) => {
      if (data) {
        const {punctuated} = data
        console.log('Live: ', punctuated && punctuated.transcript)
        console.log('');
      }
      console.log('onSpeechDetected ', JSON.stringify(data, null, 2));
    },
    /**
     * When processed messages are available, this callback will be called.
     */
    onMessageResponse: (data) => {
      console.log('onMessageResponse', JSON.stringify(data, null, 2))
    },
    /**
     * When Symbl detects an insight, this callback will be called.
     */
    onInsightResponse: (data) => {
      console.log('onInsightResponse', JSON.stringify(data, null, 2))
    },
    /**
     * When Symbl detects a topic, this callback will be called.
     */
    onTopicResponse: (data) => {
      console.log('onTopicResponse', JSON.stringify(data, null, 2))
    },
    /**
     * When trackers are detected, this callback will be called.
     */
    onTrackerResponse: (data) => {
      console.log('onTrackerResponse', JSON.stringify(data, null, 2))
    },
  }
});

Handle the audio stream

The connection should now be established to the Web Socket. Now you must create several handlers which will handle the audio stream. You can view all the valid handlers here:

const micInputStream = micInstance.getAudioStream()
/** Raw audio stream */
micInputStream.on('data', (data) => {
  // Push audio from Microphone to websocket connection
  connection.sendAudio(data)
})

micInputStream.on('error', function (err) {
  console.log('Error in Input Stream: ' + err)
})

micInputStream.on('startComplete', function () {
  console.log('Started listening to Microphone.')
})

micInputStream.on('silence', function () {
  console.log('Got SIGNAL silence')
})

Process speech using the device's microphone

Now you start the recording:

micInstance.start()

Your microphone should now be open to input which will be sent to the Web Socket for processing. The microphone will continue to accept input until the application is stopped or until you tell the connection to stop:

/**
 * Stop connection after 1 minute i.e. 60 secs
 */
setTimeout(async () => {
  // Stop listening to microphone
  micInstance.stop()
  console.log('Stopped listening to Microphone.')
  try {
    // Stop connection
    await connection.stop()
    console.log('Connection Stopped.')
  } catch (e) {
    console.error('Error while stopping the connection.', e)
  }
}, 60 * 1000)

Test

To verify and check if the code is working:

Run your code:

$ node index.js

Grabbing the Conversation ID

The Conversation ID is very useful for our other APIs such as the Conversation API. We don't use it in this example because it's mainly used for non-real-time data gathering, but it's good to know how to grab it as you can use the Conversation ID later to extract the conversation insights again.

   const conversationId = connection.conversationId

With the Conversation ID you can do each of the following (and more!):

View conversation topics

Summary topics provide a quick overview of the key things that were talked about in the conversation.

View action items

An action item is a specific outcome recognized in the conversation that requires one or more people in the conversation to take a specific action, e.g. set up a meeting, share a file, complete a task, etc.

View follow-ups

This is a category of action items with a connotation to follow-up a request or a task like sending an email or making a phone call or booking an appointment or setting up a meeting.

Full Code Sample

Here's the full sample below which you can also view on Github:

const {sdk} = require('@symblai/symbl-js')
const uuid = require('uuid').v4

// For demo purposes, we're using mic to simply get audio from the microphone and pass it on to the WebSocket connection
const mic = require('mic')

const sampleRateHertz = 16000

const micInstance = mic({
  rate: sampleRateHertz,
  channels: '1',
  debug: false,
  exitOnSilence: 6,
});

(async () => {
  try {
    // Initialize the SDK
    await sdk.init({
      appId: appId,
      appSecret: appSecret,
      basePath: 'https://api.symbl.ai',
    })

    // Need unique Id
    const id = uuid()

    // Start Real-time Request (Uses Real-time WebSocket API behind the scenes)
    const connection = await sdk.startRealtimeRequest({
      id,
      insightTypes: ['action_item', 'question'],
      config: {
        meetingTitle: 'My Test Meeting',
        confidenceThreshold: 0.7,
        timezoneOffset: 480, // Offset in minutes from UTC
        languageCode: 'en-US',
        sampleRateHertz
      },
      speaker: {
        // Optional, if not specified, will simply not send an email in the end.
        userId: 'emailAddress', // Update with valid email
        name: 'My name'
      },
      handlers: {
        /**
         * This will return live speech-to-text transcription of the call.
         */
        onSpeechDetected: (data) => {
          if (data) {
            const {punctuated} = data
            console.log('Live: ', punctuated && punctuated.transcript)
            console.log('');
          }
          console.log('onSpeechDetected ', JSON.stringify(data, null, 2));
        },
        /**
         * When processed messages are available, this callback will be called.
         */
        onMessageResponse: (data) => {
          console.log('onMessageResponse', JSON.stringify(data, null, 2))
        },
        /**
         * When Symbl detects an insight, this callback will be called.
         */
        onInsightResponse: (data) => {
          console.log('onInsightResponse', JSON.stringify(data, null, 2))
        },
        /**
         * When Symbl detects a topic, this callback will be called.
         */
        onTopicResponse: (data) => {
          console.log('onTopicResponse', JSON.stringify(data, null, 2))
        }
      }
    });
    console.log('Successfully connected. Conversation ID: ', connection.conversationId);

    const micInputStream = micInstance.getAudioStream()
    /** Raw audio stream */
    micInputStream.on('data', (data) => {
      // Push audio from Microphone to websocket connection
      connection.sendAudio(data)
    })

    micInputStream.on('error', function (err) {
      console.log('Error in Input Stream: ' + err)
    })

    micInputStream.on('startComplete', function () {
      console.log('Started listening to Microphone.')
    })

    micInputStream.on('silence', function () {
      console.log('Got SIGNAL silence')
    })

    micInstance.start()

    setTimeout(async () => {
      // Stop listening to microphone
      micInstance.stop()
      console.log('Stopped listening to Microphone.')
      try {
        // Stop connection
        await connection.stop()
        console.log('Connection Stopped.')
      } catch (e) {
        console.error('Error while stopping the connection.', e)
      }
    }, 60 * 1000) // Stop connection after 1 minute i.e. 60 secs
  } catch (e) {
    console.error('Error: ', e)
  }
})();