Chat

There are two API endpoints supported:

/v1/model/chat: Returns the generated output in a single HTTP response.
/v1/model/chat/streaming: Returns the generated response using Server-sent events (SSE) and allows the client to start receiving output progressively as it is getting generated.

Request body for /v1/model/chat

curl --location 'https://api-nebula.symbl.ai/v1/model/chat' \
--header 'ApiKey: <your_api_key>' \
--header 'Content-Type: application/json' \
--data '{
    "max_new_tokens": 1024,
    "top_p": 0.95,
    "top_k": 1,
    "system_prompt": "You are a sales coaching assistant. You help user to get better at selling. You are respectful, professional and you always respond politely.",
    "messages": [
        {
            "role": "human",
            "text": "Hi"
        }
    ]
}'

import requests
import json

url = "https://api-nebula.symbl.ai/v1/model/chat"

payload = json.dumps({
  "max_new_tokens": 1024,
  "top_p": 0.95,
  "top_k": 1,
  "system_prompt": "You are a sales coaching assistant. You help user to get better at selling. You are respectful, professional and you always respond politely.",
  "messages": [
    {
      "role": "human",
      "text": "Hi"
    }
  ]
})
headers = {
  'ApiKey': '<your_api_key>',
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Request body for /v1/model/chat/streaming

curl --location 'https://api-nebula.symbl.ai/v1/model/chat/streaming' \
--header 'ApiKey: <your_api_key>' \
--header 'Content-Type: application/json' \
--data '{
    "max_new_tokens": 1024,
    "top_p": 0.95,
    "top_k": 1,
    "system_prompt": "You are a sales coaching assistant. You help user to get better at selling. You are respectful, professional and you always respond politely.",
    "messages": [
        {
            "role": "human",
            "text": "Hi"
        }
    ]
}' --no-buffer

import requests
import json

url = "https://api-nebula.symbl.ai/v1/model/chat/streaming"

payload = json.dumps({
  "max_new_tokens": 1024,
  "top_p": 0.95,
  "top_k": 1,
  "system_prompt": "You are a sales coaching assistant. You help user to get better at selling. You are respectful, professional and you always respond politely.",
  "messages": [
    {
      "role": "human",
      "text": "Hi"
    }
  ]
})
headers = {
  'ApiKey': '<your_api_key>',
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload, stream=True)

print(response.text)

Request

Method: POST

URL

Non-streaming endpoint:/v1/model/chat

Streaming endpoint: /v1/model/chat/streaming

Headers

Content-Type: must be set to application/json- Content-Type: application/json
ApiKey: must be set to valid API key - ApiKey: <api_key>

Body

{
    "model": "<model name>", // Optional
    "system_prompt": "<system prompt here>", // Optional
    "messages": [
      {  // First message must be with role = "human"
          "role": "human",
          "text": "<human message or prompt>"
      },
      { // Following message must be with role = "assistant"
          "role": "assistant",
          "text": "<human message or prompt>"
      }
    ],
    "max_new_tokens": 1024,
    "top_p": 1.0,
    "top_k": 1,
    "temperature": 0.0,
    "repetition_penalty": 1.0,
    "return_scores": false
}

model (optional, string): Name of the model. Defaults to nebula-chat-large. If using a custom or fine-tuned model, the respective name of the model should be used.

system_prompt (optional, string): The system prompt sets the behavior for the model.

messages (required, List[Message]): List of messages. A minimum of 1 message must be present, and multiple messages with human and assistant roles can be passed, where the first and last message in the list must be have the role human.

Structure of Message object

{
  "role": "human" | "assistant", // can be either "human" or "assistant"
  "text": "<text containing message from human or assistant>"
}

role (string): Role associated with the message - must be either human or assistant.
text (string): The text of the message.

max_new_tokens (optional, int, defaults to 128): The maximum number of tokens to generate, excluding the tokens in the input messages.
stop_sequences (optional, List[string], defaults to []): List of text sequences to stop generation at the earliest occurrence of one of the stop sequences. The maximum number of strings that can be passed is 5 and the maximum length of each passed string can be 50 characters.
temperature (optional, float, defaults to 0.0): Modulates the next token probabilities. Values can range from 0.0 to 1.0. Higher values increase the diversity and creativity of the output by increasing the likelihood of selecting lower-probability tokens. Lower values make the output more focused and deterministic by mostly selecting tokens with higher probabilities. Values must be between 0.0 and 1.0.
top_p (optional, float, defaults to 0): If set to a value less than 1, keep only the most probable tokens with probabilities that add up to top_p or higher.
top_k (optional, integer, defaults to 0): The number of highest probability vocabulary tokens to keep for top-k filtering.
repetition_penalty (optional, float, defaults to 1.0): The parameter for repetition penalty. 1.0 means no penalty. The maximum value allowed for repetition_penalty is 1.2.
return_scores (optional, boolean, defaults to False): Indicates whether to include token-level scores, which indicates the model's confidence in the response.

Response

{
    "model": "<model_name>", // model used for generating response
    "messages": [
        {
            "role": "human",
            "text": "<message or prompt by human>"
        },
        { // New message appended to message list from request with model's response
            "role": "assistant",
            "text": "<model's response>"
        }
    ],
    "stats": {
        "input_tokens": <number of input tokens>,
        "output_tokens": <number of output tokens generated>,
        "total_tokens": <total tokens input + output>
    },
    "scores": [
      {
        "token": "<token text>",
        "score": "<logprob score>"
    ]
}

model (string): The model used for generation.

messages (List[Message]) List of messages with one additional message with role assistant containing content generated by the model.

stats (dictionary): Represents statistics about the generation process.

input_tokens (integer): The number of tokens in the input.
output_tokens (integer): The total number of tokens in the output.
total_tokens (integer): The total number of tokens processed (input + output tokens).

scores (optional, List[dictionary]): List of token-level scores associated with the generated output tokens.

token (string): The generated output token.
score (float): The log-prob associated with the generated output token.

Streaming Response

When a request is made to/v1/model/chat/streaming endpoint, the response is generated as SSE events to send output progressively.

{
  "text": "<model's response so far>",
  "delta": "<delta between last event and current event's content>",
  "is_final": <true|false>

text (string): Text content of the response generated so far.

delta (string): Additional text in the current event from the last event. This is useful in building straightforward concatenation logic when rendering progressive text as API returns it in consecutive events.

is_final (boolean): Indicates if the event is the final or last event in the response or not. true if the final event, else false.

Errors

HTTP Error Code is returned in response when the request fails with detail field containing details of failure.

detail (string): Details about the error.

{
  "detail": "<error message>"
}

Error Code	Error Description
400 - Bad Request	The request body is incorrect.
401 - Unauthorized	Invalid authentication details were provided. Either the API Key is missing or not correct.
429 - Rate limit reached	Too many requests are sent and have exceeded rate limits.
500 - Server error	The servers had an error while processing the request.
503 - Service Unavailable	The servers are overloaded due to high traffic.