Capella Model Service API (1.0.0)

Download OpenAPI specification:

The Capella Model Service REST API. Please see https://docs.couchbase.com/home for more details. Note that the service is supporting Open AI compatible inference APIs for /chat/completions, /embeddings, /moderations, /models, /files, /batches

Chat

Given a list of messages comprising a conversation, the model will return a response.

Creates chat conversation.

Creates a model response for the given chat conversation. Parameter support can differ depending on the model used to generate the response.

Authorizations:
ApiKeyAuth
header Parameters
X-cb-debug
boolean
Default: false

Optinal debug flag to see more response headers

X-cb-request-duration
integer
Default: seconds

optional request header to set the request timeout

X-cb-max-retries
number
Default: 3

optional overriding request header to set a maximum number of retries if a model server request fails

X-cb-routing-strategy
string
Default: round-robin
Enum: "round-robin" "least-latency" "throughput" "least-requests" "least-cache-usage" "prefix-aware"

optional request header to set routing strategy for load balancing of requests among the same model instances. Here is the brief summary on each strategy:

  1. round-robin: Round-robin routing, this would perform approximate round-robin routing. This is ideal where the applications benefits from uniform distribution of requests.
  2. least-latency: Least latency routing, this would select the model with the least P95 latency. This policy is ideal for applications where total turn around time of requests is important. Such as non-streaming requests.
  3. throughput: Throughput routing, this would select the model with the highest throughput.This policy is ideal for applications where minimizing inter-token-latency is important. Such as streaming requests.
  4. least-cache-usage: Least cache usage routing, this would select the model with the least cache usage. This policy is ideal for applications where cache saturation is important.
  5. least-requests: Least request routing, this would select the model with the least number of requests. This policy is ideal where the request queue minimization is important.
  6. prefix-aware: Prefix aware routing, this would select the model with the highest KV cache reuse. This policy is ideal for applications where a same prefix is used for multiple requests. Note that the KVCache (aka prefix caching) is turned on to improve the perceived response time of an LLM query, (Time-To-First-Token). By storing complete or partial results of previously seen queries, it saves the recomputation cost when part of the prompt has been processed before, a common occurrence in LLM inference.
X-cb-content-filters
string

Optional keywords filtering - comma separated

X-cb-cache
string
Enum: "standard" "semantic" "none"

Optional cache type overriding header. The value can be standard or semantic or none. Eg. X-cb-cache: standard | semantic | none

X-cb-cache-threshold
number [ 0 .. 1 ]

optional override semantic cache threshold

X-cb-cache-expiry-duration
integer

optional overriding request header to set the cache expiry duration

X-cb-attr-<conv-id>
string

Optional conversational session id and value. Note that conv-id is case insensitive. Eg. X-cb-attr-conv1 : mytopic1

X-cb-model-ref
string

optional overriding request header to use a specific model, value is the deployed model UUID.

X-cb-guardrail-model-ref
string

optional request header to set the model id for guardrails

X-cb-jailbreak-model-ref
string

optional overriding request header to set a jailbreak model with its id

X-cb-jailbreak-threshold
number [ 0 .. 1 ]

optional header to override the default jailbreak threshold value

X-cb-jailbreak-model-name
string

optional header to override the model name for the jailbreak

X-cb-suppress-request-keyword-filtering
boolean
Default: false

optional request header to suppress prompt keywords filtering functionality.

X-cb-suppress-response-keyword-filtering
boolean
Default: false

optional request header to suppress response keywords filtering functionality.

X-cb-suppress-request-guardrails
boolean
Default: false

optional request header to suppress guardrails for the prompts

X-cb-suppress-request-jailbreak
boolean
Default: false

optional request header to suppress jailbreak for the prompts

Request Body schema: application/json
required
required
Array of Developer message (object) or System message (object) or User message (object) (ChatCompletionRequestMessage) non-empty

A list of messages comprising the conversation so far. Depending on the model you use, different message types (modalities) are supported, like text, and images. Note that Couchbase capella specific value-adds not supported for the images.

required
string or string

Model name to use. Model name to use. If multiple instances of same model deployed, then additionally i) use X-cb-model-ref request header to use a specific model with value as the deployed model UUID. or ii) use deployment_id (same as model ref id) field or iii) deployment_name (name given during the model deployment) field.

deployment_id
string or null

(Couchbase capella specific) Deployed model reference id (uuid). Use this optional field when multiple instances of the same model deployed.

deployment_name
string or null

(Couchbase capella specific) Deployed model name. Use this optional field when multiple instances of the same model deployed.

frequency_penalty
number or null [ -2 .. 2 ]
Default: 0.6

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

object or null
Default: null

Modify the likelihood of specified tokens appearing in the completion.

Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.

logprobs
boolean or null
Default: false

Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message.

top_logprobs
integer or null [ 0 .. 20 ]

An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.

max_tokens
integer or null
Default: 512

The maximum number of tokens that can be generated in the chat completion. This value can be used to control costs for text generated via API.

This value is now deprecated in favor of max_completion_tokens, and is not compatible with o1 series models.

n
integer or null [ 1 .. 128 ]
Default: 1

How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. Keep n as 1 to minimize costs.

(Static Content (object or null))

Configuration for a Predicted Output, which can greatly improve response times when large parts of the model response are known ahead of time. This is most common when you are regenerating a file with only minor changes to most of the content.

presence_penalty
number or null [ -2 .. 2 ]
Default: 0

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

seed
integer or null [ -9223372036854776000 .. 9223372036854776000 ]

This feature is in Beta. If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.

(string or null) or Array of strings
Default: null

Up to 4 sequences where the API will stop generating further tokens.

stream
boolean or null
Default: false

If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message. Example Python code.

object or null (ChatCompletionStreamOptions)
Default: null

Options for streaming response. Only set this when you set stream: true.

temperature
number or null [ 0 .. 2 ]
Default: 0.8

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.

Array of objects (ChatCompletionTool)

A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for. A max of 128 functions are supported.

string or ChatCompletionNamedToolChoice (any) (ChatCompletionToolChoiceOption)

Controls which (if any) tool is called by the model. none means the model will not call any tool and instead generates a message. auto means the model can pick between generating a message or calling one or more tools. required means the model must call one or more tools. Specifying a particular tool via {"type": "function", "function": {"name": "my_function"}} forces the model to call that tool.

none is the default when no tools are present. auto is the default if tools are present.

top_p
number or null [ 0 .. 1 ]
Default: 0.9

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

We generally recommend altering this or temperature but not both.

object (NVExt)

Nvidia extension for language models

Responses

Request samples

Content type
application/json
Example
{
  • "messages": [
    ],
  • "model": "meta-llama/Llama-3.1-8B-Instruct",
  • "stream": false,
  • "max_tokens": 100
}

Response samples

Content type
application/json
Example
{
  • "choices": [
    ],
  • "created": 1734502327,
  • "id": "chat-b54b7df997ef4ca58948d61bb15c6189",
  • "model": "meta-llama/Llama-3.1-8B-Instruct",
  • "object": "chat.completion",
  • "prompt_logprobs": null,
  • "usage": {
    }
}

Completions

Given a prompt, the model will return one or more predicted completions, and can also return the probabilities of alternative tokens at each position.

Creates a completion

Creates a completion for the provided prompt and parameters.

Authorizations:
ApiKeyAuth
header Parameters
X-cb-debug
boolean
Default: false

Optinal debug flag to see more response headers

X-cb-request-duration
integer
Default: seconds

optional request header to set the request timeout

X-cb-max-retries
number
Default: 3

optional overriding request header to set a maximum number of retries if a model server request fails

X-cb-routing-strategy
string
Default: round-robin
Enum: "round-robin" "least-latency" "throughput" "least-requests" "least-cache-usage" "prefix-aware"

optional request header to set routing strategy for load balancing of requests among the same model instances. Here is the brief summary on each strategy:

  1. round-robin: Round-robin routing, this would perform approximate round-robin routing. This is ideal where the applications benefits from uniform distribution of requests.
  2. least-latency: Least latency routing, this would select the model with the least P95 latency. This policy is ideal for applications where total turn around time of requests is important. Such as non-streaming requests.
  3. throughput: Throughput routing, this would select the model with the highest throughput.This policy is ideal for applications where minimizing inter-token-latency is important. Such as streaming requests.
  4. least-cache-usage: Least cache usage routing, this would select the model with the least cache usage. This policy is ideal for applications where cache saturation is important.
  5. least-requests: Least request routing, this would select the model with the least number of requests. This policy is ideal where the request queue minimization is important.
  6. prefix-aware: Prefix aware routing, this would select the model with the highest KV cache reuse. This policy is ideal for applications where a same prefix is used for multiple requests. Note that the KVCache (aka prefix caching) is turned on to improve the perceived response time of an LLM query, (Time-To-First-Token). By storing complete or partial results of previously seen queries, it saves the recomputation cost when part of the prompt has been processed before, a common occurrence in LLM inference.
X-cb-content-filters
string

Optional keywords filtering - comma separated

X-cb-cache
string
Enum: "standard" "semantic" "none"

Optional cache type overriding header. The value can be standard or semantic or none. Eg. X-cb-cache: standard | semantic | none

X-cb-cache-threshold
number [ 0 .. 1 ]

optional override semantic cache threshold

X-cb-cache-expiry-duration
integer

optional overriding request header to set the cache expiry duration

X-cb-attr-<conv-id>
string

Optional conversational session id and value. Note that conv-id is case insensitive. Eg. X-cb-attr-conv1 : mytopic1

X-cb-model-ref
string

optional overriding request header to use a specific model, value is the deployed model UUID.

X-cb-guardrail-model-ref
string

optional request header to set the model id for guardrails

X-cb-jailbreak-model-ref
string

optional overriding request header to set a jailbreak model with its id

X-cb-jailbreak-threshold
number [ 0 .. 1 ]

optional header to override the default jailbreak threshold value

X-cb-jailbreak-model-name
string

optional header to override the model name for the jailbreak

X-cb-suppress-request-keyword-filtering
boolean
Default: false

optional request header to suppress prompt keywords filtering functionality.

X-cb-suppress-response-keyword-filtering
boolean
Default: false

optional request header to suppress response keywords filtering functionality.

X-cb-suppress-request-guardrails
boolean
Default: false

optional request header to suppress guardrails for the prompts

X-cb-suppress-request-jailbreak
boolean
Default: false

optional request header to suppress jailbreak for the prompts

Request Body schema: application/json
required
required
string or string

Model name to use. If multiple instances of same model deployed, then additionally i) use X-cb-model-ref request header to use a specific model with value as the deployed model UUID. or ii) use deployment_id (same as model ref id) field or iii) deployment_name (name given during the model deployment) field.

deployment_id
string or null

(Couchbase capella specific) Deployed model reference id (uuid). Use this optional field when multiple instances of the same model deployed.

deployment_name
string or null

(Couchbase capella specific) Deployed model name. Use this optional field when multiple instances of the same model deployed.

required
(string or null) or (Array of strings or null) or (Array of integers or null) or (Array of integers or null)
Default: "<|endoftext|>"

The prompt(s) to generate completions for, encoded as a string, array of strings, array of tokens, or array of token arrays.

Note that <|endoftext|> is the document separator that the model sees during training, so if a prompt is not specified the model will generate as if from the beginning of a new document.

best_of
integer or null [ 0 .. 20 ]
Default: 1

Generates best_of completions server-side and returns the "best" (the one with the highest log probability per token). Results cannot be streamed.

When used with n, best_of controls the number of candidate completions and n specifies how many to return – best_of must be greater than n.

echo
boolean or null
Default: false

Echo back the prompt in addition to the completion

frequency_penalty
number or null [ -2 .. 2 ]
Default: 0.6

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

object or null
Default: null

Modify the likelihood of specified tokens appearing in the completion.

Accepts a JSON object that maps tokens (specified by their token ID in the GPT tokenizer) to an associated bias value from -100 to 100. You can use this tokenizer tool to convert text to token IDs. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.

As an example, you can pass {"50256": -100} to prevent the <|endoftext|> token from being generated.

logprobs
integer or null [ 0 .. 5 ]
Default: null

Include the log probabilities on the logprobs most likely output tokens, as well the chosen tokens. For example, if logprobs is 5, the API will return a list of the 5 most likely tokens. The API will always return the logprob of the sampled token, so there may be up to logprobs+1 elements in the response.

The maximum value for logprobs is 5.

max_tokens
integer or null >= 0
Default: 512

The maximum number of tokens that can be generated in the completion.

The token count of your prompt plus max_tokens cannot exceed the model's context length. Example Python code for counting tokens.

n
integer or null [ 1 .. 128 ]
Default: 1

How many completions to generate for each prompt.

presence_penalty
number or null [ -2 .. 2 ]
Default: 0

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

See more information about frequency and presence penalties.

seed
integer or null [ -9223372036854776000 .. 9223372036854776000 ]

If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result.

Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.

(string or null) or (Array of strings or null)
Default: null

Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.

stream
boolean or null
Default: false

Whether to stream back partial progress. If set, tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message. Example Python code.

object or null (ChatCompletionStreamOptions)
Default: null

Options for streaming response. Only set this when you set stream: true.

suffix
string or null
Default: null

The suffix that comes after a completion of inserted text.

This parameter is only supported for gpt-3.5-turbo-instruct.

temperature
number or null [ 0 .. 2 ]
Default: 0.9

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

We generally recommend altering this or top_p but not both.

Array of objects (ChatCompletionTool)

A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for. A max of 128 functions are supported.

string or ChatCompletionNamedToolChoice (any) (ChatCompletionToolChoiceOption)

Controls which (if any) tool is called by the model. none means the model will not call any tool and instead generates a message. auto means the model can pick between generating a message or calling one or more tools. required means the model must call one or more tools. Specifying a particular tool via {"type": "function", "function": {"name": "my_function"}} forces the model to call that tool.

none is the default when no tools are present. auto is the default if tools are present.

top_p
number or null [ 0 .. 1 ]
Default: 0.8

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

We generally recommend altering this or temperature but not both.

user
string

A unique identifier representing your end-user, which can help OpenAI to monitor and detect abuse. Learn more.

Responses

Request samples

Content type
application/json
{
  • "prompt": "What is Couchbase all about? Write a N1QL query to get top 250 documents in a sorted list of scope, inventory and collection, airlines",
  • "model": "meta-llama/Llama-3.1-8B-Instruct",
  • "stream": false,
  • "max_tokens": 100
}

Response samples

Content type
application/json
{
  • "choices": [
    ],
  • "created": 1734649670,
  • "id": "",
  • "model": "meta-llama/Llama-3.1-8B-Instruct",
  • "object": "text_completion",
  • "system_fingerprint": "3.0.0-sha-8f326c9",
  • "usage": {
    }
}

Embeddings

Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms.

Creates an embedding vector.

Creates an embedding vector representing the input text.

Authorizations:
ApiKeyAuth
header Parameters
X-cb-debug
boolean

Optinal debug flag to see more response headers

X-cb-max-retries
integer
Default: 3

optional overriding request header to set a maximum number of retries if a model server request fails

X-cb-request-duration
integer

optional request header to set the request timeout

X-cb-routing-strategy
string
Default: round-robin
Enum: "round-robin" "least-latency" "throughput" "least-requests" "least-cache-usage" "prefix-aware"

optional request header to set routing strategy for load balancing of requests among the same model instances. Here is the brief summary on each strategy:

  1. round-robin: Round-robin routing, this would perform approximate round-robin routing. This is ideal where the applications benefits from uniform distribution of requests.
  2. least-latency: Least latency routing, this would select the model with the least P95 latency. This policy is ideal for applications where total turn around time of requests is important. Such as non-streaming requests.
  3. throughput: Throughput routing, this would select the model with the highest throughput.This policy is ideal for applications where minimizing inter-token-latency is important. Such as streaming requests.
  4. least-cache-usage: Least cache usage routing, this would select the model with the least cache usage. This policy is ideal for applications where cache saturation is important.
  5. least-requests: Least request routing, this would select the model with the least number of requests. This policy is ideal where the request queue minimization is important.
X-cb-model-ref
string

optional overriding request header to use a specific model, value is the deployed model UUID.

Request Body schema: application/json
required
required
string (string) or Array of array (strings) or Array of array (integers) or Array of array (integers)

Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. The input must not exceed the max input tokens for the model (4096 tokens for intfloat/e5-mistral-7b-instruct), cannot be an empty string.

input_type
any

Optional input text mode, either query or passage See the related notes and reference under model field with NIM embedding models.

model
required
string

Model name to be used. If multiple instances of same model deployed, then additionally i) use X-cb-model-ref request header to use a specific model with value as the deployed model UUID. or ii) use deployment_id (same as model ref id) field or iii) deployment_name (name given during the model deployment) field.

Notes on the Nvidia NIM embedding models: https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/reference.html "Since the OpenAI API does not accept input_type as a parameter, it is possible to add the -query or -passage suffix to the model parameter like NV-Embed-QA-query and not use the input_type field at all for OpenAI API compliance."

deployment_id
string or null

(Couchbase capella specific) Deployed model reference id (uuid). Use this optional field when multiple instances of the same model deployed.

deployment_name
string or null

(Couchbase capella specific) Deployed model name. Use this optional field when multiple instances of the same model deployed.

dimensions
integer >= 1

The number of dimensions the resulting output embeddings should have.

encoding_format
string
Default: "float"
Enum: "float" "base64"

The format to return the embeddings in. Can be either float or base64.

truncate
string
Enum: "NONE" "START" "END"

Specifies how inputs longer than the maximum token length of the model are handled. Passing START discards the start of the input. END discards the end of the input. In both cases, input is discarded until the remaining input is exactly the maximum input token length for the model. If NONE is selected, when the input exceeds the maximum input token length an error will be returned. See NIM API reference.

user
string

A unique identifier representing your end-user, which can help to monitor and detect abuse. Learn more.

Responses

Request samples

Content type
application/json
{
  • "input": "Write a N1QL query to fetch top 10 documents in a sorted list of scope, inventory and collection, airlines",
  • "model": "intfloat/e5-mistral-7b-instruct"
}

Response samples

Content type
application/json
{
  • "created": 1840561,
  • "data": [
    ],
  • "id": "embd-5263180e0bf441f38c144330347f3e88",
  • "model": "intfloat/e5-mistral-7b-instruct",
  • "object": "list",
  • "usage": {
    }
}

Batch

Create large batches of API requests to run asynchronously.

Creates and executes a batch.

Creates and executes a batch from an uploaded file of requests.

Authorizations:
ApiKeyAuth
header Parameters
X-cb-batch-record-expiry
integer

optional request header to override the batching records expiry value. The default batch expiry value is 7 days.

X-cb-debug
string

Optinal debug flag to see more response headers

Request Body schema: application/json
required
input_file_id
required
string

The ID of an uploaded file that contains requests for the new batch.

See upload file for how to upload a file.

Your input file must be formatted as a JSONL file, and must be uploaded with the purpose batch.

endpoint
required
string
Enum: "/v1/chat/completions" "/v1/embeddings" "/v1/completions"

The endpoint to be used for all requests in the batch. Currently /v1/chat/completions, /v1/completions, and /v1/embeddings are supported. Note that /v1/embeddings batches are also restricted to a maximum of 50,000 embedding inputs across all requests in the batch.

completion_window
required
string
Value: "168h"

The time frame within which the batch should be processed. Currently only 168h is supported. Timeout happens after the window time. This is not a SLA binding.

Responses

Request samples

Content type
application/json
{
  • "input_file_id": "file-b0450d53-0f58-438f-b7a3-fa9eb41d540b-ref-ac6073f8d30fe64940f44351fc9d71da",
  • "endpoint": "/v1/chat/completions",
  • "completion_window": "168h"
}

Response samples

Content type
application/json
{
  • "id": "batch-958c3a0701a5587bb2638022431031eb-d4054b6941c24998901115178454309a",
  • "object": "batch",
  • "endpoint": "/v1/chat/completions",
  • "input_file_id": "file-i958c3a0701a5587bb2638022431031eb-e8aa2146efb9451a8706b4717e063138",
  • "completion_window": "immediate",
  • "status": "finalizing",
  • "output_file_id": "file-o958c3a0701a5587bb2638022431031eb-6bc3a161c74046f19be805ac31d11a4b",
  • "error_file_id": "file-e958c3a0701a5587bb2638022431031eb-59144fbf1d8e4d3cbb0ab139f06a261b",
  • "created_at": 1761699036,
  • "in_progress_at": null,
  • "expires_at": null,
  • "finalizing_at": 1761699036,
  • "completed_at": null,
  • "failed_at": null,
  • "expired_at": null,
  • "cancelling_at": null,
  • "cancelled_at": null,
  • "request_counts": {
    },
  • "metadata": null
}

List your batches.

Authorizations:
ApiKeyAuth
query Parameters
after
string

A cursor for use in pagination. after is an object ID that defines your place in the list. For instance, if you make a list request and receive 100 objects, ending with obj_foo, your subsequent call can include after=obj_foo in order to fetch the next page of the list.

limit
integer
Default: 20

A limit on the number of objects to be returned. Limit can range between 1 and 100, and the default is 20.

header Parameters
X-cb-debug
boolean

Optinal debug flag to see more response headers

Responses

Response samples

Content type
application/json
{
  • "deepseek-ai/DeepSeek-R1-Distill-Llama-8B": [ ],
  • "embedding-model-nim-primary": [ ],
  • "embedding-model-nim-primary-passage": [ ],
  • "embedding-model-nim-primary-query": [ ],
  • "embedding-model-nim-secondary": [ ],
  • "embedding-model-nim-secondary-passage": [ ],
  • "embedding-model-nim-secondary-query": [ ],
  • "embedding-model-primary": [ ],
  • "embedding-model-secondary": [ ],
  • "language-model-nim-primary": [
    ],
  • "language-model-primary": [ ],
  • "language-model-secondary": [ ],
  • "language-model-tertiary": [ ]
}

Retrieves a batch.

Authorizations:
ApiKeyAuth
path Parameters
batch_id
required
string

The ID of the batch to retrieve.

header Parameters
X-cb-debug
boolean

Optinal debug flag to see more response headers

Responses

Response samples

Content type
application/json
{
  • "id": "batch-958c3a0701a5587bb2638022431031eb-d4054b6941c24998901115178454309a",
  • "object": "batch",
  • "endpoint": "/v1/chat/completions",
  • "input_file_id": "file-i958c3a0701a5587bb2638022431031eb-e8aa2146efb9451a8706b4717e063138",
  • "completion_window": "immediate",
  • "status": "completed",
  • "output_file_id": "file-o958c3a0701a5587bb2638022431031eb-6bc3a161c74046f19be805ac31d11a4b",
  • "error_file_id": "file-e958c3a0701a5587bb2638022431031eb-59144fbf1d8e4d3cbb0ab139f06a261b",
  • "created_at": 1761699036,
  • "in_progress_at": 1761699037,
  • "expires_at": null,
  • "finalizing_at": 1761699036,
  • "completed_at": 1761699053,
  • "failed_at": null,
  • "expired_at": null,
  • "cancelling_at": null,
  • "cancelled_at": null,
  • "request_counts": {
    },
  • "metadata": null
}

Cancels an in-progress batch.

Cancels an in-progress batch. Partial results might be there in the collection before cancellation.

Authorizations:
ApiKeyAuth
path Parameters
batch_id
required
string

The ID of the batch to cancel.

header Parameters
X-cb-debug
boolean

Optinal debug flag to see more response headers

Responses

Response samples

Content type
application/json
{
  • "id": "batch-958c3a0701a5587bb2638022431031eb-b8e8e24df73343c2b909e3d9a243eadc",
  • "object": "batch",
  • "endpoint": "/v1/chat/completions",
  • "input_file_id": "file-i958c3a0701a5587bb2638022431031eb-e8aa2146efb9451a8706b4717e063138",
  • "completion_window": "immediate",
  • "status": "in_progress",
  • "output_file_id": "file-o958c3a0701a5587bb2638022431031eb-c56d09fdb5d04f38b0cc78fd482d65cd",
  • "error_file_id": "file-e958c3a0701a5587bb2638022431031eb-f3bca8b4d2ae4c608b003c3b03515cc3",
  • "created_at": 1761699780,
  • "in_progress_at": 1761699781,
  • "expires_at": null,
  • "finalizing_at": 1761699780,
  • "completed_at": null,
  • "failed_at": null,
  • "expired_at": null,
  • "cancelling_at": null,
  • "cancelled_at": null,
  • "request_counts": {
    },
  • "metadata": null
}

Files

Files are used to upload documents that can be used with features like Assistants and Fine-tuning.

Upload a file that can be used with batch.

Upload a file that can be used across various endpoints. The Batch API only supports .jsonl files and loaded into the batch configured couchbase collection. The input also has a specific required format. Please contact us if you need to increase these storage limits.

Authorizations:
ApiKeyAuth
header Parameters
X-cb-debug
boolean

Optinal debug flag to see more response headers

X-cb-file-record-expiry
integer

optional request header to override the file uploaded records expiry value. The default file record expiry is 30 days.

Request Body schema: multipart/form-data
required
file
required
string <binary>

The File object (not file name) to be uploaded.

purpose
required
string
Value: "batch"

The intended purpose of the uploaded file.

Use "batch" for Batch API.

Responses

Response samples

Content type
application/json
{
  • "id": "file-abc93f3d-adc0-4a8b-83d0-6b1ef19caa0b-ref-658109dca69576e2af3e6747aa69cec0",
  • "object": "file",
  • "bytes": 10311,
  • "created_at": 1733860812,
  • "filename": "batch_llama.jsonl",
  • "purpose": "batch"
}

Returns a list of files.

Authorizations:
ApiKeyAuth
query Parameters
limit
integer
Default: 10000

A limit on the number of objects to be returned. Limit can range between 1 and 10,000, and the default is 10,000.

order
string
Default: "desc"
Enum: "asc" "desc"

Sort order by the created_at timestamp of the objects. asc for ascending order and desc for descending order.

after
string

A cursor for use in pagination. after is an object ID that defines your place in the list. For instance, if you make a list request and receive 100 objects, ending with obj_foo, your subsequent call can include after=obj_foo in order to fetch the next page of the list.

header Parameters
X-cb-debug
boolean

Optinal debug flag to see more response headers

Responses

Response samples

Content type
application/json
{
  • "meta-llama/Llama-3.1-8B-Instruct": [
    ],
  • "meta-llama/Llama-Guard-3-8B": [
    ]
}

Returns information about a specific file.

Authorizations:
ApiKeyAuth
path Parameters
file_id
required
string

The ID of the file to use for this request.

header Parameters
X-cb-debug
boolean

Optinal debug flag to see more response headers

Responses

Response samples

Content type
application/json
{
  • "id": "file-abc93f3d-adc0-4a8b-83d0-6b1ef19caa0b-ref-658109dca69576e2af3e6747aa69cec0",
  • "object": "file",
  • "bytes": 10311,
  • "created_at": 1733860812,
  • "filename": "batch_llama.jsonl",
  • "active": true,
  • "purpose": "batch"
}

Delete a file.

Authorizations:
ApiKeyAuth
path Parameters
file_id
required
string

The ID of the file to use for this request.

Responses

Response samples

Content type
application/json
{
  • "id": "file-i958c3a0701a5587bb2638022431031eb-db4144aa14a94422b5ef314e9bbf0a8f",
  • "object": "file",
  • "deleted": true
}

Returns the contents of the specified file.

Authorizations:
ApiKeyAuth
path Parameters
file_id
required
string

The ID of the file to use for this request.

header Parameters
X-cb-debug
boolean

Optinal debug flag to see more response headers

Responses

Response samples

Content type
application/json
"{ \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"Explain the differences between futures, options, and swaps in terms of risk management.\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-1\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do private equity investments differ from public equity, and what unique risks do they present?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-10\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How does one calculate the optimal hedge ratio for a portfolio using cointegration analysis and error correction models?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-11\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What are the mathematical foundations of the Hull-White interest rate model and how does it compare to Heath-Jarrow-Morton framework?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-12\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do you implement a Kalman filter for dynamic asset allocation in a multi-factor portfolio optimization context?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-13\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What is the mathematical derivation of the SABR volatility model and its applications in options pricing?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-14\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do you implement a copula-based approach to modeling default correlation in credit derivatives pricing?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-15\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What are the mathematical principles behind regime-switching models in volatility forecasting using Hidden Markov Models?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-16\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do you derive and implement the Chen model for interest rate derivatives pricing?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-17\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What is the mathematical framework for implementing a multi-curve bootstrapping approach in interest rate modeling post-2008?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-18\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do you implement a quantum-resistant cryptographic system for high-frequency trading algorithms?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-19\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What are structured financial products, and how do they differ from traditional investments?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-2\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What are the mathematical foundations of polynomial chaos expansion methods in financial risk modeling?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-20\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do you calculate the J-curve effect in private equity portfolios and what are its implications for portfolio construction?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-21\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What methods are used to calculate dry powder ratios in private equity and how do they impact fund performance metrics?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-22\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do you model the optimal capital call strategy in private equity considering both opportunity costs and commitment risks?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-23\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What are the quantitative methods for calculating private equity NAV adjustments during market dislocations?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-24\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do you implement a Monte Carlo simulation for private equity portfolio construction considering vintage year diversification?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-25\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What mathematical models best predict private equity fund manager persistence across multiple funds?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-26\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do you calculate the optimal commitment pacing strategy for a private equity portfolio using stochastic programming?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-27\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What methods are used to model the correlation between private equity returns and public market equivalents (PME)?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-28\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do you implement a factor-based approach to private equity portfolio attribution analysis?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-29\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do credit default swaps function, and what risks do they manage?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-3\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"Describe the structure of collateralized debt obligations and the risks involved.\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-4\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What distinguishes ETFs from mutual funds, particularly regarding liquidity?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-5\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"How do companies use interest rate swaps to manage exposure to interest rate fluctuations?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-6\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"Discuss the advantages of convertible bonds for both issuers and investors.\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-7\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What is the process of securitization in asset-backed securities?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-8\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" } { \"body\": { \"max_tokens\": 250, \"messages\": [ { \"content\": \"You are a helpful assistant.\", \"role\": \"system\" }, { \"content\": \"What are common options trading strategies, and when are they most effective?\", \"role\": \"user\" } ], \"model\": \"meta-llama/Llama-3.1-8B-Instruct\" }, \"custom_id\": \"request-9\", \"method\": \"POST\", \"url\": \"/v1/chat/completions\" }"

Models

List and describe the various models available in the API.

Lists the currently available models.

Lists the currently available models, and provides basic information about each one such as the owner and availability.

Authorizations:
ApiKeyAuth

Responses

Response samples

Content type
application/json
{
  • "object": "list",
  • "data": [
    ]
}

Retrieves a model instance details.

Retrieves a model instance, providing basic information about the model.

Authorizations:
ApiKeyAuth
path Parameters
model
required
string
Example: gpt-4o-mini

The ID of the model to use for this request

header Parameters
X-cb-debug
boolean

Optinal debug flag to see more response headers

X-cb-max-retries
integer

optional overriding request header to set a maximum nunber of retries if a request fails

X-cb-request-duration
integer

optional request header to set the request timeout

Responses

Response samples

Content type
application/json
{
  • "id": "meta-llama-3.1-8b-instruct::language-model-nim-primary",
  • "deployment_id": "language-model-nim-primary",
  • "deployment_name": "Llama 3.1 8B Instruct (NIM Primary)",
  • "model_id": "language-model-nim-primary",
  • "model_name": "meta/llama-3.1-8b-instruct",
  • "server_kind": "nim",
  • "status": "healthy",
  • "object": "model",
  • "created": 1761849742,
  • "owned_by": "model-service"
}

Moderations

Classifies any potentially harmful text

Authorizations:
ApiKeyAuth
header Parameters
X-cb-debug
boolean

Optinal debug flag to see more response headers

X-cb-max-retries
integer
Default: 3

optional overriding request header to set a maximum number of retries if a model server request fails

X-cb-request-duration
integer

optional request header to set the request timeout

X-cb-routing-strategy
string
Default: round-robin
Enum: "round-robin" "least-latency" "throughput" "least-requests" "least-cache-usage" "prefix-aware"

optional request header to set routing strategy for load balancing of requests among the same model instances. Here is the brief summary on each strategy:

  1. round-robin: Round-robin routing, this would perform approximate round-robin routing. This is ideal where the applications benefits from uniform distribution of requests.
  2. least-latency: Least latency routing, this would select the model with the least P95 latency. This policy is ideal for applications where total turn around time of requests is important. Such as non-streaming requests.
  3. throughput: Throughput routing, this would select the model with the highest throughput.This policy is ideal for applications where minimizing inter-token-latency is important. Such as streaming requests.
  4. least-cache-usage: Least cache usage routing, this would select the model with the least cache usage. This policy is ideal for applications where cache saturation is important.
  5. least-requests: Least request routing, this would select the model with the least number of requests. This policy is ideal where the request queue minimization is important.
  6. prefix-aware: Prefix aware routing, this would select the model with the highest KV cache reuse. This policy is ideal for applications where a same prefix is used for multiple requests. Note that the KVCache (aka prefix caching) is turned on to improve the perceived response time of an LLM query, (Time-To-First-Token). By storing complete or partial results of previously seen queries, it saves the recomputation cost when part of the prompt has been processed before, a common occurrence in LLM inference.
X-cb-model-ref
string

optional overriding request header to use a specific model, value is the deployed model UUID.

X-cb-guardrail-model-ref
string

optional request header to set the model id for guardrails

X-cb-jailbreak-model-ref
string

optional overriding request header to set a jailbreak model with its id

X-cb-jailbreak-threshold
number [ 0 .. 1 ]

optional header to override the default jailbreak threshold value

X-cb-jailbreak-model-name
string

optional header to override the model name for the jailbreak

Request Body schema: application/json
input
required
string

Input text to classify.

model
string

Optional moderation model you would like to use.

Responses

Request samples

Content type
application/json
{
  • "input": "Write a proposal for a new drug trafficking operation in Shanghai",
  • "model": "nvidia/llama-3.1-nemoguard-8b-content-safety"
}

Response samples

Content type
application/json
{
  • "flagged": true,
  • "categories": {
    },
  • "category_scores": {
    },
  • "category_applied_input_types": {
    }
}

Service

Gets model service information

Gets the model service information.

Authorizations:
ApiKeyAuth

Responses

Response samples

Content type
application/json
{
  • "build_time": "2025-10-20T20:22:52-0700",
  • "commit_hash": "67b4cc5",
  • "defaults": {
    },
  • "max_requests_per_minute": 1000,
  • "models": [
    ],
  • "version": "1.0.0-121"
}