Manage Deployments with AI Services APIs

concept

The Couchbase Capella Management API and the Model Serving API are secure REST APIs that enable you to provision, deploy, and configure Capella AI Services.

This page is for Capella AI Services. It covers the AI Services features in the Management API, and the Model Service API. For more information about the Management API for Capella Operational features, see Manage Deployments with the Management API.

For Capella AI Services, you can use REST APIs to interact with and manage services. This includes:

The Management API.
The Model Service API.

You do not need any special tools, SDKs, or libraries to access the Management API or the Model Service API. Both of these APIs support off-the-shelf HTTP clients, such as Insomnia or Postman.

To get started with the APIs, create an API key and make an API call.

The Management API is not available for free tier operational clusters. Upgrade your account to manage deployments with the Management API.

The Management API

The Capella Management API is a secure, versioned REST API provided by Couchbase Capella that allows users to manage tasks for Couchbase operational clusters, App Services, and AI Services. This includes management tasks such as scaling clusters, configuring access controls, and deploying embedding models and Large Language Models (LLMs) for AI Services.

For a full Management API reference guide, see Management API Reference.

Versioning

The Management API uses semantic versioning. The full version number consists of 3 parts: major.minor.patch.

An update to the Management API may be a non-versioned bug fix, a patch update, a minor update, or a major update. When a non-versioned bug fix releases, the version number is not updated. When a patch update, a minor update, or a major update releases, the version number updates as well.

New versions of the Management API, with some exceptions, are backward compatible, minimizing the effect to existing deployments. In most cases, you’re not required to update your existing infrastructure when a new version of the Management API releases. Features added in a new version of the Management API do not break existing deployments.

Exceptions to backward compatibility are as follows:

Critical security fixes. A vulnerable version is marked as unsupported when a fixed version releases. In this case, you must plan to upgrade as soon as possible.
Fundamental architectural changes, when unavoidable.

Authentication and Authorization

The Management API uses API keys for authentication and authorization. They’re required for accessing endpoints exposed by the Management API.

Couchbase Capella associates API keys with roles and permissions. To access an endpoint, an API key must have the appropriate Capella roles. The Management API reference guide lists the roles required for each endpoint.

You can configure an allowed IP address list for each API key during its creation. Every API key has an expiration date.

To authenticate a Management API call, pass the API key as a Bearer token in the HTTP Authorization header.

Concurrent Access and PUT Requests

To support concurrent requests, PUT requests support optimistic concurrency control, using If-Match headers in combination with HTTP ETags (entity tags). Exceptions are noted in the endpoint descriptions of the Management API reference page. A typical pattern is to do a GET request on a specified resource, which returns an ETag value. You can then include the ETag value in a PUT request to update the resource.

PUT requests ignore all fields in the body of the request that are not required. You can find the required fields for each PUT request in the Management API reference guide.

Auditing

All requests to the Management API are audited. All endpoints support Audit fields for audit purposes, except where noted in the Management API reference guide.

Lists are paginated.

Service Limits

Service limits set boundaries for managing clusters, App Services, and AI Services with the Management API.

Long-Running Requests

The following long-running requests are rejected with error 504, Gateway Timeout:

Read requests taking longer than 90 seconds to process.
Write requests taking longer than 120 seconds to process.

Rate Limiting

Requests are limited to 100 requests per minute per API key.

Payload Size

Requests and response payloads are restricted to 18 MB.

The Model Service API

The Model Service API is a secure, versioned inference REST API for Capella’s Model Service. Provided by Couchbase Capella AI Services, this API allows you to use your Large Language Models (LLMs) and embedding models. This includes sending inference requests and receiving outputs such as chats, completions, and embeddings.

For a full Model Service API reference guide, see Inference API Reference.

Versioning

The Model Service API uses semantic versioning. The full version number consists of 3 parts: major.minor.patch.

An update to the API may be a non-versioned bug fix, a patch update, a minor update, or a major update. When a non-versioned bug fix releases, the version number is not updated. When a patch update, a minor update, or a major update releases, the version number also updates.

Authentication and Authorization

The Model Service API uses Model Service API keys for authentication and authorization. To access an endpoint, you must provide:

A Model Service API key.
A model’s connection string.

Model Service API Keys

Model Service API keys are different from the API keys required by the Management API. For more information about the difference in API keys, see Get Started with AI Services APIs.

Couchbase Capella associates Model Service API keys with an AWS region. Your Model Service API key must have the same AWS region as the model you want it to connect to.

You can configure an allowed IP address list for each API key during its creation. Every API key has an expiration date.

To authenticate a Model Service API call, pass the API key as a Bearer token in the HTTP Authorization` header.

Model Connection String

The Model Service API uses a different base URL than the Management API. The base URL for the Model Service API is a unique string generated for every AI model you deploy. This unique string is labeled as:

The model connection string in the Management API.
A model’s Model Endpoint in the Capella UI.

Auditing

Couchbase Capella stores API requests and responses separately in audit log files. To get this information, contact Couchbase Support by creating a Support ticket.

Auditing Limits

The maximum audit log file size is 100 MB.
A maximum of 512 backup audit log files are retained.
Audit logs have a retention period of 30 days.

Rate Limiting

Rate limits control how often you can call the Model Service API.

Calls Rate Limit

The default limit is set to 1,000 calls per minute.

Token-Based Rate Limiting

There’s no token-based limiting.

Request Max Tokens

If unspecified in the payload, the default max tokens is 512.

Request Timeouts

Timeout limits set time limits for model requests.

Timeouts can be applied to:

Queue time for the Model Service.
Wait time for the model engine.
Processing time for the model engine.
Value-adds such as guardrails and caching.

Requests failing to complete within the timeout limit are rejected. These long-running request limits set time constraints for each request:

Completion Requests

A maximum default timeout of 300 seconds (5 minutes).
A minimum timeout of 3 seconds.

These limits are configurable in X-cb-request-duration header.

Embedding Requests

A maximum default timeout of 60 seconds (1 minute).
A minimum timeout of 3 seconds.

These limits are configurable in X-cb-request-duration header.

Payload Size Limits

Payload size limits restrict the size of a request.

File Upload Limits

Maximum file upload size is 100 MB.
Maximum request count per file is 1,000 requests.
The file types allowed are JSONL files for batch processing.

Concurrent Access and Processing Limits

Concurrent access and processing limits control the number of requests the Model Service API can handle simultaneously.

Worker Pool Management

The default executor can handle up to 128 concurrent requests at a time.
The request queue can hold up to 1,000 requests waiting to be processed.

Request Processing Priority

Real-time requests are always processed before batched requests.
Batched requests are processed after all the pending real-time requests.
With queue management, real-time and batched requests are placed in separate queues.

API Key Limits

API key limits control usage tied to each Model Service API key.

Active API Keys

You can have a maximum of 100 active API Keys.
The API key expiration is configurable per API key.

API Key Caching

The in-memory cache size is 100 entries (matches MaxActiveAPIKeys).
The cache time to live (TTL) is 3 minutes for in-memory and 30 minutes for remote.
The lookup timeout is 3 seconds for the cache and 30 seconds for the secrets manager.

Cache Limits

Cache limits control how much data the Model Service API temporarily stores in cache.

Cache Expiry

The default cache expiry is 1 hour (3,600 seconds).
The maximum cache expiry is 7 days.
The minimum cache expiry is 1 hour.
The cache expiry is configurable via the X-cb-cache-expiry header.

Batch Processing Limits

Batch processing limits control the size and number of batches the Model Service API processes at once.

Batch Configuration

The default batch processor queue size is 1,000 requests.
The default batch requests limit is 1,000 requests.
The default batch file size limit is 100 MB.
The default batch queries per second (QPS) is 100 requests per second.

Batch Record Expiry

The default batch record expiry is 7 days.
The default file record expiry is 30 days.
The minimum expiry for batch and file records is 1 hour.
The maximum batch expiry is 7 days.
The maximum file expiry is 30 days.

Error Handling and Retries

Error handling and retries define how the Model Service API responds to failures and when it attempts to repeat requests.

Retry Configuration

The maximum number of retries is 3 attempts.
The initial retry interval is 250 milliseconds.
The number of retries is configurable per request via the X-cb-max-retries header.

Manage Deployments with AI Services APIs

The Management API

Versioning

Authentication and Authorization

Concurrent Access and PUT Requests

Auditing

Service Limits

The Model Service API

Versioning

Authentication and Authorization

Model Service API Keys

Model Connection String

Auditing

Rate Limiting

Request Timeouts

Payload Size Limits

Concurrent Access and Processing Limits

API Key Limits

Cache Limits

Batch Processing Limits

Error Handling and Retries

See Also