Chat Completion

Introduction

This guide demonstrates how to use the SAP AI SDK for Java to perform chat completion tasks using OpenAI models deployed on SAP AI Core.

warning

All classes under any of the ...model packages are generated from an OpenAPI specification. This means that these model classes are not guaranteed to be stable and may change with future releases. They are safe to use, but may require updates even in minor releases.

New User Interface (v1.4.0)

We're excited to introduce a new user interface for OpenAI chat completions starting with version 1.4.0. This update is designed to improve the SAP AI SDK by:

Decoupling Layers: Separating the convenience layer from the model classes to deliver a more stable and maintainable experience.
Staying Current: Making it easier for the SAP AI SDK to adapt to the latest changes in the OpenAI API specification.
Consistent Design: Aligning with the Orchestrator convenience API for a smoother transition and easier adoption.

Please Note:

The new interface is gradually being rolled out across the SAP AI SDK.
We welcome your feedback to help us refine this interface.
The existing interface (v1.0.0) is deprecated but available.

Prerequisites

Before using the AI Core module, ensure that you have met all the general requirements outlined in the overview. Additionally, include the necessary Maven dependency in your project.

Maven Dependencies

Add the following dependency to your pom.xml file:

<dependencies>
    <dependency>
        <groupId>com.sap.ai.sdk.foundationmodels</groupId>
        <artifactId>openai</artifactId>
        <version>${ai-sdk.version}</version>
    </dependency>
</dependencies>

See an example pom in our Spring Boot application

Usage

In addition to the prerequisites above, we assume you have already set up the following to carry out the examples in this guide:

A Deployed OpenAI Model in SAP AI Core

Refer to How to deploy a model to AI Core for setup instructions.
In case the model is deployed in a custom resource group, refer to this section.

Example deployed model from the AI Core /deployments endpoint

{
  "id": "d123456abcdefg",
  "deploymentUrl": "https://api.ai.region.aws.ml.hana.ondemand.com/v2/inference/deployments/d123456abcdefg",
  "configurationId": "12345-123-123-123-123456abcdefg",
  "configurationName": "gpt-4o-mini",
  "scenarioId": "foundation-models",
  "status": "RUNNING",
  "statusMessage": null,
  "targetStatus": "RUNNING",
  "lastOperation": "CREATE",
  "latestRunningConfigurationId": "12345-123-123-123-123456abcdefg",
  "ttl": null,
  "details": {
    "scaling": {
      "backendDetails": {}
    },
    "resources": {
      "backendDetails": {
        "model": {
          "name": "gpt-4o-mini",
          "version": "latest"
        }
      }
    }
  },
  "createdAt": "2024-07-03T12:44:22Z",
  "modifiedAt": "2024-07-16T12:44:19Z",
  "submissionTime": "2024-07-03T12:44:51Z",
  "startTime": "2024-07-03T12:45:56Z",
  "completionTime": null
}

Simple chat completion

var result =
    OpenAiClient.forModel(GPT_4O_MINI)
        .withSystemPrompt("You are a helpful AI")
        .chatCompletion("Hello World! Why is this phrase so famous?");

String resultMessage = result.getContent();

Using a Custom Resource Group

var destination = new AiCoreService()
    .getInferenceDestination("custom-rg")
    .forModel(GPT_4O);
OpenAiClient.withCustomDestination(destination);

Custom Headers

To add custom headers to single or groups of your LLM calls, you can use the .withHeader method of the OpenAIClient.

var client = new OpenAIClient();

var result = client.withHeader("foo", "bar").chatCompletion("Hello World! Why is this phrase so famous?");

Message history

Since v1.4.0

var request =
    new OpenAiChatCompletionRequest(
        OpenAiMessage.system("You are a helpful assistant"),
        OpenAiMessage.user("Hello World! Why is this phrase so famous?"));

var response = OpenAiClient.forModel(GPT_4O).chatCompletion(request).getContent();

Since v1.0.0

var systemMessage =
    new OpenAiChatSystemMessage().setContent("You are a helpful assistant");
var userMessage =
    new OpenAiChatUserMessage().addText("Hello World! Why is this phrase so famous?");
var request =
    new OpenAiChatCompletionParameters().addMessages(systemMessage, userMessage);

var result = OpenAiClient.forModel(GPT_4O_MINI).chatCompletion(request);

String resultMessage = result.getContent();

See an example in our Spring Boot application

Chat Completion with Specific Model Version

By default, when no version is specified, the system selects one of the available deployments of the specified model, regardless of its version. To target a specific version, you can specify the model version along with the model.

OpenAiChatCompletionOutput result =
    OpenAiClient.forModel(GPT_4O_MINI.withVersion("1106")).chatCompletion(request);

Chat completion with Custom Model

You can also use a custom OpenAI model for chat completion by creating an OpenAiModel object.

OpenAiChatCompletionOutput result =
    OpenAiClient.forModel(new OpenAiModel("custom-model", "v1")).chatCompletion(request);

Ensure that the custom model is deployed in SAP AI Core.

Stream chat completion

It's possible to pass a stream of chat completion delta elements, e.g. from the application backend to the frontend in real-time.

Asynchronous Streaming - Blocking

This is a blocking example for streaming and printing directly to the console:

String msg = "Can you give me the first 100 numbers of the Fibonacci sequence?";

OpenAiClient client = OpenAiClient.forModel(GPT_4O_MINI);

// try-with-resources on stream ensures the connection will be closed
try (Stream<String> stream = client.streamChatCompletion(msg)) {
    stream.forEach(
        deltaString -> {
            System.out.print(deltaString);
            System.out.flush();
        });
}

Asynchronous Streaming - Non-blocking

Since v1.4.0

The following example demonstrate how you can leverage a concurrency-safe container (like an AtomicReference) to "listen" for usage information in any incoming delta.

String question = "Can you give me the first 100 numbers of the Fibonacci sequence?";
var userMessage = OpenAiMessage.user(question);
var request = new OpenAiChatCompletionRequest(userMessage);

OpenAiClient client = OpenAiClient.forModel(GPT_4O);
var usageRef = new AtomicReference<CompletionUsage>();

// Prepare the stream before starting the thread to handle any initialization exceptions
Stream<OpenAiChatCompletionDelta> stream = client.streamChatCompletionDeltas(request);

// Create a new thread for asynchronous, non-blocking processing
Thread streamProcessor =
    new Thread(
        () -> {
            // try-with-resources ensures the stream is closed after processing
            try (stream) {
                stream.forEach(
                    delta -> {
                        usageRef.compareAndExchange(null, delta.getCompletionUsage());
                        System.out.println("Content: " + delta.getDeltaContent());
                    });
            }
        });

// Start the processing thread; the main thread remains free (non-blocking)
streamProcessor.start();
// Wait for the thread to finish (blocking)
streamProcessor.join();

// Access information caught from completion usage
Integer tokensUsed = usageRef.get().getCompletionTokens();
System.out.println("Tokens used: " + tokensUsed);

Since v1.0.0

The following example is non-blocking and demonstrates how to aggregate the complete response. Any asynchronous library can be used, such as the classic Thread API.

var question = "Can you give me the first 100 numbers of the Fibonacci sequence?";

var userMessage =
    new OpenAiChatMessage.OpenAiChatUserMessage().addText(question);
var requestParameters =
    new OpenAiChatCompletionParameters().addMessages(userMessage);

var client = OpenAiClient.forModel(GPT_4O_MINI);
var totalOutput = new OpenAiChatCompletionOutput();

// Prepare the stream before starting the thread to handle any initialization exceptions
Stream<OpenAiChatCompletionDelta> stream =
    client.streamChatCompletionDeltas(requestParameters);

var streamProcessor =
    new Thread(
        () -> {
          // try-with-resources ensures the stream is closed after processing
          try (stream) {
            stream.peek(totalOutput::addDelta).forEach(System.out::println);
          }
        });

streamProcessor.start(); // Start processing in a separate thread (non-blocking)
streamProcessor.join(); // Wait for the thread to finish (blocking)

// Access aggregated information from total output
Integer tokensUsed = totalOutput.getUsage().getCompletionTokens();
System.out.println("Tokens used: " + tokensUsed);

Please find an example in our Spring Boot application. It shows the usage of Spring Boot's ResponseBodyEmitter to stream the chat completion delta messages to the frontend in real-time.

Executing Tool Calls

Since v1.8.0

You can pass tool definitions to the model to enable tool calling.

Below is a sample tool named WeatherMethod with parameters location and temperature unit.

class WeatherMethod {
  enum Unit {C,F}
  record Request(String location, Unit unit) {}
  record Response(double temp, Unit unit) {}

  static Response getCurrentWeather(Request request) {
    int temperature = request.location.hashCode() % 30;
    return new Response(temperature, request.unit);
  }
}

What to consider:

Self-explanatory interfaces that avoid acronyms.
Provide clear, humane readable error messages.
Enriched data objects to avoid client-side data merging.
Filter output to control size

You can define the function and pass it to the model as follows.

var client = OpenAiClient.forModel(GPT_4O_MINI);

var messages = new ArrayList<OpenAiMessage>();
messages.add(OpenAiMessage.user("What's the weather in Berlin in Celsius"));

// 1. Define the functions
List<OpenAiTool> tools =
  List.of(
    OpenAiTool.forFunction(WeatherMethod::getCurrentWeather)
      .withArgument(WeatherMethod.Request.class)
      .withName("weather")
      .withDescription("Get the weather for the given location"));

// 2. Query LLM for assistant to call the function
OpenAiChatCompletionRequest request = new OpenAiChatCompletionRequest(messages).withToolsExecutable(tools);
OpenAiChatCompletionResponse response = client.chatCompletion(request);

Optionally, you can execute the tool call and request the LLM to format the result in a user-friendly way.

// 3. Execute the function call at runtime
messages.add(response.getMessage());
messages.addAll(response.executeTools());

// 4. Send back the results, and the model will incorporate them into its final response.
OpenAiChatCompletionResponse finalResponse = client.chatCompletion(request.withMessages(messages));
String content = finalResponse.getContent();

Please find more to the example in the sample application including the definition for generateSchema and parseJson.

Introduction​

Prerequisites​

Maven Dependencies​

Usage​

Simple chat completion​

Using a Custom Resource Group​

Custom Headers​

Message history​

Chat Completion with Specific Model Version​

Chat completion with Custom Model​

Stream chat completion​

Asynchronous Streaming - Blocking​

Asynchronous Streaming - Non-blocking​

Executing Tool Calls​