Skip to main content

Resilience Capabilities

The SAP Cloud SDK for Java provides abstractions for some frequently used resilience patterns like timeout, retry, rate limiter, or circuit breaker. Applying such patterns helps to make an application more resilient against failures it might encounter.

The following article describes which resilience features the SAP Cloud SDK offers and how to apply them. If you are looking for a quick start with resilience also check out our dedicated tutorial on the topic!

Using the Resilience API

To make use of the resilience capabilities by the SAP Cloud SDK, make sure to add following dependencies to your project:

<dependency>
<groupId>com.sap.cloud.sdk.cloudplatform</groupId>
<artifactId>resilience</artifactId>
</dependency>
<dependency>
<groupId>com.sap.cloud.sdk.frameworks</groupId>
<artifactId>resilience4j</artifactId>
<scope>runtime</scope>
</dependency>

The SAP Cloud SDK allows running any code in the context of one or more resilience patterns. There are two essential building blocks for achieving this:

  1. The ResilienceConfiguration that determines which patterns should be applied.
  2. The ResilienceDecorator is capable of applying the configuration to an operation.

The fluent Resilience Configuration API provides builders that help with assembling different resilience patterns and their associated parameters. Which patterns are available and how to use them is explained in the dedicated section below.

The Resilience Decorator is capable of applying such a configuration to a given Callable or Supplier.

Executing Operations

Consider the following code:

result = ResilienceDecorator.executeSupplier(() -> operation(), configuration);

This code executes operation() in a resilient manner according to a ResilienceConfiguration. The decorator will apply all in configuration configured patterns and all logic that is needed to combine these patterns.

Some resilience patterns are applied over multiple executions of the same operation. For example, the circuit breaker will prevent further executions if a significant portion of previous attempts failed.

To understand how the SAP Cloud SDK applies this concept consider the following snippet:

configuration1 = ResilienceConfiguration.of("config-id-1");
configuration2 = ResilienceConfiguration.of("config-id-1");
configuration3 = ResilienceConfiguration.of("config-id-2");

ResilienceDecorator.executeSupplier(() -> operation(), configuration1);
ResilienceDecorator.executeSupplier(() -> operation(), configuration1);
ResilienceDecorator.executeSupplier(() -> operation(), configuration2);
ResilienceDecorator.executeSupplier(() -> operation(), configuration3);

Here executions one, two, and three will all share the same "resilience state". This means that they will share the same instance of a circuit breaker or bulkhead. The state is shared via the identifier of the associated configuration.

Operation Types

The decorator operates with two kinds of operations:

Callable

May throw checked or unchecked Exceptions

Supplier

May only throw unchecked Exceptions

Noticeable is the difference in signatures: Callable throws a checked exception while Supplier does not. You can choose whatever fits your use case best.

Execution Variants

The decorator allows for three different ways of applying a configuration:

ExecuteImmediately runs the operation
DecorateReturns a new operation to be run later
QueueImmediately runs the operation asynchronously

In case your operation should run asynchronously we highly recommend you leverage the queue functionality. The decorator will ensure the Thread Context with Tenant and Principal information is propagated correctly to new Threads.

note

Note that the Resilience Decorator will try to propagate the current Thread Context at the time the decorator is invoked. This is important when you are decorating a Callable or Supplier and running it later. The Thread Context must be available whenever decorateCallable or decorateSupplier is evaluated. If the call to ResilienceDecorator should take place asynchronously, you should follow these steps to ensure the Thread Context is available.

Failures and Fallbacks

An operation might fail for two reasons:

  1. The operation itself encounters a failure and throws an error or exception
  2. A resilience pattern causes the operation to fail (e.g. the circuit breaker prevents further invocations)

The SAP Cloud SDK wraps all kind of checked and unchecked exceptions into a ResilienceRuntimeException and throws them.

To deal with failures one can either catch the ResilienceRuntimeException or provide a fallback function:

executeCallable(() -> operation(), configuration,
(throwable) -> {
log.debug("Encountered a failure in operation: ", throwable);
log.debug("Proceeding with fallback value: {}", fallback);
return fallback;
});

In the case of Callable, this relieves you of the need to catch the exception at the outer level.

Building a Resilience Configuration

A new ResilienceConfiguration with default values is created by providing an identifier for it:

configuration = ResilienceConfiguration.of("identifier");

The identifier can be either a string or a class. In the case of the latter, the (full) class name will be used as the identifier. The identifier will be used to apply resilience patterns across multiple invocations to operations.

Check the Javadoc for which patterns and parameters will be applied by default. You can also create a configuration with all patterns disabled:

configuration = ResilienceConfiguration.empty("identifier");

Individual resilience patterns are configured via dedicated builder classes like TimeLimiterConfiguration and are added to the configuration via dedicated setters, e.g. timeLimiterConfiguration(). For details see the list of Resilience Capabilities below.

Multi Tenancy

The SAP Cloud SDK is capable of applying the different resilience patterns in a tenant and principal aware manner. Consider for example the Bulkhead pattern which limits the number of parallel executions. If the operation is tenant-specific then you would probably want to avoid one tenant blocking all others.

For this reason, the SAP Cloud SDK by default isolates resilience patterns based on tenant and principal, if they are available. This strategy can be configured, e.g. for running without any isolation use:

configuration.isolationMode(ResilienceIsolationMode.NO_ISOLATION);

Other than no isolation there are essentially two modes for tenant and/or principal isolation:

Required

Always isolates on tenant and/or principal level, will throw an exception if no tenant/principal is available

OptionalOnly isolates if tenant and/or principal information is available

Details can be found on the API reference of ResilienceIsolationMode.

Resilience Capabilities

The following resilience patterns are available and can be configured in a Resilience Configuration:

Timeout

TimeLimiterConfiguration

Limit how long an operation may run before it should be interrupted

Rate LimiterRateLimiterConfigurationLimit the number of operations accepted in a window of time
Retry

RetryConfiguration

Retry a failed operation a limited amount of times before failing
Circuit Breaker

CircuitBreakerConfiguration

Reject attempts if too many failures occurred in the past

Bulkhead

(also known as Shed Load or Load Shedding)

BulkheadConfiguration

Limit how many instances of this operation may run in parallel

You can find good explanations on how the individual patterns behave on the documentation of resilience4j which the SAP Cloud SDK uses under the hood to perform resilient operations.

Be aware that the patterns interact with each other. They are applied in the following order:

Fallback ( Retry ( CircuitBreaker ( RateLimiter ( TimeLimiter ( Bulkhead ( Function ) ) ) ) ) )

If you read from right to left, it shows you the order in which the aspects will be applied. For example, Fallbacks are called last while Bulkhead is the first aspect applied. Hence, exceptions are also propagated from right to left.

Based on the order, the following inferences (not exhaustive) can be made:

  • Every timeout will trigger a retry, if configured.
  • Only if all retries failed the fallback function will be considered.

You can get more details in the Resilience4j official documentation.