Resilience Capabilities of the SAP Cloud SDK for Java

The SAP Cloud SDK for Java provides abstractions for some frequently used resilience patterns like timeout, retry or circuit breaker. Applying such patterns helps making an application more resilient against failures it might encounter.

The following article describes which resilience features the SDK offers and how to apply them. If you are looking for a quick start with resilience also check out our dedicated tutorial on the topic!

Using the Resilience API#

The SDK allows to run any code in the context of one or more resilience patterns. There are two essential building blocks for achieving this:

  1. The ResilienceConfiguration that determines which patterns should be applied.
  2. The ResilienceDecorator which is capable of applying the configuration to an operation.

The fluent Resilience Configuration API provides builders that help with assembling different resilience patterns and their associated parameters. Which patterns are available and how to use them is explained in the dedicated section below.

The Resilience Decorator is capable of applying such a configuration to a given Callable or Supplier.

Executing Operations#

Consider the following code:

result = ResilienceDecorator.executeSupplier(() -> operation(), configuration);

This code executes operation() in a resilient manner according to a ResilienceConfiguration. The decorator will apply all in configuration configured patterns and all logic that is needed to combine these patterns.

Some resilience patterns are applied over multiple executions of the same operation. For example the circuit breaker will prevent further executions, if a significant portion of previous attempts failed.

To understand how the SDK applies this concept consider the following snippet:

configuration1 = ResilienceConfiguration.of("config-id-1");
configuration2 = ResilienceConfiguration.of("config-id-1");
configuration3 = ResilienceConfiguration.of("config-id-2");
ResilienceDecorator.executeSupplier(() -> operation(), configuration1);
ResilienceDecorator.executeSupplier(() -> operation(), configuration1);
ResilienceDecorator.executeSupplier(() -> operation(), configuration2);
ResilienceDecorator.executeSupplier(() -> operation(), configuration3);

Here executions one, two and three will all share the same "resilience state". This means that they will share the same instance of a circuit breaker or bulkhead. So the state is shared via the identifier of the associated configuration.

Operation Types#

The decorator operates with two kinds of operations:

CallableMay throw checked or unchecked Exceptions
SupplierMay only throw unchecked Exceptions

Noticeable is the difference in signatures: Callable throws a checked exception while Supplier does not. So you can choose whatever fits your use case best.

Execution Variants#

The decorator allows for three different ways of applying a configuration:

ExecuteImmediately runs the operation
DecorateReturns a new operation to be run later
QueueImmediately runs the operation asynchronously

In case your operation should run asynchronously we highly recommend you leverage the queue functionality. The decorator will ensure the Thread Context with Tenant and Principal information is propagated correctly to new Threads.

note

Note that the Resilience Decorator will try to propagate the current Thread Context at the time the decorator is invoked. This is important when you are decorating a Callable or Supplier and running it later. The Thread Context must be available whenever decorateCallable or decorateSupplier is evaluated. So if the call to ResilienceDecorator should take place asynchronously you should follow these steps to ensure the Thread Context is available.

Failures and Fallbacks#

An operation might fail for two reasons:

  1. The operation itself encounters a failure and throws an error or exception
  2. A resilience pattern causes the operation to fail (e.g. the circuit breaker prevents further invocations)

The SDK wraps all kind of checked and unchecked exceptions into a ResilienceRuntimeException and throws them.

To deal with failures one can either catch the ResilienceRuntimeException or provide a fallback function:

executeCallable(() -> operation(), configuration,
(throwable) -> {
log.debug("Encountered a failure in operation: ", throwable);
log.debug("Proceeding with fallback value: {}", fallback);
return fallback;
});

In the case of Callable this relieves you of the need to catch the exception at the outer level.

Building a Resilience Configuration#

A new ResilienceConfiguration with default values is created by providing an identifier for it:

configuration = ResilienceConfiguration.of("identifier");

The identifier can be either a string or a class. In case of the latter the (full) classname will be used as identifier. The identifier will be used to apply resilience patterns across multiple invocations to operations.

Check the JavaDoc for which patterns and parameters will be applied by default. You can also create a configuration with all patterns disabled:

configuration = ResilienceConfiguration.empty("identifier");

Individual resilience patterns are configured via dedicated builder classes like TimeLimiterConfiguration and are added to the configuration via dedicated setters, e.g. timeLimiterConfiguration(). For details see the list of Resilience Capabilities below.

Multi Tenancy#

The SDK is capable of applying the different resilience patterns in a tenant and principal aware manner. Consider for example the Bulkhead pattern which limits the amount of parallel executions. If the operation is tenant specific then you would probably want to avoid one tenant blocking all others.

For this reason the SDK by default isolates resilience patterns based on tenant and principal, if they are available. This strategy can be configured, e.g. for running without any isolation use:

configuration.isolationMode(ResilienceIsolationMode.NO_ISOLATION);

Other than no isolation there are essentially two modes for tenant and/or principal isolation:

RequiredAlways isolates on tenant and/or principal level, will throw an exception if no tenant/principal is available
OptionalOnly isolates if tenant and/or principal information is available

Details can be found on the API reference of ResilienceIsolationMode.

Resilience Capabilities#

The following resilience patterns are available and can be configured in a Resilience Configuration:

TimeoutTimeLimiterConfigurationLimit how long an operation may run before it should be interrupted
RetryRetryConfigurationRetry a failed operation a limited amount of times before failing
Circuit BreakerCircuitBreakerConfigurationReject attempts if too many failures occurred in the past
Bulkhead

(also known as Shed Load or Load Shedding)

BulkheadConfigurationLimit how many instances of this operation may run in parallel

You can find good explanations on how the individual patterns behave on the documentation of resilience4j which the SDK uses under the hood to perform resilient operations.

Be aware that the patterns interact with each other. They are applied in the following order:

  1. Timeouts
  2. Bulkhead
  3. Circuit Breaker
  4. Retries
  5. Fallbacks

This means that every individual attempt triggered by retries will be limited by the timeout. Every failed retry will be accounted for in the circuit breaker. Only if all retries failed the fallback function will be considered.

Last updated on by Frank Essenberger