Manage transactions in the cloud, Part 2: Making do without distributed transactions
Manage transactions in the cloud, Part 2
Making do without distributed transactions
How to ensure transactional qualities in cloud applications
This content is part # of # in the series: Manage transactions in the cloud, Part 2
Stay tuned for additional content in this series.
This content is part of the series:Manage transactions in the cloud, Part 2
Stay tuned for additional content in this series.
In Part 1 of this series, you learned about transaction processing
and distributed transactions. Transactional properties help to reduce the
error-handling effort required for an application. However, distributed
transactions, which manage transactions across multiple resources, are not
necessarily available in cloud-based runtimes. Here in Part 2, we look at
the cloud setup, and ways of ensuring transactional qualities across
multiple resource managers even without distributed transactions.
In a cloud setup, distributed transactions are frowned upon even more than
they are in a classical setup. Resources (resource managers such as
databases) can be volatile, as can transaction managers. They can be
started or stopped at any time to adjust for different load situations.
Restarting a resource manager aborts all transactions currently active in
it, but what about transaction managers that are in the middle of the
voting phase of a distributed transaction? Without a persistent
transaction manager log, transactions are lost in limbo and resources
locked for those transactions will never be released by their resource
Also in a cloud setup, as well as in some service-oriented architecture
(SOA) setups, the protocols used to call a resource manager frequently do
not support transactions such as REST over HTTP calls. With such
protocols, a call is equivalent to an autocommitted operation, as
described in Part 1.
“The cloud makes it easier to quickly deploy new versions
or quickly scale your application, but this can make things more
difficult for the application programmer. Using what you learn here
about transaction management should help you create applications that
work better in the cloud.”
Therefore, while you could probably set up the application servers in the
cloud with a shared, persistent transaction log on a separate storage
infrastructure to compensate for volatile operations, many popular
communication protocols do not support the use of transactions.
This means that in a cloud setup you have to make do without distributed
transactions. The following sections present a number of techniques for
coordinating multiple resources in a non-transactional setup.
Ways to handle distributed
transactions in the cloud
The underlying assumption of transaction processing is that you want
multiple resources to be consistent with each other. A change in one
resource should be consistent with a change in another resource. Without
distributed transactions, these error-handling patterns can be used for
- A change is performed in only one of the resources. If this situation
is acceptable from a functional point of view, transactions can be
autocommitted, and the error situation in the other resource can be
logged and ignored.
- The change in the first resource is rolled back when an error occurs
while changing the second resource. This requires that the transaction
on the first resource still be open when the error occurs, so that it
can be rolled back. Alternatively, when the transactions on both
resources are kept open, and the error occurs on the first commit of
one of the two transactions, the rolled-back change in the first
resource is not visible to the outside, so on rollback no window of
- One resource is changed and, at a later point in time, the second
resource is changed to become consistent with the first resource. This
could be due to a recoverable error situation that occurs while
changing the second resource, but it could also be an architectural
pattern that defers the changes in the second resource anyway, such as
in batch processing. This pattern creates a (larger) window of
inconsistency, but in the end consistency is eventually achieved.
Thus, this pattern is referred to as eventual consistency. I
call it roll-forward inconsistency as the second change is
- The change in the first resource is compensated when an error occurs
while changing the second resource. For this pattern, the
first resource needs to provide a compensating transaction to roll
back the changes in a separate transaction. This also creates eventual
consistency, but the error pattern is different from the roll-forward
inconsistency above. I call this a roll-back inconsistency,
as the first change is rolled back.
In the next section, I look at the techniques for employing the eventual
consistency property. The system will not be immediately consistent, but
it should become consistent soon after the transaction is committed.
How can error conditions be reduced so that errors will happen only
infrequently, or not at all?
Ignore—and fix later
The easiest approach, of course, is to ignore any error situation and fix
it with a batch job later on. However, that may not be as easy as it
seems. Consider a shop system with a shopping cart in one database and an
order-processing service. If the order service is not available and you
ignore it, you need to do the following:
- Maintain a status on the shopping cart that the order has not yet been
- Consider the time needed until the batch runs to fix the situation. If
it only runs at night, you may lose a day until order processing
happens on the back end, possibly creating dissatisfied
Figure 1 illustrates an example of an eventual consistency batch fixing up
Figure 1. A batch job repairing
A variation on this theme is to write a “job” to the database, and have a
background process continuously attempt to deliver the order until
successful. In this case, you would essentially be building your own
transaction log and transaction manager.
Time-based pessimistic locking
Pessimistic locking is a technique that allows transactions to ensure that
the resource they want to change is actually available to them.
Pessimistic locking prevents other transactions from modifying a resource
(like a database row) and is sometimes even necessary in short-lived
In long-running business transactions like that described in Part 1, an
initial short-running transaction changes the state of the resource so
that other transactions cannot lock it again. In the credit card example
in Part 1, you saw that such techniques can be successfully used at scale
in a long-running business transaction.
This technique can also work in a cloud setup, however there is one caveat:
You need to make sure that the lock is released when it’s not needed
anymore. You can do this by successfully committing a change on the object
or rolling back the transaction. Basically, you need to write and process
a transaction log in the transaction client or transaction manager to
ensure that the lock is released even if other possible errors have
If you cannot guarantee that the lock will be released, you can use
time-based pessimistic locking—a lock that releases itself after a
given time interval—as a last resort. Such a lock can be
implemented with a simple timestamp on the resource, so no batch job is
A lock introduces a shared state between the transaction client and the
resource manager. The lock needs to be identified and associated with the
transaction; therefore, you need to pass either a lock identifier or a
transaction identifier between the systems to resolve the lock on commit
In contrast to pessimistic locking, optimistic locking is not an
error-avoiding technique; its goal is to detect inconsistencies. While
pessimistic locking reserves a resource for a transaction, a more
optimistic approach is to assume that there is no inconsistency and to
directly perform the local change for a transaction in the first step.
However, you then need to be able to roll back the change if the
inconsistency does occur. This is a compensation of the
optimistically performed operation.
In theory, this is a nice approach, as you only need to call the
compensation service operation when the operation is rolled back. However,
in a JEE application, for example, a resource manager can decide to roll
back when the
commit() is attempted, rather than when the
change is performed, as some database constraints may only be checked at
commit time. The result is that you need to write your own Java resource
adapter because only then do you know for sure whether a transaction on
another resource has been rolled back or committed. But this once again
introduces distributed transactions, which you want to avoid in a cloud
setup for the reasons mentioned above. Also, you need to determine what
happens when your compensation fails for some reason.
A better approach is to note the failure in some kind of transaction
log—for example, a row written to a specific table in a separate
(autocommit transaction) operation. You would then let a batch job process
the compensation at a later time. However, a problem occurs when the
second change is fully processed at the called service, but the caller
does not receive a response, as shown in the next section. If the first
resource is then compensated, you have generated another inconsistency. If
such situations should be avoided, you should check the state of the
second resource before trying to compensate the first change.
As you can see here, discussing all the different possible error modes is a
tedious but necessary endeavor. Only one more specific situation remains
to be discussed: when an operation actually succeeds, but the caller does
not receive the reply and assumes an error.
Consider the shopping cart and the order service again. Let’s say the order
service has been called, but the reply has gone missing on the network.
The caller does not know whether the order service call succeeded and
tries again. The order should not be processed twice, of course, so what
can be done?
Figure 2. Service operations must manage
“double submit” problems
Figure 2 illustrates such a situation. This problem is similar to the
“double submit” problem in web applications, in which a button is
accidentally double-clicked and the same request is sent to the server
twice. What if the server response resource manager detects this repeated
call? In our example, the order service has to identify the second call as
a repeated call and returns just a copy of the first call’s result; the
actual order is processed only once and the order service is
Repeatability of a service helps greatly in reducing the complexity of
error handling in a setup without distributed transactions. This means
that the service is idempotent: You can repeat the service call
multiple times and the result will always be as if the service was called
only once. But of course, this should only happen to the very same order.
A separate order should be processed as a new order, even if it has the
This solution can be implemented by comparing all attributes of a service
call. For example, my bank rejects a second transaction with the identical
amount, subject, and recipient as the first, just in case I entered a
transaction twice into the online banking application by accident.
More robustly, a transaction identifier that’s sent with the request can
help detect double submits. In that case, the called service operation
registers a transaction ID that’s sent by the caller and when the same ID
is used again, it returns only the original result. However, this may
introduce a shared state (the transaction ID) between service instances in
a cluster, so that a repeated call to another cluster member would still
be processed only once, thus creating another consistency problem.
Manual transaction and error
With the lack of distributed transactions, the system needs to look at each
individual operation and decide how to tackle it in a way that ensures
consistency. A client that calls multiple resource managers—each
possibly in its own transaction—must carefully decide which
resource manager to commit first and what to do with the others in case of
an error. This is determined by the functional requirements.
Non-transactional resources such as web services are even worse because
they cannot be rolled back but only compensated. What the resource manager
used to do for you in a classical system you now have to do for
Consider this example from one of my projects: We stored some records in a
separate system using a web service, which we needed to reference later.
As the local transaction sometimes ran into an error, we had to seriously
look at when the error occurred and where to call the compensating
operation. In the end, we made the call repeatable and did away with the
compensation altogether. If a transaction was rolled back, the next try
would write (and overwrite) the same record again, and we had the pointer
we needed for referral without any leftovers from previous tries.
In another example, we were forced to keep a reference counter of a master
data record usage in another system, using an increment/decrement web
service. Whenever we created an object that referred to the master data
record (that is, store its ID), we had to increment the counter, and
decrement it when deleting the object. This caused two problems. First,
the call was a web service so we had to use a Java resource adapter to
handle rollbacks. Even then, errors occurred even during rollback (for
example, due to network problems). Second (and worse), decrement was not a
100% compensation. If the value reached zero—for example in a
rollback—there was the danger that the master data record would be
deleted (which we found out about only afterward, of course). In the end,
we counted only up to when we created an object, and noted in a separate
(transaction) log table whenever we did something with this master data
record. A separate nightly batch then fixed the counter value.
Examples like these are common. Even eBay uses a basically transaction-free
design. Most of eBay’s database statements are auto-commit (which is
similar to a web service call); only in certain, well-defined situations
are statements combined into a single transaction (on that single
database). Distributed transactions are not allowed. (See “Scalability Best Practices: Lessons from eBay.”)
In a related scenario for distributed replicated databases, the CAP
theorem states that out of the three
capabilities—consistency, availability, and performance—only
two can be achieved at any one time. While this is defined for a single,
distributed database, it can also be applied to transactions across
multiple resources as discussed above.
Relaxing strong consistency and allowing eventual consistency windows is
formalized in the “BASE” paradigm:
- Basically Available
- Soft state
- Eventual consistency
Availability is prioritized over strong consistency, which results in a
soft state that may not be the latest value, but may
eventually be updated. This principle applies to the database area, but is
also relevant to web services or microservices that are deployed in the
cloud. However, while in a distributed database the database vendor is
responsible for keeping the system (eventually) consistent; in an
application that uses multiple resources, the application must take care
of that itself, as shown above.
The new eventual consistency paradigm can even be acceptable in some
functional scenarios. See, for example, Google’s discussion in “Providing a Consistent User Experience and Leveraging the Eventual
Consistency Model to Scale to Large Datasets.”
Going further, you can try to find and use algorithms that are safe through
eventual consistency. The recently proposed CALM theorem holds
that some programs (those that are “monotonic”) can be run safely on an
eventually consistent basis. For example, if you add a counter by using a
read operation, add 1, and then write it back, it will be susceptible to
consistency problems; however, if you add to a counter by calling an
“increment” operation, it will not. See “Eventual Consistency
Today: Limitations, Extensions, and Beyond” and “CALM: consistency as logical
monotonicity” for more information.
Having to do without the strong consistency models that are guaranteed by
distributed transactions can make life harder in some ways for the average
programmer. Work that was previously done by the transaction manager must
now be done by the programmer, especially in the area of error handling.
On the other hand, the programmer’s life is also made easier: Programming
for eventual consistency from the start makes the program much more
scalable and resilient to network delays and problems in a world that is
much more distributed than the programmer’s own datacenter. Not everyone
creates applications for an eBay-sized scale from the start, of course,
but distributed transactions can create significant overhead. (However, I
confess that I wish I could use at least one database and a message store
in a single transaction.)
So how can you get along without distributed transactions? Here are some
- In a new system, try to find algorithms that solve your business
problem but work under eventual consistency (see the CALM
- Challenge the business needs on strong consistency requirements.
- Make sure the services you call are adapted to this architecture by
being repeatable (idempotent) or by providing proper
- In an existing system, modify your algorithms to make them simpler and
less prone to failures in the services your system calls using the
techniques described here.
The cloud makes it easier to quickly deploy new versions or scale your
application, but this can also make things more challenging. Using what
you have learned here about transaction management should help you create
applications that work better in the cloud.
Many thanks to my colleagues Sven Höfgen, Jens Fäustl, and Martin Kunz
for reviewing and commenting on this series.
via IBM developerWorks : Cloud computing https://ibm.co/2cihRPX
May 2, 2017 at 09:57AM