Extending k-CAS with efficient read-only CMP operations

The kcas library currently uses the GKMZ algorithm for Efficient Multi-word Compare and Swap or MCAS aka k-CAS. This is a nearly optimal algorithm for MCAS as it requires only k 1 CAS operations.

The new library API also provides a transactional API for using the algorithm. For example, suppose one would create the following shared memory locations:

let a = Loc.make 10
let b = Loc.make 52
let x = Loc.make 0
let y = Loc.make 0

Using the Tx API one could define a transaction x_to_b_sub_a

let x_to_b_sub_a = Tx.(
  let* a' = get a
  and* b' = get b in
  set x (b' - a')
)

to update x with the difference of b and a and commit that transaction:

Tx.commit x_to_b_sub_a

One could similarly define a transaction y_to_a_add_b

let y_to_a_add_b = Tx.(
  let* a' = get a
  and* b' = get b in
  set y (a'   b')
)

to update y with the sum of a and b and commit that transaction:

Tx.commit y_to_a_add_b

The above committed transactions essentially correspond to MCAS operations as follows:

Tx.commit x_to_b_sub_a == [ CAS (a, 10, 10); CAS (b, 52, 52); CAS (x, 0, 42) ]
Tx.commit y_to_a_add_b == [ CAS (a, 10, 10); CAS (b, 52, 52); CAS (y, 0, 62) ]

CAS with equal expected or before and desired or after values essentially expresses an operation that does not change the logical content of the target location, but only "asserts" that it does not change during the operation.

Note that the transactions x_to_b_sub_a and y_to_a_add_b, unlike the MCAS operations they generate, are independent of the exact values of the locations being accessed. It is important to distinguish between them. A transaction is a specification for generating a list of CASes.

One might attempt to perform both of the two MCAS operations

[ CAS (a, 10, 10); CAS (b, 52, 52); CAS (x, 0, 42) ]

and

[ CAS (a, 10, 10); CAS (b, 52, 52); CAS (y, 0, 62) ]

in parallel, but that will not be allowed by the GKMZ algorithm. Every CAS actually updates the targeted memory locations. This means two things:

CAS operations targeting the same location can only execute sequentially.
CAS operations, even those that do not change the logical content of a location, cause contention as after the operation only the cache of the writer will have a valid copy of the shared memory location.

Could we extend upon GKMZ and allow read-only CMP operations to be expressed directly and also make it so that read-only CMP operations do not write to memory?

Let's first examine, in a bit more detail, how GKMZ, or our OCaml adaptation of it, operates. For this issue, in order to make it a bit easier to follow, I'll simplify the implementation a bit. We'll replace the splay tree representation with a simpler list, assume that locations are always given in some total order, and we also ignore the release of unused values after the operation.

Here are the core data structures used by the GKMZ algorithm:

 type 'a loc = 'a state Atomic.t
 and 'a state = { before : 'a; after : 'a; casn : casn }
 and cass = CASS : 'a loc * 'a state -> cass
 and casn = status Atomic.t
 and status = Undetermined of cass list | After | Before

Take a closer look at the internal cass type. Previously we explained transactions and talked about MCAS in terms a different CAS type whose definition could look like this:

 type cas =
   | CAS : 'a loc * 'a * 'a -> cas

The difference between the above logical cas operation and the internal cass descriptor is that the logical cas deals with plain values of type 'a while the internal cass descriptor uses a state type or the 'a state type. A location 'a loc is an atomic location that contains a 'a state. The core GKMZ algorithm attempts a MCAS by attempting to replace the states of all the locations in a list of casses and then setting status of the casn descriptor to either After or Before depending on whether the whole operation was a success or a failure, respectively.

When using GKMZ one first prepares a list of internal cass descriptors and a casn descriptor that has the Undetermined list of those descriptors. The data structure is cyclic: casn contains the list of cass descriptors which contain states which contain a reference to the casn descriptor. This cyclic form allows the whole data structure to be traversed starting from any state, which one might find in a location.

Here is a the core of the GKMZ algorithm in OCaml:

 let finish casn desired =
   match Atomic.get casn with
   | After -> true
   | Before -> false
   | Undetermined _ as current ->
     Atomic.compare_and_set casn current desired |> ignore;
     Atomic.get casn == After

 let rec gkmz casn = function
   | [] -> finish casn After (* seems like a success *)
   | (CASS (loc, desired) :: continue) as retry ->
     let current = Atomic.get loc in
     if desired == current then
       gkmz casn continue
     else
       let current_value =
         if is_after current.casn then
           current.after
         else
           current.before
       in
       if current_value != desired.before then
         finish casn Before (* seems like a failure *)
       else
         match Atomic.get casn with
         | Undetermined _ ->
           (* operation still unfinished *)
           if Atomic.compare_and_set loc current desired then
             gkmz casn continue
           else
             gkmz casn retry
         | After -> true (* operation was a success *)
         | Before -> false (* operation was a failure *)

 and is_after casn =
   match Atomic.get casn with
   | Undetermined cass -> gkmz casn cass
   | After -> true
   | Before -> false

 let get loc =
   let state = Atomic.get loc in
   if is_after state.casn then
     state.after
   else
     state.before

 let atomically logical_cas_list =
   let casn = Atomic.make After in
   let cass =
     logical_cas_list
     |> List.map @@ function
        | CAS (loc, before, after) -> CASS (loc, {before; after; casn})
   in
   Atomic.set casn (Undetermined cass);
   gkmz casn cass

Note that every call of atomically allocates a fresh location for casn descriptor and also fresh states for all the CASS descriptors. This is important as it makes sure that ABA problems are avoided. It is doubly important in the following extended algorithm.

Let's then simply extend the algorithm to allow CMP operations. First we extend the logical cas type with a new logical CMP operation:

 type cas =
   | CAS : 'a loc * 'a * 'a -> cas
   | CMP : 'a loc * 'a -> cas

It turns out that we don't need to change the internal data structures at all. The gist is that an internal read-only CASS (loc, state) descriptor refers to the state of a location before the operation started. We can distinguish such a state simply by comparing the casn of the state to the casn of the entire operation. Furthermore, because we know that the states and casn are always freshly allocated, we know that we can compare them simply by their identities without ABA problems.

Then we extend the algorithm to allow read-only CMP operations. The idea is simple: instead of attempting to store the states of read-only CMP operations to the locations, we simply check that those locations have their original states. Additionally, before we attempt to complete an operation as a success (by writing After to the casn), we verify that all of the read-only locations still have their original values.

 let is_cmp casn state =
   state.casn != casn
 
 let finish casn desired =
   match Atomic.get casn with
   | After -> true
   | Before -> false
-  | Undetermined _ as current ->
   | Undetermined cass as current ->
     let desired =
        if desired == After
           && cass
              |> List.exists @@ fun (CASS (loc, state)) ->
                 is_cmp casn state && Atomic.get loc != state then
           Before
        else
           desired in
     Atomic.compare_and_set casn current desired |> ignore;
     Atomic.get casn == After

 let rec gkmz casn = function
   | [] -> finish casn After (* seems like a success *)
   | (CASS (loc, desired) :: continue) as retry ->
     let current = Atomic.get loc in
     if desired == current then
       gkmz casn continue
     else if is_cmp casn desired then
       finish casn Before (* seems like a failure *)
     else
       let current_value =
         if is_after current.casn then
           current.after
         else
           current.before
       in
       if current_value != desired.before then
         finish casn Before (* seems like a failure *)
       else
         match Atomic.get casn with
         | Undetermined _ ->
           (* operation still unfinished *)
           if Atomic.compare_and_set loc current desired then
             gkmz casn continue
           else
             gkmz casn retry
         | After -> true (* operation was a success *)
         | Before -> false (* operation was a failure *)

 and is_after casn =
   match Atomic.get casn with
   | Undetermined cass -> gkmz casn cass
   | After -> true
   | Before -> false

 let get loc =
   let state = Atomic.get loc in
   if is_after state.casn then
     state.after
   else
     state.before

 let atomically logical_cas_list =
   let casn = Atomic.make After in
   let cass =
     logical_cas_list
     |> List.map @@ function
        | CAS (loc, before, after) -> CASS (loc, {before; after; casn})
        | CMP (loc, expected) ->
          let current = Atomic.get loc in
          if get loc != expected || Atomic.get loc != current then
            raise Exit
          else
            CASS (loc, current)
   in
   Atomic.set casn (Undetermined cass);
   gkmz casn cass
 
 let atomically logical_cas_list =
   try atomically logical_cas_list with Exit -> false

The above implementation is specifically designed to minimize the diffs compared to the original GKMZ algorithm. Minor optimizations are possible that are not shown above. Additionally, it makes sense to distinguish the case when the algorithm specifically fails during the verification step. We'll get back to this shortly.

We claim that the above algorithm is linearizable and obstruction-free.

It should be clear that if one thread runs the above algorithm in isolation, it will be able to finish in a finite number of steps. When not running in isolation, the verification steps (in finish) of two operations may indefinitely cause both to fail. To see this, consider the following operations operating in parallel:

[ CMP (a, 0); CAS (b, 0, 1) ] and [ CAS (a, 0, 1); CMP (b, 0) ]

Let's assume both operations manage to initially convert the CMP operations in atomically, check the read-only CASS once during gkmz, and perform their mutating CASS operations. At that point both enter the verification step and both of them will fail. The same could happen on a subsequent retry.

To prove linearizability we need to show that, for any set of operations performed in parallel, there is an order in which the operations could have been performed sequentially giving the same state at the end for all locations.

First note that the new algorithm operates exactly the same as the original GKMZ algorithm in case only CAS operations are performed. All the cases where operations write to overlapping locations are already proven to be linearizable by the basic GKMZ algorithm. Operations that are completely non-overlapping are trivially linearizable.

The interesting case to consider is when an operation R that only reads a location x (and might also write other locations) needs to be linearizable with an operation W that writes to said location x. We claim any observer of said operations will only be able to read an end state that is consistent with R happening before W. Let's assume the opposite, that an observer reads the results of W and R and can determine that the operation W happened before R. For that to be possible, the casn descriptors of both W and R must be set to the After state. The only way for that to be possible is that the R operation verified the x location before W wrote to it and so the result of R must be consistent with R happening before W. This contradicts the assumption and proves the original claim.

Previously we mentioned that it makes sense to distinguish the case when the verification step fails. Let's assume we have done so. Consider having a transaction mechanism using the new algorithm. Initially such a mechanism attempts to perform the transaction optimistically using the obstruction-free algorithm for k-CAS-n-CMP. If that repeatedly fails during the verification step, then the transaction mechanism can switch to using only k-CAS operations and try to complete the operation in lock-free manner. This way the transaction mechanism can guarantee lock-free behavior, which ensures that at least one thread will be able to make progress.

Recall the example transactions x_to_b_sub_a and y_to_a_add_b that we started with. Using the new k-CAS-n-CMP algorithm the transactions can generate the following operations:

Tx.commit x_to_b_sub_a == [ CMP (a, 10); CMP (b, 52); CAS (x, 0, 42) ]
Tx.commit y_to_a_add_b == [ CMP (a, 10); CMP (b, 52); CAS (y, 0, 62) ]

The new algorithm will then be able to run the two transaction in parallel.

polytypic/gkmz-with-read-only-cmp-ops.md

Extending k-CAS with efficient read-only CMP operations

polytypic commented Jul 31, 2023