Points of @Contention, Redux
I recently published a post over on the Logentries blog which outlined the phenomenon of false-sharing, and ways you can work around it on the JVM. I was surprised to see it was so widely read, so I felt I should provide a more rigorous benchmark- this time using JMH. Head over here if you want to check it out; suffice to say the results are equally clear.
Somebody in the office who is far smarter than me subsequently asked the question:
Doesn’t volatile always reach out to main memory?
If that’s case, it seems a bit irrelevant discussing how to avoid on-die cache misses, right? I didn’t have a ready answer for this, so when confronted with seemingly obvious [but actually quite subtle] things like this, dig out the Java Memory Model spec (link to PDF here. If anybody can find an HTML version, that’d be cool). One major reason for using volatiles in the example is just straight correctness: chapter 12, “Non-atomic Treatment of double and long”, makes this clear:
JavaTM virtual machines are free to perform writes to long and double values atomically or in two parts
So to guarantee that a reading thread sees an atomic [albeit potentially inconsistent] view of a data type bigger than 32 bits, it’s gotta be volatile. To be clear, when I say atomic, I mean the JMM guarantees that a volatile type will not get into a condition where its value consists of 32 bits of data from one thread and 32 from another. It can be overwritten by multiple other threads, (the inconsistency bit) but as a programmer you can be sure it won’t be in some crappy half-way state. To be fair, if the working example used an int, byte, e.t.c, this would be a moot point.
The next motivation is visibility- without the volatile modifier, there’s no way of communicating a value change between threads. As long as only one thread is writing to a given variable, all the readers will see its value correctly. According to the gospel of chapter 3:
A write to a volatile field happens-before every subsequent read of that volatile.
The JMM refers to the coordination points between threads (like a volatile write) as happens-before relationships. When people talk about memory consistency and instruction ordering in Java, you hear that term a lot 😉
Answer the question already!
Alright, does a volatile read always reach out to DRAM? This is where things get a bit murky [read: architecture dependent] but for my own edification we can look into the assembly code that HotSpot generates to see what’s going on. For starters, this means cracking out this little incantation:
java -XX:+UnlockDiagnosticVMOptions-XX:CompileCommand=print,*YourClass.andMethod YourMainClass
For that to work, however, you need Java 7 or later and a dynamic library called hsdis. The Mac .dylib can be found here; drop it into your $JAVA_HOME/jre/lib and you’re all set. There’s detailed documentation of all the CompileCommand variants, if you’re interested (although it’s omitted from the Oracle java manpage).
Anyway, let’s say we have a class like this where for some bizarre reason we want to continuously read a variable and eventually print it without updating:
public class NonVolatileClass implements Runnable {
public long a;
@Override
public void run() {
long b;
for (int i = 0; i < 1E9; i++) {
b = a;
}
System.out.println(a);
}
}
Before you ask: the loop isn’t just there to spice up this otherwise pointless example- it ensures we hit the b = a path often enough for HotSpot to compile it on the fly. Otherwise we don’t have any generated native code to observe. The pertinent assembly for the above looks like this:
0x0000000104287223: jmpq 0x0000000104287276 ;*iload_3
; - com.logentries.blog.NonVolatileClass::run@2 (line 13)
0x0000000104287228: mov 0x10(%rsi),%rdx ;*getfield a
; - com.logentries.blog.NonVolatileClass::run@12 (line 14)
; implicit exception: dispatches to 0x00000001042873fe
0x000000010428722c: inc %edi
Ok- no surprises here; it’s a straight mov operation into a general-purpose register. Kind of anticlimactic since on a 64-bit platform it can be performed as 1 instruction. Don’t count on this, though (cough, ARM, Android). What happens if we make a volatile?
0x000000010d6895e3: jmpq 0x000000010d68963c ;*iload_3
; - com.logentries.blog.VolatileClass::run@2 (line 13)
0x000000010d6895e8: vmovsd 0x10(%rsi),%xmm0 ; implicit exception: dispatches to 0x000000010d6897ce
0x000000010d6895ed: vmovq %xmm0,%rdx ;*getfield a
; - com.logentries.blog.VolatileClass::run@12 (line 14)
0x000000010d6895f2: inc %edi
Right- in order to honour the JMM-specified behaviour, the JVM pulls a from memory into an MMX register using the VMOVSD instruction that was introduced with AVX before it’s moved to a general-purpose register. That’s not too much worse than a regular read, and if it sits in L1/2/3 cache and doesn’t get written, it’s just as cheap. Things get a bit more funky if we start updating it though:
0x000000010be89ca3: jmpq 0x000000010be89d1e ;*iload_1
; - com.logentries.blog.VolatileClass::run@2 (line 12)
0x000000010be89ca8: vmovsd 0x10(%rsi),%xmm0 ; implicit exception: dispatches to 0x000000010be89eae
0x000000010be89cad: vmovq %xmm0,%rdx ;*getfield a
; - com.logentries.blog.VolatileClass::run@13 (line 13)
0x000000010be89cb2: movabs $0x1,%r10
0x000000010be89cbc: add %r10,%rdx
0x000000010be89cbf: mov %rdx,0x40(%rsp)
0x000000010be89cc4: vmovsd 0x40(%rsp),%xmm0
0x000000010be89cca: vmovsd %xmm0,0x10(%rsi)
0x000000010be89ccf: lock addl $0x0,(%rsp) ;*putfield a
; - com.logentries.blog.VolatileClass::run@18 (line 13)
0x000000010be89cd4: inc %edi
The first bit should look familiar from the volatile read, but is that a lock at the end? Normally our fancy CPU is free to reorder instructions as it sees fit, but to make our happens-before invariant hold true, the JVM needs to ensure that this can’t happen. Adding an x86 LOCK operation to the last copy operation ensures that next time a read happens it sees this view of the world. This also has the effect of dirtying any necessary cache regions- did I ever mention false-sharing? If we had another unfortunate volatile variable in the same cache-line, it’d be evicted even if it never got updated. Rough justice, eh?
Stop showing me generated assembly
Alright, that was a bit intense; to sum up: on x86-64, a volatile read can avoid the penalty of a full reach out to main memory assuming it’s not being updated and the cache is not heavily contended. But you might also be surprised by how HotSpot implements the JMM’s atomicity guarantees!