I recently published a post over on the Logentries blog which outlined the phenomenon of false-sharing, and ways you can work around it on the JVM. I was surprised to see it was so widely read, so I felt I should provide a more rigorous benchmark- this time using JMH. Head over here if you want to check it out; suffice to say the results are equally clear.

Somebody in the office who is far smarter than me subsequently asked the question:

Doesn’t volatile always reach out to main memory?

If that’s case, it seems a bit irrelevant discussing how to avoid on-die cache misses, right? I didn’t have a ready answer for this, so when confronted with seemingly obvious [but actually quite subtle] things like this, dig out the Java Memory Model spec (link to PDF here. If anybody can find an HTML version, that’d be cool). One major reason for using volatiles in the example is just straight correctness: chapter 12, “Non-atomic Treatment of double and long”, makes this clear:

JavaTM virtual machines are free to perform writes to long and double values atomically or in two parts

So to guarantee that a reading thread sees an atomic [albeit potentially inconsistent] view of a data type bigger than 32 bits, it’s gotta be volatile. To be clear, when I say atomic, I mean the JMM guarantees that a volatile type will not get into a condition where its value consists of 32 bits of data from one thread and 32 from another. It can be overwritten by multiple other threads, (the inconsistency bit) but as a programmer you can be sure it won’t be in some crappy half-way state. To be fair, if the working example used an int, byte, e.t.c, this would be a moot point.

The next motivation is visibility- without the volatile modifier, there’s no way of communicating a value change between threads. As long as only one thread is writing to a given variable, all the readers will see its value correctly. According to the gospel of chapter 3:

A write to a volatile field happens-before every subsequent read of that volatile.

The JMM refers to the coordination points between threads (like a volatile write) as happens-before relationships. When people talk about memory consistency and instruction ordering in Java, you hear that term a lot 😉

Answer the question already!

Alright, does a volatile read always reach out to DRAM? This is where things get a bit murky [read: architecture dependent] but for my own edification we can look into the assembly code that HotSpot generates to see what’s going on. For starters, this means cracking out this little incantation:

java -XX:+UnlockDiagnosticVMOptions-XX:CompileCommand=print,*YourClass.andMethod YourMainClass

For that to work, however, you need Java 7 or later and a dynamic library called hsdis. The Mac .dylib can be found here; drop it into your $JAVA_HOME/jre/lib and you’re all set. There’s detailed documentation of all the CompileCommand variants, if you’re interested (although it’s omitted from the Oracle java manpage).

Anyway, let’s say we have a class like this where for some bizarre reason we want to continuously read a variable and eventually print it without updating:

public class NonVolatileClass implements Runnable {

    public long a;

    @Override
    public void run() {
        long b;
        for (int i = 0; i < 1E9; i++) {
            b = a;
        }
        System.out.println(a);
    }
}

Before you ask: the loop isn’t just there to spice up this otherwise pointless example- it ensures we hit the b = a path often enough for HotSpot to compile it on the fly. Otherwise we don’t have any generated native code to observe. The pertinent assembly for the above looks like this:

Ok- no surprises here; it’s a straight mov operation into a general-purpose register. Kind of anticlimactic since on a 64-bit platform it can be performed as 1 instruction. Don’t count on this, though (cough, ARM, Android). What happens if we make a volatile?

Right- in order to honour the JMM-specified behaviour, the JVM pulls a from memory into an MMX register using the VMOVSD instruction that was introduced with AVX before it’s moved to a general-purpose register. That’s not too much worse than a regular read, and if it sits in L1/2/3 cache and doesn’t get written, it’s just as cheap. Things get a bit more funky if we start updating it though:

The first bit should look familiar from the volatile read, but is that a lock at the end? Normally our fancy CPU is free to reorder instructions as it sees fit, but to make our happens-before invariant hold true, the JVM needs to ensure that this can’t happen. Adding an x86 LOCK operation to the last copy operation ensures that next time a read happens it sees this view of the world. This also has the effect of dirtying any necessary cache regions- did I ever mention false-sharing? If we had another unfortunate volatile variable in the same cache-line, it’d be evicted even if it never got updated. Rough justice, eh?

Stop showing me generated assembly

Alright, that was a bit intense; to sum up: on x86-64, a volatile read can avoid the penalty of a full reach out to main memory assuming it’s not being updated and the cache is not heavily contended. But you might also be surprised by how HotSpot implements the JMM’s atomicity guarantees!