Java threads may not use all your CPUs

Simple testing reveals a major flaw in Java 1.1.7 on Solaris 2.6

Enterprise-level applications often need more CPU power than a single-CPU machine can provide, but getting applications to effectively use multiple CPUs can be tricky. If you’re lucky, you may be able to run an application on separate single-CPU machines; Web servers scale this way. But many applications, such as databases, need to run on a single machine, which will require additional CPUs to be scaled.

In this article, we’ll examine tests that show that Java threads in Sun’s JRE 1.1.7 do not use more than two processors by default, regardless of how many are in the machine. Java is advertised as automatically scaling to use all available processors, but we’ll see below that this is not entirely true. We’ll also look at some simple ways to maximize Java multithreading performance on multi-CPU machines.

Note: My tests indicate that JRE 1.1.7 does not use more than two CPUs when running pure Java programs. Under other conditions, such as with native code that has native OS-level threads, JRE 1.1.7 may utilize all of the CPUs.

The scoop on symmetric multiprocessing

Getting multiple equivalent CPUs to cooperate in one machine is known as symmetric multiprocessing (SMP); running a single-CPU machine is known as uniprocessing. SMP machines’ CPU modules usually plug into a bus. In theory, you add capacity by plugging in more CPUs.

Unfortunately, because coordinating CPU activity is difficult and most applications spend a lot of time waiting for I/O rather than using the CPU, using two processors doesn’t necessarily double your throughput. Performance may even degrade with multiple CPUs. Indeed, for software not designed for an SMP system, adding a CPU may increase performance by no more than 30 percent.

For CPU-intensive programs designed for SMP, a second CPU may increase performance by 80 or 90 percent, depending on what the application is doing. CPU-intensive programs that can run as parallel processes or threads benefit more from additional CPU power than programs that spend most of their CPU time idle, waiting for a network or disk.

SMP options

One way to use multiple CPUs on a single machine is to run multiple processes that communicate via the various forms of interprocess communication, such as semaphores and shared memory. The operating system will automatically allocate different CPUs to different processes. Interprocess communication techniques are not especially portable, though. One alternative is using native thread programming in C or C++: this is difficult, but is a way to use multiple CPUs in a single process with better portability.

Java thread programming is simple by comparison and provides high portability. Java threads have also been advertised as scaling well on multi-CPU machines. You can run Java threads in parallel on multiple CPUs, but only if you use a native-threads library for Java instead of the green threads option. (The green thread implementation does all thread scheduling within the JVM. On Unix, that means all threads will run in a single process and never use multiple CPUs.)

Native-threads packages schedule threads in the platform’s own threading library, and sometimes in the kernel itself. An environment variable or command line switch is normally used to activate native threads. Even when using native threads, Java does not necessarily use all available CPUs; it depends on how the particular JVM was written.

SMP support: Unix vs. Windows NT

Unix is far superior to Windows NT in the SMP arena. Windows NT can currently handle a maximum of four CPUs — even if a machine has slots for more than four, performance is likely to decrease beyond that number because Windows NT is not efficient at partitioning work among more than four CPUs. Beware of demonstrations where the vendor sets up several machines side by side and claims good scalability. It is extremely difficult to partition most enterprise applications among several independent machines.

Solaris currently scales up to 64 CPUs, theoretically benefiting from each one. (I was able to test up to 12 CPUs, and found this to be true.) Solaris’s CPU scalability lets you start with a low-end Solaris machine and add more CPUs as needed without reworking your applications or architecture. Of course, your application must be designed to use SMP.

Give the CPU a workout

I ran some tests in an attempt to prove that adding CPUs to a large Sun machine would increase performance. In a research note about SMP scalability on Linux (see Resources), Cameron MacKinnon explained how he tested Linux SMP scalability by running multiple processes that each did nothing but count to 1 billion. This should only test the CPU — it’s not accessing the disk or memory, and maybe not even the CPU cache. In that spirit, here is an example C program that should give your CPU a workout:

main() { unsigned long i; for (i=0; i<1000000000; i++); }

I compiled this program with maximum optimization, like this: gcc -O3 -o loop loop.c — I ran it on a 500-MHz Dell Optiplex GX1 PCI-bus PC with 512-KB cache running Linux 2.2. It takes about four seconds to run.

% time loop
4.02user 0.00system 0:04.02elapsed 99%CPU

Note:

some output removed for clarity

Our 500-MHz CPU ticks off 1 billion clock cycles in two seconds. Since our program runs in four seconds, we’re using about two clock cycles per increment of the variable i.

What if we try to simultaneously run two copies of this program on a single-CPU Linux PC? They won’t run at exactly the same time because only one can use the CPU at any given moment. The kernel will schedule the processes, making them seem to run at the same time, but it will take twice as long. The elapsed time is directly proportional to the number of processes we’re running. I ran as many as 24 processes with the following Perl script while the machine was otherwise idle:

#!/usr/local/bin/perl
$| = 1;      # Do not buffer output.
$ = "n";   # Add newline to print statements.
for ($procs = 1; $procs <= 24; $procs++) {
    for ($i = 1; $i <= $procs; $i++) {
        if ($pid = fork) {}
        elsif (defined $pid) {
            exec 'loop';
        }
        else {
            die "cannot fork: $!n";
        }
    }
    $start = time();
    for ($i = 1; $i <= $procs; $i++) { wait; }
    $end = time();
    $latency = $end - $start;
    print "$procs $latency";
}

The results can be seen in Table 1.

Table 1. Scalability of single-CPU machine
Processes

Time

(in seconds)

1 4
2 9
3 12
4 16
5 20
6 24
7 28
8 32
9 37
10 38
11 44
12 48
13 53
14 56
15 61
16 64
17 69
18 71
19 76
20 80
21 83
22 89
23 89
24 93

Figure 1, plotted with Gnuplot, illustrates the results.

Figure 1. Scalability of single-CPU PC hardware running Linux

Compilers do the funniest things

This test demonstrates that as you run more simultaneous processes, it takes proportionally longer for them to complete, which is normal for CPU-bound processes on a single-CPU machine. I have done similar experiments on Sun hardware, turning off all but one CPU, and saw the same linear scaling, but from a different starting point. That starting point for a single copy of the loop.c program is about 6.7 seconds on a Sun E450 with one 250-MHz CPU.

At this point, a common reaction is, “Now wait a minute, the Sun hardware is more expensive than the PC hardware, but runs this test more slowly?” Yes, in this specific case, the test runs slower on Sun hardware, but you can’t draw too many conclusions from that. The Sun machine is designed to scale up to many processors; the PC hardware is not. This adds some overhead. The Sparc CPU runs at half the clock rate of the Intel CPU and uses a different instruction set; the Sparc CPU is RISC, while the Intel CPU is CISC.

To see how random some simple test results can be, simply count backwards in loop.c, going from 1 billion to zero rather than from zero to 1 billion. You will find that the Sun machine takes 3.35 seconds, exactly half the time it did before; the PC counts down in exactly the same time, 4.02 seconds. Now the Sun machine looks faster. What’s going on here?

When counting down, the gcc compiler generates fewer instructions on the Sun machine than it does when counting up, but generates the same number of instructions on the Intel platform. Remember, the Sparc and Intel chips use different instruction sets. The gcc compiler is able to find a slightly more efficient way to count down on the Sparc CPU. (You can dump and examine the assembly language generated by gcc by using its -S option.) Compiler optimizations can be especially confusing.

Changing the lower limit of the counting can also affect the results. For example, if you count from 1 billion down to 4,095 on the Sun machine, it is about as fast as counting down to zero. But if you count from 1 billion down to 4,096 or more, the running time doubles. When the lower limit is 4,095 or less, the compiler generates a single opcode. But when the lower limit is 4,096 or more, the compiler generates multiple opcodes for the same function. A really clever compiler would just set our loop counter to its end value and skip the loop entirely, because the compiler can see that the loop has no contents. Then the loop would seem to run instantly and the whole test would be meaningless.

Give a set of CPUs a workout

I mentioned that Sun hardware is designed to scale up to 64 CPUs: does it really use all those CPUs effectively? This is an important question — CPUs are expensive. Let’s start with processes. On a 12-CPU Sun machine, the total running time is not affected until we exceed 12 loop processes. Then, the total running time increases with each additional process because we no longer have enough CPUs to run all the processes in parallel.

Every additional process adds about one-twelfth of one process’s running time. This increase shows that a CPU is shared equally among all processes — if not, the running time would increase by more than one-twelfth, because at least one CPU would be doing more than its fair share of work. The results can be seen in Figure 2. (The total running times are much higher than before because I forgot to compile with the -O3 optimization switch to gcc.)

Figure 2. Scalability of a 12 CPU machine running Solaris

If we turn off some of the CPUs and rerun the test, the rise in the graph begins sooner. One by one, I turned off 11 of the 12 CPUs and reran the whole test using between one and 24 processes. The result was a 12-by-24 grid of points. Here is the Perl wrapper that ran the test:

#!/usr/local/bin/perl
$| = 1;
$ = "n";
if ($<) {
    # Then we're not running as root.
    die "Need to run as root to execute psradm. Terminating";
}
# CPU numbers are not necessarily sequential.
@cpu_array = (0, 1, 3, 5, 8, 9, 12, 14, 16, 19, 20);
for ($cpu = 0; $cpu < 11; $cpu++) {
    for ($procs = 1; $procs <= 24; $procs++) {
        for ($i = 1; $i <= $procs; $i++) {
            if ($pid = fork) {}
            elsif (defined $pid) {
                exec 'loop';
            }
            else { die "cannot fork: $!n"; }
         }
        $start = time();
        for ($i = 1; $i <= $procs; $i++) { wait; }
        $end = time();
        $latency = $end - $start;
        $cpus_running = 12 - $cpu;
        print "$cpus_running $procs $latency";
    }
    print;
    `psradm -f $cpu_array[$cpu]`;
}

You’ll find the results in Figure 3.

Figure 3. Sun 12 CPU scalability

On the left side of Figure 3, where we have fewer processes than CPUs, the surface is flat, meaning the execution time remains constant. But when processes outnumber CPUs, the execution time increases. This curve shows that for totally CPU-intensive processes (not using memory, disk, or the network), Sun hardware gives you 100 percent bang for every buck you spend on CPUs, at least up to 12 CPUs.

Exercise the CPU with Java

What about CPU-intensive Java threads? Do they utilize Sun CPUs as effectively as processes? Let’s rewrite the loop.c program in Java:

class Loop implements Runnable {
    public static void main(String[] args) {
        for (int t = 0; t < Integer.parseInt(args[0]); t++)
            new Thread(new Loop()).start();
    }
    public void run() {
        for (int i = 0; i < 1000000000; i++);
    }
}

If we compile and run one thread under Sun’s Java 1.1.7 on a four-CPU Sun machine, the execution time is 13 seconds:

% javac Loop.java
% time java Loop 1

If we compile and run one thread under the Blackdown Java 1.1.6v5 on the PC running Linux, the execution time is 76 seconds. It’s much slower because Linux’s Blackdown JDK 1.1 doesn’t come with a just-in-time (JIT) compiler, but the Sun JDK does. A simple repetitive loop like this is prime JIT-compiler material. Be aware that there are open source JIT compilers for the Blackdown JDK, but I didn’t try them.

A strange result

Back on the Sun box, I ran one to 24 Java native threads on the 12-CPU machine to see what the response curve would look like. I expected to see a graph just like in Figure 2: flat until 12 CPUs, then rising linearly. What I got was very different, as illustrated in Figure 4.

Figure 4. Scalability of Java 1.1.7 threads

What happened? First, it seems that at least two CPUs are being used, because the curve is flat at one and two threads. Second, it doesn’t look like more than two CPUs are ever used because the running time keeps increasing, albeit in a stair-step manner. The running time doubles at three threads, but is unchanged at four threads — very odd. To get to the bottom of the issue, we can find out exactly which CPUs are busy by using the Solaris mpstat tool:

% JRE Loop 3 & mpstat 1

You’ll find the output on my Website. At three threads, mpstat shows that during any given second, either one or two CPUs are busy. So it looks like JRE 1.1.7 does not effectively distribute its work across two CPUs. The third thread is assigned to only one of the two CPUs in use. Nor does it ever use more than two CPUs, which is the bigger problem.

As we did previously, let’s turn off CPUs one at a time and run from one to 24 threads for each number of CPUs. Figure 5 illustrates the results.

Figure 5. Scalability of Java 1.1.7 threads

This Java-based test shows that two CPUs are much better than one, but beyond two, additional CPUs don’t do anything for us. (At one point, someone else logged into this machine and ran a test, causing the dip in the graph.) So it seems that Sun’s 1.1.7 JRE does not scale well on Sun’s own SMP hardware.

How to fix it

An email conversation with a Sun employee confirmed that the Java 1.1 mapping of native threads to LWPs (lightweight processes, also known as kernel threads) was probably not creating more than two LWPs. LWPs are the entities that actually run on a CPU, so no more than two CPUs could be used. All of our user-level threads were mapped to an equal number of native threads in the native-thread library, which were then mapped to only these two LWPs.

Sun pointed out three ways to fix this scalability problem:

  • Stay on Java 1.1.7 and write a native method to call the Sun thr_setconcurrency API, which can tell the Solaris thread library to create more LWPs. Unfortunately, this solution destroys your program’s portability. You can get more information on this API by typing man thr_setconcurrency at a Solaris shell prompt.
  • Upgrade to Solaris 8 and use its single-level thread library, which maps user-level threads directly to LWPs, by appropriately setting LD_LIBRARY_PATH.
  • Upgrade to Java 1.2.2, which automatically uses the thr_setconcurrency API to match the number of LWPs to the number of processors in the machine. Running Java 1.2.2 requires patching Solaris 2.6.

I chose the third way. I reran the previous test on Java 1.2.2 on a patched Solaris 2.6 machine and got the results illustrated in Figure 6.

Figure 6. Scalability of Java 1.2.2 threads

Now that’s more like it! We see our familiar flat surface on the left, with no jiggles. It looks like all our CPUs are employed to run Java threads, and each additional CPU is fully utilized. So Java 1.2.2 would fix the scalability problem we saw with Java 1.1.7. Unfortunately, I work in a busy production environment where it’s difficult to make major changes like patching the operating system or changing JREs, so we couldn’t do that immediately. However, we could run more Java 1.1.7 processes and load balance across them, which makes better use of the CPUs, even if each process uses only two CPUs.

Improve throughput with multiple JREs

I also tested to see if running more 1.1.7 JREs and load balancing actually provides better throughput than running just one 1.1.7 JRE. To do this, I started with the Weather servlet from Jason Hunter’s book, Java Servlet Programming (see Resources), and made it even more CPU-intensive.

This was a more realistic test of Website performance than our simple counting program. The Weather servlet does extensive string manipulation and generates a lot of HTML. I modified it to loop through the string manipulation 100 times, then throw away the HTML, returning only a few bytes to report the servlet’s success. The idea was to test CPU performance, not network capacity. I also specifically avoided disk and database access so I wouldn’t test them by mistake.

I ran the servlet on BEA WebLogic 4.5.1 on Java 1.1.7 on Solaris 2.6. I set the number of WebLogic execute threads to 15. Because this servlet would access main memory and not just CPU cache, I did not expect perfect CPU scalability. And I didn’t get it, but I was still pleased with the results. (See Figure 7.)

Figure 7. Scalability of the Weather servlet

Figure 7 is a throughput graph, not a time-to-completion graph, so higher numbers are better. The peak is 12 CPUs and 12 JREs. With just one JRE, applying more than two CPUs does not increase throughput. This is expected if each 1.1.7 JRE can use only two CPUs, but increasing the number of JREs increases the benefit of adding CPUs. On the other hand, time spent waiting on I/O like disk or network access will increase the number of JREs you can use, because blocked JREs will not use CPU time, leaving it free for other JREs.

To be even more realistic, I ran a similar load test against a page generated by extensive database queries from a servlet we actually use in production. The result is illustrated in Figure 8. Since we introduced a new bottleneck, the database, the benefit of adding JREs and CPUs isn’t as significant. After adding the second JRE or fourth CPU, performance levels off at a limit imposed by the database.

Figure 8. Scalability of page generated by database access

Conclusion

As we’ve seen in the tests above, Sun’s production JRE 1.1.7 does not use more than two CPUs on Solaris 2.6 on Sun hardware. This can be fixed by adding a native method, upgrading the operating system, or upgrading to Sun’s JRE 1.2.2. If you can’t do any of those things, you can simply run more JREs and balance the load across them. Sun’s hardware does scale well, but a little work and testing may be needed to get the most out of it.

My initial goal was to simply confirm the common belief that native Java threads scale well on Sun SMP hardware; I was surprised to find it was not true for Java 1.1.7. These experiments say nothing about other JREs or platforms — if anyone were to run these tests on other SMP platforms, I would be very interested in the results.

Source: www.infoworld.com