Java Tip 92: Use the JVM Profiler Interface for accurate timing

Improve performance analysis by measuring Java thread CPU time

The textbook approach to simple performance analysis is to call System.currentTimeMillis() before and after the code to be measured. This is comparable to using a stopwatch when testing GUI activity, and works fine if elapsed time is really what you want. The downside is that this approach may include much more than your code’s execution time. Time used by other processes on the system or time spent waiting for I/O can result in inaccurately high timing numbers.

Trouble with threads

Programming languages like C and Pascal must use operating system calls to get the CPU time spent by a thread or process. Java applications can do this too, by using native, and therefore unportable, methods. Unfortunately, the results for an OS thread may not be directly related to the executing Java thread. This is because the way Java threads are mapped to OS threads is entirely up to the JVM — and different JVMs use different strategies.

Some JVMs use green threads, which run all the Java threads in one native OS thread (a so-called n-to-one mapping). The HotSpot JVM uses native threads, which may execute in parallel on a multi-CPU machine. The fact that several threads are executing in parallel means that the sum of the CPU times may exceed the elapsed real time. On Solaris, Java threads are not bound permanently to the same native threads but are remapped by the scheduler (in an n-to-m mapping). So, getting the CPU time for the current native thread does not give you the time you want. Contrast this with the Blackdown port of the JDK 1.2 to Linux, where a thread is akin to a process (one-to-one mapping).

The result of the JVM successfully hiding the underlying machine and operating system means that the native information is often useless to Java programmers. Fortunately, each JVM knows how it maps threads, even if this detail is hidden from the application programmers. Java 2 introduced a new API — the Java Virtual Machine Profiler Interface (JVMPI) — that allows access to the necessary timing information.

The Profiler Interface

The JVMPI is a C interface to the JVM, where profilers may access the state of the JVM via an in-process native agent, and can be notified of interesting events like object allocations and method invocations. You can use the JVMPI with a frontend that provides a GUI, like commercial profilers OptimizeIt or JProbe, or the agent can simply dump the profiling information into a file like HPROF (see Resources for more information).

Profilers are very good at identifying hot spots that need to be optimized. They work either by sampling or by instrumentation, meaning by tracing method invocations or by modifying class code on class loading. Sampling has little overhead but is very coarse-grained, whereas instrumentation has a significant overhead and may get in the way of optimization, especially because Java programs typically have many small methods.

A potential problem is that the profiler itself uses CPU and memory resources. The latter may affect execution of the program with respect to hardware caching and OS swapping of virtual memory, causing the timing information to get blurred. This is not a serious problem for profiling in which you are interested in the big picture and looking at relative figures, but if you want to microbenchmark JVMs, components, code snippets, or algorithms, you must choose another approach that has less overhead.

The JVMPI provides a function, GetCurrentThreadCpuTime(), which returns the CPU time in nanoseconds for the current Java thread. It does this no matter which technique you use to map Java threads to OS threads. Despite the nanosecond resolution, the function is not more precise than the underlying operating system. On Windows NT it works in 10 millisecond increments. An extremely simple profiler agent with the ability to access this information consists of the following C++ code:

#include <jvmpi.h>
// global jvmpi interface pointer
static JVMPI_Interface *jvmpi_interface;
extern "C" {
  // profiler agent entry point
  JNIEXPORT jint JNICALL
  JVM_OnLoad(JavaVM *jvm, char *options, void *reserved) {
    // get jvmpi interface pointer
    if ((jvm->GetEnv((void **)&jvmpi_interface, JVMPI_VERSION_1)) < 0) {
      return JNI_ERR;
    }
    return JNI_OK;
}
  JNIEXPORT jlong JNICALL
  Java_dk_capgemini_tc_JProf_getCurrentThreadCpuTime(JNIEnv *, jclass) {
    // return 0 if agent not initialized
    return jvmpi_interface == 0 ? 0 : 
jvmpi_interface->GetCurrentThreadCpuTime();
  }
}

This function may be called from a Java program using JNI via the following class:

package dk.capgemini.tc;
public class JProf {
  public static native long getCurrentThreadCpuTime();
  static { System.loadLibrary("capjprof"); }
}

The capjprof.zip file (Resources) accompanying this tip contains the source code and make files for Solaris and Windows NT. The resulting DLL or shared object must be put in the library path for the JVM to find, and the JVM must be told to use it with an application.

java -Xruncapjprof application

The -Xrun option enables the profiler by instructing the JVM to call JVM_OnLoad() in the capjprof library. If you run the application without this option, JProf.getCurrentThreadCpuTime() will return 0.

Comparing collection indexing

You can try this out with a comparison of indexing into an array, an Arrays.asList, a ListArray, and a Vector. First you need a helper class to support the microbenchmark:

class Prof extends dk.capgemini.tc.JProf {
  String me;
  long time, cputime;
  Prof(String name) {
    me = name;
  }
  void start() {
    cputime = getCurrentThreadCpuTime();
    time = System.currentTimeMillis();
  }
  void stop() {
    cputime = getCurrentThreadCpuTime() - cputime;
    time = System.currentTimeMillis() - time;
  }
  void print() {
    System.out.println(me + " time: " + time + " ms"
      + " cputime: " + cputime/1000000 + " ms");
  }
}

The methods to be measured index a number of times and assign the result to a variable so that the loop is not optimized away. Below is the Vector indexing. The other methods look the same except for the type of the collection list parameter.

void index(int n, Vector c) {
  Object o;
  for (int i=0; i<n; i++) {
    for (int j=0; j<c.size(); j++) {
      o = c.get(j);
    }
  }
}

The main method creates the collections with the same contents and makes the comparisons (see capjprof.zip for the full source). The application takes a number of optional parameters controlling the number of elements in the collection (default is 1,000), the number of times to loop (default is 10,000), and whether the methods should be called in sequence or in parallel (default is in sequence).

java -Xruncapjprof performance.CollIndex [elements 1000] [times 10000] 
[threads]

My single-processor Windows NT with JDK 1.2.2 using the JIT produces the following results, which show that you should avoid unnecessary synchronization in Vector, because a simple array is much more efficient.

**Indexing collection 1,000 elements 10,000 times in sequence**
Test	Stopwatch time (ms)	JVMPI time (ms)
`Vector`	3,074	3,034
`ArrayList`	872	841
`asList`	951	891
`array`	70	70

With JDK 1.3 RC1, the figures indicate that synchronization has become much more efficient with HotSpot.

**Indexing collection 1,000 elements 10,000 times in sequence**
Test	Stopwatch time (ms)	JVMPI time (ms)
`Vector`	1,262	1,211
`ArrayList`	1,202	1,201
`asList`	681	640
`array`	170	170

The reason ArrayList does not perform better than Vector here is that ArrayList.get() contains sloppy code. Make your own version with a better range check if performance matters. Just for fun, try to run the measurements in parallel in separate threads and note the difference in elapsed time and CPU time.

**Indexing collection 1,000 elements 10,000 times in parallel**
Test	Stopwatch Time (ms)	JVMPI Time (ms)
`Vector`	3,325	1,271
`ArrayList`	3,095	1,201
`asList`	1,953	630
`array`	531	160

Conclusion

In this tip, I demonstrated how to utilize the JVMPI to measure execution time and discussed why this approach is best for microbenchmarking.

However, there are still problems. An optimizing JVM like HotSpot may fool you with microbenchmarks because overly simple code may escape optimization or be totally eliminated.

Additionally, the garbage collector may interfere and use time in a entirely JVM-dependent manner. With JDK 1.2.2, the garbage collection is done in the user thread, whereas a system thread does the work with HotSpot. Under HotSpot, the incremental garbage collection cannot run when using an active profiler agent. You can influence the behavior of the garbage collector by calling System.gc() and by setting initial and maximum heap size when starting the JVM.

Jesper Gørtz is a senior consultant at
Cap Gemini Danmark.

Source: www.infoworld.com