Java Performance Tuning

by Anuj Verma · Published 17/01/2015 · Updated 01/01/2018

Performance tuning is the improvement of system performance. Typically in computer systems, the motivation for such activity is called a performance problem, which can be either real or anticipated. Most systems will respond to increased load with some degree of decreasing performance. A system’s ability to accept higher load is called scalability, and modifying a system to handle a higher load is synonymous to performance tuning.

Systematic tuning follows these steps:

Assess the problem and establish numeric values that categorize acceptable behavior.
Measure the performance of the system before modification.
Identify the part of the system that is critical for improving the performance. This is called the bottleneck.
Modify that part of the system to remove the bottleneck.
Measure the performance of the system after modification.
If the modification makes the performance better, adopt it. If the modification makes the performance worse, put it back the way it was.

Java Performance Tuning

Virtual machine optimization methods

Many optimizations have improved the performance of the JVM over time. However, although Java was often the first Virtual machine to implement them successfully, they have often been used in other similar platforms as well.

Just-in-time compiling

Just-in-time compilation and HotSpot
Early JVMs always interpreted Java bytecodes. This had a large performance penalty of between a factor 10 and 20 for Java versus C in average applications. To combat this, a just-in-time (JIT) compiler was introduced into Java 1.1. Due to the high cost of compiling, an added system called HotSpot was introduced in Java 1.2 and was made the default in Java 1.3. Using this framework, the Java virtual machine continually analyses program performance for hot spots which are executed frequently or repeatedly. These are then targeted for optimizing, leading to high performance execution with a minimum of overhead for less performance-critical code. Some benchmarks show a 10-fold speed gain by this means. However, due to time constraints, the compiler cannot fully optimize the program, and thus the resulting program is slower than native code alternatives.

Adaptive optimizing

Adaptive optimizing is a method in computer science that performs dynamic recompilation of parts of a program based on the current execution profile. With a simple implementation, an adaptive optimizer may simply make a trade-off between just-in-time compiling and interpreting instructions. At another level, adaptive optimizing may exploit local data conditions to optimize away branches and use inline expansion.

A Java virtual machine like HotSpot can also deoptimize code formerly JITed. This allows performing aggressive (and potentially unsafe) optimizations, while still being able to later deoptimize the code and fall back to a safe path.

Garbage collection

Garbage collection (computer science)
The 1.0 and 1.1 Java virtual machines (JVMs) used a mark-sweep collector, which could fragment the heap after a garbage collection. Starting with Java 1.2, the JVMs changed to a generational collector, which has a much better defragmentation behaviour. Modern JVMs use a variety of methods that have further improved garbage collection performance.

Other optimizing methods

Compressed Oops
Compressed Oops allow Java 5.0+ to address up to 32 GB of heap with 32-bit references. Java does not support access to individual bytes, only objects which are 8-byte aligned by default. Because of this, the lowest 3 bits of a heap reference will always be 0. By lowering the resolution of 32-bit references to 8 byte blocks, the addressable space can be increased to 32 GB. This significantly reduces memory use compared to using 64-bit references as Java uses references much more than some languages like C++. Java 8 supports larger alignments such as 16-byte alignment to support up to 64 GB with 32-bit references.

Split bytecode verification

Before executing a class, the Sun JVM verifies its Java bytecodes (see bytecode verifier). This verification is performed lazily: classes’ bytecodes are only loaded and verified when the specific class is loaded and prepared for use, and not at the beginning of the program. (Note that other verifiers, such as the Java/400 verifier for IBM iSeries (System i), can perform most verification in advance and cache verification information from one use of a class to the next.) However, as the Java class libraries are also regular Java classes, they must also be loaded when they are used, which means that the start-up time of a Java program is often longer than for C++ programs, for example.

A method named split-time verification, first introduced in the Java Platform, Micro Edition (J2ME), is used in the JVM since Java version 6. It splits the verification of Java bytecode in two phases:

Design-time – when compiling a class from source to bytecode
Runtime – when loading a class.

Escape analysis and lock coarsening

Java is able to manage multithreading at the language level. Multithreading is a method allowing programs to perform multiple processes concurrently, thus producing faster programs on computer systems with multiple processors or cores. Also, a multithreaded application can remain responsive to input, even while performing long running tasks.

Before Java 6, the virtual machine always locked objects and blocks when asked to by the program, even if there was no risk of an object being modified by two different threads at once. For example, in this case, a local vector was locked before each of the add operations to ensure that it would not be modified by other threads (vector is synchronized), but because it is strictly local to the method this is needless:

public String getNames() {
     Vector v = new Vector();
     v.add("Me");
     v.add("You");
     v.add("Her");
     return v.toString();
}

Starting with Java 6, code blocks and objects are locked only when needed,so in the above case, the virtual machine would not lock the Vector object at all.

Register allocation improvements

Before Java 6, allocation of registers was very primitive in the client virtual machine (they did not live across blocks), which was a problem in CPU designs which had fewer processor registers available, as in x86s. If there are no more registers available for an operation, the compiler must copy from register to memory (or memory to register), which takes time (registers are significantly faster to access). However, the server virtual machine used a color-graph allocator and did not have this problem.

Class data sharing

Class data sharing (called CDS by Sun) is a mechanism which reduces the startup time for Java applications, and also reduces memory footprint. When the JRE is installed, the installer loads a set of classes from the system JAR file (the JAR file holding all the Java class library, called rt.jar) into a private internal representation, and dumps that representation to a file, called a “shared archive”. During subsequent JVM invocations, this shared archive is memory-mapped in, saving the cost of loading those classes and allowing much of the JVM’s metadata for these classes to be shared among multiple JVM processes.

Java Performance Tuning

You may also like...

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Java Performance Tuning

Systematic tuning follows these steps:

Java Performance Tuning

Virtual machine optimization methods

Just-in-time compiling

Adaptive optimizing

Garbage collection

Other optimizing methods

Split bytecode verification

Escape analysis and lock coarsening

Register allocation improvements

Class data sharing

You may also like...

What are the the new features in J2SE 5 or java 1.5 ?

What are the Object and Class classes used for?

How are this() and super() used with constructors?

Leave a Reply Cancel reply

Recent Posts

Recent Comments