by
tags:

I was recently consulting a client. Long garbage collections cycles triggered the client failover to switch server, since the server was idle for too long. My task was to improve the garbage collection. The tuning task took several iterations, in each step there was something new to learn. There are many availables tools to perform GC analysis. One of them is GCEasy . It is a great tool. It is being used by leading firms and supplies great visual experience. I was using Java Mission Control as tool for analysing the results. I prefered it since it supplies more information than just the one GC logs supply. It also has nice visualisation. The screenshots from Java Mission Control are embedded in this post.

Journey Begins

First, I recorded the execution with the parameters that were already defined by my client. I noticed that there were short peaks of Weak References. However, during most of the execution the weak references were not available. Graph with peaks

After looking at the configuration, I discovered that “Young Generation Size” was configured.

Wrong Defaults

This is wrong, according to Oracle. The basic behaviour of the G1 is to make the size of the new generation memory regions flexible. It is suggested to use -XX:G1NewSizePercent and-XX:G1MaxNewSizePercent .So “-XX:NewSize=512m -XX:MaxNewSize=512m” were removed. I also noticed that “Maximum Tenuring Threshold” was set to 0,this means that all objects will be promoted immediately. I removed “-XX:MaxTenuringThreshold=0” too.

Where are the Young?

So, I executed load test with only “-XX:+UseG1GC” and no other parameter. The weak references look now much better! But the GC cannot get rid from the Phantom references, they look pretty stable. Phantom Reference

When investigating a bit more, it came out that “Young GC collection” never took place. That’s bad for performance. In fact, I would expect more young generation collection than old ones. So the journey continued.. No Young Generation

Change Configuration

So it is clear that configuration needs to be changed, in order to achieve young generation collections. To the rescue I learned from this post that: “The fundamental idea of improving GC behavior is to ensure short lived objects die young and never get promoted. With the more memory G1 has, the better assurance you will get that objects are not getting prematurely promoted to the old generation.” I added the parameters defined by the post to see how my client’s product will behave: “-XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:MaxGCPauseMillis=100 -XX:+DisableExplicitGC -XX:TargetSurvivorRatio=90 -XX:G1NewSizePercent=50 -XX:G1MaxNewSizePercent=80 -XX:InitiatingHeapOccupancyPercent=10 -XX:G1MixedGCLiveThresholdPercent=50 -XX:+AggressiveOpts”

This had better result. However, there were not enough young generation cycles. When looking at the total GC time, we can see it takes more time to STW (Stop The World), than the first iteration. GC cleans Phantom references nicely and uses many objects as needed.

Some Progress

Root Cause

Java Mission Control also supplies a view of Allocations in its Memory tab. By Thread

When I investigated “Allocation in New TLAB” -> “Allocation by Thread” Tab, the core issue was revealed. The server used ~11 threads to write logs, they created ~27 percentages of the allocations during the stress testing. So refactoring the logging mechanism was the true solution for this issue.

Summary

When I started my investigation I treated the product as a black box. But the real solution was hidden inside the business logic. GC can be tuned by parameters, but the business logic is essential to understand the environment. The business logic issues (What creates weak references? ,Which threads are doing most of the allocations? ,How long users session last? ) were the key to improving the long GC pauses.