First of all, before delving into details, let me give a little background. I help to develop and run citysearch.com, a business directory and review website that pioneered small business web pages in early 90s and has a strong focus on user reviews. It is a typical web site developed in Java and heavily relying on APIs (web services) in the new releases.
On a Monday in February, our operation engineers reported that there was a web server crashed over the weekend. Thanks to our health check, there was little or no disruption of the service and they can reboot Tomcat to put the node into rotation.
By this point you can imagine that I won't believe that JVM would crash. It's the same web app that has been running for years without much trouble. There was also not crash log available (we did not try hard to find it). A couple days later, another server crashed. Then 2-3 more servers crashed in the following weekend. It is indeed very serious. With the help from the operation engineer, we find the hs_err_pidXXXX.log file:
abc@xxxxxxx /tmp $ cat hs_err_pid16272.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f3a59da8677, pid=16272, tid=139887128516352
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J org.apache.http.client.protocol.RequestAddCookies.process(Lorg/apache/http/HttpRequest;Lorg/apache/http/protocol/HttpContext;)V
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#
--------------- T H R E A D ---------------
Current thread (0x0000000000f2c800): JavaThread "taskExecutorPool-8" [_thread_in_Java, id=18318, stack(0x00007f3a028a9000,0x00007f3a029aa000)]
What the hell? Problematic frame on Java code and failed to write core dump?? I know that all our web service API call use Apache Http Client, which is a top level open source project under Apache. We did not use the latest 4.x release but for the sake of argument, why Http Client, a pure Java library, can crash JVM? For years I've held the belief that we can cause memory leak in Java but not to crash it so long we stay aways from JNI. Even on that front I ever wrote some JNI code in C# and never crashed JVM spectacularly like this.
Following the suggestion in the log, we reconfigured "ulimit" to allow core dump and successfully generated a core dump. Now the real problem struck me.
xxxx@YYYYYYYY ~ $ /usr/java/default/bin/jmap -J-d64 /usr/java/default/bin/java core.25835
Attaching to core core.25835 from executable /usr/java/default/bin/java, please wait...
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.tools.jmap.JMap.runTool(JMap.java:197)
at sun.tools.jmap.JMap.main(JMap.java:128)
Caused by: sun.jvm.hotspot.debugger.UnmappedAddressException: 7f66c28ee1ac
at sun.jvm.hotspot.debugger.PageCache.checkPage(PageCache.java:208)
at sun.jvm.hotspot.debugger.PageCache.getData(PageCache.java:63)
at sun.jvm.hotspot.debugger.DebuggerBase.readBytes(DebuggerBase.java:217)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.readCInteger(LinuxDebuggerLocal.java:482)
at sun.jvm.hotspot.debugger.linux.LinuxAddress.getCIntegerAt(LinuxAddress.java:69)
at sun.jvm.hotspot.utilities.CStringUtilities.getString(CStringUtilities.java:61)
at sun.jvm.hotspot.HotSpotTypeDataBase.readVMTypes(HotSpotTypeDataBase.java:174)
at sun.jvm.hotspot.HotSpotTypeDataBase.(HotSpotTypeDataBase.java:85)
at sun.jvm.hotspot.bugspot.BugSpotAgent.setupVM(BugSpotAgent.java:569)
at sun.jvm.hotspot.bugspot.BugSpotAgent.go(BugSpotAgent.java:493)
at sun.jvm.hotspot.bugspot.BugSpotAgent.attach(BugSpotAgent.java:347)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:169)
at sun.jvm.hotspot.tools.PMap.main(PMap.java:67)
... 6 more
xxxx@ YYYYYYYY ~ $
I also tried JHat that ended up the same stack trace as JMap. Some people pointed out on stack overflow that Java version discrepancy cold be the culprit in some situations but certainly not in ours. Same errors occurred even I made sure the runtime and JDK java version are exactly the same. IBM has a nice tool to analyze core dump as detailed at CDIMASCIO’s blog. Unfortunately it can not load the core dump file as well.
As the last resort, I went out of my comfortable zone and gave the ultimate GDB tool a try.
kouh@aws1devweb2 ~ $ sudo gdb ./jdk1.7.0_25/bin/java core.25835
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
Reading symbols from /home/kouh/jdk1.7.0_25/bin/java...Missing separate debuginfo for /home/kouh/jdk1.7.0_25/bin/java
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/a5/58f547fe0b95fdc6a109cb7d9692d6d7969794.debug
(no debugging symbols found)...done.
[New Thread 26876]
[New Thread 26882]
[New Thread 26878]
......
.....
Loaded symbols for /usr/java/jdk1.7.0_25/jre/lib/amd64/librmi.so
Reading symbols from /usr/java/jdk1.7.0_25/jre/lib/amd64/libawt.so...(no debugging symbols found)...done.
Loaded symbols for /usr/java/jdk1.7.0_25/jre/lib/amd64/libawt.so
Reading symbols from /usr/java/jdk1.7.0_25/jre/lib/amd64/headless/libmawt.so...(no debugging symbols found)...done.
Loaded symbols for /usr/java/jdk1.7.0_25/jre/lib/amd64/headless/libmawt.so
Core was generated by `/usr/bin/java -Dnop -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogM'.
Program terminated with signal 6, Aborted.
#0 0x000000344f432625 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6_6.5.x86_64 jdk-1.7.0_25-fcs.x86_64 libgcc-4.4.6-3.el6.x86_64
(gdb) where
#0 0x000000344f432625 in raise () from /lib64/libc.so.6
#1 0x000000344f433e05 in abort () from /lib64/libc.so.6
#2 0x00007f66c2726ac5 in os::abort(bool) () from /usr/java/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so
#3 0x00007f66c2886137 in VMError::report_and_die() () from /usr/java/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so
#4 0x00007f66c272a5e0 in JVM_handle_linux_signal () from /usr/java/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so
#5
#6 0x00007f66ba4ab0bb in ?? ()
#7 0x00000007ffffffff in ?? ()
#8 0x3f40000000000004 in ?? ()
#9 0x00000007343f7da0 in ?? ()
#10 0x00000007343fa1c0 in ?? ()
#11 0x00000007343fa280 in ?? ()
#12 0x00000007343fa1c0 in ?? ()
#13 0x00000001e687f453 in ?? ()
#14 0x00000007343fa1a0 in ?? ()
#15 0x00000000c2c8fad0 in ?? ()
#16 0x0000000400000002 in ?? ()
#17 0x000000073978df28 in ?? ()
#18 0x00000007343fa1c0 in ?? ()
#19 0x00000007343fa298 in ?? ()
#20 0x0000000000000000 in ?? ()
(gdb)
As you can see above, last instruction address to access is 0, which is a well-known bug in native code (C/C++ etc, JVM was written using C++).
This confirms my suspicious that we were vulnerable to a famous JIT compiler bug reported by others.
JDK-8021898 : Broken JIT compiler optimization for loop unswitching
Bug reported on Http Client:
People commented that adding “-XX:-LoopUnswitching” fixed the issue for them.
The OPS took my recommendation and added loopunswitching. There was no more crash afterward. It is still a mystery why an old bug can strike us at such a late time. We certainly did not have any upgrade on our infrastructure. Hope my saga can help you if you are in the same situation. You can always resort to GDB if you run out of your tool box.