We recently ran into this error constantly on a large cluster.
Py4JJavaError: An error occurred while calling o1277.saveAsTextFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 314.0 failed 4 times, most recent failure: Lost task 0.3 in stage 314.0 (TID 13145, 10.166.227.223, executor 6): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1430)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1429)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
Interesting the same code worked find in Spark 1.6.2 and the data size is very small (< 1000 bytes). Eventually I found there were 2400 partitions and concluded that these partition were unnecessary created by DataFrame's union() operations. Remember to run repartition(2) after join if you ever ran into the same issue.
Hongbing's Engineering Log
Thursday, August 24, 2017
Monday, October 26, 2015
IntelliJ IDEA Tricks
Had decided to go with IntelliJ after 10 years Eclipse. Certainly it is a great IDE as it blew aways Eclipse on setting up Scala development. I will include tricks and shortcut I learn along the way down below.
Change default Java class file header
IntelliJ comes with default header. Put the cursor in the javadoc header, press Option+Enter on Mac to change it to your preferred header.
/** * Created by hkou on 10/23/15. */
Create unit test
Put the cursor in file header section, then press Option+Enter, one menu option is to create test. IntelliJ offers a comprehensive list of test such as Goovy JUnit, Spock, Junit, TestNG, ScalaTest.
Tuesday, March 24, 2015
JVM crashed in production, really?!
Can JVM crash, especially in well configured and maintained large web apps? Very very rare. I only ran into it once on Windows 2008 server, which was later attributed to a deprecated x86 function call at system level. We had support contract with Microsoft who was very helpful at diagnosing. Our automation engineers load tested the web app against the network shared folder, and there was a program that frequently pumped files out to make room for new files. The server crashed after the weekend-long load testing. Sadly that the server crashed happened but it was not JVM's fault. All in all, JVM has been so stable and reliable. In fact, a handful web apps I developed have been running for years without a reboot.
First of all, before delving into details, let me give a little background. I help to develop and run citysearch.com, a business directory and review website that pioneered small business web pages in early 90s and has a strong focus on user reviews. It is a typical web site developed in Java and heavily relying on APIs (web services) in the new releases.
On a Monday in February, our operation engineers reported that there was a web server crashed over the weekend. Thanks to our health check, there was little or no disruption of the service and they can reboot Tomcat to put the node into rotation.
By this point you can imagine that I won't believe that JVM would crash. It's the same web app that has been running for years without much trouble. There was also not crash log available (we did not try hard to find it). A couple days later, another server crashed. Then 2-3 more servers crashed in the following weekend. It is indeed very serious. With the help from the operation engineer, we find the hs_err_pidXXXX.log file:
What the hell? Problematic frame on Java code and failed to write core dump?? I know that all our web service API call use Apache Http Client, which is a top level open source project under Apache. We did not use the latest 4.x release but for the sake of argument, why Http Client, a pure Java library, can crash JVM? For years I've held the belief that we can cause memory leak in Java but not to crash it so long we stay aways from JNI. Even on that front I ever wrote some JNI code in C# and never crashed JVM spectacularly like this.
Following the suggestion in the log, we reconfigured "ulimit" to allow core dump and successfully generated a core dump. Now the real problem struck me.
I also tried JHat that ended up the same stack trace as JMap. Some people pointed out on stack overflow that Java version discrepancy cold be the culprit in some situations but certainly not in ours. Same errors occurred even I made sure the runtime and JDK java version are exactly the same. IBM has a nice tool to analyze core dump as detailed at CDIMASCIO’s blog. Unfortunately it can not load the core dump file as well.
As the last resort, I went out of my comfortable zone and gave the ultimate GDB tool a try.
As you can see above, last instruction address to access is 0, which is a well-known bug in native code (C/C++ etc, JVM was written using C++).
This confirms my suspicious that we were vulnerable to a famous JIT compiler bug reported by others.
First of all, before delving into details, let me give a little background. I help to develop and run citysearch.com, a business directory and review website that pioneered small business web pages in early 90s and has a strong focus on user reviews. It is a typical web site developed in Java and heavily relying on APIs (web services) in the new releases.
On a Monday in February, our operation engineers reported that there was a web server crashed over the weekend. Thanks to our health check, there was little or no disruption of the service and they can reboot Tomcat to put the node into rotation.
By this point you can imagine that I won't believe that JVM would crash. It's the same web app that has been running for years without much trouble. There was also not crash log available (we did not try hard to find it). A couple days later, another server crashed. Then 2-3 more servers crashed in the following weekend. It is indeed very serious. With the help from the operation engineer, we find the hs_err_pidXXXX.log file:
abc@xxxxxxx /tmp $ cat hs_err_pid16272.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f3a59da8677, pid=16272, tid=139887128516352
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J org.apache.http.client.protocol.RequestAddCookies.process(Lorg/apache/http/HttpRequest;Lorg/apache/http/protocol/HttpContext;)V
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#
--------------- T H R E A D ---------------
Current thread (0x0000000000f2c800): JavaThread "taskExecutorPool-8" [_thread_in_Java, id=18318, stack(0x00007f3a028a9000,0x00007f3a029aa000)]
What the hell? Problematic frame on Java code and failed to write core dump?? I know that all our web service API call use Apache Http Client, which is a top level open source project under Apache. We did not use the latest 4.x release but for the sake of argument, why Http Client, a pure Java library, can crash JVM? For years I've held the belief that we can cause memory leak in Java but not to crash it so long we stay aways from JNI. Even on that front I ever wrote some JNI code in C# and never crashed JVM spectacularly like this.
Following the suggestion in the log, we reconfigured "ulimit" to allow core dump and successfully generated a core dump. Now the real problem struck me.
xxxx@YYYYYYYY ~ $ /usr/java/default/bin/jmap -J-d64 /usr/java/default/bin/java core.25835
Attaching to core core.25835 from executable /usr/java/default/bin/java, please wait...
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.tools.jmap.JMap.runTool(JMap.java:197)
at sun.tools.jmap.JMap.main(JMap.java:128)
Caused by: sun.jvm.hotspot.debugger.UnmappedAddressException: 7f66c28ee1ac
at sun.jvm.hotspot.debugger.PageCache.checkPage(PageCache.java:208)
at sun.jvm.hotspot.debugger.PageCache.getData(PageCache.java:63)
at sun.jvm.hotspot.debugger.DebuggerBase.readBytes(DebuggerBase.java:217)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.readCInteger(LinuxDebuggerLocal.java:482)
at sun.jvm.hotspot.debugger.linux.LinuxAddress.getCIntegerAt(LinuxAddress.java:69)
at sun.jvm.hotspot.utilities.CStringUtilities.getString(CStringUtilities.java:61)
at sun.jvm.hotspot.HotSpotTypeDataBase.readVMTypes(HotSpotTypeDataBase.java:174)
at sun.jvm.hotspot.HotSpotTypeDataBase.(HotSpotTypeDataBase.java:85)
at sun.jvm.hotspot.bugspot.BugSpotAgent.setupVM(BugSpotAgent.java:569)
at sun.jvm.hotspot.bugspot.BugSpotAgent.go(BugSpotAgent.java:493)
at sun.jvm.hotspot.bugspot.BugSpotAgent.attach(BugSpotAgent.java:347)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:169)
at sun.jvm.hotspot.tools.PMap.main(PMap.java:67)
... 6 more
xxxx@ YYYYYYYY ~ $
I also tried JHat that ended up the same stack trace as JMap. Some people pointed out on stack overflow that Java version discrepancy cold be the culprit in some situations but certainly not in ours. Same errors occurred even I made sure the runtime and JDK java version are exactly the same. IBM has a nice tool to analyze core dump as detailed at CDIMASCIO’s blog. Unfortunately it can not load the core dump file as well.
As the last resort, I went out of my comfortable zone and gave the ultimate GDB tool a try.
kouh@aws1devweb2 ~ $ sudo gdb ./jdk1.7.0_25/bin/java core.25835
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
Reading symbols from /home/kouh/jdk1.7.0_25/bin/java...Missing separate debuginfo for /home/kouh/jdk1.7.0_25/bin/java
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/a5/58f547fe0b95fdc6a109cb7d9692d6d7969794.debug
(no debugging symbols found)...done.
[New Thread 26876]
[New Thread 26882]
[New Thread 26878]
......
.....
Loaded symbols for /usr/java/jdk1.7.0_25/jre/lib/amd64/librmi.so
Reading symbols from /usr/java/jdk1.7.0_25/jre/lib/amd64/libawt.so...(no debugging symbols found)...done.
Loaded symbols for /usr/java/jdk1.7.0_25/jre/lib/amd64/libawt.so
Reading symbols from /usr/java/jdk1.7.0_25/jre/lib/amd64/headless/libmawt.so...(no debugging symbols found)...done.
Loaded symbols for /usr/java/jdk1.7.0_25/jre/lib/amd64/headless/libmawt.so
Core was generated by `/usr/bin/java -Dnop -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogM'.
Program terminated with signal 6, Aborted.
#0 0x000000344f432625 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6_6.5.x86_64 jdk-1.7.0_25-fcs.x86_64 libgcc-4.4.6-3.el6.x86_64
(gdb) where
#0 0x000000344f432625 in raise () from /lib64/libc.so.6
#1 0x000000344f433e05 in abort () from /lib64/libc.so.6
#2 0x00007f66c2726ac5 in os::abort(bool) () from /usr/java/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so
#3 0x00007f66c2886137 in VMError::report_and_die() () from /usr/java/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so
#4 0x00007f66c272a5e0 in JVM_handle_linux_signal () from /usr/java/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so
#5
#6 0x00007f66ba4ab0bb in ?? ()
#7 0x00000007ffffffff in ?? ()
#8 0x3f40000000000004 in ?? ()
#9 0x00000007343f7da0 in ?? ()
#10 0x00000007343fa1c0 in ?? ()
#11 0x00000007343fa280 in ?? ()
#12 0x00000007343fa1c0 in ?? ()
#13 0x00000001e687f453 in ?? ()
#14 0x00000007343fa1a0 in ?? ()
#15 0x00000000c2c8fad0 in ?? ()
#16 0x0000000400000002 in ?? ()
#17 0x000000073978df28 in ?? ()
#18 0x00000007343fa1c0 in ?? ()
#19 0x00000007343fa298 in ?? ()
#20 0x0000000000000000 in ?? ()
(gdb)
As you can see above, last instruction address to access is 0, which is a well-known bug in native code (C/C++ etc, JVM was written using C++).
This confirms my suspicious that we were vulnerable to a famous JIT compiler bug reported by others.
JDK-8021898 : Broken JIT compiler optimization for loop unswitching
Bug reported on Http Client:
People commented that adding “-XX:-LoopUnswitching” fixed the issue for them.
The OPS took my recommendation and added loopunswitching. There was no more crash afterward. It is still a mystery why an old bug can strike us at such a late time. We certainly did not have any upgrade on our infrastructure. Hope my saga can help you if you are in the same situation. You can always resort to GDB if you run out of your tool box.
Monday, January 12, 2015
Nice small tool for Mac users: csshX
Appreciate this nice little tool at making my life easier and more productive at troubleshooting issues in a cluster environment.
csshX at code.google.com
Friday, September 5, 2014
JGit-Flow does the release job well
No matter what kind of software processes you employ, commit, branch, merge and release are always challenging for your organization. It is definitely not hard to do any of them well given a version control system, no matter it is Source Safe, CVS, ClearCase, Subversion, or distributed version control systems GIT and Mercury. The complexity grows exponentially when you have a team of developers, everybody could be smart but have different skills and preferences. In practice, we are also working on the ever-growing software systems, which makes matter worse. At Citysearch we literally have releases going out every week.
More or less you have a working best practice for version control within your team. There might be a problem here and there but you can always find a way out. Same applies in our team. With years of fooling around we came to a position where everybody in our team had to release stories ourselves. We did it okay with a well written release check list for most part of the year until we felt it was enough. Our project is built with Maven and subversion is our repository. It was a natural choice to use Maven Release Plugin that does the job well.
After configuring your pom following usage guideline you can release from your release branch easily.
-Dtag=REL_${BUILD_ID} -Darguments=-Dtag=REL_${BUILD_ID}
Everybody can start a Jenkins release job without much training. That works great for us.
Moving onto the real treat. We are adopting GIT for future development, past experience with Maven Release Plugin drove me to look for alternatives and luckily I found JGit-Flow
After adding
There is no more ugly manipulation of system property since JGit-Flow does not package project in a temporary workspace folder. It also tags master branch and keeps develop branch in sync with master, which is missing in Maven Release Plugin. As a bonus you would embrace Vincent Driessen's branching model.
More or less you have a working best practice for version control within your team. There might be a problem here and there but you can always find a way out. Same applies in our team. With years of fooling around we came to a position where everybody in our team had to release stories ourselves. We did it okay with a well written release check list for most part of the year until we felt it was enough. Our project is built with Maven and subversion is our repository. It was a natural choice to use Maven Release Plugin that does the job well.
After configuring your pom following usage guideline you can release from your release branch easily.
mvn --batch-mode release:clean release:prepare release:performIf you need system property for your build, you could run into troubles as I did. Note that the release plugin checks out code into a temporary work folder and invokes Maven deploy goal from there. Your system property is invisible during the packaging process unless you pass it as an argument to Maven Release Plugin.
-Dtag=REL_${BUILD_ID} -Darguments=-Dtag=REL_${BUILD_ID}
Everybody can start a Jenkins release job without much training. That works great for us.
Moving onto the real treat. We are adopting GIT for future development, past experience with Maven Release Plugin drove me to look for alternatives and luckily I found JGit-Flow
After adding
<plugin>
<groupId>external.atlassian.jgitflow</groupId>
<artifactId>jgitflow-maven-plugin</artifactId>
<version>1.0-m4</version>
<configuration>
<squash>true</squash>
</configuration>
</plugin>
<groupId>external.atlassian.jgitflow</groupId>
<artifactId>jgitflow-maven-plugin</artifactId>
<version>1.0-m4</version>
<configuration>
<squash>true</squash>
</configuration>
</plugin>
into my pom.xml, all I need to do is issue command
mvn -U clean jgitflow:release-start jgitflow:release-finish -Dbuild.id=$BUILD_ID
Tuesday, May 14, 2013
Integrating blog into Citysearch
Tuesday, April 23, 2013
Android ListView performance
In a nutshell, if you've ever run into performance degradation or memory leak, be sure to watch this great youtube video of a developer session in 2010 Google I/O. It is a fantastic one covering what you need to pay attention to while developing Android list, gallery, carousal etc.
http://www.youtube.com/watch?v=wDBM6wVEO70
http://www.youtube.com/watch?v=wDBM6wVEO70
Subscribe to:
Posts (Atom)