Thursday, August 24, 2017

Possible Spark 2.1 Dataframe bug

We recently ran into this error constantly on a large cluster.

Py4JJavaError: An error occurred while calling o1277.saveAsTextFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 314.0 failed 4 times, most recent failure: Lost task 0.3 in stage 314.0 (TID 13145, 10.166.227.223, executor 6): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1430)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1429)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

Interesting the same code worked find in Spark 1.6.2 and the data size is very small (< 1000 bytes). Eventually I found there were 2400 partitions and concluded that these partition were unnecessary created by DataFrame's union() operations. Remember to run repartition(2) after join if you ever ran into the same issue.


No comments: