While working on GraphFrames in pyspark, we encountered a
ExecutorLostFailure exception. Following is the pyspark script:
My first instinct was that it is a resource problem and disabling dynamic allocation by passing
--num-executors= should solve the issue but the job was already passing these parameters.
--driver-memory 2g --executor-memory 6g --executor-cores 6 --num-executors 40
Then I tried looking into application logs. Since some of the executors were lost and there was no way to get their logs from spark history server, I had to use yarn logs
-applicationId > myapp.log to get the logs. One benefit of this is that we get the logs for all the containers for the applications as a single file, making it easier to grep through. Apart from the fact that some of the executors are being shut down for some reason, there were no helpful error messages in the logs.
My next theory was that the containers were being killed because of OOM errors. Since the input data was around 250MB, I tried narrowing down transformations in the job to identify what might be causing this. Finally, it was the UDF that was the culprit. Removing call to
path_cost_function allowed the job to complete successfully.
Next, I tried optimizing path_cost_function implementation by passing the float values directly instead of the structs:
This resulted in a bit better error messages with the job now failing with:
Since this seems to be originating from tungsten, my next step was to disable tungsten for the job with
This resulted in the job being completed successfully but a bit slowly then it would have with tungsten enabled. SPARK-10474 which was fixed in Spark 1.5.1 seems relevant here and tuning
spark.shuffle.safetyFraction might help.
Interestingly, same job when implemented in Scala runs fine with tungsten. Once of the reason for this could be a lack of memory overhead arising due to passing data between JVM and python process. Following is the scala version.