EMR — Hadoop Dissonance
EMR — Hadoop Dissonance
For some time now I have been working with an EMR cluster for our streaming spark jobs at my client. EMR is a very versatile and comfortable platform that simplifies a lot of workflows. Nevertheless, it has not been easy. Some functionality is lacking or even buggy.
Did I really just say that? Yes, I did. Hear me out.
When creating a new EMR cluster behind the scenes it starts up a Hadoop cluster, with at least 1 master node and 1 slave node using YARN as the resource manager. Nothing surprising here. It’s just that if I know there is a YARN resource manager in the background, maybe I would like to use it for all the stats and metrics? I mean, the name says it all, doesn’t it? Resource Manager.
But no. It seems that EMR is built on top without really caring what is underneath. This is in my opinion the reason for the following use cases which I am going to show.
Metrics and Monitoring, or the lack thereof
So on the EMR side of things I have Cloudwatch which gives me all the relevant metrics as to the overall health of the EMR cluster. But there are small things missing, for example, the numer of cores. I mean, we do have the amount of memory in the cluster, used, unused and total. What about the cores?
Let’s say I want to add another spark job to this cluster. I will configure the spark job to a certain amount of cores for the driver and executor, right? How can you know you have enough resources in the system to accommodate this new spark job?
Exactly, you don’t. You will have to go to the YARN Resource Manager from where you will be able to extract that information.
Running is not really running
When running a Spark job via EMR, the first state of the job in EMR is “Pending” and it even has some nice graphics to it, making you want to wait patiently. After some time, if all goes well, your job will be moved to “Running” and it is green and we all rejoice. So everything is good, right?
While in EMR the status is pending, the job is being prepared and sent to the hadoop cluster for execution. There you can see the states go through a different process. Initially, it is in the “ACCEPTED” state, followed by “RUNNING”, again, if all goes well. Sometimes a spark job gets stuck in the ACCEPTED state, for instance, when there is a lack of resources. And here is the quirk: A spark job can be in the ACCEPTED state in the Hadoop cluster, and in the RUNNING state in the EMR interface.
Really ? But that’s not all:
So this one time my job was in running mode, but for some reason, I was not getting any data written to my S3 bucket. After a short investigation I understood, the spark job did not have enough resources, specifically cores, to create an executor. Only the driver was running. But the job was in “RUNNING” state. I was able to figure that out through the YARN Resource Manager interface where I could clearly see that no executor was running, and there was a lack of cores. But looking at the EMR interface, there was no indication that anything was wrong.
Shutdown in EMR does not shutdown in Hadoop
Whenever we upgrade our spark pipeline, we need to stop the old spark job and start a new one with the new library. For this, we have an elaborate Jenkins job which first calls the EMR to stop the spark job, wait for termination, and then starts the new one.
What happened was, we suddenly got duplicate data in our S3 bucket. How did this happen? Well, quite simple. Stopping the EMR job, even after seeing the state as terminated, does not stop the spark job from running on the Hadoop cluster.
Confused? So are we.
So the spark job continues to run in the Hadoop cluster until one terminates the job from the YARN Resource Manager. Again, no indication on the EMR, rather, misinformation.
Streams down in EMR while up in Hadoop
Definitely, the worst discrepancy is the following: We tied our alarms to a watchdog service, checking every couple of minutes whether the spark job is running. Very basic stuff. One day the alarm went off, indicating the spark job stopped working. This set a chain reaction in motion to automatically re-run the same spark job on the EMR cluster. Again, very basic stuff. The only issue was that yet again we started to see duplicate data.
What now I ask? How is this possible?
Well, when taking a closer look at the YARN Resource Manager we saw that the old spark job, the one we have just been told was dead, was up and alive and staring us in our face as if saying “I know, I’m up, it’s not my fault…”
Quite funny to see that in the CloudWatch Monitoring Details, I can see the number of running apps is…9!
Much later I even got to a point where I had 9 spark jobs running on the Hadoop cluster, while the EMR shows all spark jobs have failed miserably.
This post is not to make you steer away from EMR. Not at all. This post is to open your eyes so you know the problems of this platform beforehand and not afterward as it happened to us.
AWS EMR is a very platform comfortable to create and manage Hadoop clusters. I can easily add more resources, add bootstrap actions and do many more configurations which having a self-managed Hadoop cluster is just so much more painful to do. I can even create and use multiple such clusters within minutes!
Another way to achieve this would be by leveraging the SparkOperator on K8s, but this is a topic for another post ;-)
We will contact you as soon as possible.