This article is an overview of the path we followed to migrate Spark Workloads to Kubernetes and to avoid EMR dependency. EMR was an important support tool at Empathy.co to orchestrate Spark workloads, but once the workloads became more complex, the use of EMR also became more complicated. So, back in December 2020, the Step Function flow to orchestrate the different EMR clusters was like this:
In January 2021, an initial Spark on Kubernetes RFC was proposed by the Platform Engineering team. The aim was to have a better solution with less possible friction between teams, especially the Data and Ether teams, the main users of the Spark workloads.
The fundamentals to replatform Spark workloads were:
Over a period of six months, three teams were focused exclusively on replatforming those Spark workloads to Kubernetes. The main goal was to complete the migration without focusing on performance or any more enhancements. However, for business purposes, there were some improvements in some jobs and some new Spark jobs were added. Six months later, all the jobs have been migrated to Kubernetes and EMR was decommissioned successfully.
Let’s begin with an overview of the different topics to be able to consider a successful case.
EMR solution has a bunch of different instance types, but not all AWS instances are compatible with EMR. With EMR autoscaling, it takes a while to upscale the resources needed; so to avoid long waiting times, autoscaling was not used on EMR.
On the other hand, using Kubernetes makes it possible to deploy the workloads on the vast majority of AWS instance types. Just set a nodepool on your EKS cluster with some taints, and set nodeSelector tolerances to make it work. For Kubernetes, a cluster-autoscaler was added, and the feedback after six months of usage is good.
Kubernetes cluster-autoscaler allows us to provide a bunch of instances in a shorter time than EMR. With EMR, it usually takes around 7 minutes to upscale task managers, whilst with cluster autoscaler, it takes around just 1 minute to provide a new instance.
To sum up:
- On EMR, instance types limitation and autoscaling are not fast enough.
- On Kubernetes, we have multiple instance types and fast autoscaler provision.
A significant concern is that our jobs are ephemeral, therefore we don’t need an instance running all day to run our job, only during the execution time. This use case fits well for Spot instances, which are cheap instances that are only available during some time, not like on-demand instances where availability is guaranteed. The main disadvantage of using spot instances is that instances may terminate if spot capacity becomes unavailable for the instance you are running. This disadvantage was noticed in the EMR scenario, however, for the Kubernetes scenario, the spot behaves stably and we don’t face availability issues.
The main advance on this pillar was the use of docker images to allow a distributed and repeatable way to deploy jobs. In the former solution, there were a lot of copy-paste jars and it was a little bit messy to test a new release or move between environments.
Furthermore, there is not any cloud provider Spark version dependency. There is no need to wait for the cloud provider to support a certain Spark version — just get the Docker image and deploy it on Kubernetes.
- Spot instances were not stable enough, spending time to provision new spot instances and therefore, changing to on-demand instances to avoid hiccups.
- Tons of copy-paste instructions.
- Cloud provider Spark version dependency.
- Spot instances seem more reliable than EMR.
- Docker image distribution.
- Spark version independence from the cloud provider.
EMR solution had the following dependencies:
- Serverless framework. IaC approach.
- Step Functions. To orchestrate the logic regarding sequential jobs.
- Lambda. To execute Apache Livy API calls.
- Cloudwatch log group. To check lambda logging.
- Cloudwatch rules. To schedule cron jobs.
- API Gateway. To execute the step functions from an endpoint without calling the StepFunction API.
- CloudFormation. The serverless framework creates a CloudFormation stack to deploy all the above.
- S3. Because of persistence reasons.
- Jars distribution on S3. Friction for sharing between environments.
For the Kubernetes solution, the dependencies are the following:
- Spark Operator. To create all the Kubernetes resources needed easily to execute Spark, as mount config-map, secrets, etc.
- Argo-workflows. To create the orchestration needed to execute sequential jobs and create a schedule policy.
- ArgoCD: GitOps delivery tool.
- Docker image: Dependencies encapsulated on a docker image.
- S3. Because of persistence reasons.
On this topic, there are huge benefits because when migrating to other cloud providers, persistence is where the focus should be.
One of the biggest improvements was to encapsulate the Spark jobs in a Docker image, providing the benefits of using containers. Also, the CI is more mature for images than the EMR setup, uploading jars to a bucket, syncing with the EMR instances…
In both cases, Push-gateway was used to allow batch jobs to expose their metrics to Prometheus, while Grafana was used to review Dashboards. Some differences:
- In the EMR scenario, the monitoring solution had some caveats regarding workflow complexity that created friction between teams because of the lack of standardisation.
- With the Kubernetes solution, Platform Engineering provides an opinionated workflow that helps to easily add metrics to Prometheus and dashboards to Grafana.
Another important point is the Spark History Server; here are some differences:
- With EMR, it was available only during the cluster execution.
- With Kubernetes, it has been launched with S3 persistence, therefore the historical data is available after cluster termination.
Some years ago, Empathy couldn’t find any open source solution that fit well enough with the requirements. The usual ELK stack showed some performance issues on Logstash, so the decision was taken to create an in-house log-parser to send metrics to Elasticsearch until some open source tool fit the requirements. The in-house solution had dependencies on a custom service running on each instance: AWS Kinesis and AWS Lambda. Hence, the solution was always tied to AWS. Last year, the use of EFK started increasing in the community and with the migration to Kubernetes, the decision was to adopt this logging stack and retire the in-house log solution. The logging-operator from Banzai meets the requirements, and it’s been happily deployed as the logging solution.
Now, spark logs can be retrieved using the
kubectl logs command and from a Kibana UI, where each team can create/customise their logging flow.
The main achievement here was being able to get rid of the in-house solution whilst also improving team autonomy to customise their logging flow easily.
Once the migration had been completed, it was enough to compare with the EMR solution to review how well the job had been performed.
One motivation to move to Kubernetes was to reduce the cost; as you know, EMR has a fee on AWS, but EKS does too. The EKS fee is lower than the EMR fee, although the EKS fee is shared between all the workloads running on the cluster. Let’s review if the goal has been accomplished by reviewing one lower environment:
On top of this, there is an increase in the number of jobs launched:
The conclusions are the following:
- Scalability: Faster instance provision.
- Reliability: Spot instances have a more reliable behaviour than using Spot on EMR. Docker image distribution and Spark version independence from a cloud provider.
- Portability: Docker image providing a better distribution flow, avoid too many dependencies from AWS, and adopt open source approach.
- Metrics: Clear workflow definition.
- Logging: EFK open source adoption and provide more autonomy to the teams.
- Performance: Better performance.
- Cost-effectiveness: Reduce cost as well as increase Spark usage.
- Deployment with enterprise customers to replace Google Dataproc
- Performance tuning
- Custom metrics for Grafana dashboards
- Alerting on failed jobs
- Adopt spot instances on Data jobs
- Allocate cost to Kubernetes workloads to provide a faster business feedback
- Optimise usage, fine-tuning underutilisation resources
This has been a long article and we’ve learned a lot along the way. I hope our learnings will help you to become more cloud-agnostic.