Running Spark in Production: Choosing where to host Spark

 In Analytics, Data Engineering, Data Science, Platform Technology

Appuri helps customers diagnose and predict churn. This, by its very nature, is a Big Data problem because user behavior is predictive of churn. A popular game title, for example, can easily collect 4 billion events a day. Over the last few months, we decided to move our core ML pipeline over to Apache Spark. This series of blog posts covers real-world lessons in learning how to run Spark in production.



What’s not covered in the documentation

Spark has a vibrant, active community. There are hundreds of Meetups, Spark Summit is around the corner and the Apache project’s commit history is extremely active. Unfortunately, Spark has good but not great documentation; in all honestly, the project is evolving so fast you shouldn’t expect otherwise. The biggest gap is advice and lessons on running Spark in production. Once you have tired of all the word count examples, you still have to deal with these issues:

  1. [This blog post] Where should you host Spark? Should you run it side-by-side with an existing Hadoop installation? What about Amazon Web Services (AWS) Elastic MapReduce (EMR) vs. Databricks?
  2. How do you write a performant Spark application? Again, making word count (and similar canonical examples) fast is easy. How do you make a real-world example (say, with billions of rows and complex joins) fast?
  3. What tools are available to help you tune and monitor Spark jobs and clusters? Developer time is more expensive than CPU time (what a time to be alive!) How do monitor a Spark cluster and investigate performance issues?


Should you self-host Spark?

At Appuri, we have a strong bias towards managed services. There are, of course, several notable exceptions to this soft rule – for example, we run Kafka instead of using AWS Kinesis, and we use Kubernetes instead of a tiringly negative run with Amazon EC2 Container Service (ECS). On the other hand, we really like hosted data services – for example, we love Amazon Redshift, a managed data warehouse.

We made the decision to use a managed Spark installation early on. The devops cost of running Spark is non-trivial; unless you have a team managing a Hadoop cluster, we strongly recommend using a managed option because the tech stack around deploying, scaling and monitoring Spark is immature at best.



EMR has been in the market for several years. It started off as a managed Hadoop service and, to be honest, it wasn’t great. The job queue was limited (256 jobs at a time), the job lifecycle was sometimes broken and so on. Things improved with YARN, but as early as last year, running Spark on EMR required custom bootstrap actions.

We tried EMR recently and were pleasantly surprised at how far it had come along.

  1. Excellent control. EMR lets you get fine-grained with your cluster. You can specify mixes of EC2 instance types, pricing (spot vs. on-demand) and choose from a fairly up-to-date menu of Spark distributions and other Hadoop applications like Zeppelin and Hue (sadly, EMR doesn’t offer the latest version of Hue).
  2. Task instance groups. One of the strong suits of EMR is the ability add additional capacity on-demand using Task Instance Groups. Lets say you notice that your cluster is running out of memory – you can add a group of EC2 instances in a task instance group (say, with spot pricing) to add capacity to the cluster. This makes dynamic scaling possible; for example, you can run a small cluster of 10 instances during the day and scale it up at night when batch jobs run.
  3. Consistent S3 (EMRFS). S3 is eventually consistent – surprise, if Process A PUTs an object in S3 and the PUT succeeds, Process B may get a 404 if it tries a GET – at least until the object is replicated to other S3 servers. EMRFS is a Hadoop-compatible filesystem that adds retries and consistency, making this annoying property of S3 a non-issue.
  4. Affordable pricing. EMR pricing is an add-on on top of EC2 pricing – you will pay anywhere from $0.09 to $0.27 on top of EC2 pricing.
  5. Works with other Amazon services. EMR clusters log to S3, write metrics to CloudTrail and so on. These integrations aren’t great but they just work. For example, you can tell EMR to run a JAR from S3. Woo! I always appreciate how different AWS services play well together.
  6. There’s a client for it. Seriously, there is a client for EMR in your favorite language – Go, Python, Node etc. That’s the good news. The bad news is it’s written in AWS’s famous RPC request-response style. Oh well.
  7. Dated bits. Frameworks on EMR are always a little dated – for example, the latest version of Hue is not available. Spark 1.6.1 is available, but the 2.0 preview is not.



Databricks is the Spark company i.e. these are the people who literally built Spark. There’s some serious brainpower here, and the good news is that the product is solid and only getting better. We tried Databricks several times over the last year and a steady set of improvements have made it a serious choice.

  1. Sane defaults. Where EMR gives you control, Databricks takes it away. For example – you can only choose two instance types (high CPU or high memory). This isn’t necessarily a bad thing – in fact, it’s good because these two choices make sense for most people. I just wish Databricks gave me a little more control.
  2. Excellent notebooks. In some ways, it seems as if Databricks is geared at data scientists whereas EMR is geared at developers. True to this form, Databricks offers polished notebooks for data science. There are some really cool custom visualizations of ML algorithms which aren’t available anywhere else.
  3. Git workflow and scheduling. Building on notebooks being front and center, a pricier SKU of Databricks offers git integration so your data scientists can check in notebooks. You can even schedule notebooks to run on a schedule. The developer in me doesn’t quite like the idea of notebooks as deployable objects, but hey – it works!
  4. Consistent S3 view. Very similar to EMR, Databricks has its own file system (DBFS) which provides a consistent view on S3. It goes further by allowing you to “mount” S3 keys as paths. Neat!
  5. Pricey. Databricks is expensive. The per-node markup on top of EC2 charges is $0.40/hour. Yikes!


So who wins?

Turns out, both! We are using Amazon EMR for core ML and ETL workflows, and Databricks for the unbeatable notebook interface. This way, we get the best of both worlds. Note that I didn’t cover Google Dataproc – if you have experience with it, I’d love to hear about it.

Want to see a demo of Appuri or just chat? Drop me a line.