A friend inquired how I am preparing for the Apache Spark Developer Certificate. There exist official collections. Here is my unsurprising compendium, focusing on dev materials. I may continue to add good resources. Good luck to the learners (self included).


    • Learning Spark This is a must-read for clear explanations of Spark and RDDs. Some of the notable spark dev specific things I was introduced to in this book:
  • clear examples in 3 languages needed for the dev exam (scala, python, java)
  • Data Partitioning
  • Input considerations (Splittable vs Unsplittable)
  • How to pass functions to Spark with serialization considerations
  • Accumulators and Broadcasts
  • Key Performance Considerations

Tuning, Debugging, Wide and Narrow dependencies

  • avoid groupByKey() when possible for efficiency sake.
  • Use BroadcastHashJoin when joining Large tables with small tables.
  • Filter transform huge tables before joins, especially if you can’t broadcast a small table.
  • Avoid OOMs by being judicious with any actions that may return unbounded output to driver (eg .collect(), .countByKey(), .countByValue(), or .collectAsMap()).
  • Serialization Errors
  • Only the driver can perform ops on RDDs. ie, use alternatives to map+get or map+map

MLlib, GraphX, SparkSQL, Spark Streaming

Great chapters for ALS recommender, GraphX network analysis, decision trees/ random forest, K-means, and Latent Semantic Analysis. As a side-effect, I learned extra scala syntax reading this. YMMV.

More Spark Streaming

Spark User List and Stack Overflow

Cool Stuff but probably not needed for exam


  1. Hey ! how was the exam like? Does exam cover the spark internals in that deep at like for streaming and mLib. If you suggest how would you say the exam breakup would be like?

    I would appreciate the help.

