A friend inquired how I am preparing for the Apache Spark Developer Certificate. There exist official collections. Here is my unsurprising compendium, focusing on dev materials. I may continue to add good resources. Good luck to the learners (self included).
- Learning Spark This is a must-read for clear explanations of Spark and RDDs. Some of the notable spark dev specific things I was introduced to in this book:
- clear examples in 3 languages needed for the dev exam (scala, python, java)
- Data Partitioning
- Input considerations (Splittable vs Unsplittable)
- How to pass functions to Spark with serialization considerations
- Accumulators and Broadcasts
- Key Performance Considerations
- RDD Function Examples These simple examples were a good resource to me. The API DOCs are good but they don’t have examples
- Advanced Spark Features (Matei Zaharia) slides about broadcast, accumulators, partitioning.
- Spark Programming Guide Duh!
Tuning, Debugging, Wide and Narrow dependencies
- avoid groupByKey() when possible for efficiency sake.
- Use BroadcastHashJoin when joining Large tables with small tables.
- Filter transform huge tables before joins, especially if you can’t broadcast a small table.
- Avoid OOMs by being judicious with any actions that may return unbounded output to driver (eg .collect(), .countByKey(), .countByValue(), or .collectAsMap()).
- Serialization Errors
- Only the driver can perform ops on RDDs. ie, use alternatives to map+get or map+map
- Spark Knowledge Base Contains some of the same material to the above link, and more
- Diagram of Narrow v. Wide dependencies
- How-to: Tune Your Apache Spark Jobs (Part 1) (Sandy Ryza)
- How-to: Tune Your Apache Spark Jobs (Part 2) (Sandy Ryza)
- How-to: Translate from MapReduce to Apache Spark (Part 1) (Sean Owen)
- How-to: Translate from MapReduce to Apache Spark (Part 2) (Juliet Hougland)
- spark architecture (memory – storage and shuffle) (Alexey Grishchenko)
MLlib, GraphX, SparkSQL, Spark Streaming
- AMPLAB mini-course exercises
- Spark Summit 2014 Advanced Training Exercises
- Anomaly Detection with Spark (Sean Owen)
- This presentation includes using graphX for TextRank
- GraphX: Unified Graph Analytics on Spark
More Spark Streaming
- Spark Streaming Slides from Spark Summit 2014 (Tathagata Das)
- Sampling Twitter using Spark Streaming (Patrick Wendell)
- Webinar | Streaming Big Data Analytics with Team Apache Spark & Spark Streaming, Kafka, Cassandra (Helena Edelson) with slides
Spark User List and Stack Overflow
- Spark User List Track what kinds of issues people are running into, and sometimes answers from the experienced.
- All Apache-Spark tagged Stack Overflow