ETL with Apache Spark

In continuation to my previous post on Modern Data Warehouse Architecture, in this post I’ll give an example  using PySpark API from Apache Spark for writing ETL jobs to offload the data warehouse.

Spark is lightening-fast in data processing and works well with hadoop ecosystem, you can read more about Spark at Apache Spark home. For now, let’s talk about the ETL job. In my example, I’ll merge a parent and a sub-dimension (type 2) table form MySQL database and will load them to a single dimension table in Hive with dynamic partitions. When building a warehouse on hive, it is advisable to avoid snow-flaking to reduce unnecessary joins as each join task creates a map task. Just to raise the curiosity, the throughput on a stand alone Spark deployment for this example job is 1M+ rows/min. Continue reading