In continuation to my previous post on Modern Data Warehouse Architecture, in this post I’ll give an example using PySpark API from Apache Spark for writing ETL jobs to offload the data warehouse.
Spark is lightening-fast in data processing and works well with hadoop ecosystem, you can read more about Spark at Apache Spark home. For now, let’s talk about the ETL job. In my example, I’ll merge a parent and a sub-dimension (type 2) table form MySQL database and will load them to a single dimension table in Hive with dynamic partitions. When building a warehouse on hive, it is advisable to avoid snow-flaking to reduce unnecessary joins as each join task creates a map task. Just to raise the curiosity, the throughput on a stand alone Spark deployment for this example job is 1M+ rows/min. Continue reading
My idea of writing this post is to help people who are trying to install Oozie with Hadoop 2+ environment. As I had to refer different places for fixing the errors which I encountered during the process. Here’s it goes..
Step 1: Download Oozie 4.1 from the Apache URL and save the tarball to any directory
tar -zxf oozie-4.1.0.tar.gz
sudo mv oozie-4.1.0 /usr/local/oozie-4.1.0
Step 2: Assuming you have maven installed, if not, refer to the installation instructions here
Step 3: Update the pom.xml to change the default hadoop version to 2.3.0. The reason we’re not changing it to hadoop version 2.6.0 here is because 2.3.0-oozie-4.1.0.jar is the latest available jar file. Luckily it works with higher versions in 2.x series
--Replace it with
Step 4: Build Oozie executable
mvn clean package assembly:single -P hadoop-2 -DskipTests
Step 5: The executable will be generated in the target sub directory under distro dir. Move it to a new folder under /usr/local/
tar -zxf oozie-4.1.0-distro.tar.gz
sudo mv oozie-4.1.0 /user/local/oozie-4.1.0
With the changing trends in the world of BI and the Big Data wave everywhere, a lot of organizations have started initiatives to explore how it fits in. To leverage the data ecosystem at it’s fullest potential, it is necessary to think forward and ingest new technology pieces in the right place. That way, in a long run, both business and IT will reap its benefits.
Here’s an interesting prediction by Gartner
“ By 2020, information will be used to reinvent, digitalize or eliminate 80% of business processes and products from a decade earlier.“
Imagine all the time, money and efforts you’ll save off your existing data and infrastructure components if the Big Data implementation goes well. The architecture diagram below , is a conceptual design of how you can leverage the computation power of Hadoop ecosystem in your traditional BI / Data warehousing processes along with all the real time analytics and data science. They call it a data lake, warehouse is old school now.
Alright, having a Hadoop ecosystem saves the computational time and provides all bells and whistles of real time analytics but “how does it save money? Continue reading
“Tell me and I forget. Teach me and I remember. Involve me and I learn.”
I’m a big fan of practical learning, “implement as you learn” is my mantra for learning anything. Hadoop being open source gives the best opportunity for getting your hands dirty as you read about it. There are plenty of free resources online that you can refer to get started with and in this post, I’m going to list and refer some of the good ones I’ve come across.
Getting Started with Hadoop
Depending on your level of interest in learning and exploring Hadoop, you can enroll in any of the free online fundamental courses offered from Big Data University or watch video tutorials form edureka on YouTube. These two sources do not require a sign in from your corporate email id and give a basic overview on what Hadoop is? And of-course the documentation provided by Apache helps in understanding it detail, alternatively you can read the Yahoo Hadoop tutorial. Continue reading
To Be or not To Be is the question to ask today. I’m not being Hamlet here but with the evolution of Big Data and looking at the current technology trends, I’m curious to discover the cases where it can overcome conventional Data warehouse, where it cannot and most important what are the areas where both DW and BigData can be implemented in conjunction?
Data Warehousing concept has been in place for more than last four decades now while Big Data, In Memory and the Cloud concept started prevailing in early 2Ks and are in high demand today. Between now and then the data has grown exponentially in the network and the competitive analysis of this data has lead to evolution of new tools and BI concepts. Does it mean a slow sunset for DW? I’d say it’s a myth as the business cases differ, an ideal approach will be a blend of both for sure. Continue reading