While learning the new Tez engine and query vectorization concepts in Hadoop 2.0, I came to know that the query vectorization is claimed as 3x powerful and consume less CPU time in actual Hadoop cluster. Hortonworks tutorial uses a sample sensor data in a CSV that is imported into Hive. Then a sample has been used to explain the performance.
The intention of this post is neither explaining Tez engine and query vectorization nor Hive query. Let us familiarize the problem I have worked before get to know the purpose of this post.
One sample CSV file called ‘HVAC.csv’ contains 8000 records that contain temperature information on different building during different days. Part of the file content:
Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID
6/1/13,0:00:01,66,58,13,20,4
6/2/13,1:00:01,69,68,3,20,17
6/3/13,2:00:01,70,73,17,20,18
6/4/13,3:00:01,67,63,2,23,15
6/5/13,4:00:01,68,74,16,9,3
…
In the Hive, following configurations are specified to enable Tez engine and query vectorization.
1 2 3 4 |
|
I execute the following query in my sandbox that surprisingly took 48 seconds for a ‘group by’ and ‘count’ on 8000 records as shown below:
1
|
|
This query groups the sensor data by date and count the number of building for that date. It produces 30 results as shown below:
1 2 3 4 5 6 7 |
|
Then I plan to write simple program without MapReduce castle, since it is just 8000 records. I created a F# script that read the CSV (note that I did not use any CSV type provider) and using Deedle exploratory library (again, LINQ can also help). I achieved the same result as shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
In the FSI,
1 2 3 4 5 6 7 8 9 10 11 |
|
The is completed within 83 milliseconds. You may argue that I comparing apple with orange. No!. My intention is to understand when MapReduce is the savior. The parable of the above exercise is that be cautious and analyze well before moving your data processing mechanisms into MapReduce clusters.
Elephants are very effective in labor requiring hard slogging and heavy lifting. Not for your gardens!!
Note that the sample CSV files from HortonWorks is clearly for training purpose. This blog post just take that as an example to project the maximum data-generation capability of a small or medium size application for a period. The above script may not scale and will not perform well with more than the above numbers. Hence, this is not anti-MapReduce proposal.