04/05/2018
Data Access Components of Hadoop Ecosystem- Pig and Hive
Pig-
Apache Pig is a convenient tools developed by Yahoo for analysing huge data sets efficiently and easily. It provides a high level data flow language Pig Latin that is optimized, extensible and easy to use. The most outstanding feature of Pig programs is that their structure is open to considerable parallelization making it easy for handling large data sets.
Pig Use Case-
The personal healthcare data of an individual is confidential and should not be exposed to others. This information should be masked to maintain confidentiality but the healthcare data is so huge that identifying and removing personal healthcare data is crucial. Apache Pig can be used under such circumstances to de-identify health information.
Pig - Dataflow Language on Hadoop
Apache Pig is a high level procedural dataflow language on top of Hadoop for processing and analysing big data without having to write Java based MapReduce code. Apache Pig has RDBMS like features- joins, distinct clause, union, etc. For crunching large files containing semi-structured or unstructured data.
One cannot deny the importance of Hadoop MapReduce in processing big data but coding vanilla MapReduce jobs is not easy for people from a non-programming background. Apache Pig was developed at Yahoo by Alan Gates and his team to address this problem so that professionals without a programming background could also work with Hadoop.
Apache Pig Components
1) Pig Latin
It is a SQL like data flow language to join, group and aggregate distributed data sets with ease.
2) Pig Engine
Pig engine takes the Pig Latin scripts written by users, parses them, optimizes them and then executes them as a series of MapReduce jobs on a Hadoop Cluster.
Features of Apache Pig
1) Apache Pig has rich set of SQL like operators for joins, sort, filter, etc.
2) Developers can create their customized user defined function in Java and invoke them inside a Pig Latin Script. Developers can extend the existing operators and write functions for reading, writing and processing data.
3) It handles all kinds of data from diverse data sources-structured, semi-structured and unstructured.
When You Should Use Apache Pig
1) If the business use case requires processing multiple data sources then Pig could be an ideal choice. For example, if a business wants to analyse how a particular ad is performing then they have to combine data from multiple sources like -IP geo-location ,click through rates, web server traffic and other details to get an in-depth understanding of the customers on specific ads.
2) If the application requires handling Time Sensitive Data loads then Apache Pig could be a perfect choice, as it is built on top of hadoop and can scale out easily. Pig converts the scripts into MapReduce jobs and spreads the load across multiple servers for faster processing.
3) If the business requires analysis through sampling then Apache Pig should be considered to sample large datasets with a random distribution of data to gain meaningful analytic insights.
Advantages of Apache Pig
1) It is a procedural language and not declarative, unlike SQL, so has expressive power in transforming data at every step.
2) Users can control the ex*****on in every step. If a user wants to write user defined functions, it is pretty straightforward.
3) Has all the features offered by MapReduce like fault-tolerance, parallelization, and flexibility. Pig, in addition to the features already mentioned, has few additional RDBMS like features.
4) Learning curve for Apache Pig is steep. Even, if a person does not have a programming background he or she can easily pick-up and write PigLatin scripts as it is English like language to understand.
5) It enhances the productivity of big data developers by decreasing the development time, complexity and maintenance efforts.