Hadoop Framework

Hadoop Framework To gain some knowledge on Hadoop which may be useful to build your own project

In this page we will share the knowledge about hadoop and its related concepts like Hbase, Hive,
Pig, Sqoop, Mapreduce, MongoDB, Cassandra, Spark, Storm, Mahout, Flume, Impala, Oozie, Zookeeper along with their installations and configurations also.

Data Access Components of Hadoop Ecosystem-  Pig and HivePig-​Apache Pig is a convenient tools developed by Yahoo for an...
04/05/2018

Data Access Components of Hadoop Ecosystem- Pig and Hive
Pig-
​Apache Pig is a convenient tools developed by Yahoo for analysing huge data sets efficiently and easily. It provides a high level data flow language Pig Latin that is optimized, extensible and easy to use. The most outstanding feature of Pig programs is that their structure is open to considerable parallelization making it easy for handling large data sets.

Pig Use Case-
The personal healthcare data of an individual is confidential and should not be exposed to others. This information should be masked to maintain confidentiality but the healthcare data is so huge that identifying and removing personal healthcare data is crucial. Apache Pig can be used under such circumstances to de-identify health information.

Pig - Dataflow Language on Hadoop
Apache Pig is a high level procedural dataflow language on top of Hadoop for processing and analysing big data without having to write Java based MapReduce code. Apache Pig has RDBMS like features- joins, distinct clause, union, etc. For crunching large files containing semi-structured or unstructured data.

One cannot deny the importance of Hadoop MapReduce in processing big data but coding vanilla MapReduce jobs is not easy for people from a non-programming background. Apache Pig was developed at Yahoo by Alan Gates and his team to address this problem so that professionals without a programming background could also work with Hadoop.

Apache Pig Components
1) Pig Latin
It is a SQL like data flow language to join, group and aggregate distributed data sets with ease.
2) Pig Engine
Pig engine takes the Pig Latin scripts written by users, parses them, optimizes them and then executes them as a series of MapReduce jobs on a Hadoop Cluster.

Features of Apache Pig
1) Apache Pig has rich set of SQL like operators for joins, sort, filter, etc.
2) Developers can create their customized user defined function in Java and invoke them inside a Pig Latin Script. Developers can extend the existing operators and write functions for reading, writing and processing data.
3) It handles all kinds of data from diverse data sources-structured, semi-structured and unstructured.

When You Should Use Apache Pig
1) If the business use case requires processing multiple data sources then Pig could be an ideal choice. For example, if a business wants to analyse how a particular ad is performing then they have to combine data from multiple sources like -IP geo-location ,click through rates, web server traffic and other details to get an in-depth understanding of the customers on specific ads.
2) If the application requires handling Time Sensitive Data loads then Apache Pig could be a perfect choice, as it is built on top of hadoop and can scale out easily. Pig converts the scripts into MapReduce jobs and spreads the load across multiple servers for faster processing.
3) If the business requires analysis through sampling then Apache Pig should be considered to sample large datasets with a random distribution of data to gain meaningful analytic insights.

Advantages of Apache Pig
1) It is a procedural language and not declarative, unlike SQL, so has expressive power in transforming data at every step.
2) Users can control the ex*****on in every step. If a user wants to write user defined functions, it is pretty straightforward.
3) Has all the features offered by MapReduce like fault-tolerance, parallelization, and flexibility. Pig, in addition to the features already mentioned, has few additional RDBMS like features.
4) Learning curve for Apache Pig is steep. Even, if a person does not have a programming background he or she can easily pick-up and write PigLatin scripts as it is English like language to understand.
5) It enhances the productivity of big data developers by decreasing the development time, complexity and maintenance efforts.

Continuation of YARN:Understanding the Differences between the Components of Hadoop 1.0 and Hadoop 2.0 (Refer to the fig...
26/03/2018

Continuation of YARN:
Understanding the Differences between the Components of Hadoop 1.0 and Hadoop 2.0 (Refer to the figure (1))

The Hadoop 1.0 or the so called MRv1 mainly consists of 3 important components namely:

1) Resource Management:
This is an infrastructure component that takes care of monitoring the nodes, allocating the resources and scheduling various jobs.

2) Application Programming Interface (API):
This component is for the users to program various MapReduce applications.

3) Framework:
This component is for all the runtime services such as Shuffling, Sorting and executing Map and Reduce processes.

The major difference with Hadoop 2.0 is that, in this next generation of Hadoop the cluster resource management capabilities are moved into YARN. Refer to the Figure (2)

YARN
YARN has taken an edge over the cluster management responsibilities from MapReduce, so that now MapReduce just takes care of the Data Processing and other responsibilities are taken care of by YARN.

Hadoop 2.0 (YARN) and Its Components - Refer to the figure (3)

In Hadoop 2.0, the Job Tracker in YARN mainly depends on 3 important components

1. Resource Manager Component:
This component is considered as the negotiator of all the resources in the cluster. Resource Manager is further categorized into an Application Manager that will manage all the user jobs with the cluster and a pluggable scheduler. This is a relentless YARN service that is designed for receiving and running the applications on the Hadoop Cluster. In Hadoop 2.0, a MapReduce job will be considered as an application.

2. Node Manager Component:
This is the job history server component of YARN which will furnish the information about all the completed jobs. The NM keeps a track of all the users’ jobs and their workflow on any particular given node.

3. Application Master Component (aka User Job Life Cycle Manager):
This is the component where the job actually resides and the Application Master component is responsible for managing each and every Map Reduce job and is concluded once the job completes processing.

A list on Hadoop 2.0 Components
RM-Resource Manager
1.It is the global resource scheduler

2.It runs on the Master Node of the Cluster

3.It is responsible for negotiating the resources of the system amongst the competing applications.

4.It keeps a track on the heartbeats from the Node Manager

NM-Node Manager
1.Node Manager communicates with the resource manager.

2.It runs on the Slave Nodes of the Cluster

AM-Application Master
1.There is one AM per application which is application specific or framework specific.

2.The AM runs in Containers that are created by the resource manager on request.

Migration from Hadoop 1.0 to Hadoop 2.0
With the advent of YARN framework as a part of the Hadoop 2.0 platform, there are several applications and tools available now for Hadoop programmers that will help them make the best out of big data which they never thought of.

YARN has been capable of providing the organizations something that is far beyond Map Reduce, by separating the cluster resource management function completely from the data processing function. With comparatively less overloaded sophisticated programming protocols and being cost effective, companies preferably would like to migrate their applications from Hadoop 1.0 to Hadoop 2.0. An edge that YARN provides to Hadoop Users is that it is backward compatible (i.e. one can easily run an existing Map Reduce job on Hadoop 2.0 without making any modifications) thus compelling the companies to migrate from Hadoop 1.0 to Hadoop 2.0 without even giving it a second thought.

Despite the fact that most of the Hadoop applications have migrated from Hadoop 1.0 to Hadoop 2.0 there are migrations that are still in progress and companies are consistently striving hard to accomplish this long needed upgrade for their applications.

With Hadoop YARN, it is now easy for Hadoop Developers to build applications directly with Hadoop, devoid of having to bolt them from any other outside third party vendor tools which was the case with Hadoop 1.0.This is another important reason why companies that are currently using Hadoop, will establish Hadoop 2.0 as a platform for creating applications and manipulating data for more effectively and efficiently.

YARN is the elephant sized change that Hadoop 2.0 has brought in but undoubtedly there are lots of challenges involved as companies migrate from Hadoop 1.0 to Hadoop 2.0 however the basic changes to the MR framework will have greater usability level for Hadoop in the upcoming big data scenarios. Hadoop 2.0 being more isolated and scalable over the earlier version, it is anticipated that soon there will be several novel tools that will get the most out of the new features in YARN (Hadoop 2.0).

Today We are posting some data about YARN.Hadoop 2.0 (YARN) Framework - The Gateway to Easier Programming for Hadoop Use...
25/03/2018

Today We are posting some data about YARN.

Hadoop 2.0 (YARN) Framework - The Gateway to Easier Programming for Hadoop Users
With a rapid pace in evolution of Big Data, its processing frameworks also seem to be evolving in a full swing mode. Hadoop (Hadoop 1.0) has progressed from a more restricted processing model of batch oriented MapReduce jobs to developing specialized and interactive processing models (Hadoop 2.0). With the advent of Hadoop 2.0, it is possible for organizations to create data crunching methodologies within Hadoop which were not possible with Hadoop 1.0 architectural limitations. In this piece of writing we provide the users an insight on the novel Hadoop 2.0 (YARN) and help them understand the need to switch from Hadoop 1.0 to Hadoop 2.0.

Evolution of Hadoop 2.0 (YARN) -Swiss Army Knife of Big Data:
With the introduction of Hadoop in 2005 to support cluster distributed processing of large scale data workloads through the MapReduce processing engine, Hadoop has undergone a great refurbishment over time. The result of this is a better and advanced Hadoop framework that does not merely support MapReduce but renders support to various other distributed processing models also.

The huge data giants on the web such as Google, Yahoo and Facebook who had adopted Apache Hadoop had to depend on the partnership of Hadoop HDFS with the resource management environment and MapReduce programming. These technologies collectively enabled the users to manage processes and store huge amounts of semi-structured, structured or unstructured data within Hadoop clusters. Nevertheless there were certain intrinsic drawbacks with Hadoop MapReduce pairing. For instance, Google and other users of Apache Hadoop had various alluding issues with Hadoop 1.0 of not having the ability to keep track with the flood of information that they were collecting online due to the batch processing arrangement of MapReduce.

Introduction to Hadoop YARN (Hadoop 2.0):Refer to Figure (1)
Hadoop 2.0 popularly known as YARN (Yet another Resource Negotiator) is the latest technology introduced in Oct 2013 that is being used widely nowadays for processing and managing distributed big data.

Hadoop YARN is an advancement to Hadoop 1.0 released to provide performance enhancements which will benefit all the technologies connected with the Hadoop Ecosystem along with the Hive data warehouse and the Hadoop database (HBase). Hadoop YARN comes along with the Hadoop 2.x distributions that are shipped by Hadoop distributors. YARN performs job scheduling and resource management duties devoid of the users having to use Hadoop MapReduce on Hadoop Systems.

Hadoop YARN has a modified architecture unlike the intrinsic characteristics of Hadoop 1.0 so that the systems can scale up to new levels and responsibilities can be clearly assigned to the various components in Hadoop HDFS.

Need to Switch from Hadoop 1.0 to Hadoop 2.0 (YARN)
The foremost version of Hadoop had both advantages and disadvantages. Hadoop MapReduce is a standard established for big data processing systems in the modern era but the Hadoop MapReduce architecture does have some drawbacks which generally come into action when dealing with huge clusters.

Limitations of Hadoop 1.0
1)Issue of Availability:
Hadoop 1.0 Architecture had only one single point of availability i.e. the Job Tracker, so in case if the Job Tracker fails then all the jobs will have to restart.

2)Issue of Scalability:
The Job Tracker runs on a single machine performing various tasks such as Monitoring, Job Scheduling, Task Scheduling and Resource Management. In spite of the presence of several machines (Data Nodes), they were not being utilized in an efficient manner, thereby limiting the scalability of the system.

3)Cascading Failure Issue:
In case of Hadoop MapReduce when the number of nodes is greater than 4000 in a cluster, some kind of fickleness is observed. The most common kind of failure that was observed is the cascading failure which in turn could cause the overall cluster to deteriorate when trying to overload the nodes or replicate data via network flooding.

4)Multi-Tenancy Issue:
The major issue with Hadoop MapReduce that paved way for the advent of Hadoop YARN was multi-tenancy. With the increase in the size of clusters in Hadoop systems, the clusters can be employed for a wide range of models.

Hadoop MapReduce devotes the nodes of the cluster in the Hadoop System so that they can be repurposed for other big data workloads and applications. Nevertheless, with Big Data and Hadoop, ruling the data processing applications for cloud deployments, the number of nodes in the cluster is likely to increase and this issue is addressed with a switch from 1.x to 2.x.

This is not just the end of the limitations coming from Hadoop MapReduce apart from the above mentioned issues there were several other concerns addressed by Hadoop programmers with version 1.0 such as inefficient utilization of the resources, hindering constraints in running any other Non-MapReduce applications, running ad-hoc queries, carrying out real time analysis and limitations in running the message passing approach.

Address

Hyderabad

Telephone

9603380958

Website

Alerts

Be the first to know and let us send you an email when Hadoop Framework posts news and promotions. Your email address will not be used for any other purpose, and you can unsubscribe at any time.

Contact The Business

Send a message to Hadoop Framework:

Share