HADOOP - Overview
Due to the advent of new
technologies, devices, and communication means like social networking sites,
the amount of data generated is growing rapidly every year.
Hadoop is an ecosystem an
Apache open source software platform for distributed storage and
distributed processing of very large data sets on computer clusters built from
commodity hardware. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.
BENEFITS
Some of the reasons organizations use Hadoop is
its’ ability to store, manage and analyze vast amounts of structured and
unstructured data quickly, reliably, flexibly and at low-cost.
· Scalability and Performance – distributed processing of data local to each node in
a cluster enables Hadoop to store, manage, process and analyze data at petabyte
scale.
· Reliability – large computing clusters are prone to failure of individual
nodes in the cluster. Hadoop is fundamentally resilient – when a node fails
processing is re-directed to the remaining nodes in the cluster and data is
automatically re-replicated in preparation for future node failures.
· Flexibility – unlike traditional relational database management systems,
you don’t have to created structured schemas before storing data. You can store
data in any format, including semi-structured or unstructured formats, and then
parse and apply schema to the data when read.
· Low Cost – unlike proprietary software, Hadoop is open source and runs
on low-cost commodity hardware.
Hadoop Architecture
Hadoop framework includes following four modules:
·
Hadoop
Common: These are
Java libraries and utilities required by other Hadoop modules. These libraries
provides filesystem and OS level abstractions and contains the necessary Java
files and scripts required to start Hadoop.
· Hadoop
YARN: This is a
framework for job scheduling and cluster resource management.
·
Hadoop
Distributed File System (HDFS): A
distributed file system that provides high-throughput access to application
data.
·
Hadoop
MapReduce: This is
YARN-based system for parallel processing of large data sets.
No comments:
Post a Comment