HBase
• HBase is a
horizontally scalable, distributed, open source, and a sorted map database.
• It runs on top of
Hadoop file system that is Hadoop Distributed File System (HDFS).
• HBase is a NoSQL
non-relational database that doesn't always require a predefined schema.
• HBase is
a column-based database.
• It can be seen as a
scaling flexible, multidimensional spreadsheet where any structure of data is
fit with on-the-fly addition of new column fields, and fined column structure
before data can be inserted or queried.
Comparison Of HBase v/s
RDBMS
Relational Databases
|
HBase
|
Uses tables as databases
|
Uses regions as databases
|
Normalized
|
De-normalized
|
The technique used to store logs is commit logs
|
The technique used to store logs is Write-Ahead
Logs (WAL)
|
The reference system used is coordinate system
|
The reference system used is ZooKeeper
|
Uses the primary key
|
Uses the row key
|
Use of rows, columns, and cells
|
Use of rows, column families, columns, and cells
|
Master
server
• The master server co-ordinates the
cluster and performs administrative operations, such as assigning regions and
balancing the loads.
• Assigning regions on startup , re-assigning
regions for recovery or load balancing
• - Monitoring all Region Server
instances in the cluster (listens for notifications from zookeeper)
• Admin functions
• - Interface for creating, deleting,
updating tables
Region
server
• The region servers do the real work. A
subset of the data of each table is handled by each region server. Clients talk
to region servers to access data in HBase.
Region Server Components
• A Region Server runs on an HDFS data
node and has the following components:
• WAL: Write Ahead Log is a file on the
distributed file system. The WAL is used to store new data that hasn't yet been
persisted to permanent storage; it is used for recovery in the case of failure.
• BlockCache: is the read cache. It
stores frequently read data in memory. Least Recently Used data is evicted when
full.
• MemStore: is the write cache. It
stores new data which has not yet been written to disk. It is sorted before
writing to disk. There is one MemStore per column family per region.
• Hfiles store the rows as sorted
KeyValues on disk.
Regions
• A region contains all rows in the
table between the region’s start key and end key. Regions are assigned to the
nodes in the cluster,
called “Region Servers,” and these serve data for reads and writes.
called “Region Servers,” and these serve data for reads and writes.
HBase
Memstore
• The MemStore stores updates in memory
as sorted KeyValues, the same as it would be stored in an HFile.
• There is one MemStore per column
family.
• The updates are sorted per column
family.
HFiles
• HFiles are the physical representation
of data in HBase. Clients do not read HFiles directly but go through region
servers to get to the data.
• HBase internally puts the data in
indexed StoreFiles that exist on HDFS for high-speed lookups.
Write Ahead Log:
• In short pronounced as WAL.
• The WAL is the lifeline that is needed
when disaster strikes. Each Region Server adds updates (Puts, Deletes) to its
write-ahead log (WAL) first, and then to “MemStore”. This ensures that HBase
has durable writes.
• if the server crashes it can
effectively replay that log to get everything up to where the server should
have been just before the crash
Cool explanation
ReplyDeleteThanks Satish
Deletewell explained !! :) a good read..
ReplyDeleteThanks Kinjal
DeleteInformative..something new apart from regular topics :-)
ReplyDeleteThanks Meghana..
Delete