Sunday, November 20, 2016

HBase

HBase

       HBase is a horizontally scalable, distributed, open source, and a sorted map database.
       It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS).
       HBase is a NoSQL non-relational database that doesn't always require a predefined schema.
        HBase is a column-based database.
       It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried.


Comparison Of HBase v/s RDBMS

Relational Databases
HBase
Uses tables as databases
Uses regions as databases
Normalized
De-normalized
The technique used to store logs is commit logs
The technique used to store logs is Write-Ahead Logs (WAL)
The reference system used is coordinate system
The reference system used is ZooKeeper
Uses the primary key
Uses the row key
Use of rows, columns, and cells
Use of rows, column families, columns, and cells



 Components of Hbase




Master server
       The master server co-ordinates the cluster and performs administrative operations, such as assigning regions and balancing the loads.
        Assigning regions on startup , re-assigning regions for recovery or load balancing
       - Monitoring all Region Server instances in the cluster (listens for notifications from zookeeper)
       Admin functions
       - Interface for creating, deleting, updating tables

Region server
       The region servers do the real work. A subset of the data of each table is handled by each region server. Clients talk to region servers to access data in HBase.



                   
                  Region Server Components

       A Region Server runs on an HDFS data node and has the following components:
       WAL: Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn't yet been persisted to permanent storage; it is used for recovery in the case of failure.
       BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full.
       MemStore: is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region.
       Hfiles store the rows as sorted KeyValues on disk.


Regions
       A region contains all rows in the table between the region’s start key and end key. Regions are assigned to the nodes in the cluster,
called “Region Servers,” and these serve data for reads and writes.



           HBase Memstore
       The MemStore stores updates in memory as sorted KeyValues, the same as it would be stored in an HFile.
       There is one MemStore per column family.
       The updates are sorted per column family.



HFiles
       HFiles are the physical representation of data in HBase. Clients do not read HFiles directly but go through region servers to get to the data.
       HBase internally puts the data in indexed StoreFiles that exist on HDFS for high-speed lookups.





Write Ahead Log:
       In short pronounced as WAL.
       The WAL is the lifeline that is needed when disaster strikes. Each Region Server adds updates (Puts, Deletes) to its write-ahead log (WAL) first, and then to “MemStore”. This ensures that HBase has durable writes.
       if the server crashes it can effectively replay that log to get everything up to where the server should have been just before the crash


6 comments: