Sunday, November 20, 2016

Checkpoint Status on Name Node


Checkpoint Status on SNN

While monitoring Cloudera’s ecosystem I came across an unhealthy node pointing to below issue which deals with Checkpointing.

“The filesystem checkpoint is 16 hour(s), 40 minute(s) old. This is 1600.75% of the configured checkpoint period of 1 hour(s). Critical threshold: 400.00%. 10,775 transactions have occurred since the last filesystem checkpoint. This is 1.08% of the configured checkpoint transaction target of 1,000,000."

Checkpoint supposed to be fired 

Filesystem Checkpoint Period<dfs.namenode.checkpoint.period> : 1 Hours
or
Filesystem Checkpoint Transaction Threshold<dfs.namenode.checkpoint.txns> 1,000,000

Checkpointing is triggered by one of two conditions: if enough time has elapsed since the last checkpoint (dfs.namenode.checkpoint.period), or if enough new edit log transactions have accumulated (dfs.namenode.checkpoint.txns). The checkpointing node periodically checks if either of these conditions are met (dfs.namenode.checkpoint.check.period), and if so, kicks off the checkpointing process.

Now prior drilling down to the solution; let’s first understand what is Checkpoint all about.

Checkpointing is critical part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient NameNode recovery and restart, and is an important indicator of cluster health.
NameNode’s primary responsibility is storing the HDFS namespace. This means things like the directory tree, file permissions, and the mapping of files to block IDs. It’s important that this metadata (and all changes to it) are safely persisted to stable storage for fault tolerance.
This filesystem metadata is stored in two different constructs: the fsimage and the edit log. The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata. However, while the fsimage file format is very efficient to read, it’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability. This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage.
Why is Checkpointing Important?

A typical edit ranges from 10s to 100s of bytes, but over time enough edits can accumulate to become unwieldy. A couple of problems can arise from these large edit logs. In extreme cases, it can fill up all the available disk capacity on a node, but more subtly, a large edit log can substantially delay NameNode startup as the NameNode reapplies all the edits. This is where checkpointing comes in.
Checkpointing is a process that takes an fsimage and edit log and compacts them into a new fsimage. This way, instead of replaying a potentially unbounded edit log, the NameNode can load the final in-memory state directly from the fsimage. This is a far more efficient operation and reduces NameNode startup time.


Checkpointing creates a new fsimage from an old fsimage and edit log.


However, creating a new fsimage is an I/O- and CPU-intensive operation, sometimes taking minutes to perform. During a checkpoint, the namesystem also needs to restrict concurrent access from other users. So, rather than pausing the active NameNode to perform a checkpoint, HDFS defers it to either the SecondaryNameNode or Standby NameNode, depending on whether NameNode high-availability is configured. The mechanics of checkpointing differs depending on if NameNode high-availability is configured.
Back to issue what is causing this and how can I get it to stop. There can be multiple reasons for the issue; either there could be a failed or improper Upgrade procedure or, it might result from the Secondary NameNode having an incorrect ${dfs.namenode.checkpoint.dir}/current/VERSION file. In the second scenario, everything under the SNN's ${dfs.namenode.checkpoint.dir} the directory needs to be wiped out and rebuilt again so that checkpointing will work again
And for me second seems to be more probable reason. As there were
a) Zero generation of Edit Logs.
b) No FS images at all.
c) No version file

Basically these files are part of current directory under DFS.
/bigdata/dfs/snn/current; here snn is secondary name node.
Files under current directory looks like e.g.

- -rw-r--r-- 1 hdfs hdfs   79098 Nov 20 23:01 edits_0000000000000086448-0000000000000087020
-rw-r--r-- 1 hdfs hdfs  297866 Nov 20 22:01 fsimage_0000000000000086447
-rw-r--r-- 1 hdfs hdfs  297866 Nov 20 22:01 fsimage_0000000000000086447
-rw-r--r-- 1 hdfs hdfs      62 Nov 20 22:01 fsimage_0000000000000086447.md5
rw-r--r-- 1 hdfs hdfs     172 Nov 20 23:01 VERSION

Resolution

Follow these steps to resolve the issue:
1.   Shutdown HDFS service(s).
2.   Log in to the Secondary NameNode host
3.   cd to the value of ${dfs.namenode.checkpoint.dir}: this you can find under configuration tab of Cloudera Manager.
4.   mv current current.bad
5.   Start up HDFS service(s) only
6.   Wait for HDFS services to come online
7.   Start the remaining Hadoop Services

Make sure not to suppress the checkpoint alert as it is very critical for building up healthy node.


HBase

HBase

       HBase is a horizontally scalable, distributed, open source, and a sorted map database.
       It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS).
       HBase is a NoSQL non-relational database that doesn't always require a predefined schema.
        HBase is a column-based database.
       It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried.


Comparison Of HBase v/s RDBMS

Relational Databases
HBase
Uses tables as databases
Uses regions as databases
Normalized
De-normalized
The technique used to store logs is commit logs
The technique used to store logs is Write-Ahead Logs (WAL)
The reference system used is coordinate system
The reference system used is ZooKeeper
Uses the primary key
Uses the row key
Use of rows, columns, and cells
Use of rows, column families, columns, and cells



 Components of Hbase




Master server
       The master server co-ordinates the cluster and performs administrative operations, such as assigning regions and balancing the loads.
        Assigning regions on startup , re-assigning regions for recovery or load balancing
       - Monitoring all Region Server instances in the cluster (listens for notifications from zookeeper)
       Admin functions
       - Interface for creating, deleting, updating tables

Region server
       The region servers do the real work. A subset of the data of each table is handled by each region server. Clients talk to region servers to access data in HBase.



                   
                  Region Server Components

       A Region Server runs on an HDFS data node and has the following components:
       WAL: Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn't yet been persisted to permanent storage; it is used for recovery in the case of failure.
       BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full.
       MemStore: is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region.
       Hfiles store the rows as sorted KeyValues on disk.


Regions
       A region contains all rows in the table between the region’s start key and end key. Regions are assigned to the nodes in the cluster,
called “Region Servers,” and these serve data for reads and writes.



           HBase Memstore
       The MemStore stores updates in memory as sorted KeyValues, the same as it would be stored in an HFile.
       There is one MemStore per column family.
       The updates are sorted per column family.



HFiles
       HFiles are the physical representation of data in HBase. Clients do not read HFiles directly but go through region servers to get to the data.
       HBase internally puts the data in indexed StoreFiles that exist on HDFS for high-speed lookups.





Write Ahead Log:
       In short pronounced as WAL.
       The WAL is the lifeline that is needed when disaster strikes. Each Region Server adds updates (Puts, Deletes) to its write-ahead log (WAL) first, and then to “MemStore”. This ensures that HBase has durable writes.
       if the server crashes it can effectively replay that log to get everything up to where the server should have been just before the crash


HADOOP - Overview

HADOOP - Overview

Due to the advent of new technologies, devices, and communication means like social networking sites, the amount of data generated is growing rapidly every year. 

Hadoop is an ecosystem an Apache open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

BENEFITS
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.
·            Scalability and Performance – distributed processing of data local to each node in a cluster enables Hadoop to store, manage, process and analyze data at petabyte scale.
·            Reliability – large computing clusters are prone to failure of individual nodes in the cluster. Hadoop is fundamentally resilient – when a node fails processing is re-directed to the remaining nodes in the cluster and data is automatically re-replicated in preparation for future node failures.
·           Flexibility – unlike traditional relational database management systems, you don’t have to created structured schemas before storing data. You can store data in any format, including semi-structured or unstructured formats, and then parse and apply schema to the data when read.
·           Low Cost – unlike proprietary software, Hadoop is open source and runs on low-cost commodity hardware.

Hadoop Architecture

Hadoop framework includes following four modules:
·        Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.
·    Hadoop YARN: This is a framework for job scheduling and cluster resource management.
·        Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
·        Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.