Hadoop Administration & Performance Tuning

Wednesday, May 2, 2018

Zookeeper Issue : Last transaction was partial

Zookeeper Issue : Last transaction was partial

Issue: ZooKeeper was continuously crashing with below error

2018-03-23 12:20:53,374 ERROR org.apache.zookeeper.server.persistence.Util: Last transaction was partial.

2018-03-23 12:20:53,375 ERROR org.apache.zookeeper.server.ZooKeeperServerMain: Unexpected exception, exiting abnormally

Brief: Like any other transnational system; zookeeper works just like any other software where any transactions that are related to state of zookeeper will be written first to the disk and then its updates takes place. When the transaction log file reaches a certain size, a new transaction log file gets created.

ZooKeeper stores its data in a data directory and its transaction log in a transaction log directory.

Data Directory

myid - contains a single integer in human readable ASCII text that represents the server id.
snapshot.<zxid> - holds the fuzzy snapshot of a data tree.

Log Directory

The Log Directory contains the ZooKeeper transaction logs. Before any update takes place, Zookeeper ensures that the transaction that represents the update is written to non-volatile storage.

A new log file is started each time a snapshot is begun. The log file's suffix is the first zxid written to that log.

After talking to the team and getting info what all was done prior this error; came to know Hadoop eco system was aborted once during one transaction activity.

Since the termination was abnormal and issue was related to transaction entry so I rush to LOG directory and saw Zookeeper transactions log file was sized 0<junk file> ; which means that while booting up ; zookeeper was trying apply logs for the consistent state but due to 0 sized it didn’t manage to replay the transactions and hence failed to start.

-rw-r--r-- 1 zookeeper zookeeper 0 Mar 23 03:17 log.da092

Solution: I removed the junk log file and started Zookeeper again and succeeded. Simple enough!!!!

Writing to transaction log files is not efficient step for a heavy loaded system, because let say on startup zookeeper would have to replay every transaction it ever processed.

So periodically, zookeeper will write a snapshot of the current state of its in memory database to file.

In short both snapshot and transactions logs are very important to zookeeper.

Enjoy Learning!!!

Tuesday, December 13, 2016

HBASE: Table already exists

Last weekend while expermenting on Hbase component of Hadoop ecosystem; and i came across an issue while creating table.

=> Hbase::Table - emp

hbase(main):003:0> create 'emp', 'personal data', 'professional data'

ERROR: Table already exists: emp!

Cool so table is already present; that’s ok; might be it is already present; sounds easy peasy right but somehow it turned out to be very interesting for me.

At first i checked its existence and i listed all the table and unfortunately it’s not there.

hbase(main):004:0> list

TABLE

emp_v

random_test

temp_test

So to outsmart Hbase:) i tried deletion

hbase(main):005:0> drop 'emp'

ERROR: Table emp does not exist.

Excellent so in one context Hbase table is not present but it is present sounds whacky right; but all in all its good there is something for me to dig in and there is a saying "They Say Half Knowledge is a Dangerous thing"; so why not add another half and make it full.

I went ahead and tried to see if there is any trace of a file which might left behind and from that Hbase might be reading.

I checked that and I saw only one namespace i.e. "default"

#hdfs dfs -ls /hbase/data

drwxr-xr-x - hbase hbase 0 2016-12-11 20:51 /hbase/data/default

you can see nowhere a single directory.

# hdfs dfs -ls /hbase/data/default

drwxr-xr-x - hbase hbase 0 2016-12-08 18:20 /hbase/data/default/emp_v

drwxr-xr-x - hbase hbase 0 2016-12-09 13:45 /hbase/data/default/random_test

drwxr-xr-x - hbase hbase 0 2016-12-09 00:08 /hbase/data/default/temp_test

So above didn’t helped either; but this is true that Hbase is reading it from somewhere but where; i started from basics that how client coordinate with Hbase and this proved to be fruitful step for me.

Way forward was "The Zookeeper".

A distributed HBase relies completely on Zookeeper (for cluster configuration and management). In HBase, Zookeeper coordinates, communicates, and shares state between the Masters and Region Servers.

Zookeeper is a client/server system for distributed coordination that exposes an interface similar to a filesystem, where each node (called a Znode) may contain data and a set of children.

Each Znode has a name and can be identified using a filesystem-like path (for example, /root-znode/sub-znode/my-znode).

Znode which seems to be important for this context

Znode	Description
hbase/table (zookeeper.znode.masterTableEnableDisable)	Used by the master to track the table state during assignments (disabling/enabling states, for example).

Zookeeper can be accessed via zkcli and i logged in to zookeeper client via hbase zkcli or you may use zookeeper-client command.

i did ls command and issue seems cracked; here is the bright and shiny emp table.

ls /hbase/table

[ random_test, emp, hbase:meta, hbase:namespace, temp_test, emp_v]

As I mentioned above as part of Znode that HBase stores some data schema and state of each table in zookeeper to be able to coordinate between all the region servers .

You might be wondering why table metadata still present although dropped from Hbase??

What is the solution??

Since it seems an orphan kind of table to me; so why not remove the metadata from zookeeper; Zookeeper provides rmr command to remove a specified Znode.

So from same login I executed below command which expected to remove a Znode i.e. <Metadata of emp table>

rmr /hbase/emp

and I checked ls command again that whether emp table is still present or not; finally table is removed a big relief to me.

ls /hbase/table

[ random_test, hbase:meta, hbase:namespace, temp_test, emp_v]

And I created table again and this time I succeeded J

hbase(main):001:0> create 'emp', 'personal data', 'professional data'

0 row(s) in 2.5890 seconds

=> Hbase::Table - emp

Sunday, November 20, 2016

Checkpoint Status on Name Node

Checkpoint Status on SNN

While monitoring Cloudera’s ecosystem I came across an unhealthy node pointing to below issue which deals with Checkpointing.

“The filesystem checkpoint is 16 hour(s), 40 minute(s) old. This is 1600.75% of the configured checkpoint period of 1 hour(s). Critical threshold: 400.00%. 10,775 transactions have occurred since the last filesystem checkpoint. This is 1.08% of the configured checkpoint transaction target of 1,000,000."

Checkpoint supposed to be fired

Filesystem Checkpoint Period<dfs.namenode.checkpoint.period> : 1 Hours

Filesystem Checkpoint Transaction Threshold<dfs.namenode.checkpoint.txns> 1,000,000

Checkpointing is triggered by one of two conditions: if enough time has elapsed since the last checkpoint (dfs.namenode.checkpoint.period), or if enough new edit log transactions have accumulated (dfs.namenode.checkpoint.txns). The checkpointing node periodically checks if either of these conditions are met (dfs.namenode.checkpoint.check.period), and if so, kicks off the checkpointing process.

Now prior drilling down to the solution; let’s first understand what is Checkpoint all about.

Checkpointing is critical part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient NameNode recovery and restart, and is an important indicator of cluster health.

NameNode’s primary responsibility is storing the HDFS namespace. This means things like the directory tree, file permissions, and the mapping of files to block IDs. It’s important that this metadata (and all changes to it) are safely persisted to stable storage for fault tolerance.

This filesystem metadata is stored in two different constructs: the fsimage and the edit log. The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata. However, while the fsimage file format is very efficient to read, it’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability. This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage.

Why is Checkpointing Important?

A typical edit ranges from 10s to 100s of bytes, but over time enough edits can accumulate to become unwieldy. A couple of problems can arise from these large edit logs. In extreme cases, it can fill up all the available disk capacity on a node, but more subtly, a large edit log can substantially delay NameNode startup as the NameNode reapplies all the edits. This is where checkpointing comes in.

Checkpointing is a process that takes an fsimage and edit log and compacts them into a new fsimage. This way, instead of replaying a potentially unbounded edit log, the NameNode can load the final in-memory state directly from the fsimage. This is a far more efficient operation and reduces NameNode startup time.

Checkpointing creates a new fsimage from an old fsimage and edit log.

However, creating a new fsimage is an I/O- and CPU-intensive operation, sometimes taking minutes to perform. During a checkpoint, the namesystem also needs to restrict concurrent access from other users. So, rather than pausing the active NameNode to perform a checkpoint, HDFS defers it to either the SecondaryNameNode or Standby NameNode, depending on whether NameNode high-availability is configured. The mechanics of checkpointing differs depending on if NameNode high-availability is configured.

Back to issue what is causing this and how can I get it to stop. There can be multiple reasons for the issue; either there could be a failed or improper Upgrade procedure or, it might result from the Secondary NameNode having an incorrect ${dfs.namenode.checkpoint.dir}/current/VERSION file. In the second scenario, everything under the SNN's ${dfs.namenode.checkpoint.dir} the directory needs to be wiped out and rebuilt again so that checkpointing will work again

And for me second seems to be more probable reason. As there were

a) Zero generation of Edit Logs.

b) No FS images at all.

c) No version file

Basically these files are part of current directory under DFS.

/bigdata/dfs/snn/current; here snn is secondary name node.

Files under current directory looks like e.g.

- -rw-r--r-- 1 hdfs hdfs 79098 Nov 20 23:01 edits_0000000000000086448-0000000000000087020

-rw-r--r-- 1 hdfs hdfs 297866 Nov 20 22:01 fsimage_0000000000000086447

-rw-r--r-- 1 hdfs hdfs 62 Nov 20 22:01 fsimage_0000000000000086447.md5

rw-r--r-- 1 hdfs hdfs 172 Nov 20 23:01 VERSION

Resolution

Follow these steps to resolve the issue:

1. Shutdown HDFS service(s).

2. Log in to the Secondary NameNode host

3. cd to the value of ${dfs.namenode.checkpoint.dir}: this you can find under configuration tab of Cloudera Manager.

4. mv current current.bad

5. Start up HDFS service(s) only

6. Wait for HDFS services to come online

7. Start the remaining Hadoop Services

Make sure not to suppress the checkpoint alert as it is very critical for building up healthy node.

HBase

• HBase is a horizontally scalable, distributed, open source, and a sorted map database.

• It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS).

• HBase is a NoSQL non-relational database that doesn't always require a predefined schema.

• HBase is a column-based database.

• It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried.

Comparison Of HBase v/s RDBMS

Relational Databases	HBase
Uses tables as databases	Uses regions as databases
Normalized	De-normalized
The technique used to store logs is commit logs	The technique used to store logs is Write-Ahead Logs (WAL)
The reference system used is coordinate system	The reference system used is ZooKeeper
Uses the primary key	Uses the row key
Use of rows, columns, and cells	Use of rows, column families, columns, and cells

Components of Hbase

Master server

• The master server co-ordinates the cluster and performs administrative operations, such as assigning regions and balancing the loads.

• Assigning regions on startup , re-assigning regions for recovery or load balancing

• - Monitoring all Region Server instances in the cluster (listens for notifications from zookeeper)

• Admin functions

• - Interface for creating, deleting, updating tables

Region server

• The region servers do the real work. A subset of the data of each table is handled by each region server. Clients talk to region servers to access data in HBase.

Region Server Components

• A Region Server runs on an HDFS data node and has the following components:

• WAL: Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn't yet been persisted to permanent storage; it is used for recovery in the case of failure.

• BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full.

• MemStore: is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region.

• Hfiles store the rows as sorted KeyValues on disk.

Regions

• A region contains all rows in the table between the region’s start key and end key. Regions are assigned to the nodes in the cluster,
called “Region Servers,” and these serve data for reads and writes.

HBase Memstore

• The MemStore stores updates in memory as sorted KeyValues, the same as it would be stored in an HFile.

• There is one MemStore per column family.

• The updates are sorted per column family.

HFiles

• HFiles are the physical representation of data in HBase. Clients do not read HFiles directly but go through region servers to get to the data.

• HBase internally puts the data in indexed StoreFiles that exist on HDFS for high-speed lookups.

Write Ahead Log:

• In short pronounced as WAL.

• The WAL is the lifeline that is needed when disaster strikes. Each Region Server adds updates (Puts, Deletes) to its write-ahead log (WAL) first, and then to “MemStore”. This ensures that HBase has durable writes.

• if the server crashes it can effectively replay that log to get everything up to where the server should have been just before the crash

HADOOP - Overview

HADOOP - Overview

Due to the advent of new technologies, devices, and communication means like social networking sites, the amount of data generated is growing rapidly every year.

Hadoop is an ecosystem an Apache open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

BENEFITS

Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.

· Scalability and Performance – distributed processing of data local to each node in a cluster enables Hadoop to store, manage, process and analyze data at petabyte scale.

· Reliability – large computing clusters are prone to failure of individual nodes in the cluster. Hadoop is fundamentally resilient – when a node fails processing is re-directed to the remaining nodes in the cluster and data is automatically re-replicated in preparation for future node failures.

· Flexibility – unlike traditional relational database management systems, you don’t have to created structured schemas before storing data. You can store data in any format, including semi-structured or unstructured formats, and then parse and apply schema to the data when read.

· Low Cost – unlike proprietary software, Hadoop is open source and runs on low-cost commodity hardware.

Hadoop Architecture

Hadoop framework includes following four modules:

· Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.

· Hadoop YARN: This is a framework for job scheduling and cluster resource management.

· Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.

· Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.