PROVE IT !!If you know it then PROVE IT !! Skill Proficiency Test

Testing HDFS centralized cache

I really hate negative posts. I try technologies hoping to find them useful so as many readers as possible can adopt them and benefit from them. However, sometimes a technology does not live to the expectations, at least in my tests. It can still be useful as users will know what not to use, or at least test it thoroughly before using it. This is the case with HDFS centralized cache.

HDFS offers a caching mechanism that takes advantage of the Data nodes memory. Blocks are loaded in memory and pinned there so that when a client requests those blocks they can be served directly from memory which is much faster than disk. There are some 3rd party products out there that does the same, but this option comes with Hadoop out of the box.

Hadoop  has a special set of commands for managing this cache – the cacheadmin commands.

You must explicitly cache a directory or a file, and in case you cache a directory the caching is not recursive and sub directories will not be cached automatically. The full documentation can be found here. I was curious to see if Cloudera has integrated cache commands into their Cloudera manager, but was surprised to see that their documentation about it is basically a copy of the Apache hadoop guide and you still have to use the command line cacheadmin.

There are two major concepts in caching:

Directive: This is actually the cache definition. It’s properties are which path to cache, the replication factor, for how long to keep the blocks in cache (ttl) and which pool this directive belongs to. The replication factor means how many copies of the block to keep in cache and it defaults to 1 (you still have 3 copies on disk).

Cache pool: Groups together  several directives. A cache pool can have permissions that lets only specific users to add directives to the pool.

Caching is more efficient for small files that are accessed frequently and the documentation mentions a use case where we have a small hive table that is often joined with other tables as a good candidate for caching.

You can read the documentation links above for more details.

What I want to do in this post is to actually measure the performance gain that we can get by using HDFS cache.

I used the below scripts to create two csv files, one small file with 8000 lines and another, bigger, with 10,000,000 lines. Then I transferred them to HDFS and created hive tables based on them.

Big table:

Small table:

Then I uploaded the files to HDFS and created hive tables:

Now I ran the below query that joins the two tables several times and recorded the average duration of the query run:

select * from bigtable a, smalltable b where a.serial=b.serial and b.random_number>27000;

Then I created a pool and a directive so that the small table will be cached.

Next thing you should do is to make sure that enough memory is allocated on each datanode to contain your cached table:

dfs.datanode.max.locked.memory=2500M

Or in Cloudera manager:

You should also make sure the OS kernel parameters support the amount of memory you set:

Now we can run hdfs cacheadmin -listDirectives -stats and see that the whole small table is cached (bytes cached is the same as bytes needed):

Again, I ran the same query several times and calculated the average, this time with cache on. I was surprised to see there was no significant difference from the queries that ran without caching. Here is the comparison (Y axis is seconds):

The time is not very impressive as my test cluster is small, but you can see that both took about 54 seconds to run and the query was even a bit faster  without cache. This is strange. Before giving up I decided to try one more time with a different approach. This time I used teragen and terasort to measure cache efficiency. You can read about teragen and terasort in this post.

First I ran teragen to create the test data. Then I ran terasort on that data several times and recorded the average duration. Then I added cache directive on the teragen data and made sure it was cached:

Then I ran terasort again several times, this time with caching enabled (deleting the terasort output directory before each run):

Since terasort reads the teragen data and sorts it, it should perform faster as it gets the data from memory and not from disk. Again I was surprised (and disappointed) how close the results were with and without cache (the chart is misleading, the difference is only one second, only 0.47%):

Conclusions and thoughts:

I really hate when I test a technology and find that there’s nothing to report. I may miss something or do something wrong, but according to my tests this caching technology does not work as it should. It did not provide any performance gain and in some magical way it actually degraded performance.

If you are looking for an alternative you may check alluxio. It needs more installation and setup then hdfs’s own caching but according to some tests I made long ago that you can find here and here, it gives some performance boost.

Let’s block ads! (Why?)