Welcome


The front page shows all the recent posts. To see a more organized view use the tabs above.


The Hash Cache

Posted by Alex at 10:25 PM

In build 11 of WhsDbCheck I introduced the new /keephashcache option to preserve the hash cache from run to run in order to save the reading and hashing time done the first time. I think this option deserves a little more explanation. But before we do that, let’s start with what the hash cache actually is.

When the windows home server stores data, it sores a hash of each cluster along with the actual data. A hash is a small number that can be used to absolutely verify the integrity of a much larger data chunk, any size in fact, for those who don’t know. In the home server each hash is typically used to verify a 4096 byte chunk of data, although this can be variable. I use this hash in a level 4 check to make sure that every single byte that exists is the backup database is the same exact byte that was written at backup time.

I don’t suggest anyone try this, but I did :) You can change a single byte in a data file and see the level 4 check detect the fault. Very cool stuff. Oh, and I mean it, don’t try this unless you have a backup of the database!

So here’s the problem though. In order for the check to be anywhere near reasonably fast, we’re talking hours instead of days or weeks here, we need to load the hashes into memory.

Let’s do some math:

  • If you have a 300 Gigabyte database.
  • 4096 bytes per cluster (this is typical for NTFS).
  • That means you have 78,643,200 hashes.
  • At 16 bytes per hash that would require 1.17 Gigabytes of RAM.

The first generation HP MediaSmart server comes with only 512 MB of RAM. So clearly, this is not going to work if you just load everything into RAM.

After much experimentation, in order to make a level 4 check possible I came up with a clever technique of perform this check in a reasonable amount of time. I call this the hash cache.

Essentially, the hash cache has one important quality, it avoids hard drive seek times at all costs. It turns out that drives and file systems are pretty good at reading a sequential stream, as long as it stays sequential. This is what the hash cache does, and it works well.

So this is where the /hashcache switch comes in. By default, the hash cache uses 256 MB of memory, but you can change that by specifying a new size with /hashcache=(size in MB). E.g. /hashcache=1000. To make the hash cache ~1Gb. This will make a level 4 check faster if you have the RAM. Specifying 0 will turn off the hash cache (not recommended), and specifying a value more than your RAM will be really bad for performance. Specifying a value more than a couple of gigabytes will slow things down too.

Now that I’ve explained what the hash cache is, in the next post I will talk about the new /keephashcache switch and how and when to use it.

blog comments powered by Disqus