Monday, August 08, 2011

Thoughts on ZFS dedup

Following on from some thoughts on ZFS compression, and nudged by one of the comments, what about ZFS dedup?

There's also a somewhat less opinionated article that you should definitely read.

So, my summary: unlike compression, dedup should be avoided unless you have a specific niche use.

Even for a modest storage system, say something in the 25TB range, then you should be aiming for half a terabyte of RAM (or L2ARC). Read the article above. And the point isn't just the cost of an SSD or a memory DIMM, it's the cost of a system that can take enough SSD devices or has enough memory capacity. And then think about a decent size storage system that may scale to 10 times that size. Eventually, the time may come, but my point is that while the typical system you might use today already has cpu power going spare to do compression for you, you're looking at serious engineering to get the capability to do dedup.

We can also see when turning on dedup might make sense. A typical (server) system may have 48G of memory so, scaled from the above, something in the range of 2.5TB of unique data might be a reasonable target. Frankly, that's pretty small, and you actually need to get a pretty high dedup ratio to make the savings worthwhile.

I've actually tested dedup on some data where I expected to get a reasonable benefit: backup images. The idea here is that you're saving similar data multiple times (either multiple backups of the same host, or backups of like data from lots of different hosts). I got a disappointing saving - of order 7% or so. Given the amount of memory we would have needed to put into a box to have 100TB of storage, this simply wasn't going to fly. By comparison, I see 25-50% compression on the same data, and you get that essentially for free. And that's part of the argument behind having compression on all the time, and avoiding dedup entirely.

I have another opinion here as well, which is that using dedup to identify identical data after the fact is the wrong place to do it, and indicates a failure in data management. If you know you have duplicate data (and you pretty much have to know you've got duplicate data to make the decision to enable dedup in the first place) then you ought to have management in place to avoid creating multiple copies of it: snapshots, clones, single-instance storage, or the like. Not generating duplicate data in the first place is a lot cheaper than creating all the multiple copies and then deduplicating them afterwards.

Don't get me wrong: deduplication has its place. But it's very much a niche product and certainly not something that you can just enable by default.


Daniel said...

I don't have personal experience with ZFS dedup, but one other use scenario that I suspect would get quite a favourable dedup ratio is storing virtual disk images for VMware etc. Consider a VPS hosting provider, who clone any new customer VPSes from a base VM. There could potentially be hundreds or thousands of VMs that are largely identical.

Christer Solskogen said...

Thanks for clearing some stuff up! :-)

Anonymous said...

I've had luck with dedup. One of the primary use cases was on a ZFS dataset housing the zone root FS for zones.

I have also toyed with using dedup for audit data with mixed results. But as Peter noted, in some cases turning on compression gave similar or better results.

Note: Dedup with ZFS was using on Solaris 11.