Monday, October 09, 2023

When zfs was young

On the Solaris 10 Platinum Beta program, one of the most exciting promised features was ZFS, the new file system.

I was especially interested, given that I was in a data-heavy position at the time. The limits of UFS were painful, we had datasets into several terabytes already - and even the multiterabyte file system support that got added was actually pretty useless because the inode density was so low. We tried QFS and SAM-QFS, and they were pretty appalling too.

ZFS was promised, and didn't arrive. In fact, there were about 4 of us on the beta program who saw the original zfs implementation, and it was quite different from what we have now. What eventually landed as zfs in Solaris was a complete rewrite. The beta itself was interesting - we were sent the driver, 3 binaries, and a 3-line cheatsheet, and that was it. There was a fundamental philosophy here that the whole thing was supposed to be so easy to use and sufficiently obvious that it didn't need a manual, and that was actually true. (It's gotten rather more complex since, to be fair.)

The original version was a bit different in terms of implementation than what you're used to, but not that much. The most obvious change was that originally there wasn't a top-level file system for a pool. You created a pool, and then created your file systems. I'm still not sure which is the correct choice. And there was a separate zacl program to handle the ACLs, which were rather different.

In fact, ACLs have been a nightmare of bad implementations throughout their history on Solaris. I already had previous here, having got the POSIX draft ACL implementation reworked for UFS. The original zfs implementation had default aka inheritable ACLs applied to existing objects in a directory. (If you don't immediately realise how bad that is, think of what this allows you to do with hard links to files.) The ACL implementations have continued to be problematic - consider that zfs allows 5 settings for the aclinherit property as evidence that we're glittering a turd at this point.

Eventually we did get zfs shipped in a Solaris 10 update, and it's been continually developed since then. The openzfs project has given the file system an independent existence, it's now in FreeBSD, you can run it (and it runs well) on Linux, and in other OS variations too.

One of the original claims was that zfs was infinitely scalable. I remember it being suggested that you could create a separate zfs file system for each user. I had to try this, so got together a test system (an Ultra 2 with an A1000 disk array) and started creating file systems. Sure, it got into several thousand without any difficulty, but that's not infinite - think universities or research labs and you can easily have 10,000 or 100,000 users, we had well over 20,000. And it fell apart at that scale. That's before each is an NFS share, too. So that idea didn't fly.

Overall, though, zfs was a step change. The fact that you had a file system that was flexible and easily managed was totally new. The fact that a file system actually returned correct data rather than randomly hoping for the best was years ahead of anything else. Having snapshots that allowed users to recover from accidentally deleted files without waiting days for a backup to be restored dramatically improved productivity. It's win after win, and I can't imagine using anything else for storing data.

Is zfs perfect? Of course not, and to my mind one of the most shocking things is that nothing else has even bothered to try and come close.

There are a couple of weaknesses with zfs (or related to zfs, if I put it more accurately). One is that it's still a single-node file system. While we have distributed storage, we still haven't really matured that into a distributed file system. The second is that while zfs has dragged storage into the 21st century, allowing much more sophisticated and scalable management of data, there hasn't been a corresponding improvement in backup, which is still stuck firmly in the 1980s.


Fazal Majid said...

ZFS also has a problem with fragmentation that seriously degrades performance on write-heavy filesystems like RDBMSes.

jnickelsen said...

When I worked heavily with ZFS (think hundreds of file systems for web hosting customers' mysql dbs), I found replicating snapshots to another (preferably remote) volume a quite decent method for making backups. Aging out previous snapshots gave us hourly/daily/weekly/monthly backups from which we could restore with ease — or even have a (patched) mysqld run directly on a snapshot for data recovery.

Eric J. Bowman said...

Excellent work on the distro, Peter. Zap is an excellent tool, I love the overlay concept/implementation, and the router zone is fantastic!

I installed TFTP in mine, put "dhcp-option=66,""" in dnmasq.conf and loaded Then I set up a lx zone... in some fashion I have yet to determine, should work but doesn't? Tried bhyve zone, same thing but trying to get BSD netboot, no results yet.

PXE Zone type, maybe? A "system" Zone like your router zone?

Peter Tribble said...

Thanks Eric!

I haven't done too much to get PXE to work - as it's local anyway you can boot bhyve directly from media. And does work fine like that.

One thing that will help is to add -6 to the create-zone arguments. This turns off all the anti-spoofing stuff as well as enabling IPv6, but the side-effect is that dhcp works. I normally force cloud-init or poke in a manual configuration to give systems as much chance as possible to bring the network up correctly.