I've been investigating a strange problem on one of my servers. It started out as a simple case of NetBackup failing to back the filesystem up.
Now that's not entirely unusual - NetBackup often goes off and sulks in a corner. But this was rather different, as it didn't disappear as mysteriously as it came. Rather, it stayed put and the filesystem repeatedly refused to complete a backup. And the diagnostics are pretty much non-existent.
OK, so after a week or so of this I decide to try an alternative approach. To my surprise, I couldn't blame NetBackup.
First attempt was to try ufsdump. It started off in a promising manner, then froze completely on me.
OK, that's not good. So I try various attempts at tar. So various tar commands, writing to a remote tape or filesystem. That would work, right? Wrong!
Each attempt freezes completely on me. That's local tar piped to tar that writes over nfs; tar piped to an rsh; tar on an nfs client; tar using rmt to write to a tape.
That's odd. Now, at least I can look at the output and see how far it's got. I'm starting to make headway, as it looks like it gets to the same point each time.
OK, so I start to build a copy by tarring up various bits of the filesystem, avoiding the place where I know it breaks. Until I get into the suspect area. And yes, it still fails in the same place (but at least I've got a copy of most of the filesystem now, so can breathe easier).
The bad area is in an area that has various versions (views of the data) of the index files of a proprietary search engine. Now it looks like tar always traverses the hierarchy in the same order. OK, so if I manually list the subdirectories in an order that puts the failed one last, I can copy off the files I'm missing, right?
That was a fine theory until it froze on me again. And this is where it gets really strange. Each subdirectory has the same structure. So in each subdirectory there's a bunch of files with different suffixes. And it always fails on one particular suffix. Furthermore, it fails at about the same distance (about 38 of 40 megabytes). That's about as weird as it gets, in my experience. What on earth is there about these files that causes anything that tries to back these files up to lock up completely?
And it gets worse. I can cp this file locally. Try cp to any nfs-mounted location and the cp wedges. An rcp to any remote system wedges. And it's the same again - it wedges at the same distance into the file.
It must be something in the data stream that's contained in these files. At least I did find one way to copy them across - if I gzip them first then they go across fine.
But where is the bug that's being tickled? Is this something in the network stack?