Monday, February 16, 2009

Comparing Drobo and DroboShare to an OpenSolaris storage server

In my previous entry I recalled the trial of recovering from a corrupted HFS+J volume that contained my Time Machine backups. During that process, I became aware of some of the drawbacks of the storage appliance on which my backups were stored, the Drobo. During all of that intensive reading and writing of large chunks of data, it was obvious that the Drobo was a bit on the slow side. Meanwhile, I had come across a video tutorial of ZFS, the file system that Sun created a few years back. It was so impressive, I decided then and there that the solution was to replace the Drobo with a server running OpenSolaris and ZFS. Granted, there are some advantages to an appliance like the Drobo, and in this entry I'll outline the pluses and minuses.

Drobo Advantages
It's an appliance, you plug it in and stick in the drives and that's basically all you need to do. In fact, the disks do not even have to be the same size, although if they are then it won't waste any of the capacity. With four 200GB drives, the Drobo provided 554GB of usable storage space. And it does this while consuming a modest 36W of power.

Drobo Disadvantages
While the list of advantages was rather brief, this list is going to be significantly longer. First, if you are using a Drobo, you almost certainly want to share it on a LAN, in which case you need to buy a DroboShare (another $200 on top of the $500 for the empty Drobo, plus whatever you had to pay for the disks). To make this work properly, you need to run the Drobo Dashboard, which has a couple of idiosyncrasies. First, you have to completely disable the firewall in Mac OS X in order for the Dashboard to detect the DroboShare. Second, the Dashboard application files must be owned by a particular user, the one that installed the software. That would not be a problem except that the application won't launch for any other user. While the DroboShare can be mounted without using the Dashboard, it seems there is a file ownership issue with at least some of the files on the Drobo, making them appear to be corrupted. The Drobo support tech was stumped and unable to resolve that particular issue, and could not bring themselves to admit that the Dashboard software was flawed.

During all of that Time Machine volume copying I was doing, the used space on the Drobo climbed upward. But as I deleted the botched copies, the Drobo was reluctant to return the space to the free side of the usage dial. Weeks went by and mysteriously one day the space all came back. I'll never know why, just one of the mysteries of the black box that is the Drobo. In fact, that is exactly what the Drobo is: a closed, black box. There is no access into it, no remedy when something goes wrong, no tool for diagnosis. If the box fails for any reason, you have exactly one place to turn to for help (if you're Bill Streeter, you'll know exactly how true that is). You better hope your support contract is still good ($50/year, $150 if you let it lapse). In the worst case, you have to buy a new Drobo just so you can get the data off of your disks. Yes, that is indeed the truth. Like any proprietary RAID-like device, the Drobo's on-disk format is a closely guarded secret. If the device fails and you can't find a replacement, your data is gone forever.

While its power consumption is modest, it has insufficient airflow around the disks and as a result they run a bit too hot for my taste. Now that they are in the new storage server, they are much cooler. Speaking of power, the power connector on the Drobo is notoriously loose. It's fallen out numerous times over the year that I've used my Drobo. The tech support person suggested taping the power cord to the side of the Drobo. Brilliant.

Last, but not least, the Drobo has a weird capacity "limit" of 2TB. Apparently it has something to do with the fact that it's only interface to the world is through a USB port. I would assume the second generation Drobo is better at this, but I'm not willing to spend $500 to find out. Regardless, your capacity is limited to whatever you can find in four disks, as that is the maximum number of disks any single Drobo can take. But, if you've got the money, you can plug two Drobos into a single Drobo Share. I doubt too many people have done that, as it would cost $1200 plus the cost of the disks. Meanwhile, I could build a system to hold 10+ disks for a fraction of that cost.

OpenSolaris and ZFS
The only disadvantage to running a server with OpenSolaris is that you have to install, configure, and maintain it. But hey, I've been doing that for years so it's no trouble for me. In fact, the recent releases of OpenSolaris are remarkably easy to set up and administer. Most of the standard configuration is done using the graphical interface, and for everything else there are well-written manual pages. As for the advantages of OpenSolaris and ZFS, there are many. First of all, ZFS is the most amazing file system on the planet. It's incredibly easy to set up a storage pool and create file systems. It handles stripes, mirrors, and data/parity formats (RAID 0, 1, 5, and 6) depending on your replication needs. You can add as many disks as your hardware can handle, and you can configure them any way you like.

ZFS has invincible data integrity. Unlike most, if not all, RAID 5 implementations, it does not suffer from the infamous write hole. Instead, it never overwrites live data, so all writes go to free blocks, with data being written first, then the meta data blocks, and finally the über blocks. If power is lost at any point, when the system comes back, it will only see valid data and meta data. What's more, all blocks are check-summed, and that checksum is stored in the parent block. This checksum carries upward to the über block, which effectively has a fingerprint of the entire file system. This guards against the worst kind of data loss, the silent kind, as it detects the occasional bit rot that some disks can suffer. With built-in data replication, these problems can be automatically corrected on the fly.

But what about data portability, which was a major issue for me with the Drobo? Well, get this: ZFS is open source. What's more, it's been ported to BSD and Mac OS X. So even if my storage server were to suddenly die, I not only have a choice of vendors to repair/replace the hardware, but I also have a choice of operating systems to access the data in the storage pool. But, in my opinion OpenSolaris is the best choice as it has the reference implementation of ZFS. What's more, OpenSolaris supports SMB, NFS, iSCSI, and AFP (via netatalk; see earlier blog entry), so I have several choices for how to access the storage over the network.

ZFS has built-in support for snapshots. In fact, it has a feature similar to Time Machine, called Time Slider, that makes automatic snapshots and manages their expiration, much like Time Machine. With snapshots in place, if I ever run into a corrupted HFS disk image again, I can roll back the file system to an earlier snapshot. Granted, I may lose some data, but it's better than losing the entire image to an uncorrectable error.

Final Notes
I lied earlier when I said there was only one disadvantage to running a storage server instead of a Drobo. The one other issue is the power consumption of most server-class systems is over 100W. In fact, my current server consumes at least 102W and often hits 120W while actively doing work. But, I have a plan to replace the hardware with low power parts, ones primarily aimed at the mobile and embedded market. I plan to talk about that more in a future entry.

While I could certainly format the Drobo using ZFS, I would only gain snapshots and on-disk consistency. Performance would still be rather poor, and disk management and repair would rely entirely on the Drobo itself. As far as ZFS would know, the Drobo would appear as one big disk. That is not the recommended scenario for ZFS according to its creators.

Earlier I mentioned that Drobo provided 554GB of usable space with the four 200GB disks I had installed. In comparison, ZFS provided just 548GB. Not too bad considering I'm getting rock solid data integrity and automated snapshots.

One final point, in regards to reclaiming disk space after deleting files. With ZFS, the freed space was returned within seconds, unlike the many weeks that it took the Drobo to realize I had deleted 100GB of data.

All in all, I'm very happy with the decision I've made. I now have a fast, reliable, serviceable, and manageable storage box that I can update with newer software versions indefinitely, as well as easily grow the capacity as my needs change.

Time Machine and invalid sibling link, now what?

Last month I was playing around with the timedog script to determine where all the disk space was going on my Time Machine backups. I had mounted the remote disk image using hdiutil attach, when a few minutes later Time Machine kicked off a backup. In a moment of poor judgment, I tried to stop the backup and unmount the second mount of the backup volume that TM had established. At first nothing appeared to have gone wrong as a result of that action, but the next time TM tried to make a backup, it said the volume was corrupted and it couldn't make another backup. No matter, I was sure Disk Utility or fsck could correct the problem, surely the error wasn't too serious. Ah, but this error was no run of the mill error. It was in the fact the dreaded invalid sibling link error. This is the error that makes most disk repair tools turn pale and shirk away into the corner. In most cases, nothing will fix the broken link, and there's no telling just what might be lost as a result.

But, I wasn't going to give up too easily. After all, I had over a year's worth of backups that I wanted to recover. I Googled around for days, reading forum posts and blog entries, and anything else that might bring some hope to my dire situation. In many cases, others who had this problem tried fsck or Disk Warrior, and some of them were successful. By this time, I had tried fsck -r several dozen times, to no avail. Being the inventive type, I tried using hdiutil convert to create a new disk image from the corrupt one. But, as you can probably guess, all that accomplished was creating a new disk image with the same invalid sibling link. That meant that a simple disk block copy was not going to work, I had to try something that would copy the files one by one from the corrupt volume to a new one. Knowing that Time Machine makes gratuitous use of hard links, I needed a copy program that knew how to manage the hard links. Otherwise, a simple copy would result in a TM volume that was many times the size of the original.

Using rsync -H was the first thing I tried, but it ran out of memory before it managed to copy anything. Next, I tried SuperDuper!, which was recommended by quite a few people. The free version would only perform a whole disk copy, erasing the destination and starting from scratch. That was fine since that was exactly what I wanted. Sadly, it too failed after about 36 hours, reporting a "type 8 error". It had managed to copy quite a bit of the TM backups, but I wasn't satisfied, I wanted everything.

At this point I began to realize that there was only one way I was going to make a faithful copy of the Time Machine volume in a reasonable amount of time. Yep, that's right, I would have to write a script that would accomplish my goal. Being that Python is the language in which I am strongest, after Java, I chose to put its built-in file manipulation routines to good use. I figured I could use the same approach that timedog was using, comparing the inode values of the directory entries from one snapshot to the next. In this way, I could know which entries were hard links and which were new files. After just four weeks of spending my spare time hacking away in Python, I finally succeeded.

Introducing, the fruit of my labor. It's a Python script that traverses a Time Machine volume and reproduces its contents to the destination of your choice. Aside from knowing about the hard links that Time Machine creates, it also copies over the extended attributes that convince TM to accept the copied backups. As a result of writing this tool, I have recovered my Time Machine backups and everything seems to be working fine once again. Granted, I'll never know what files may have been lost due to the invalid sibling link error, but at least I managed to save the vast majority of the data.

Sunday, February 8, 2009

Building netatalk on OpenSolaris 2008.11

For a few months now, basically since the Drobo started supporting third party applications, I have been using my Drobo, via a DroboShare, as a Time Machine backup for my MacBook Pro. I used the BackMyFruitUp toolkit to set up the DroboShare as an AFP server, so the Mac saw it as an Apple-compatible network file share. One particularly fun step in that process was migrating my existing TM volume over to the Drobo, but that's another story. This story is about how I replaced the Drobo and DroboShare with a server running OpenSolaris.

Installing OpenSolaris

To start with, I took the old web/file server I had sitting around since last year and installed OpenSolaris 2008.11. Why OpenSolaris you may ask, considering I had been a Linux user for over a decade? Well, I have one word for you: ZFS. If you don't think ZFS is the most rockin' file system on the planet, then you haven't watched the three hour presention. Yeah, 3 hours, but it's absolutely fascinating. But seriously, I can't give ZFS all of the credit for prompting the switch to OpenSolaris. There are plenty of very good reasons to use OpenSolaris and ZFS is just one of them.

Okay, so once OpenSolaris was on the system disk, what next? Well, you should update the installed packages and reboot into the new boot environment. Next we'll need the C compiler and related packages: pfexec pkg install gcc-dev

The system is now capable of compiling software from source, in particular netatalk. Skip the 2.0.3 release and go straight for whatever is the latest, which as of today is 2.0.4 beta2. We'll need that in order to work around a weird permissions issue introduced in Leopard. But first, we must install a compatible version of Berkeley DB.

Installing Berkeley DB

For now, netatalk works best with a slightly older release of Berkeley DB, version 4.2.52. Compiling this is pretty straightforward. Start by adding /usr/local/lib to the library load path (pfexec crle -u -l /usr/local/lib). Then compile and install Berkeley DB like so (consult their build instructions for details, but it basically goes like this):
  1. cd build_unix
  2. ../dist/configure --prefix=/usr/local
  3. make
  4. pfexec make install
Installing netatalk

With the recent versions of netatalk, it expects the Solaris directory structure to look differently than what OpenSolaris has these days. To accommodate this, make a symbolic link from /usr/ucbinclude to /usr/include so that netatalk builds. Then edit sys/, removing 'solaris' from line 294 (to skip building the modules we don't really need), then save the file and you're ready to compile. Here I'm skipping the DDP bits that don't compile cleanly on OpenSolaris, and I'm giving PAM a miss because it's more work to set it up.
  1. ./configure --disable-ddp --without-pam
  2. make
  3. pfexec make install
Now comes the configuration stage. This setup suits my own needs, so if you want additional services then check out the netatalk documentation for more information. In general though, you will probably want to make similar changes to the default configuration, so I'll detail what I've done for my environment.
  • Edit /usr/local/etc/netatalk/afpd.conf, adding the following line at the end of the file (this sets up the encrypted password authentication method and tells clients not to save the password, although that seems to be ignored on OS X):
- -transall -uamlist -nosavepassword
  • Edit /usr/local/etc/netatalk/netatalk.conf, changing "yes" to "no" for the atalk and papd services (atalk is for pre-OSX systems, and papd is for printer sharing).
  • Edit /usr/local/etc/netatalk/AppleVolumes.default, adding the following (changing the default ~ line as well):
~ options:usedots,invisibledots,upriv perm:0770
/zeepool/shared "Shared" allow:@staff options:usedots,invisibledots,upriv perm:0770
/zeepool/nathan_backup "Nathan Backup" allow:nfiedler options:usedots,invisibledots,upriv perm:0770
/zeepool/antonia_backup "Antonia Backup" allow:akwok options:usedots,invisibledots,upriv perm:0770
That's four lines of text above; the blog editor breaks the lines unfortunately. The usedots option tells netatalk to use dots instead of ":2e" for encoding dot files, while invisibledots says to make the dot files invisible by default. Now about the permissions issue alluded to above (see this discussion for details). With Tiger, newly created files would be writable by others, but in Leopard the permissions are wacky, so the latest netatalk has a work around for that. Add the upriv option and perm:0770 to force the permissions for new files to allow others to read and write to them. After all, this is a shared volume, it's silly if no one else can access the files.

With the configuration complete, you can start the netatalk services. I'm assuming that it's not running already, in which case you can just run this command: pfexec /etc/init.d/atalk start

Connecting and Permissions

Now at this point you should be able to connect to the server from your Mac, using the Connect to Server feature in Finder (you can use the Cmd+K shortcut). Type in something like "afp://myserver" in the dialog, replacing myserver with the name of your server, and you will be prompted for a name and password. Use whatever you have for your user accounts on the OpenSolaris server. You could configure netatalk to use PAM, allowing authentication against LDAP or some other service, but for simplicity I just use the system accounts. Once you've authenticated, you will be prompted to select an available shared volume. It doesn't seem to matter which one you pick since the server will be added to the Finder sidebar, and from there you can browse to any of the shared volumes. As for accessing the files on the server, make sure the ownership and permissions are set up such that the user you connect as can read and write to those areas. For instance, the nfiedler user has read/write permission to /zeepool/nathan_backup, and that same user is a member of the staff group, and the /zeepool/shared area is owned by the staff group and is group writable. So far this seems to be working for us, but if you have better ideas then by all means please leave a comment.

I can't take the credit for uncovering this information. In fact, this entry is just pulling together the different bits of information into a single, concise set of instructions. The original blog that I encountered was written by Marc Haisenko, and for step-by-step instructions on configuring netatalk on Linux, I found the kremalicious blog by Matthias Kretschmann.

Saturday, December 27, 2008

Burstsort for Java

Shortly before working at Quantcast, I became interested in a sorting algorithm that I had not heard of before, called Burstsort. I found it while browsing Wikipedia, reading about various methods of sorting. Burstsort, in case you haven't heard of it already, is very fast for large sets of strings, much faster than quicksort and its friends, including multikey quicksort and radixsort. It works by inserting the strings to be sorted into a shallow trie structure, where buckets are used to store the string references, to reduce memory usage. The buckets are "burst" when they exceed a certain size, and these buckets are sorted using a multikey quicksort. The structure is then traversed in order to retrieve the sorted strings. As a result, Burstsort is cache friendly and thus runs considerably faster than algorithms that are not cache-aware.

Along with the original paper is a C implementation, but as far as I could tell, there was no Java implementation, at least not in open source. So, after reading all of the Burstsort papers several times, I finally started writing a Java implementation of the original algorithm. You can find the project on Google Code, at the burstsort4j project page. The initial implementation is basically a rewrite of the original C code. After fixing a few bugs that I introduced during the rewrite, it appears to be working well and is indeed much faster than the other algorithms (quicksort and its multikey variant). Of course, I also rewrote those based on their C implementations, so it could be due to mistakes made on my part. Hopefully, since this is all open source now, others can evaluate the code and point out any mistakes I may have made.

In the mean time, I'll be working on the newer algorithms, in particular the CP-burstsort and the "bucket redesign" Burstsort. The goal there is to reduce the memory usage, without trading off substantially from the run time.

Sunday, September 7, 2008

Finally a new product that ships

During my last 18 months at Sun, I have been working on a project whose name changed several times, and whose purpose changed almost as often. For the time being, it is just another source forge, but at least it "shipped". For all of the projects that I have worked on that were breaking new ground, I believe this is the first one to see the light of day. Another remarkable quality of my part in this project is that I worked on just one area, source code management. The project is called Project Kenai, and at the moment you need an invitation to create a new project. Given it's relative newness, it's the sanest way to grow the site and scale the infrastructure to meet demand.

Let me give you a brief history of the project, from an insider's point of view. To start with, I was introduced to the newly assembled project as the fifth developer, to work with other developers poached from NetBeans-related products (Creator and Enterprise Pack). We learned we were building a web site, a developer collaboration site, it was eventually called. The plan was to build a prototype for the upcoming JavaOne conference, two months away. We were to learn a new language, Ruby, and a new framework, Rails. None of us had done anything quite like this before, and we going to do it in a very un-Sun-like manner. We were packed into a small room, told in rough terms what to build, and encouraged to use the "agile" development methodology. I use lower-case because we never really knew what Agile was, nor did we ever actually practice it.

The prototype came and went; the demo at JavaOne was cancelled. We got Craig McClanahan and started the entire system from scratch. Instead of a single Rails app that basically did nothing more than scrape HTML from other services (e.g. Hudson, Mailman, WebSVN, MediaWiki), we were to build a set of RESTful web services, all in Rails, each implementing one particular aspect of the system. Because of my familiarity with Subversion, I took on the source code management portion of the project.

Initially almost everything was written in Rails, except that we chose Sympa for the mailing list support, and being written in Perl, the mailing list web service was also written in Perl. My part was still Rails at that time, because Ruby support in Subversion was quite satisfactory. That changed, however, when we took on Mercurial support in addition to Subversion. My first attempt was to invoke the hg script from Ruby and capture the output. This led to dreadful performance and was very buggy. Of my own volition, and on my own time, I learned Python and Django and rewrote the SCM web service so I could properly support both Mercurial and Subversion.

And that was basically the last bit of coding I did for this project, several months ago. Since then I've been fighting all of the integration issues and exploring options for improving the broken deployment process (at this time, it consists of a set of poorly written shell scripts, invoked by hand on each of the production systems, in a long and complicated process). Not developing new code or solving interesting problems, while debating with management about priorities (yes, releasing is good, but having an infrastructure to get you there is at least as important), is very disheartening. Ultimately, this is the reason I have left Sun. The project shipped on the very same day I left the company; it shipped and I shipped out.

Saturday, May 31, 2008

Analysis of Java implementations of Fibonacci Heap

Some years ago, around 1997 or so, I wrote a Java implementation of the Fibonacci Heap data structure, as described in Introduction to Algorithms, by Cormen, Leisersen, Rivest, and Stein. A data structure such as the Fibonacci Heap is useful in graphing applications, such as the one I spent some time working on, GraphMaker.

Unfortunately, there were mistakes not only in my implementation, but in the pseudo-code published in the book. Due to the fact that my version was one of the first ever written in Java, and it was open source, it eventually spread to other open source software. A few months back, Jason Lenderman and John Sichi brought up an issue in the implementation via an email to me. In particular, John felt that the size of the degree array in the consolidate() method was too large. In fact, I had it set to the size of the heap, which meant the consolidate method had a running time of O(n). Oops, so much for the amortized O(log n) we were hoping for. After spending some time looking at other implementations, and studying the CLRS book, I realized that calculating the degree array size at all was a waste of time (n can never be greater than Integer.MAX_VALUE, and log base phi of that is 45). Terrific! The method was much faster now that it had an appropriately sized degree array.

Not being satisfied that everything was working perfectly, I proceeded to write a unit test that would stretch the heap to its limits. Inserting a large numbers of random numbers, and then extracting the minimum value would cause a massive heap consolidation. This yielded a problem, and it was bewildering. I couldn't for the life of me see what was going wrong. Then my wife, Antonia, came to the rescue. At the time she was between jobs and had some time on her hands, so she took a look at it and found that the original pseudo-code in CLRS was missing two important steps. Elated, I submitted a bug report to Dr. Cormen and subsequently the fix has made its way into the example Java source code in an upcoming edition of the book. However, Antonia was not the first to realize there was a problem in the pseudo-code. It seems that Doug Cutting wrote a version of Fibonacci Heap in Java for the Apache Nutch project, and it didn't have the problems that my wife had uncovered.

Curious what other Java implementations looked like, I found several and have collected some notes on their implementations. In particular, I was looking at the consolidate operation, which is the only complex bit of code in a Fibonacci Heap, as everything else is fairly trivial. The "array" referred to below is the degree array used to keep track of the root nodes by their degree.
  • Scott Cotton
    • Calculates array size using binary search of lookup table
    • Pre-fills array with nulls
    • Allocates an additional buffer for iterating root list -- can be very expensive
    • Rebuilds root list
  • Ashish Gupta
    • Calculates array size
    • Pre-fills array with nulls
    • Breaks when "w" or "nextW" is made a child
    • Rebuilds root list
  • Apache Nutch (removed in Nov 2007)
    • Calculates array size
    • Involves a hash table which kills performance
  • CLRS
    • Calculates array size
    • Pre-fills array with nulls
    • Breaks when "nextW" is made a child
    • Rebuilds root list
  • John Sichi
    • Degree array is of size N
    • Pre-fills array with nulls
    • Counts number of root list elements (R)
    • Iterates over root list R times, possibly wasting additional time
    • Does not handle the "w" and "nextW" issue
    • Rebuilds root list
It should be noted that all of the problems in Sichi's implementation are entirely my fault, as his code is a fork of my original implementation. Since our email discussion, Jason and John are aware of all of the bugs and their appropriate fixes.

In the process of analyzing the various implementations, I learned a few things.
  • Allocate a fixed-size array for the degrees of the root nodes. At most it will be 45 entries to hold log base phi of Integer.MAX_VALUE elements, and it's cheaper to create that than to perform a series of floating point operations to arrive at a number that is slightly smaller than 45 (e.g. it's 23 to hold 50,000 elements).
  • Do not fill the degrees array with nulls -- all arrays in Java are automatically initialized to zero/false/null.
  • Do not waste time rebuilding the root list at the end of consolidate; the order of the root elements is of no consequence to the algorithm.
  • Use the Cormen "splice in" technique in removeMin() to save significant time (see the Java implementation in the recent editions of the book, or the version in GraphMaker).
My new implementation, as found in GraphMaker, has all of these improvements, so take a look if you're at all interested. In all, this was a fascinating exercise for me. I hope you had a chance to learn something, too.

Wednesday, May 21, 2008

More Subversion vs. Mercurial metrics

I wrote a while back about the disk usage of Subversion versus Mercurial, but all of my examples were rather small. Today I managed to create a migration of a Subversion repository, as complete and intact as possible (with trunk, tags, and branches directories just like in Subversion). The repository contains 3526 revisions and 28977 files, stored in Subversion 1.4.6 and migrated to Mercurial 1.0, using a hacked hg convert to keep the TTB structure intact.

Keeping the entire revision history, the Subversion repository used 238 MB of disk, while the Mercurial repository (sans working copy) occupied 403 MB. Compare that with a "snapshot" repository (i.e. a repository with one revision) that consists of the latest version of all the files: Subversion 160 MB, Mercurial 229 MB. You can probably guess from these numbers that our repository has a lot of third-party code that we do not change frequently. Interestingly, in both cases, Subversion comes out ahead in terms of disk usage, by a significant margin.