Thursday, May 15, 2008

Starting Apache on Mac OS X

For the last year I've been using an ugly hack to have my custom compiled Apache started at bootup on my Mac OS X systems. Basically, and this is really bad, I would replace the /usr/sbin/httpd binary with a symbolic link to the version I compiled. Why is that bad, you might ask? Because if Apple ever updates their installed Apache, it will probably overwrite my version of httpd. Thanks to Google, I recently found a much better way, and wanted to document it as a blog entry all its own, rather than as a comment or sidenote to something else.

Say you've compiled your version of Apache and it's now installed in the default location of /usr/local/apache2, and let's say you want it to be started at bootup, effectively replacing the version of Apache that comes with Mac OS X. Here are the simplest steps I know of to do this, and without using a single hack.

  1. Before doing anything else, turn off the Web Sharing item in the Sharing panel of System Preferences, since in a few minutes, you'll be running your own Apache instance.
  2. Create the file /Library/LaunchDaemons/org.apache.httpd.plist with the contents shown below, using your favorite editor (e.g. emacs, nano, vi).
  3. Use the launchd service to manage the Apache instance by invoking the command: sudo launchctl load -w /Library/LaunchDaemons/org.apache.httpd.plist
And that will do it. You now have a managed Apache instance of your own creation, and it should be running on the configured port (e.g. 80). Now just make a request and check the logs to make sure it's really working as you expected.

Here's the plist file mentioned in the steps above:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>org.apache.httpd</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/apache2/bin/httpd</string>
<string>-k</string>
<string>start</string>
</array>
<key>RunAtLoad</key>
<true/>
</dict>
</plist>
If you ever want to turn off your version of Apache, you can invoke the launchctl command shown above, replacing "load" with "unload". In the mean time, you can manage the Apache instance using the apachectl command as usual (e.g. /usr/local/apache2/bin/apachectl graceful).

Friday, April 11, 2008

What's the best programming language?

The other day I came across an esoteric programming language called brainfuck. It gives new meaning to the term "esoteric". If you can write something practical in that, you're pretty damn hard-core. You also have entirely too much time on your hands. This made me wonder, "What is the worst programming language?", and I began collecting my thoughts based on my personal experience. But rather than focus on the negative, what about turning the question around and asking what is the best programming language? To start with, let me explain my recent experiences with a handful of programming languages.

In the past year at work, I've been involved in a project that involves mostly Ruby on Rails development. Coming up to speed with Ruby, after many years of Java programming, was an interesting experience. Ruby is certainly easy to pick up. It's syntax is quite elegant, more expressive than Java, and closures are an appealing alternative to inner classes. However, after reading through glyphobet's analysis of Ruby, I've come to appreciate how little I really learned. From that essay, and Gilad Bracha's rant on monkey patching, and this very insightful analysis by Alex Martelli, I'm strongly in favor of using Python over Ruby.

Getting back to my recent experiences, just when I felt comfortable with Ruby, it was time to shift gears and write an Apache module. Given that my C skills are pretty rusty, I opted to use mod_perl instead, as I had at least used Perl a few times in the last decade. This, as it turned out, was not nearly as easy as I had hoped. It wasn't long before I decided that Perl was a time-wasting abomination, having evolved in a rather poor manner over a period of many years. I'm by no means a language expert, but if I get confused by a language, and waste entirely too much time on the lousy syntax, then there is a problem.

As a reprieve, I had the opportunity to pick up Python. This came about because we wanted to integrate with a third party tool written in Python. Since Ruby can't call Python code, I initially tried invoking the tool's shell script wrapper, capturing the output and parsing it. Not surprisingly, performance was dreadful, and it was unstable as well. To solve the problem, I broke down and started learning Python, as well as Django, in an effort to write a new web service to handle integration with this third party tool. In a matter of weeks, I had a functional web service, complete with unit tests, and its performance was fantastic compared to the Rails service.

In addition to these scripting languages, I had the chance to write some Java code (again, another integration piece). Naturally that went very quickly since I have a decade of basically nothing but Java experience. But it isn't just about experience. The tool support for Java is simply phenomenal -- code completion, error-checking, and flawless refactoring are amazing time savers. A little experience and great tools make Java a very productive language.

So what is the best programming language? To make the selection a little easier, let's consider the popular languages, those in the top 20 on the TIOBE list. Having written code in 11 of them myself, I can honestly say that Java, in my opinion, is the major contender. It's syntax is much easier to learn than most others, it has incredibly good tool support, and the fact that you can run it just about anywhere (enterprise, workstation, mobile phone, sensors, etc) makes it very versatile. Granted, Ruby and Python are fun to learn and have some advantages, they come at a significant cost that makes me appreciate the durability of Java.

Thursday, April 10, 2008

Hadoop Summit 2008: My Take (Part IV)

Continuing on with the user testimonials, Steve Schlosser (of Intel?) spoke on using Hadoop to create graphs of ground motion models, using interpolated data generated via two different systems. A technique he had learned was to massage the key values returned from the Mapper to trick Hadoop into naturally gathering tuples for the reduce stage. He reminded everyone that while MapReduce sounds simple, it's actually a number of stages: InputFormat, Map, ManipulateKey, Bucket, Combine, Shuffle, Sort, Reduce, and finally OutputFormat. Their cluster, by the way, consisted of 50 8-core blades with 8GB memory and 300GB disk. That's a pretty hefty set of machines, but I think they had a lot of calculations to run through to create their pretty pictures.

Mike Haley from Autodesk got a turn to tell how he uses Hadoop in an effort to simplify the cataloging and searching of building materials. At least I think that's what it was about. He spent a lot of time talking about building materials and just a couple of minutes on Hadoop, so I kinda missed the point of it. Still, it was interesting; it's not often you get a chance to take a peek into a completely different industry.

To provide a perspective on Yahoo's use of Hadoop, Christian Kunz spoke on the WebMap change-over. He apparently was not slated to speak for this presentation, as the schedule shows Arnab Bhattacharjee as the speaker. Subsequently, the talk was brief and it was obvious Christian was not at all comfortable in front of a large audience. In short, Yahoo went from a set of in-house scripts and programs, developed over a number of years, to Hadoop and saw a lot of advantages and improvements. The talk was literally 10 minutes, so there wasn't much else to it.

Next up was a guy from Google, who spoke briefly on the problem of hiring people who know anything about parallel computing, much less MapReduce. Well, yeah, duh. Most colleges aren't teaching distributed/parallel computing techniques, and except for Yahoo and Google, who uses MapReduce? He then introduced Jimmy Lin from the University of Maryland, who spoke on a new course being offered at the university in which students use Hadoop to solve interesting problems. At this point in the day, I was a bit tired and wasn't really getting much from these last presentations. Interesting, yes, but not remarkable in my opinion. But I liked the irony that Google has a hard time finding people who are ready to dive into M-R, and their solution is to use their competitors open-source clone of their very own infrastructure. Come to think of it, the students aren't just learning M-R, they are learning Hadoop, so it's just as likely they would get snatched up by Yahoo! upon graduation.

To wrap up the exciting day, a panel of folks from Yahoo, Powerset, and Mahout spoke on the future direction of Hadoop. For instance, the core developers want to eventually have Hadoop auto-configure itself, but still have options for power-users to tweak the settings (my thought was of the Java VM, which has dozens of settings for fine tuning its behavior, but out of the box, it automatically adjusts itself to suit the environment). Another wish was to support Kerberos for protecting the HDFS data; right now anyone can read anything in the system, which is not good for privacy and isolation. Another comment was that building the community is a challenge. Hopefully as committers become more senior, contributors can be mentored into committers and help foster new involvement from others.

Well, that was it, the whole Hadoop Summit in a nutshell. There was a happy hour but I had to get home to my family, so I gave it a miss. For additional reading material, check out these: Hadoop Summit - Best in Show, At the Hadoop Summit, Visited Hadoop Conference, Hadoop Summit Notes.

Wednesday, April 9, 2008

Hadoop Summit 2008: My Take (Part III)

After an exciting morning, we all had the chance to partake in a free lunch. I had the (mis)fortune of sitting next to a gentleman from CMU whose job was to travel the world and meet researchers to learn the latest in cool inventions. Probably there was more to it than that, but that was the part I understood. He and two other gentlemen my age that he had met earlier in the day pontificated on the future directions of technology and its effects on our lives, while I struggled to keep up and not look dumb. It's hard being a heads-down sort of techie and finding yourself thrust into a conversation with people whose job is to think big.

After that minor ordeal, the technical presentations started up again with Michael Stack explaining the virtues of HBase. One of his early remarks was that HBase doesn't have any of that sissy RDBMS stuff. Well that's good, we wouldn't want to rile the DB folks, who already have a hard time grasping the obvious (to be fair, they have a follow-up article that is at least a little bit better than their original post). One interesting revelation was that HBase ran significantly slower than the numbers published in the BigTable paper, as little as 20%. Speculation that HDFS was to blame was made. That Hadoop is still a little immature is not a surprise, and clearly performance is not yet a major concern.

Next up was Bryan Duxbury of Rapleaf, who spoke of their use of HBase and Hadoop. Bryan is the one responsible for the Ruby support in HBase, by the way. He primarily discussed the performance of HBase. They are using a 64 node cluster, with a total of 2TB disk space and 64GB memory. After some experimentation, Bryan found that the compression built into HBase was a little slow, and that compressing the data in the client helped (less data going over the network, less processing for HBase when data is stored as-is).

Speaking of databases, the next talk was by two developers at Facebook who are working on Hive. It sounded to me like another database implemented on top of HDFS, with yet another query language. The good news was it resembled SQL and supports streaming (to programs written in languages other than Java). As far as I can tell, the project is not open source, so this is of little interest to me. Sure, I like hearing about these projects, but as with the Microsoft presentation, it's all just "research" until you share your work with others to enable active discussion and collaboration.

Jinesh Varia from Amazon then presented on their use of Hadoop in relation to Amazon Web Services. He primarily spoke of the architecture of EC2 and SQS. A couple of interesting points were that SQS uses message queues to deal with machine failure, and S3 is a bottleneck when used from EC2. No word on how they are planning to address that. I got the impression that Amazon uses their own infrastructure extensively, with basically everything running on EC2. Definitely an interesting talk by an energetic speaker.

I'll finish up with this mini-series on the Hadoop Summit in my next post.

Saturday, March 29, 2008

Hadoop Summit 2008: My Take (Part II)

After the initial introductory speakers, the conference quickly became very technical with a preso by Chris Olston of Yahoo! on Pig, the system for performing data analysis on large quantities of data distributed over a Hadoop cluster. It's basically the equivalent of Sawzall, if you're at all familiar with Google's technologies. One of the instigators for a system such as Pig is that many times developers are writing the same sorts of joins, sorts, and merges over and over again. Chris says there's even a mailing list at Yahoo for sharing M-R snippets for such tasks. Pig makes it very easy to describe what data you want, and what you want to do with it. It compiles your Pig Latin script into a set of M-R applications and runs them on the cluster. A point that Chris stressed is that Pig Latin is not a query language, but rather a data flow description. Because it has an imperative style, the order of actions is more well-defined than say SQL.

I found Chris' talk so engrossing that I wanted to join Yahoo! just to work on Pig. He's a very good speaker and Pig strikes me as a very practical, and yet easily overlooked, part of the Hadoop system. For some reason that I have yet to understand, I fancy working on systems like this, the underlying infrastructure that most people (as in users, not developers) never see or give any thought to. Sure, working on GMail would be cool and all, but I'd rather develop something like Pig.

Following that was a talk by Kevin Beyer of IBM, on JAQL, the JavaScript version of Pig. Well, that's not entirely fair, but I did get the sense they were very similar. Same sorts of basic operations, the difference being the syntax for JAQL is basically a cross between JavaScript and JSON. There may have been some advantages one way or the other, but they were subtle. In any case, both JAQL and Pig work closely with Hadoop and do essentially the same things.

Next up was possibly the most "interesting" talk, by Michael Isard from Microsoft. Yeah, that Microsoft, the one that tried (and is still trying) to buy Yahoo! one way or another. He described his research project Dryad LINQ, or at least I think that's what he works on. His description of Dryad was that of a system that analyzes a graph and performs a set of tasks over a distributed system. The graph describes the tasks, the data flow, and dependencies. It's a very different approach than M-R, more general purpose. Naturally Dryad made a few trade-offs to improve overall performance. For instance, he believes in general Dryad performs very well, but it's failure handling is rather inefficient. Hmm, interesting approach. I believe Google made the realization that in a large enough cluster, you are always going to have failures, so you had better deal with them gracefully and efficiently. And naturally these points came up during the Q&A section, to which he responded that he needs to come up with some comparison numbers. When asked if he'd looked at Hadoop in terms of performance, he flat out said "no". Not a surprise there; frankly I'd be surprised if he even ran Hadoop once, let alone read any of the Google papers.

It goes without saying (but I'm saying it anyway) that all of this is implemented in, and on top of, Microsoft technologies (e.g. .NET, Windows). And you can surely bet that because it's still in the research group, it will be a while before it sees the light of day, and it will most certainly not be open source in any reasonable way. One really funny part was some shill in the front row said "Well, I think judging by the reaction in the room, you're kicking everybody's butt, congratulations." Um, yeah, I don't think it was at all obvious that Dryad was better than Hadoop. They made certain choices and ended up with a very different system, with very different performance characteristics and features. It seemed to me that each node in Dryad was some arbitrary program, so they forfeited all of the advantages that M-R provides. Also, there's no distributed file system (his actual response to a question from the audience). Presumably everything is stored in an SQLServer instance. Like I said, I really don't see how that's better than Hadoop.

For a pleasant change of pace, the next talk was about X-Trace, given by Andy Konwinski from UC Berkeley. He and Matei Zaharia (presumably) created hooks in Hadoop to enable monitoring events in the system as they occur within the cluster. Andy had a very appealing self-deprecating style, and made a few jokes about pretty graphs and dumb programmer mistakes, which warmed up the room after the rather dry and strange talk about Dryad. For instance, he and his colleagues used X-Trace to identify a silly configuration mistake they had made in Hadoop. They had set up 30 map workers but left the default number of reducers to 1, which caused their sample job to run for hours longer than it should have. This became blindingly obvious when they rendered a few graphs to show what was going on in the cluster. Clearly if you're running into problems with Hadoop, X-Trace would be an excellent debugging tool. Andy gave another example in which a graph made very clear that one machine in particular was having disk problems, of a sort that impacted performance without necessarily taking the machine out of working order (and thus out of the cluster).

For the last talk before lunch, Ben Reed presented on the ZooKeeper project, which is both a distributed lock manager of sorts, and a distributed file system for very small files (everything is kept in main memory). It's purpose is to facilitate configuration of the nodes in a cluster, enabling them to elect leaders and define membership, as well as serving as a name server. It's actually very similar to Google's Chubby lock service, just written in Java. Everything is stored in memory for fast response times, with a disk-based log, I assume for reconstructing the data if the node is restarted. The ZooKeeper team found that a system consisting of about three to five nodes works best. Fewer and reliability goes down; more and performance becomes an issue as the leader tries to keep all of the servers up-to-date.

Then there was lunch, which I'll continue with in the next installment.

Thursday, March 27, 2008

Hadoop Summit 2008: My Take (Part I)

On Tuesday I attended the first Hadoop Summit held in the TechMart building in Sunnyvale. It was a terrific experience, much better than I had expected. It doesn't help that my expectations have been tempered by such events as the first NetBeans Day, wherein many of the people that showed up where only there to see James Gosling speak for the first few minutes, after which no more than ~50 people came through (and half of them were Sun employees). It probably didn't help that they were competing with JavaOne as well, which they have corrected since then.

But getting back to the Hadoop Summit. Everything worked very well. Parking was easy, registration was a breeze, and the staff was friendly and accommodating. I asked politely about whether water would be supplied and they quickly chased down the TechMart folks to locate the bottles -- they were already in the auditorium waiting for us. Strike one for dumb attendee. The breakfast spread was a good as any conference I'd been to.

Upon entering the auditorium, it was immediately evident this was an event for computer nerds. There were multiple WiFi routers sitting on a table on one side of the room, and power strips were on the floor beneath every row of chairs. Wow. Too bad the WiFi connections were flaky -- I only got an IP address for only a few minutes, then lost it. Someone came up and asked if the Internet connections were working for me, which gave me the sense I was not the only one suffering from bad connectivity. No matter, I was only going to take notes anyway.

By the time Anjay was ready to open the presentations at 8:55, the room was nearly full. On the wall was a sign stating that the maximum occupancy was 299. I'd say it was a safe bet that about 300 people were crammed into the room, as there were folks standing along the walls.

The first to speak was Doug Cutting, the guy who created Hadoop as a subproject of Lucene. His opening line was "Are you sure you are all in the right place, there's an awful lot of people here." He gave the history of nutch and how Hadoop got started. Getting the history of it all was fascinating, I always like hearing how projects get started. Another fascinating aspect of Hadoop is that it's been barely two years from an almost nothing subproject to a large project with a conference consisting of hundreds of attendees. I guess that means distributed computing is more appealing than Java IDEs. That or Yahoo does a better job organizing these things than Sun. In all fairness, subsequent NetBeans Days have grown exponentially, and are much, much, much better than the first one.

Okay, back to the Hadoop Summit. Next up was Eric14 (some bizarre nickname because his high school friends couldn't pronounce Baldeschwieler), who explained how Hadoop is being used and developed within Yahoo. It's growing very fast and they are struggling to hire people who have any experience with distributed computing, let alone Hadoop in particular. He says they have tens of thousands of nodes (i.e. machines running in clusters in case you are not familiar with this overloaded term) and each machine typically has about 8 cores. Yeah, that's a lot of horse power, but Yahoo must feel they can make good use of it with Hadoop, which supports many MapReduce jobs running in parallel. I believe Google stated they use systems with 4 cores, each with two IDE disk drives.

Eric was curious what the attendees were doing in terms of their Hadoop usage, and asked for an informal poll. There were many people, a little less than half, that had at least 20-node clusters, whereas only a few attendees were running clusters with more than 100 nodes. I was surprised, I didn't expect that many folks to be running such sizable clusters.

There's lots more to say, so breaking this up into several parts makes sense to me. Look for at least three more entries, roughly along the lines of presentations given, with my impression on each of them.

Tuesday, March 18, 2008

Geoworks: the inspiration

From time to time I have thought about writing of my experiences at my first employer, Geoworks. It was a fun place to work, there were a lot of really smart developers, and I met my wife there. The company is gone now, although there's a website holding on to the domain. Wondering if anyone else was writing about their Geoworks experiences, I came across a couple of entries. The last one is pretty amazing, a beautiful example of how I wish I could write. I never met Dave, but I did get to work closely with Stevey (I just called him Steve) for two weeks in Bartlett, Tennessee in the summer of 1997. Like everyone else at Geoworks, when Dave died, I sent my condolences to Steve and Mike, who was also a friend of mine. At the time I wanted to say something meaningful, hoping to help them feel even just a little better. I knew it was impossible given I didn't know them very well, but it was what I wished for. Even now, more than 10 years later, I can't think of what to say. At this point, I'm just glad Steve didn't "off" himself. He is an inspiration in more ways than one. In addition to being a thoughtful and entertaining writer, he's also a talented speaker.

Reading Steve's blog has forced me to evaluate how I spend my time on the computer. He has a knack for putting what I've often taken for granted into an enlightening perspective. What's interesting to me is that I didn't discover his blog until after I ventured outside of the Java realm. Working with nothing but Java for a decade, and not just Java but Java tools, left me so insulated from everything else going on the industry that I was completely unaware of Rails, del.icio.us, or the many clever people blogging on every conceivable topic, including Steve.

Cheers to you, Steve, for surviving, thriving, and inspiring others to be more than they started out being. If you ever read this, please forgive my lamentable writing, but know that you are the single most remarkable person I ever met at Geoworks, and we only barely knew each other. Thanks.