2007-05-29

Against Centralized Source Control

Linus Torvalds gave an hour-long talk about the Git source control system.
The talk was hosted by Google.

The most important points seem to be:
  1. git users are all peers with a complete copy of the tree. There is no repository and no commit access.
  2. git focuses on hashing and merging sets of files really, really fast
  3. git can't scale above 1 million files, but Linus believes that projects with that many files should be split into multiple projects for other reasons.
Linus has a lot of abuse for CVS, Subversion, and Perforce, because he thinks that they've been focusing on the wrong problem. He believes that distributed source control systems are inherently more robust and secure than centralized ones. Linus is relentless in his pursuit of technical excellence and doesn't hesitate to label competing projects as "stupid and ugly."

Speaking as someone who has a lot of experience using one particular centralized system, Perforce, I can say that when our central sever went down, we couldn't get any work done. Git users, on the other hand, can continue to work on patch sets even when the network is down or the main server is down, and even merge two patch sets when the network is unavailable.

git provides some security against malicious attempts to corrupt the central server. Because git checksums all files using the cryptographically sound hash SHA-1, people who are syncing up (or "pulling") a git branch should notice if that branch has been tampered with by people going outside the source control system.

One thing that is overlooked in this talk is that most corporations are also interested in securing their closed-source code against attempts to steal it. This is not an issue for Linux, and in fact, Linus wants as many people to mirror the kernel as possible, so that if any of the main server disks go down, he can copy it right back from them. "I don't do backups," Linus states flatly.

Linus also makes some jabs at Perforce that really hit home. He complains that all clients in a centralized source control system have to share the same namespace, so they often have to be named funny things in order to keep them from conflicting. Rather than naming them purely by what changes they contain, people must put some unique identifer in the client name in case anyone has a similar client. This is not a big deal at a small company. But try scaling that up to a really large organization with multiple sites, or an open source project, and it becomes difficult.

Interestingly enough, Linus states that his parting with the Bitkeeper people was amicable. He seems to respect their product, and states that using Bitkeeper really showed him what a source control system should be. I guess a lot of people in the open source community were upset that Linus was using a tool that was not itself open source to maintain his open source kernel. I haven't really looked into the debate much, but it's still interesting to get Linus' angle.

Linus states that git reduces political pressures because he doesn't have to decide who to give commit access to. I suppose what he really means is that it makes it easier for him to have degrees of trust. Rather than giving him a binary decision to make-- give this person commit access, or not?-- he can just pull patchsets from people with different levels of scrutiny depending on how much he trusts them. Ultimately, of course, Linus must decide what goes into head-of-line, aka "Linus' Tree," and that will always involve some amount of politics.

I will have to try out git the next time I get a chance. It just seems like one of those tools that will make an impact in the way people work, even if they don't end up using it directly. Obviously, most commercial projects are not the Linux kernel, where you have thousands of functionally anonymous contributors submitting bug fixes and patches. But a lot of big companies have multiple sites, and if those sites are going to develop software together effectively, they're going to need tools like git.

2007-05-09

Google Hell

There was a recent story in Forbes about "Google Hell."

Basically, in the course of indexing the world wide web, Google deems certain pages to be low quality or bogus. These pages are then put on to the "supplemental index," or, more colorfully, "Google Hell." Pages in Google Hell are ranked very low-- and once you get into Google Hell, you tend to stay there for a while.

It's easy to understand why Google created the supplemental index. They did it to improve the quality of their search engine. We've all seen the hordes of fake sites, containing nothing but links to other sites-- usually sites pushing questionable or illegal products. Google probably felt that they had to create Google Hell in order to avoid getting bogged down by these armies of bogus sites.

Of course, from the merchants' perspective, things are quite different. Each individual merchant needs Google much more than Google needs him. There is no question about who holds the cards in this relationship. Just as being located along a major highway can make or break a restaurant, being on the first page of results for "diamonds" can make or break an online store.

Search engine rankings are so important to companies that a whole business has sprung up called "search engine optimization." SEO consultants try to find ways to boost the search engine rankings of sites, using techniques such as improving site layout and generating positive "buzz" for the site. There is also a darker side to SEO. Some consultants set up spam sites with thousands of links to their clients. These sites are unsightly and of no use to viewers.

Of course, from Google or Yahoo's perspective, SEO is always bad. They would rather have the sites ranked according to how useful they are to viewers, not according to how much the site spent on SEO. And none of the money for SEO goes to the search engines themselves. So the search engines and the SEOs are playing a game where each tries to outsmart the other. Sometimes, when SEOs get too aggressive, they end up getting their clients into Google Hell by using techniques Google considers to be spammer tactics.

This also raises the question: if you are an unscrupulous businessman, why not pay some SEOs to get your competitors into Google Hell? If all it takes is just some links from spam sites to the victim, it would seem like a no-brainer. I guess this is just another part of the arms race between SEO'ers and search engine'ers.

I find this story interesting because it highlights the gap between people's expectations and reality.
People think that search engines are impartial librarians showing people a representative sample of the results that exist on the web. But in fact, the search engines have no such obligation. There is no law saying that Yahoo, or Google, or Microsoft, or any other company, cannot ask for cash in return for good search engine placements. Or even edit results for political reasons.

There be no law upon the high internets. Yarr.

The only protection we have as consumers is some vague promises, and the deeply-felt desire of companies to avoid controversy. That, and the fact that there is more than one search engine to choose from. I suppose that will just have to do for now.