2006-04-29

The Most Popular OS in the World

The most popular Operating system in the world, going by number of units installed, is not Windows, or Linux, or anything like that, but actually a small embedded OS called ITRON.

I guess that makes sense. People don't usually own more than one computer, but they always own dozens of electronic gizmos, many of which run this embedded OS. Likewise with corporations-- the number of small computers outweights the number of big ones.

2006-04-23

SCIDs

Yesterday, we hosted a party here at Forbes house. It's interesting being the host. It's been a good carnival.

Anyway-- I was reading this guy's site about SCIDs (source code in database systems).
It's at http://mindprod.com/projects/scid.html

In his words:
We have been teaching our customers to regard their data as a precious resource that should be milked and reused by finding many possible ways of summarising, viewing and updating it.
However, we programmers have not yet learned to treat our source code as a similar structured data resource. This is an enormous project, but you could start small. The basic idea is your pre-parse your code and put it in a database.


This an ambitious idea, indeed. He's proposing that we change the whole source-code-as-text paradigm that's been in place-- well-- ever since punch cards died out a half-century ago. In its place, he would put a database which programmers could modify in various ways.

The ultimate goal is to come up with a better way of programming than just writing text on a screen. You could use pointing and clicking to design GUIs, or write little scripts to find similarities in various parts of the database. With an editor that truly understands the language it is written for, the possibilities really are limitless. In essence, everything the compiler knows, the editor now knows.

There's a lot of potential problems here. The biggest practical problem is really that of vendor lock-in. Using plain old text for source files is a well-known and well-understood standard. If a company moves to some proprietary SCID, there's no guarantee that they will be able to export their code in a reasonably readable and usable form to another SCID. Why would any vendor make this easy? I guess reverse engineering is sometimes legal, but companies could set up a number of copyright and patent roadblocks under the current system that would effectively trap any company switching to a proprietary SCID. If I were a manager contemplating this, I would be very afraid.

Secondly, for those using unix, text is basically the lowest common denominator of unix systems. Traditional unix tools like grep, awk, cat, and find don't work on databases. I guess you could write equivalents of these tools for your new database, but that would be a lot of work.

Finally, in a more philosophical sense, the SCID philosophy tends to run counter to the traditional "separation of function" philosophy under which the compiler, text editor, and revision control system were all separate entities. The old philosophy wasn't perfect, but it tended to give the authors of all of these tools a pretty good idea of what their jobs should be. In a very real sense, implementing a high-quality SCID forces us to re-consider the real world problem of division of responsibility. Should the CVS or subversion people work with the SCID people on a new feature they want to add? Does the GCC team need to meet with the text editor team before releasing their new revision? It's even possible to argue that only the biggest, most monolithic companies have the resources to produce something with as much vertical integration as a true SCID. And these are not necessarity the organizations we want in control.

As the author mentions in closing, some SCID-like systems have already been built. Perhaps eclipse is the most famous one. I guess IDEs (integrated development environments) of all kinds could potentially evolve into SCIDs slowly. Generally, most IDEs don't provide very much meta-programming functionality now. It will be interesting to see how this plays out in the future. As I mentioned before, this may not be a very good research topic for academics, because the players with the resources to make these kinds of systems a reality are the big guys like Microsoft and IBM. As with so many other problems the software industry faces today, it's a (lack of) infrastructure problem.

2006-04-10

linear methods of image classification

I attended a VASC talk today at Carnegie Mellon. Aleix Martinez presented "The Secret Life of Linear Methods: Why Linear Methods Work, Do Not."

I guess the main idea here was that we are looking for heuristics to classify real-world data. Usually it's image data, but the example of genes was also mentioned. The data is expressed as a vector, and then we choose some heuristic like LDA, and have at it. Aleix mentioned that as an undergraduate, he was frusted by the poor performance of a robot he built to navigate hallways and such, which used LDA.

Despite this undergraduate experience, Aleix is a big fan of linear methods, because, he points out, they are much more intuitive than non-linear ones. I can testify to this-- I was able to follow (most) of the math, after taking a reasonable class in linear algebra. I'm not confident that I would have the same understanding if the topic was non-linear methods. I need to brush up on my statistics, though.

Anyway, he presented some nifty methods of enhancing LDA. Apparently the main thrust of LDA is that we would like to know which dimensions in our n-dimensional dataset are "the important dimensions." The important dimensions are the one that can distinguish data points from one dataset from those of another. The implicit assumption here, of course, is that there are such dimensions. Apparently plain vanilla LDA can become confused by certain datasets, and produces an incorrect result. His method involves deciding if LDA would become confused, and if so, partitioning the datasets. Man, I hope I got that right! There's probably some details I'm missing or getting wrong.

I hope I didn't stand out too much at the talk. It was a pretty small audience, and the posted date was early by 10 minutes, so I ended up coming too early. I think they would probably just assume that I'm an undergraduate, though. Which is pretty much correct.

Anyway, it was an interesting talk. I thought Aleix was pretty down-to-earth, despite the complexity of the topic involved. And as any EE student knows, linear methods are still worthy of respect, even in this day of desktop supercomputers.

He mentioned that his current funding is coming from the NIH. Maybe there's some overlap with all the biotechnology stuff that's been going on. I've been hearing that biotechnology is going to be huge in the future. I guess with our aging population, and increasing standard of life... Definitely worth thinking about. Also, I'm going to grab some stats books next time I visit the library. I need to get at least Gaussians down cold.

Edit: Arthur pointed out to me that in image processing, image data is generally expressed as a vector, not as a matrix.

2006-04-01

I recently read a long discussion about garbage collection in C++.

It's an interesting discussion to me, because I thought most experts already agreed that C++ was extremely hostile to garbage collection. And most of the C++ programmers involved in the thread seem to be at least a little bit skeptical. Possibly some of them take pride in being skillful enough to "manage their own memory"-- unflattering comparisons to Java were common.

Driving the discussion were the noted C++ expert Herb Sutter and a grad student named Andrei Alexandrescu. It was funny to see them sparring with some of the old-line C++ programmers like Jeremy Coffin.

One behavior that C++ has now which is kind of desirable is that the destructor for objects is called just as soon as the object goes out of scope. This makes it possible to use objects to manage scarce resources like network connections or mutexes, which you want to release as soon as you are done using. In most garbage collected languages, you do not have a good idea of when the "finalizer" (that's what the destructor is called in these languages) will be called, so you can't really use this "resource allocation is initialization" idiom as much.

I think C# has the best of both worlds, since it allows the programmer to specify this behavior for certain objects declared on the stack if he wants. I should check out C# some time-- it really has a lot of nice features, from what I hear.