Friday, January 5, 2007

Computers are too complicated

I left my last blog post convinced that computer viruses are allowed by too much complexity, not the cause of some inept software developer. I love to complain about computers that don't work just as much as the next person, but I'd rather help make a change than just complain. This is half ramble and a creative endeavour. I'm just throwing this out here as it leaves my fingers. Brainstorming is never wrong, it's just a creative tool.

To repeat my earlier stance, computers (operating systems) have become too complex for the tools we use to monitor them. Would we run a nuclear power plant with gauges strewn all over the entire operation? No. We bring all the gauges and controls into a central control room. This is the problem with todays operating systems. We've strewn the gauges all over the place and made them next to impossible to read. I've been working with computers for a long time (I graduated computer science in 88) and I still can't tell you the health of a computer when I walk up to it. I'm not even sure I could tell you the health after a few hours of looking at it. There are tools out there that help with this (Registry Mechanic on Windows comes to mind), but even those don't give much information, they just fix and forget.

Where is the information that we need better access to to monitor the health of a computer? Three places. 1) Archival storage. 2) RAM 3) Process status.

What are the tools we use to monitor these today? 1) File managers (nothing much more complicated than ls). 2) vmstat is the only thing I can think of and that is a crude as it gets. 3) ps (Windows ctrl/alt/del, process tab) and that doesn't give any historical information about the history of the process.

What would it take in those three areas to find a virus just by looking at a proper monitoring tool? (This is sort of fun and I feel like I'm on to something as I start to visualize a new tool that monitors a running operating system). I worked at Sun in the early 90s and Rich Pettit's setool comes to mind, but even that was way too complicated, but it did bring together a lot of disparate data. That's the sort of thing I'm trying to visualize.

You look at a list of process that are currently running. You can see a graph of individual process cpu time and memory usage since it started. You click on the file that is running from that same view and see the change history and mechanism (human or computer) that changed the executable. Maybe some sort of finger printing to see who or what actually made the change and when (Journaling disk drives in VMS were really cool, but never became mainstream). You could see the processes interaction with the network over time and actually drill down into the network traffic to view the traffic as a video stream. You could click on the current memory and see a map of memory and drill down to see what's using what and how much of it. Maybe a sysadmin could see the actual contents of memory with tools to view different types of memory in different ways (this is a big stretch, but part of brainstorming). From the archival side of the house you could look at an executable on disk and see when and where it was run over the past days and what process was starting it and even drill down into those processes as well. Just a bunch of random ideas and I'm sure there are more where those came from.
Given a better view of the internals and history of an operating system, would a layman be able to detect a virus immediately? Given enough journaled information could you reverse the effects of a rogue process? Given a lot of centralized gauges, can an engineer decide when a nuclear power plant is going to melt down and what to do? I say yes to all of these, but I'm not asking normal people to be nuclear engineers. With the current state of the art even the engineers are blind when it comes to an operating system. Even a nuclear power plant's control rooms have alarms and I doubt most of them know what every gauge and switch does, they probably have a big manual and a big panic button. But I'm not talking about the dangers of a nuclear plant. I'm talking about an operating system and the problems we face today. Not enough information about it's current and historical state to be able to diagnose a problem or to know if a change was critical or malicious.

I never even got to the actual root cause. The real problem lies in how we program computers and the complexity of programming languages. Maybe I'll get to it in another entry.

No comments: