Home | Get FREE Tools | Forums Login

Tools

A Romp Through Benchmarking History

When I first came to Intel, the Intel 80486 microprocessor was just coming out. My first assignment was to work with a couple of other people to write the “i486 Performance Report”. I think I actually still have that report somewhere. Anyway, PC benchmarking began back in the “Stone Age” – where Whetstone* and Dhrystone* ruled the earth. With Dhrystone, you would compile and run the benchmark and then compare the score to the Dhrystone score from a VAX 11/780. You did the comparison by dividing the score by 1757 which was the score on the VAX 11/780. Now, a VAX 11/780 was conventionally known as a 1 MIPS machine. That is, it could execute one million VAX 11/780 instructions per second.

Somewhere along the line, someone decided that if your machine could, say, double the Dhrystone score of a VAX 11/780 then the system had to be a 2 MIPS machine! This implied that a system with double the score also executed two million instructions per second. A computer scientist would ask: Which instructions and under what conditions? But that was a pesky detail not many people cared about. It was easy to understand, and provided the easy sound-bite. So people were talking about the 10 MIPS machines, 20 MIPS machines etc.

The fundamental problem was that although MIPS could be used as a generic metric, it could not be used to deduce the instruction throughput on a given machine. Different systems could execute different instructions and different number of instructions to execute the Dhrystone code. Further just because a machine executed Dhrystone code at a certain rate it didn’t mean it could do the same for other applications. Years later, it seems that old habits do in fact persist. The game console and supercomputer industry still suffer from using rate metrics similar to Dhrystone MIPS ...

Dhrystone and a bunch of other synthetic benchmarks were all people had at the time. Yes, there were some decent application benchmarks here or there. Yes, there were heroic efforts by the likes of “Anon et al.” that described the famous “Debit-Credit” benchmark. But for the most part synthetics pretty much ruled the roost.

The developers of synthetics meant well. In fact the best of them went to great trouble to profile applications, and use the resulting characterizations to create benchmarks that matched those characterizations. After a while, however, people realized that the practice of using synthetic benchmarks as the sole determinant of system or even component performance was deeply flawed. A fundamental problem was that as architectures proliferated and software complexity and variation grew, profiles used in the design of these synthetic benchmarks no longer reflected real application behavior. There were other issues as well. Synthetics became notorious for being easy optimization targets.

Toward the end of the 1980’s, a few brave souls got together and figured there was a better way and took action. Come back for the beginning of Benchmarking 2.0.

Discuss here!


*Other names and brands may be claimed as the property of others.


Discuss this article!