Home | Get FREE Tools | Forums Login

Tools

MythMash: Frontside Bus - Bottleneck or Room to Grow?

By Bill Machrone

Faster processors. Dual and quad cores. More memory. Bigger programs. Multithreading. Multitasking. Huge datasets. Monster graphics boards. Ever-faster DDR memory.

The motherboard has gotten to be a very busy place. Every component plays a vital role, and the frontside bus is no exception.

Microprocessors have changed dramatically over the last couple of decades, from subscalar designs that needed multiple clock cycles to perform a single instruction to today’s superscalar chips that perform multiple instructions per tick. The frontside or main memory bus was a busy thoroughfare, carrying all of the processor’s memory reads and writes, plus access to memory-mapped I/O devices and system peripherals. As processor clock speeds increased and instructions started to be interleaved, the frontside bus, or FSB, running at a lackadaisical 33MHz, didn’t necessarily keep pace because faster memory was expensive and standards were somewhat ingrained.

Fortunately, the old 33MHz bus is a distant memory. The modern FSB doesn’t connect directly to memory but to a sophisticated memory controller, part of the northbridge chip, which also drives the PCIe bus and may contain a graphics controller as well.

The question remains: can the FSB handle the additional burdens from multithreaded applications, multiple applications, and highly demanding applications such as video codecs and games? For that matter, what about dual-core and quad-core chips running at 2 and 3GHz? More processing cores means even more threads, more data being crunched, more memory reads and writes, more everything.

The short answer is yes.

The FSB still has plenty of headroom. The longer answer explains why. Read on for a deep-dive about FSB utilization and how the Core™ memory sub-system.

Get Smart

The FSB is not merely a bus anymore. Far more than a set of traces that run from one part of the motherboard to another, the FSB is part of the memory architecture, which includes caches, prefetchers, and other specialized functions. Taken together, the system is the Intel Advanced Smart Cache.

Designers long ago realized that software tends to operate on clusters of data and that some code is executed repeatedly. Performance is greatly enhanced if the data and instructions are held in high-speed cache memory, right on the processor chip. Intel architectures have a two-level cache scheme. The L1 data cache unit or DCU is closely tied to the processor core—each core in a multiprocessor chip has one. It allows the processor to deal with cache misses and enhances its ability to perform out-of-order execution.

A variety of buffers also help speed memory access efficiency. The Data Translation Lookaside Buffer (DTLB) in the Intel Core architecture is a two-level, hierarchical design. It converts virtual addresses that the processor can manipulate quickly into physical addresses for both loads and stores. The DTLB works in conjunction with on-chip Page Miss Handler logic, an important feature that causes the vast majority of page misses to be non-blocking, which means that the processor doesn’t have to do a page walk, and wait while the physical address is located. The Memory Ordering Buffer (MOB) lets the processor issue loads and stores speculatively and out of order. Out-of-order execution is vital to high performance because it lets the processor work on tasks further down the pipeline while the data for prior tasks is being fetched. The MOB ensures that the loads and stores have valid data, that speculative results are ignored, and that the physical writing to main memory happens in the most efficient manner.

Example Workloads

The chart below shows the results of running three applications, each with a reputation for taxing systems heavily, the first is a game, F.E.A.R.*, which is commonly used to benchmark high-performance computers and their graphics subsystems. It allows adjustment of the number of calculations for the physics of objects on the screen, the amount of detail in rendering, even the audio fidelity. In this and the other tests, the Intel VTune™ Performance Analyzer simply counted the number of bus clock cycles against the actual number of transfers. In this case, roughly 16.5 trillion cycles were available, but only 375 million were used, for a utilization of 4.53 percent. Clearly, the memory bus is not a bottleneck; it’s barely used

Please see “The Legal Stuff” page for system details and disclaimers.

“The frontside bus is simply not limited in typical home or desktop applications, says Ronen Zohar, an Intel Principal Engineer. “Workstation applications may tax the bus more heavily, but there are no scaling issues on dual-core, quad core and eventual octo (dual quad-core) packages.”

Zohar’s observations are borne out by tests with the DivX encoder and Photoshop applications, which incurred 3.82 and 5.23 percent utilization respectively.

“Most of the utilization is not generated by the program load instructions,” says Zohar, “but by prefetch operations which runs ahead of the actual loads to hide the memory latency —the hardware anticipating and preloading the data the program will need.”

So, How Smart is It?

Any investigation of frontside bus performance on Intel Core architectures will turn up references to how smart the prefetch process is. Prefetching is essential to the function of both the L1 and L2 caches. Prefetching is not a new idea, but the Core architecture takes it to a new level. While each core has a dedicated L1 cache, the L2 cache is shared by the processor cores. The Core processors have two hardware-based prefetchers to speed up data access from the L1 cache. The first, the Data Cache Unit (DCU) prefetcher, is also known as the streaming prefetcher and it’s particularly beneficial to streaming algorithms. It detects when the core is repeatedly accessing very recently loaded data in ascending order, and automatically fetches the next line in order to keep the DCU full and the algorithm running at top speed.

The second L1 prefetcher keeps track of the instruction pointer and looks at individual load instructions. If it determines that the load instructions are evenly spaced, or have a regular “stride,” it prefetches the next offset. This strided prefetcher can operate forwards or backwards. These prefetch algorithms reward the programmer who is aware of their operation and arranges data structures accordingly.

Unambiguous

One of the challenges of maintaining throughput in a highly parallel microarchitecture, where multiple instructions are being executed at the same time, is that a load operation may be dependent on a store that precedes it in the program code. The actual execution may be out of order, however, and in a less sophisticated implementation, the processor would have to block loads until all preceding store addresses are known. The Intel Core architecture implements memory disambiguation, which predicts which loads will not have dependencies on previous stores based on past experience, even without knowing the stores’ addresses. This allows the processor to continue loading data from the L1 cache. As the actual stores are executed, the memory disambiguator verifies its predictions. Actual conflicts occasionally occur, but the disambiguator simply causes the load and its associated instructions to be re-executed.

The L2 cache is also dependent on prefetching, but has to be aware of what the L1 prefetchers for each core are doing in order to maximize their performance. The data prefetch logic (DPL) for the L2 cache looks for patterns in the past requests of the L1 DCU and stores them in two separate arrays. It monitors DCU reads for stream activity and fills the cache accordingly. Although first introduced in the Pentium M series, the current Core architecture’s DPL is considerably more sophisticated. It detects when the stream skips cache lines and it adjusts dynamically to available bus bandwidth and the number of requests—it prefetches far ahead if the bus is not busy, less far if it’s busy.

Cache coherency has received a lot of attention lately, too. When multiple processor cores are executing threads, there is always the possibility that they are operating on the same data or modifying data that will affect the results of the other processor’s work. But is the issue relevant to most desktop multi-core designs? Zohar says, “In a single socket processor, the L2 cache is shared. Cache discrepancies are resolved within the chip itself; there is no coherency traffic on the memory bus.”

There is coherency activity in the cache, of course, but it is internal overhead, with no real effect on memory throughput. Cache coherency is a concern in multi-socket systems, as in workstations and servers, but each processor has its own FSB in such systems, and mechanisms in the chipset are specifically designed to reduce coherency traffic. So while this coherency traffic adds a bit to the overall utilization, it will still be well under 10%. “It’s a non-issue for desktop platforms,” says Zohar.

Closer Data

From the above, a clear goal emerges: bring the data closer to the processor. But do it in a way that maximizes throughput, minimizes bus traffic, and optimizes processor efficiency. Technologies such as streaming and strided prefetching, coordinated L1/L2 fetching, disambiguation, translation lookaside, and streamlined handling of page misses and memory ordering work together harmoniously to raise performance to a new level. In benchmark after benchmark, application after application, the Intel Core architecture and Advanced Smart Cache have proven their leadership with record-breaking performance.

Most importantly, today’s FSB is a system, one with more than enough bandwidth to handle dual, quad, and even eight-core processors.

The Legal Stuff

System Configuration Information:

CPU: Intel® Core™ 2 Extreme processor QX6700 (2.66GHz, 8MB shared L2 cache)

Motherboard: Intel® Desktop Board D975XBX2

Memory: 2GB of DDR 800, timings set to 4-4-4-12 1T

BIOS version: BXB97510J.86A.1474.2006.1220.1826

Chipset driver version: AUD_ALLVISTA_6.10.5405

Graphics: nVidia* 8800GTX, Driver version 7.15.11.65

Hard Drive: Maxtor* Diamond Max 10 Model 6L300SO 7200RMP

Operating System: Windows* Vista* Ultimate

Performance tests and ratings are measured using specific systems and/or components and reflect approximate performance of Intel products as measured by those tests. Any difference in system hardware, software, or configuration may affect actual performance. Buyers should consult other sources of information to evaluate system or component performance they are considering purchasing. For information on performance tests and performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm

*Other names and brands may be claimed as the property of others


Discuss this article!