Bandwidth: a memory bandwidth benchmark




Description

My program, called bandwidth, is an artificial benchmark for measuring memory bandwidth, useful for identifying a computer's weak areas.

Version 0.16

The latest version performs sequential memory access tests using a range of chunk sizes. It requires an AMD64 or Intel64 processor running 64-bit Linux. It performs its tests both using 64-bit memory transfers and 128-bit transfers using SSE2.

Version 0.15

Version 0.15 uses preset chunk sizes to effectively test several types of memory:
  • Level 1 cache sequential read accesses
  • Level 1 cache sequential write accesses
  • Level 2 cache sequential read accesses
  • Level 2 cache sequential write accesses
  • Main memory sequential read accesses
  • Main memory sequential write accesses
  • Framebuffer sequential read accesses (Linux only)
  • Framebuffer sequential write accesses (Linux only)
  • String library routines

Change log

Version 0.16 enhancements

This version offers x86_64 assembly optimizations, particularly the use of SSE2 for 128-bit transfers.

Version 0.16 is 64-bit only as I haven't had time to finish improvements to the 32-bit assembly code.

Version 0.15 enhancements

I made two key enhancements to version 0.15:
  1. I rewrote core test routines in x86 assembly.
  2. I switched the code to using microsecond timing instead of whole seconds.

Results from 0.16

On my Celeron 550 with PC2-5300 memory, running 64-bit Linux at 2 GHz, I get the following key values:
  • Using 128-bit SSE2 transfers I get around 32 gigabytes per second reading from L1 cache.
  • Using 64-bit transfers I can write to L1 cache at 14 gigabytes per second.
  • Using 128-bit SSE2 transfers I get around 13 gigabytes per second reading from L2 cache.
  • Using 64-bit transfers I can write to L2 cache at 9.4 gigabytes per second.
  • Using 128-bit SSE2 transfers I can read from main memory at 3.5 gigabytes per second.
  • Using 128-bit SSE2 transfers I can write to main memory at 2.7 gigabytes per second.
  • Reading 64 bits per access: L1 is about 5 times faster than DRAM and L2 is about 3 times faster.

On an AMD Quad Opteron 2352 running at 2.1 GHz, the 128-bit sequential read maxes out at 49.7 GB/second, and maximum 128-bit sequential write speed to main memory of 2.8 GB/second.

Results from 0.15

All bandwidth values are in millions of bytes per second, rounded to 4 significant figures.

Make/model CPU CPU speed (MHz) L1 read MB/sec L1 write MB/sec L2 read MB/sec L2 write MB/sec Main read MB/sec Main write MB/sec Main memory RAM type/speed FB read MB/sec FB write MB/sec
Lenovo 3000 N200 Intel Celeron 550 2.0 GHz 7489 7125 6533 5007 2088 1290. PC2-5300 23.05 100.2
Toshiba A205 Intel Pentium Dual T2390 1.86 GHz 7098 6734 7095 5675 2146 1255 PC2-5300 23.36 84.50
Dell XPS T700r Intel Pentium III 700 MHz 2629 2284 2607 1630. 448.5 163.7 PC100 6.680 47.40
IBM Thinkpad 560E Intel Pentium MMX 150 MHz 500.7 75.49 520.6 74.81 86.64 74.32 EDO 60ns; 50 MHz N/A N/A

Download

Commentary

One has certain expectations about the performance of different memory subsystems in a computer. My program confirms these.
  1. Reading is usually faster than writing.
  2. L1 cache writing is usually almost as fast as L1 cache reading.
  3. L2 cache reading is roughly as fast as L1 reading.
  4. L2 cache writing is usually slower than L1 writing.
  5. If the L2 cache is write-through mode then L2 writing will be very slow and more on par with main memory write speeds.
  6. L2 cache reading is usually faster than main memory reading.
  7. L2 cache writing is usually faster than main memory writing.
  8. However framebuffer writing is usually faster than framebuffer reading.
  9. Framebuffer accesses are usually slower than main memory.

The main thing that reduces a computer's bandwidth is a write-through cache, be it L2 or L1. This is especially apparent in the Pentium 150 results above.

Author

Zack Smith, email.

Links