When IO-bound Hides Inside CPU. Revealing IO bottleneck in pure… | by Gleb Sakhnov | Feb, 2022

Revealing IO bottleneck in pure CPU-bounded application using Go

Gleb Sakhnov
Photo by Christian Wiediger on Unsplash
func incrementManyTimes(val *int64, times int) {
for i := 0; i < times; i++ {
*val++
}
}
// define structure type to hold the values
type IntVars struct {
i1 int64
i2 int64
}
// create the actual values
vars := IntVars{i1: 0, i2: 0}
incrementManyTimes(&vars.i1, 1000)
incrementParallel(&vars.i1, &vars.i2, 1000)
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkIncrement1Value              1.408 ns/op
BenchmarkIncrement2ValuesInParallel 2.172 ns/op
Core i7 Xeon 5500 Series Data Source Latency (approximate)               

L1 CACHE hit 1-2 ns
L2 CACHE hit 3-5 ns
L3 CACHE hit 12-40 ns

local DRAM ~60 ns
remote DRAM ~100 ns

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkIncrement1Value              1.408 ns/op
BenchmarkIncrement2ValuesInParallel 2.172 ns/op

Mitigation

type IntVars struct {
i1 int64
_ [56]byte // padding
i2 int64
}
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkIncrement1Value               1.367 ns/op
BenchmarkIncrement2ValuesInParallel 1.374 ns/op

Cross-architecture support

import "golang.org/x/sys/cpu"type IntVars struct {
i1 int64
_ cpu.CacheLinePad // padding
i2 int64
}

Measuring CPU cache performance

func main() {
a := IntVars{}
incrementParallel(&a.i1, &a.i2, 100000000)
}
▶ perf stat -B -e L1-dcache-load-misses ./test8,650,268      L1-dcache-load-misses
▶ perf stat -B -e L1-dcache-load-misses ./test-padded205,526      L1-dcache-load-misses

Leave a Comment