Perf#
Introduction#
Linux perf is a powerful performance analysis tool that uses hardware performance counters and kernel tracepoints to profile applications and the system. Unlike sampling profilers that add overhead, perf leverages CPU hardware to count events like cycles, instructions, cache misses, and branch mispredictions with minimal impact on program execution.
Perf helps identify CPU hotspots, memory access patterns, and system bottlenecks.
It’s essential for optimizing performance-critical code, understanding where time
is spent, and validating optimization efforts. The tool works on any executable
without recompilation, though debug symbols (-g) improve output readability.
Basic Profiling#
Record and analyze CPU samples:
$ perf record ./myprogram # record CPU samples
$ perf report # interactive report
$ perf report --stdio # text report
$ perf record -g ./myprogram # record with call graphs
$ perf report -g # show call graph in report
Common record options:
$ perf record -F 99 ./prog # sample at 99 Hz
$ perf record -p 1234 # attach to running process
$ perf record -a sleep 10 # system-wide for 10 seconds
$ perf record -o out.data ./prog # custom output file
Perf Stat#
Get summary statistics without recording samples:
$ perf stat ./myprogram
Performance counter stats for './myprogram':
1,234.56 msec task-clock
123 context-switches
1,000,000 cycles
800,000 instructions # 0.80 insn per cycle
50,000 cache-misses
10,000 branch-misses
$ perf stat -e cycles,instructions ./prog # specific events
$ perf stat -r 5 ./prog # run 5 times, show stats
$ perf stat -d ./prog # detailed stats
Perf Top#
Real-time view of system or process hotspots:
$ perf top # system-wide live view
$ perf top -p 1234 # specific process
$ perf top -F 99 # sample at 99 Hz
$ perf top -g # show call graphs
$ perf top -ns comm,dso # sort by process and library
$ perf top -e cache-misses # profile cache misses
Useful Events#
Profile specific hardware and software events:
# CPU events
$ perf stat -e cycles,instructions,branches,branch-misses ./prog
# Cache events
$ perf stat -e cache-references,cache-misses ./prog
$ perf stat -e L1-dcache-loads,L1-dcache-load-misses ./prog
$ perf stat -e LLC-loads,LLC-load-misses ./prog
# Memory events
$ perf stat -e page-faults,minor-faults,major-faults ./prog
# System calls
$ perf stat -e 'syscalls:sys_enter_*' ./prog
List available events:
$ perf list # all events
$ perf list hw # hardware events
$ perf list sw # software events
$ perf list cache # cache events
$ perf list tracepoint # kernel tracepoints
Call Graphs#
Understand where time is spent in the call hierarchy:
# Record with frame pointers (compile with -fno-omit-frame-pointer)
$ perf record -g ./myprogram
$ perf report -g
# Record with DWARF unwinding (works without frame pointers)
$ perf record --call-graph dwarf ./myprogram
# Record with LBR (Last Branch Record, Intel CPUs)
$ perf record --call-graph lbr ./myprogram
Flame Graphs#
Flame graphs visualize profiling data as interactive SVGs. The x-axis shows stack depth, and width represents time spent. Download FlameGraph tools from brendangregg/FlameGraph.
$ perf record -g ./myprogram
$ perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
# CPU flame graph
$ perf record -F 99 -g ./prog
$ perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg
# Off-CPU flame graph (shows where program waits)
$ perf record -e sched:sched_switch -g ./prog
Annotate Source#
See which source lines consume the most cycles:
$ perf record ./myprogram
$ perf annotate # interactive annotation
$ perf annotate func_name # annotate specific function
$ perf annotate --stdio # text output
Requires debug symbols (-g) and ideally compiled with -fno-omit-frame-pointer.
System-Wide Analysis#
Profile the entire system to find bottlenecks:
$ perf top -a # live system-wide view
$ perf record -a -g sleep 30 # record system for 30 seconds
$ perf report
# Find which processes use most CPU
$ perf top -ns comm
# Trace context switches
$ perf record -e context-switches -a sleep 10
Tracing#
Trace specific events and system calls:
# Trace system calls
$ perf trace ./myprogram
$ perf trace -p 1234 # trace running process
# Trace specific syscalls
$ perf trace -e open,read,write ./prog
# Count syscalls
$ perf stat -e 'syscalls:sys_enter_*' ./prog
Comparing Runs#
Compare performance between two runs:
$ perf record -o before.data ./prog_v1
$ perf record -o after.data ./prog_v2
$ perf diff before.data after.data
Common Workflows#
Find CPU hotspots:
$ perf record -g ./myprogram
$ perf report
# Look for functions with highest "Overhead" percentage
Diagnose cache performance:
$ perf stat -e cache-references,cache-misses,L1-dcache-load-misses ./prog
# High cache-miss ratio indicates poor memory access patterns
Profile a running server:
$ perf record -p $(pgrep myserver) -g sleep 30
$ perf report
Check if CPU-bound or I/O-bound:
$ perf stat ./myprogram
# Low instructions-per-cycle + high context-switches = I/O bound
# High instructions-per-cycle = CPU bound