Measuring Go 1.26's Green Tea GC in the Real World

When Go 1.26 shipped, one line in the release notes caught my attention:

Green Tea GC, the new default garbage collector, reduces GC overhead by 10-40% for real-world programs.

Bold claim. Garbage collectors are notoriously difficult to improve without compromising something else, such as latency for throughput, or memory for speed. A 10-40% range is also suspiciously wide, which usually means "it depends a lot on your workload."

I wanted to see it for myself, so I created a couple of small workloads that look resemble backend code and ran them with and without the new collector.

A quick refresher: Go's garbage collector

Go uses a concurrent mark-and-sweep collector. The expensive part is usually the mark phase, where the runtime walks the heap and figures out which objects are still alive.

That pointer traversal forces the runtime to scan large chunks of memory. For programs with lots of small heap objects, scanning can start to dominate the overall runtime cost, even when most of the actual work is computation.

Green Tea improves how that scanning happens: better heap traversal locality, less pointer-chasing overhead, better scaling across cores. What it doesn't do is change how much your program allocates. Same allocations, faster collection.

How I set up the benchmarks

To keep things clean, I ran each benchmark twice, once with Go 1.26's defaults (Green Tea on), and once with it disabled.

Go lets you turn it off with a compile-time experiment flag:

GOEXPERIMENT=nogreenteagc

So the runs looked like this:

go test -bench=. -benchmem -count=15 > green.txt
 
GOEXPERIMENT=nogreenteagc \
go test -bench=. -benchmem -count=15 > oldgc.txt

Then I compared them with benchstat:

benchstat oldgc.txt green.txt

benchstat runs statistical tests under the hood, if a result shows ~, the difference is likely noise. I ran 15 iterations per benchmark specifically so it'd have enough data to be confident.

Full benchmark code is at github.com/wulkan/greentea-gc-bench if you want to poke around or run it on your own hardware.

Workload 1: Log line processing

The first one simulates log parsing. Most backend services do some version of this, ingesting structured log lines, extracting fields and building derived records.

The lines look like:

ts=1710000000 level=INFO service=api region=eu-west user=123 action=checkout

The processing pipeline splits each line, pulls out key/value pairs, builds a map, and generates normalized tags:

func ProcessLines(lines []string) []Record {
	out := make([]Record, 0, len(lines))
 
	for _, line := range lines {
		fields := strings.Fields(line)
 
		values := make(map[string]string, len(fields))
		for _, f := range fields {
			parts := strings.SplitN(f, "=", 2)
			if len(parts) != 2 {
				continue
			}
			values[parts[0]] = parts[1]
		}
 
		tags := make([]string, 0, len(values))
		for k, v := range values {
			tags = append(tags, k+":"+strings.ToLower(v))
		}
 
		out = append(out, Record{
			Service: values["service"],
			Region:  values["region"],
			User:    values["user"],
			Action:  values["action"],
			Tags:    tags,
		})
	}
 
	return out
}

This allocates a fair amount, but the bottleneck here is string parsing and map operations, not GC. I expected modest improvements at best.

Workload 2: Span processing

The second one is closer to a tracing or telemetry pipeline. Each span has attributes stored in a linked list, and processing generates a bunch of derived objects:

type Attr struct {
	Key   string
	Value string
	Next  *Attr
}
 
func ProcessSpans(spans []Span) []Output {
	out := make([]Output, 0, len(spans))
 
	for _, s := range spans {
		var summary []*Attr
 
		for a := s.Attrs; a != nil; a = a.Next {
			n := &Attr{
				Key:   a.Key,
				Value: a.Value,
			}
 
			if len(a.Value) > 3 {
				n.Next = &Attr{
					Key:   a.Key + ".len",
					Value: strconv.Itoa(len(a.Value)),
				}
			}
 
			summary = append(summary, n)
		}
 
		out = append(out, Output{
			Key:      s.Service + ":" + s.Region + ":" + s.UserID,
			Summary:  summary,
			TagCount: len(summary),
		})
	}
 
	return out
}

This creates tens of thousands of short-lived, pointer-heavy objects per run, exactly the scenario where mark phase costs pile up. If Green Tea was going to show up anywhere, it'd be here.

Results

I ran these on a few different machines to see if the hardware made a difference.

Apple M3 Pro

$ benchstat oldgc.txt green.txt
 
name             old time/op   new time/op   delta
ProcessLines        1.24ms       1.18ms      -4.84%
ProcessSpans        8.73ms       7.89ms      -9.62%

About what I expected, the log parser improved a bit, and the span workload saw a more meaningful jump. Nearly 10% without touching any code.

Intel i7-8850H

$ benchstat oldgc.txt green.txt
 
name             old time/op   new time/op   delta
ProcessLines        2.01ms       1.87ms      -6.97%
ProcessSpans        9.48ms       8.19ms     -13.61%

The span workload improved more here than on the M3. I honestly wasn't expecting that.

AMD EPYC 9354P (controlled environment)

$ benchstat oldgc.txt green.txt
 
name                  old time/op   new time/op   delta
ProcessLines             1.83ms       1.77ms      -3.28%
ProcessSpans_Medium      3.002ms      2.564ms     -14.57% (p=0.000 n=15)
ProcessSpans_Large       14.51ms      13.30ms      -8.38% (p=0.000 n=15)

The medium span workload had the biggest relative improvement across all my tests. The p-values are basically zero, so the signal is real.

One thing that didn't change

Across every machine and every benchmark:

allocs/op: 54k -> 54k
B/op:      2.0MB -> 2.0MB

Allocation counts didn't budge. Green Tea isn't helping your code allocate less, it's just doing less work when it comes time to clean up.

Why do the numbers vary so much

Two things seem to drive most of the difference.

The first is workload shape. String-heavy code, lots of parsing, map lookups, small slices, tends to be CPU-bound in ways that the GC can't really help with. Those workloads got 3-7%. Pointer-heavy code, with large numbers of small heap objects, is where you start to see 8-14%.

The second is CPU architecture. The Go runtime has vectorized heap scanning for newer x86 chips (Intel Ice Lake and AMD Zen 4 territory). On older hardware, the improvement is real but smaller.

What this actually means

The span workload isn't a contrived example. A lot of real Go services look more like that than like the log parser, tracing pipelines, telemetry ingestion, API servers, event processors or anything that allocates a lot of short-lived pointer-rich objects. That's the class of program where GC overhead tends to quietly eat into your latency budget.

Final thoughts

In practice, this means a free performance win for a lot of services. If your workload allocates a lot of small pointer-heavy objects, upgrading to Go 1.26 will likely shave a bit off your CPU time without you touching any code.

It’s not a magic bullet, but it’s a nice improvement for the kind of backend workloads Go tends to run.