Shannon's Software Engineering Blog

pprof CPU



This is Part 7 of the Magic Link System Design Series.

  1. System Design
  2. Magic Links
  3. CSRF with Magic Links
  4. pprof & Load Testing
  5. pprof Analysis
  6. pprof Memory Optimizations
  7. pprof CPU

cpu-base

CPU

The above CPU flamegraph shows the CPU utilization after the memory optimizations made in the previous post.

As seen in the flamegraph, the Magic Link handler takes up less than half of the CPU utilization during the sample period. This is expected as it is a very simple handler. Majority of the compute in the handler is used for encryption, while the rest is part of the HTTP lifecycle.

One area of optimization would be the fmt.Fprint section that is part of the Gin logger. Every request coming in will be logged synchronously, triggering syscalls which consumes CPU utilization.

Benchmarks

When dealing with CPU utilization or latency, we need to gather baseline metrics of our system before making improvements, such that we can compare the results.

// benchmark_test.go
package main

import (
	"net/http"
	"net/http/httptest"
	"testing"
)

const (
	Host               = "http://127.0.0.1:8080"
	GenerateMagicRoute = "/magic/generate"
)

func BenchmarkGin(b *testing.B) {
	router := NewRouter(true)
	req, err := http.NewRequest(http.MethodPost, Host+GenerateMagicRoute, nil)
	req.Header.Set("X-CSRF-Token", generateCsrf())
	if err != nil {
		b.Error(err)
		return
	}
	b.ResetTimer()
	for b.Loop() {
		w := httptest.NewRecorder()
		router.ServeHTTP(w, req)
	}
}

The above code shows a simple benchmark test that is focus on the Magic Link handler, with the results shown below.

benchmark-base

Benchmark Results (baseline)

Metrics Legend

Symbol Description
op Operation
ns Nano-seconds
allocs Number of Memory Allocations
B Bytes Allocated

Buffered Logging

Writing logs to disk synchronously per request is slow. Unless your system is fragile, and you fear that logs go missing on your critical path, it makes sense to buffer the incoming logs before writing them to disk; grouping such calls reduce syscalls.

The logging library Zap, supports this and is the fastest so far among other libraries.

...using encoding/json and fmt.Fprintf to log tons of interface{} makes your application slow...
- https://github.com/uber-go/zap

When replacing the default Gin logger, we need to add 2 middlewares: (1) Logging; (2) Recovery.

func MiddlewareZapLogger(logger *zap.Logger) gin.HandlerFunc {
	return func(c *gin.Context) {
		start := time.Now()
		path := c.Request.URL.Path
		raw := c.Request.URL.RawQuery
		if raw != "" {
			path = path + "?" + raw
		}

		c.Next()

		latency := time.Since(start)

		logger.Info("http request",
			zap.Int("status", c.Writer.Status()),
			zap.String("method", c.Request.Method),
			zap.String("path", path),
			zap.String("ip", c.ClientIP()),
			zap.Duration("latency", latency),
			zap.String("user_agent", c.Request.UserAgent()),
			zap.Int("bytes_out", c.Writer.Size()),
			zap.String("errors", c.Errors.ByType(gin.ErrorTypePrivate).String()),
		)
	}
}

func MiddlewareZapRecovery(logger *zap.Logger) gin.HandlerFunc {
	return gin.CustomRecovery(func(c *gin.Context, recovered any) {
		logger.Error("panic recovered",
			zap.Any("error", recovered),
			zap.String("method", c.Request.Method),
			zap.String("path", c.Request.URL.Path),
		)
		c.AbortWithStatus(http.StatusInternalServerError)
	})
}

func main() {
  // ...

  logger, err := zap.NewProduction()
	if err != nil {
		_ = fmt.Errorf("failed to run router: %w", err)
		return
	}
	defer logger.Sync()
  router := gin.New()
	router.Use(MiddlewareZapLogger(logger))
	router.Use(MiddlewareZapRecovery(logger))

  // ...
}

Running the benchmarks again, we can see the optimized results below.

benchmark-optimized

Benchmark Results (Optimized)

Using the Zap logger, latency was improved by (22614-5430) / 22614 = ~76%. This is a huge speed improvement!

cpu-optimized

CPU (Optimized)

Looking at the flamegraph, we can see that the fmt.Fprint section is gone, and that the syscalls are greatly reduced.

The CPU utilization is still low as we can't increase the number of concurrent requests due to the RSS limit mentioned here.