妙到颠毫：你应该学会的 bigcache 优化技巧

2019-12-18 Go语言中文网

本文作者：鸟窝 smallnest

原文链接：https://colobu.com/2019/11/18/how-is-the-bigcache-is-fast/

最近看到 yoko 翻译的一篇文章: [译] Go 开源项目 BigCache 如何加速并发访问以及避免高额的 GC 开销[1]，翻译自 How BigCache avoids expensive GC cycles and speeds up concurrent access in Go[2], 应该是 Douglas Makey Mendez Molero 在阅读了 bigcache 的作者写的 bigcache 设计文章Writing a very fast cache service with millions of entries in Go[3]做的一些调研和总结。

我在刚读取这篇文档的时候，顺着连接把相关的文章都找出来细细读了一遍，结合 bigcache 的代码，仔细学习了相关的优化设计，感觉设计非常的精妙，所以特意根据自己的理解又总结了一篇。

bigcache 的精妙的设计也吸引了 fasthttp 的作者 Aliaksandr Valialkin，他在 bigcache 的基础上，结合自己的公司的使用场景，进一步的做了相应的优化，也开源了这个项目 fastcache[4], 本文在最后也做了介绍。

设计 BigCache 的初衷

bigcache 的作者也不是想当然的开发一个库，而且项目遇到了需求。需求如下：

支持 http 协议
支持 10K RPS (5k 写，5k 读)
cache 对象至少保持 10 分钟
相应时间平均 5ms, p99.9 10 毫秒， p99.999 400 毫秒
其它 HTTP 的一些需求

为了满足这些需求，要求开发的 cache 库要保证：

即使有百万的缓存对象也要非常快
支持大并发访问
一定时间后支持剔除

作者考察了一番缓存框架比如 memcached、redis、couchbase 等，发觉都不太满足需求，因为这些都是独立的程序，访问它们需要网络的开销，延时无法保障，作者需要一个进程内的基于内存的 cache 库。虽然 Go 生态圈有众多的 cache 库如 LRU groups cache[5], go-cache[6], ttlcache[7], freecache[8]，但是只有 freecache 满足需求，不过作者最后还是决定自己取开发一个 cache 库。

以上是 bigcache 诞生的背景，接下来我们欣赏一下 bigcache 和其它库优美的设计。

处理大并发访问

cache 就像一个大的 hashtable, 可不可以使用一个map[string][]byte + sync.RWMutex实现满足需求的 cache 呢？

sync.RWMutex虽然对读写进行了优化，但是对于并发的读，最终还是把写变成了串行，一旦写的并发量大的时候，即使写不同的 key, 对应的 goroutine 也会 block 住，只允许一个写执行，这是一个瓶颈，并且不可控。

解决并发的问题有一个方法叫做 shard (分片)，每个分片一把锁。很多大并发场景下为了减小并发的压力都会采用这种方法，大的场景比如数据库的分片，小的场景就如刚才的场景。Java 8 之前的 ConcurrentMap 就是采用分片(segment)的方式减少竞争, Go 也有一个类似思想设计的 map 库:concurrent-map[9]。

对于每一个缓存对象，根据它的 key 计算它的哈希值: hash(key) % N, N是分片数量。理想情况下 N 个 goroutine 每次请求正好平均落在各自的分片上，这样就不会有竞争了，即使有多个 goroutine 落在同一个分片上，如果 hash 比较平均的话，单个 shard 的压力也会比较小。

竞争小了有什么好处？延迟可以大大提高，因为等待获取锁的时间变小了。

当然这里有一些要考虑的地方：

1、N 的选择

既然分片可以很好的降低锁的竞争，那么 N 是不是越大越好呢？当然不是，如果 N 非常大，比如每个缓存对象一个锁，那么会带来很多额外的不必要的开销。可以选择一个不太大的值，在性能和花销上寻找一个平衡。

另外, N 是 2 的幂，比如 16、32、64。这样设计的好处就是计算余数可以使用位运算快速计算。

func (c *BigCache) getShard(hashedKey uint64) (shard *cacheShard) {
\treturn c.shards[hashedKey&c.shardMask]
}

因为对于 2 的幂 N，对于任意的x, 下面的公式成立:

x mod N = (x & (N − 1))

所以只需要使用一次按位 AND (&)就可以求得它的余数。

2、选择 hash 算法

以前已经有非常多的哈希算法，最近几年也出现了一些新的哈希算法，也被人使用 Go 语言来实现。

很显然，一个优秀的哈希算法要保证:

哈希值应该比较随机 (质量)
哈希速度比较快 (速度)
尽量不产生额外的内存分配，避免对垃圾回收产生压力 (耗费资源少)

项目hash-bench[10]对常用的几种 Hash 算法进行了比较。

bigcache 提供了一个默认的 Hash 的实现，采用 fnv64a 算法。这个算法的好处是采用位运算的方式在栈上进行运算，避免在堆上分配。

type fnv64a struct{}

const (

\t// offset64 FNVa offset basis. See https://en.wikipedia.org/wiki/Fowler–Noll–Vo_hash_function#FNV-1a_hash
\toffset64 = 14695981039346656037

\t// prime64 FNVa prime value. See https://en.wikipedia.org/wiki/Fowler–Noll–Vo_hash_function#FNV-1a_hash
\tprime64 = 1099511628211
)

// Sum64 gets the string and returns its uint64 hash value.
func (f fnv64a) Sum64(key string) uint64 {
\tvar hash uint64 = offset64
\tfor i := 0; i < len(key); i++ {
\t\thash ^= uint64(key[i])
\t\thash *= prime64
\t}

\treturn hash
}

忽略内存开销

对于 Go 语言中的 map, 垃圾回收器在 mark 和 scan 阶段检查 map 中的每一个元素, 如果缓存中包含数百万的缓存对象，垃圾回收器对这些对象的无意义的检查导致不必要的时间开销。

bigcache 的作者做了测试。他们测试了简单的 HTTP/JSON 序列化(不会访问 cache)。在 cache 为空的时候 1 万的 QPS 的耗时大约 10 毫秒。当 cache 填满的时候， 99% 的请求都会超过 1 秒。监控显示堆中包含 4 千万的对象， GC 过程中的 mark 和 scan 也需要 4 秒。

我们可以很容易测试这种状况，比如下面的代码：

package main

import "time"

type Item struct {
\tA string
\tB string
\tC string
\tD string
\tE string
\tF string
\tG G
}

type G struct {
\tH int
\tI int
\tK int
\tL int
\tM int
\tN int
}

func main() {
\tm := make(map[int]*Item, 10*1024*1024)

\tfor i := 0; i < 1024*1024; i++ {
\t\tm[i] = &Item{}
\t}

\tfor i := 0; ; i++ {
\t\tdelete(m, i)
\t\tm[1024*1024+i] = &Item{}
\t\ttime.Sleep(10 * time.Millisecond)
\t}
}

只有一个 map 对象，里面包含一百万的元素，每 10 毫秒删一个放一个。

并发量相当小，并且单个的 goroutine 也没有竞争，但是由于元素的数量巨大，垃圾回收在mark/scan阶段需要花费上百毫秒进行标记和遍历。

那么如何解决这个问题呢？

我们知道垃圾回收器检查的是堆上的资源，如果不把对象放在堆上，不就解决这个问题了吗？还真有这样的项目offheap[11]，它提供了定制的Malloc() 和 Free()，但是你的缓存需要基于这些方法定制。当然一些基于垃圾回收的编程语言为了减少垃圾回收的时间，都会提供相应的库，比如Java: ChronicleMap, Part 1: Go Off-Heap[12]。堆外内存很容易产生内存泄漏。

第二种方式是使用 freecache[13]。freecache 通过减少指针的数量以零 GC 开销实现 map。它将键和值保存在ringbuffer中，并使用索引查找对象。

第三种优化方法是和 Go 1.5 中一个修复有关(#9477[14]), 这个 issue 还描述了包含大量对象的 map 的垃圾回收时的耗时问题，Go 的开发者优化了垃圾回收时对于 map 的处理，如果 map 对象中的 key 和 value 不包含指针，那么垃圾回收器就会对它们进行优化：

runtime: do not scan maps when k/v do not contain pointers
Currently we scan maps even if k/v does not contain pointers. This is required because overflow buckets are hanging off the main table. This change introduces a separate array that contains pointers to all overflow buckets and keeps them alive. Buckets themselves are marked as containing no pointers and are not scanned by GC (if k/v does not contain pointers).
This brings maps in line with slices and chans -- GC does not scan their contents if elements do not contain pointers.
Currently scanning of a map[int]int with 2e8 entries (~8GB heap) takes ~8 seconds. With this change scanning takes negligible time.
https://go-review.googlesource.com/c/go/+/3288

所以如果我们的对象不包含指针，虽然也是分配在堆上，但是垃圾回收可以无视它们。

如果我们把 map 定义成 map[int]int，就会发现 gc 的耗时就会降下来了。

遗憾的是，我们没办法要求用户的缓存对象只能包含int、bool这样的基本数据类型。

解决办法就是使用哈希值作为map[int]int的 key。把缓存对象序列化后放到一个预先分配的大的字节数组中，然后将它在数组中的 offset 作为map[int]int的 value。

type cacheShard struct {
\thashmap     map[uint64]uint32
\tentries     queue.BytesQueue
\tlock        sync.RWMutex
\tentryBuffer []byte
\tonRemove    onRemoveCallback
\tisVerbose    bool
\tstatsEnabled bool
\tlogger       Logger
\tclock        clock
\tlifeWindow   uint64
\thashmapStats map[uint64]uint32
\tstats        Stats
}
func (s *cacheShard) set(key string, hashedKey uint64, entry []byte) error {
\tcurrentTimestamp := uint64(s.clock.epoch())
\ts.lock.Lock()
    // 查找是否已经存在了对应的缓存对象，如果存在，将它的值置为空
\tif previousIndex := s.hashmap[hashedKey]; previousIndex != 0 {
\t\tif previousEntry, err := s.entries.Get(int(previousIndex)); err == nil {
\t\t\tresetKeyFromEntry(previousEntry)
\t\t}
\t}
    // 触发是否要移除最老的缓存对象
\tif oldestEntry, err := s.entries.Peek(); err == nil {
\t\ts.onEvict(oldestEntry, currentTimestamp, s.removeOldestEntry)
\t}
    // 将对象放入到一个字节数组中，如果已有的字节数组(slice)可以放得下此对象，则重用，否则新建一个字节数组
\tw := wrapEntry(currentTimestamp, hashedKey, key, entry, &s.entryBuffer)
\tfor {
        // 尝试放入到字节队列中，成功则加入到map中
\t\tif index, err := s.entries.Push(w); err == nil {
\t\t\ts.hashmap[hashedKey] = uint32(index)
\t\t\ts.lock.Unlock()
\t\t\treturn nil
        }
        // 如果空间不足，移除最老的元素
\t\tif s.removeOldestEntry(NoSpace) != nil {
\t\t\ts.lock.Unlock()
\t\t\treturn fmt.Errorf("entry is bigger than max shard size")
\t\t}
\t}
}
func wrapEntry(timestamp uint64, hash uint64, key string, entry []byte, buffer *[]byte) []byte {
\tkeyLength := len(key)
\tblobLength := len(entry) + headersSizeInBytes + keyLength
\tif blobLength > len(*buffer) {
\t\t*buffer = make([]byte, blobLength)
\t}
\tblob := *buffer
\tbinary.LittleEndian.PutUint64(blob, timestamp)
\tbinary.LittleEndian.PutUint64(blob[timestampSizeInBytes:], hash)
\tbinary.LittleEndian.PutUint16(blob[timestampSizeInBytes+hashSizeInBytes:], uint16(keyLength))
\tcopy(blob[headersSizeInBytes:], key)
\tcopy(blob[headersSizeInBytes+keyLength:], entry)
\treturn blob[:blobLength]
}

queue.BytesQueue 是一个字节数组，可以做到按需分配。当加入一个[]byte时，它会把数据 copy 到尾部。

值得注意的是删除缓存元素的时候 bigcache 只是把它的索引从map[uint64]uint32中删除了，并把它在queue.BytesQueue队列中的长度置为 0。那么删除操作会不会在queue.BytesQueue中造成很多的“虫洞”？从它的实现上来看，会, 而且这些"虫洞"不会被整理，也不会被移除。因为它的底层是使用一个字节数组实现的，"虫洞"的移除是一个耗时的操作，会导致锁的持有时间过长。那么寻找合适的"虫洞"重用呢？虽然遍历的方法寻找"虫洞"也是一个比较耗时的操作，我觉得这里有优化的空间。

bigcache 只能等待清理最老的元素的时候把这些"虫洞"删除掉。

剔除

对于 bigcache 来说，剔除还有意义吗？或许有。如果我们不想使用某个 key 的缓存对象，我们可以把它删除。

首先，在增加一个元素之前，会检查最老的元素要不要删除。

if oldestEntry, err := s.entries.Peek(); err == nil {
\ts.onEvict(oldestEntry, currentTimestamp, s.removeOldestEntry)
}

其次，在添加一个元素失败后，会清理空间删除最老的元素。

同时，还会专门有一个定时的清理 goroutine, 负责移除过期数据。

另外需要注意的是缓存对象没有读取的时候刷新过期时间的功能，所以放入的缓存对象最终免不了过期的命运。

另外所有的缓存对象的 lifewindow 都是一样的，比如 30 分钟、两小时。

所以，如果你真的使用 bigcache, 还是得需要注意它的这些设计，看看这些设计是否和你的场景相吻合。

fastcache

bigcache 在特定时候还是有问题，就是当 queue.BytesQueue 的容量不够的时候，它会进行扩展，扩展是一个很重的操作，它会复制原来的数据到新的字节数组上。

fasthttp 的作者采用类似 bigcache 的思想实现了fastcache[15]，他使用 chunks [][]byte替换 queue.BytesQueue，chunk 是一个 ring buffer, 每个 chunk 64KB。

type bucket struct {
\tmu sync.RWMutex
\t// chunks is a ring buffer with encoded (k, v) pairs.
\t// It consists of 64KB chunks.
\tchunks [][]byte
\t// m maps hash(k) to idx of (k, v) pair in chunks.
\tm map[uint64]uint64
\t// idx points to chunks for writing the next (k, v) pair.
\tidx uint64
\t// gen is the generation of chunks.
\tgen uint64
\tgetCalls    uint64
\tsetCalls    uint64
\tmisses      uint64
\tcollisions  uint64
\tcorruptions uint64
}

虽然chunks [][]byte也包含很多的 chunk, 但是由于 chunk 的 size 比较大，所以可以大大缩小垃圾回收需要 mark/scan 的对象的数量。带来的好处就是扩容的时候只需要增加更多的 chunk 即可。

删除还是一样，只是从 map 中删除，不会从chunks中删除。

fastcache 没有过期的概念，所以缓存对象不会被过期剔除。

参考文档

http://allegro.tech/2016/03/writing-fast-cache-service-in-go.html
https://github.com/allegro/bigcache
https://dev.to/douglasmakey/how-bigcache-avoids-expensive-gc-cycles-and-speeds-up-concurrent-access-in-go-12bb
https://pengrl.com/p/35302/
https://github.com/VictoriaMetrics/fastcache
https://www.openmymind.net/Shard-Your-Hash-table-to-reduce-write-locks/
https://medium.com/@itsromiljain/curious-case-of-concurrenthashmap-90249632d335
https://segmentfault.com/a/1190000012926722
https://github.com/coocood/freecache

文中链接

[1]

[译] Go开源项目BigCache如何加速并发访问以及避免高额的GC开销: https://pengrl.com/p/35302/

[2]

How BigCache avoids expensive GC cycles and speeds up concurrent access in Go: https://dev.to/douglasmakey/how-bigcache-avoids-expensive-gc-cycles-and-speeds-up-concurrent-access-in-go-12bb

[3]

Writing a very fast cache service with millions of entries in Go: https://allegro.tech/2016/03/writing-fast-cache-service-in-go.html

[4]

fastcache: https://github.com/VictoriaMetrics/fastcache

[5]

LRU groups cache: https://github.com/golang/groupcache/tree/master/lru

[6]

go-cache: https://github.com/patrickmn/go-cache

[7]

ttlcache: https://github.com/diegobernardes/ttlcache

[8]

freecache: https://github.com/coocood/freecache

[9]

concurrent-map: https://github.com/orcaman/concurrent-map

[10]

hash-bench: https://github.com/smallnest/hash-bench

[11]

offheap: https://godoc.org/github.com/glycerine/offheap

[12]

Java: ChronicleMap, Part 1: Go Off-Heap: https://dzone.com/articles/java-chroniclemap-part-1-go-off-heap

[13]

freecache: https://github.com/coocood/freecache

[14]

#9477: https://github.com/golang/go/issues/9477

[15]

fastcache: https://github.com/VictoriaMetrics/fastcache

喜欢本文的朋友，欢迎关注“Go语言中文网”：

Go语言中文网启用微信学习交流群，欢迎加微信：274768166

妙到颠毫：你应该学会的 bigcache 优化技巧

设计 BigCache 的初衷

处理大并发访问

忽略内存开销

剔除

fastcache

参考文档

文中链接

项目中要不要使用 Go？我是这么思考的

Go语言 CPU 性能、内存分析调试方法大汇总：你要的都在这

Go官方的限流器 time/rate 如何使用

Go语言的 defer 链如何被遍历执行？

Go1.14 的这个改进让 Gopher 生活更美好

字节跳动商业产品研发团队招聘：各种职位应有尽有，Go当然有

TIOBE 发布2020年3月编程语言榜单：Go 冲进前十，Delphi 没落

好消息，Go 新的代码文档中心：pkg.go.dev 不久将开源

Go1.13.8和Go1.12.7发布，Go1.14 延期16天，还有些 bug 没解决

Go 程序员的演变，最后的“Rob Pike”这个梗看懂了吗？

创业公司更适合用 Go 语言，那大公司呢？

别告诉我这是真的？goroutine 可能使程序变慢

Go 语言中文网 2019 年终总结暨 2020 年展望

Go 在马蜂窝即时通讯服务建设中的实践

Go1.13 标准库的 http 包爆出重大 bug，你的项目中招了吗？

明白了，原来 Go Web 框架中的中间件都是这样实现的

像 Awesome-Go 一样提升企业 Go 项目代码质量

妙到颠毫：你应该学会的 bigcache 优化技巧

Go 如何处理 HTTP 请求？掌握这两点即可

sync.Pool 一定会提升性能吗？建议你先学习一下它的设计

Go 号称几行代码开启一个 HTTP Server，底层都做了什么？

Go 简单性的价值：来自对 Go 倍加青睐的谷歌软件工程师的自述

Go语言爱好者周刊：第 18 期

从Go开源项目BigCache学习加速并发访问和避免高额的GC开销