Building A Simple Filesystem Backed Cache With Golang

By MzFit

December 7, 2024 - 17 minutes read - 3435 words

I’m an older developer who enjoys building simple, easy to maintain software. I tend to agree with the maxim “make it as simple as possible but no simpler.” I’ve also been around long enough that I’ve been bitten by issues with other people’s software. I tend to enjoy programming in the Go programming language because it generally behaves the way I expect about 95% of the time, which means I tend not to be bitten by it. It also has a rich standard library that makes it easy to build many things without reaching for third party libraries.

In that vein, there’s an eternal debate in software development that goes along the lines of:

I would be foolish to write my own software when there is probably an existing, well tested library or framework that already does what I need.

vs.

I would be foolish to use someone else’s software, it won’t behave exactly the way I want and will come with its own bugs.

Navigating this tension is more art than science. On the one hand, there are large, complex things that would be foolish to try to reinvent, say, databases or operating systems.

On the other hand, there are well documented cases where developers have had catastrophic failures by depending on third party libraries when they would have been better off writing their own. For example, the node-js left-pad incident, where a developer removed his 11 line javascript library from the npm package registry and “broke the Internet”.

I do tend to be more on the conservative side. I prefer to write my own software. It forces me to think through what I actually intend to do and also frees me from having to constantly monitor/patch/update my software because of mistakes that others made in writing their software that I depend on.

If you want to skip this writeup and jump directly to my 160 lines of code, you can do that here: https://github.com/sethgecko13/mzcache

It provides 2 functions, Read and Write. Here’s a brief look at it’s usage:

func cache() {
    // a unique key such as a URL
    key := "https://youtube.googleapis.com/youtube/v3/channels?part=contentDetails,snippet,statistics,brandingSettings&id=ABC123"  
    value := "text result from calling url"
    err := mzcache.Write(key, value)
    if err != nil {
        log.Printf("an error occurred: %s", err.Error())
    }
    result, err := mzcache.Read(key)
    if err != nil {
        log.Printf("an error occurred: %s", err.Error())
    }
    log.Printf("result from cache is: %s", result)
}

The Cutoff: Don’t Write Software That’s Too Gross

A general rule of thumb: if I can write software in about the same number of lines of code as the config file of a 3rd party package, I’d rather write the software. And let’s arbitrarily say that config files tend to be dozens of lines.

So, how many dozens of lines is too much?

I decided to re-purpose an old Mad Magazine joke:

anything more than two dozen is “two gross”

A dozen dozen - 144 - is one gross in imperial measurement. So if I can write the software myself in fewer than 288 lines of code (two gross!), then I should usually do that.

Example: Filesystem Caching

I came across a recent question on the Reddit golang forum where a developer asked whether he would be better off creating his own caching routine or should he re-use a preexisting caching library or package?

The majority of respondents seemed to think it would be foolish to build a caching routine from scratch. One user even suggested this list of existing golang caches: https://github.com/avelino/awesome-go?tab=readme-ov-file#caches.

This piqued my interest because in my application, I had also had a need to implement caching and I decided to do it myself.

My use case is that I need to call YouTube API’s to search and fetch information about YouTube workout videos. Because YouTube has a quota system in place for their API’s, and because the information changes relatively infrequently, I had a need to cache the results of the API call. For example, if I call the API to retrieve information about a popular channel like Yoga with Adriene who has over 700 videos, I didn’t want to incur the performance hit of waiting for 700 network calls. I also didn’t want to incur the quota hit for those 700 calls.

So I definitely need to cache the data. But on the other hand, retrieving the data I need entails storing at least 3.4GB of JSON, more than I want to put in memory. This rules out most of the existing golang caching libraries. This data needs to be cached to disk.

There were other suggestions on the reddit thread, such as caching using Cloudflare, or using a caching engine such as Varnish. Both of which are good ideas, but I think ultimately unsuitable for my use case (or for the reddit poster, who needs to cache terabytes of text files).

I’m going to walk through my thinking on why I built my own disk based caching routines and compare them to Varnish.

I’ll also rule out Cloudflare right off the bat. My understanding of Cloudflare caching is that I need to be able to pass “youtube.googleapis.com” as the hostname to Google’s/YouTube’s servers. There may be a magical incantation of settings in Cloudflare that allow this to work. But at first go, I wasn’t able to find it.

image of cloudflare settings proxying to youtube API’s

image of this not working

And since Cloudflare caching is a bit of a black box, I felt confident I could meet my requirements faster by explicitly writing it in go than by hunting and pecking for settings in Cloudflare.

Besides, I’m a developer, I prefer writing code over clicking configuration buttons.

Requirements

These are the things I realized I needed to ask myself as I started writing my caching routine.

Concurrent Fetches: Do I want to prevent concurrent fetches in case of a cache miss?
In-Memory: Do I need to cache in memory or just in the filesystem?
Filesystem Limitations: Do I understand how many cache files my filesystem can fit in a single directory before performance suffers? Do I understand how many cache files my filesystem can handle?
Troubleshooting Considerations: How do I troubleshoot the cache if it misbehaves?
Compression: Do I want to compress the cache?
Hashing: How do uniquely store each record in the cache?
Expiration: How do I purge/expire items in the cache?

Concurrent Fetches

I decided that I did want to prevent concurrent fetches for the same data. If I had 5 threads that all requested the same list of YouTube videos, I wanted the behavior to look like:

fetchData (100ms / 100ms elapsed)
cacheHit (1ms / 101 elapsed)
cacheHit (1ms / 102 elapsed)
cacheHit (1ms / 103 elapsed)
cacheHit (1ms / 104 elapsed)

and not

fetchData (100ms)
fetchData (101ms)
fetchData (99ms)
fetchData (110ms)
fetchData (105ms)

This keeps it more efficient on my server, letting it do less IO, provides a slightly more consistent user experience, and most importantly, protects my YouTube quotas from being easily exhausted.

Unfortunately, there is not an easy way I could find to create arbitrary mutexes for each key using the standard library. So I did decide to use flock for filesystem based locking. (If someone notices something I missed to do arbitrary memory based mutexes, I would love to hear about it)

In-Memory

I also decided that I wanted to keep my server costs down. SSD disk storage is cheap and fast enough. RAM is relatively more expensive. I did not want to impact performance by having a large cache of API responses in memory cause swapping to disk for the rest of the application.

Retrieving cache from disk in sub-millisecond time is a good enough user experience for me. No need to squeeze it down to nanosecond time in RAM when a user will never notice the difference. And at least on Linux, filesystem access will get cached opportunistically by the operating system. If there’s a popular file, I can probably trust the filesystem/OS to load it into memory for me for faster access (and purge from cache when it’s no longer popular) without me having to build my own memory management/eviction system.

Interestingly, none of the existing Go caching libraries seemed to have a filesystem only option, at least from reading their descriptions: https://github.com/avelino/awesome-go?tab=readme-ov-file#caches

Filesystem Considerations:

A naive caching strategy would be to write each file into a single directory:

cache/
    file1
    file2
    ...
    file100000

However, according to stack overflow, ext4 starts to have issues around 10,000 files in a single directory:

Note, that more than 10K files in a directory can cause difficulties for many tools. 100K can be really hard on utilities.

Source: https://stackoverflow.com/questions/17537471/what-is-the-max-files-per-directory-in-ext4

Also, the number of inodes available on the filesystem at creation time limit the number of files I can cache. So for example, on my stock cloud provided server, doing nothing special to format the filesystem, I see 6 million inodes, 5.3 million of which are free:

$ df -iT
Filesystem     Type      Inodes  IUsed   IFree IUse% Mounted on
/dev/sda1      ext4     6029376 656416 5372960   11% /

Which means I can support several million cache files without issue. I decided to use a directory structure that provided for 256 subdirectories, each of which could have it’s own 256 subdirectories. This meant that 256 * 256 = 65536 possible directories splitting up 6 million files, gives me ~ 91 files per directory, assuming even distribution of files.

One layer of 256 subdirectories would have given me the possibility for ~ 23,000 files per directory, where I would probably see performance issues. 3 layers of subdirectories gave me 256^3, or 16 million directories, which requires more inodes than my filesystem has. (And yes, I did make this mistake in development. Running out of inodes on your filesystem is never a good thing. But to anyone who wants to say this proves I shouldn’t have written my own… well… I’ve made much worse mistakes running other peoples software when I didn’t fully understand it.)

Troubleshooting Considerations:

I wanted my solution to be easy to debug. I’ve spent enough time in my career waiting for database indexes to rebuild and troubleshooting corrupted backups, I wanted a simple one to one approach where I could navigate to a cache file using standard Linux utilies. For example:

cd /var/tmp/mzcache/dir1/dir2/
ls -l
gzip -dc cached_file.gz | less

While it would have been more space efficient to have larger binary files containing multiple cached documents, this would have complicated the implementation and the troubleshooting. I never wanted to be troubleshooting my own custom binary format in the middle of a crisis. And I certainly didn’t want to have to learn enough about someone else’s caching to have to troubleshoot theirs in a crisis.

This is another reason I didn’t want to use Cloudflare. If API caching wasn’t doing it’s job, how would I even go about researching it?

Compression

Generally speaking, compression saves disk space and speeds up response time. Why? Fetching data from disk, even SSD, is slow. Therefore minimizing the amount of data stored on disk improves fetch time. So even though the CPU time needed for decompression is greater, the CPU time spent waiting for IO is less. Compression on disk generally improves performance and saves money on storage. Win-Win! So my code enables gzip compression in order to improve performance and generally save disk space.

Hashing

I was concerned about cache collisions because about a decade ago, I launched a project using a well known caching library. My memory is we got pretty close to production. Then we hit a showstopper bug where our testers were getting completely wrong results when using our product. It turns out the default behavior was to use a key/hash generator that was prone to collisions. Which, to this day, seems to me like the most ridiculous default behavior for a caching library.

So I wanted to be able to control my own keys to ensure uniqueness, and I wanted to use a robust hashing algorithm that would generate a unique hash.

I decided SHA256 would be sufficient:

The usual answer goes thus: what is the probability that a rogue asteroid crashes on Earth within the next second, obliterating civilization-as-we-know-it, and killing off a few billion people? It can be argued that any unlucky event with a probability lower than that is not actually very important.

If we have a “perfect” hash function with output size n, and we have p messages to hash (individual message length is not important), then probability of collision is about p2/2n+1 (this is an approximation which is valid for “small” p, i.e. substantially smaller than 2n/2). For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability is about 4.3*10-60.

A mass-murderer space rock happens about once every 30 million years on average. This leads to a probability of such an event occurring in the next second to about 10-15. That’s 45 orders of magnitude more probable than the SHA-256 collision. Briefly stated, if you find SHA-256 collisions scary then your priorities are wrong.

From: https://stackoverflow.com/questions/4014090/is-it-safe-to-ignore-the-possibility-of-sha-collisions-in-practice

So that’s what I used.

Expiration/Purging

Finally, I wanted to have tight control over when the cached items would expire and/or be purged. I decided that if the cache expired at a fixed time, I could refresh it with frequently accessed items during a relatively idle period for my app. I wanted to avoid a poor user experience where items expired randomly throughout the day and if a user was unlucky enough to log in right around the time the cached items they needed expired, they might pay a penalty.

I was skeptical that I could find a caching product that gave me this type of control, since most Internet based caching seems to be more along the lines of “cache this for X hours” instead of “cache this until midnight tonight”. Maybe one exists, but, again, I felt my time was better spent writing the rather trivial code needed to implement it exactly the way I wanted.

Since my caching is filesystem based, purging of files was trivial to leave to the operating system:

## Delete cache files every 7 days
15 0 * * * find /var/tmp/mzcache/ -type f -mtime +7 -delete

Again, I ended up doing this in 160 lines of code https://github.com/sethgecko13/mzcache.

package mzcache

import (
	"compress/gzip"
	"crypto/sha256"
	"errors"
	"fmt"
	"io"
	"os"
	"path/filepath"
	"time"

	"github.com/gofrs/flock"
)

var LockPath = getLockPath()
var cachePath = "/var/tmp/mzcache"

var ErrCacheEmptyString = errors.New("empty string passed in to cache")
var ErrCacheCreateDirectory = errors.New("unable to create cache directory")
var ErrCacheCreate = errors.New("unable to create cache file")
var ErrCacheWrite = errors.New("unable to write to cache file")
var ErrCacheSync = errors.New("unable to sync cache file to filesystem")

var ErrCacheLock = errors.New("unable to lock cache file")
var ErrCacheUnlock = errors.New("unable to unlock cache file")

var ErrCacheMiss = errors.New("cache file does not exist")
var ErrCacheDecompress = errors.New("unable to decompress cache file")
var ErrCacheRead = errors.New("unable to read cache file")

// all the other errors we can use errors.Join to give details about the error.
// with expired cache, we have to handle separately because there is no underlying error.
type ErrCacheExpired struct {
	FullPath string
}

func (e *ErrCacheExpired) Error() string {
	return fmt.Sprintf("cache expired %s", e.FullPath)
}

// since caching is so fundamental to my app, I choose to panic if caching does not work.
// you may want to make different decisions if you use this library.
func getLockPath() string {
	err := os.RemoveAll("/tmp/mz*")
	if err != nil {
		panic("unable to remove previous lock files")
	}
	dname, err := os.MkdirTemp("", "mz")
	if err != nil {
		panic("unable to create cache lock file")
	}
	return dname
}
func hash(key string) string {
	h := sha256.New()
	h.Write([]byte(key))
	bs := h.Sum(nil)
	result := fmt.Sprintf("%x", bs)
	return result
}
func getCacheFilePath(key string) (path string, fullPath string, hashKey string) {
	hashKey = fmt.Sprintf("%v", hash(key))
	path = fmt.Sprintf("%v/%v/%v/", cachePath, hashKey[0:2], hashKey[2:4])
	fullPath = fmt.Sprintf("%v%v.gz", path, hashKey[4:64])
	return path, fullPath, hashKey
}
func getFileLockPath(hashKey string) string {
	return fmt.Sprintf("%s/%s", LockPath, hashKey)
}
func Write(key string, value string) error {
	// don't write empty values they are probably errors
	if value == "" {
		return ErrCacheEmptyString
	}
	path, fullPath, hashKey := getCacheFilePath(key)
	if _, err := os.Stat(path); os.IsNotExist(err) {
		err := os.MkdirAll(path, 0750)
		if err != nil {
			return errors.Join(ErrCacheCreateDirectory, err)
		}
	}
	fileLock := flock.New(getFileLockPath(hashKey))
	err := fileLock.Lock()
	if err != nil {
		return errors.Join(ErrCacheLock, err)
	}
	var lockError error
	defer func() {
		err = fileLock.Unlock()
		if err != nil {
			lockError = errors.Join(ErrCacheUnlock, err)
		}
	}()
	// no need to remove file, this causes race conditions, create removes previous contents
	file, err := os.Create(filepath.Clean(fullPath))
	if err != nil {
		return errors.Join(ErrCacheCreate, err)
	}
	defer file.Close()
	g := gzip.NewWriter(file)
	defer g.Close()
	_, err = g.Write([]byte(value))
	if err != nil {
		return errors.Join(ErrCacheWrite, err)
	}
	err = file.Sync()
	if err != nil {
		return errors.Join(ErrCacheSync, err)
	}
	return lockError
}
func Read(key string, days int) (string, error) {
	var result string
	path, fullPath, hashKey := getCacheFilePath(key)
	if _, err := os.Stat(path); os.IsNotExist(err) {
		return result, errors.Join(ErrCacheMiss, err)
	}
	fileLock := flock.New(getFileLockPath(hashKey))
	err := fileLock.Lock()
	if err != nil {
		return result, errors.Join(ErrCacheLock, err)
	}
	var lockError error
	defer func() {
		err = fileLock.Unlock()
		if err != nil {
			lockError = errors.Join(ErrCacheUnlock, err)
		}
	}()
	if _, err := os.Stat(fullPath); os.IsNotExist(err) {
		return result, errors.Join(ErrCacheMiss, err)
	} else {
		file, _ := os.Stat(fullPath)
		now := time.Now()
		dt := now.AddDate(0, 0, ((days - 1) * -1))
		cutoff := dt.Format("2006-01-02")
		fileDate := file.ModTime()
		fd := fileDate.Format("2006-01-02")
		if fd < cutoff {
			return result, &ErrCacheExpired{FullPath: fullPath}
		}
		inputFile, err := os.Open(filepath.Clean(fullPath))
		if err != nil {
			return result, errors.Join(ErrCacheRead, err)
		}
		defer inputFile.Close()
		reader, err := gzip.NewReader(inputFile)
		if err != nil {
			return result, errors.Join(ErrCacheDecompress, err)
		}
		defer reader.Close()
		content, err := io.ReadAll(reader)
		if err != nil {
			return result, errors.Join(ErrCacheRead, err)
		}
		result = string(content)
		return result, lockError
	}
}

286 if you count the unit tests, comments, and empty lines. Right under the wire to avoid being two gross.

Varnish

So how does this compare to, say, Varnish?

Varnish is a content caching system built by Poul-Henning Kamp (and others). I have a lot of respect for Poul-Henning Kamp. I am absolutely, 100% sure he is both smarter than me and a better developer than me. He has written at length about what led him to create Varnish and some of the interesting performance tuning that went into Varnish. And I would be remiss if I didn’t mention his excellent A Generation Lost in the Bazaar.

All that being said… using Varnish involves 145 thousand lines of code:

cloc *
    1752 text files.
     704 unique files.                                          
    1050 files ignored.

github.com/AlDanial/cloc v 2.02  T=0.70 s (1012.8 files/s, 301809.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                              243          16647          18292          88829
reStructuredText               189          10777           2696          24306
C/C++ Header                   179           3170           9347          23998
m4                               9            332              5           2725
Python                          11            570            774           2232
make                            36            456            115           1520
SVG                              4              0            205           1215
Bourne Shell                    31            215            283            899
awk                              1             25             29            115
XSD                              1              0              0             20
-------------------------------------------------------------------------------
SUM:                           704          32192          31746         145859
-------------------------------------------------------------------------------

That’s ~ 51,000% bigger than my solution. And it’s complicated enough it has it’s own configuration language.

Also, assuming a conservative 1 bug per 1000 lines of code, that means there are probably at least 145 bugs lurking in Varnish to bite me.

So to bring it full circle, the sample configuration file that explains how to use it’s main features is 116 lines of VCL code:

https://www.varnish-software.com/developers/tutorials/example-vcl-template/#16-putting-it-all-together

Pretty close to spitting distance of my 160 lines of go code. Yes, I understand that not all of this config would be needed for my use case.

But quite frankly, I’m glad I don’t have to figure out which ones would be needed.

Conclusion

In conclusion, I feel like I made the right choice to build a filesystem based cache for my application. It does exactly what I need, is easy to follow, low maintenance, and did not require me to learn new technologies or configuration syntax.

For generalized web caching, I do actually use Cloudflare to cache and serve up this article you’re reading. And if I hadn’t had Cloudflare, I probaly would have used Varnish. Generalized web caching is hard and I have no intention of trying to re-do what took some very smart people 145k lines of code to implement in Varnish.

On the other hand, my needs were simple, and one of the things I particularly like about go is that it provides great tools for doing simple things simply.

Here’s my solution again if you want to look at the code now that you’ve seen the thinking that went into it:

https://github.com/sethgecko13/mzcache