ZFS Recordsize

2019-06-11

At Joyent we operate an object storage system. One of the key problems in any storage system is storing data as efficiently as possible. At cloud scale this is more important than ever. Small percentage gains in storage capacity savings have result in massive returns.

For the example, a 1% reduction in capacity needed to store one exabyte of data saves 10 petabytes. TEN PETABYTES. If we have a storage server that has 250TB of usable capacity, that's 40 storage servers worth of capacity we've saved. 40 boxes! That's absolutely crazy.

I was able to find a way for us to save between one and two percent capacity by diving into the nitty gritty of the ZFS recordsize attribute.

ZFS isn't like other filesystems where block sizes are statically configured when the system is initially deployed. Instead ZFS has what is called a recordsize. A record is an atomic unit in ZFS. It's what is used to calculate checksums, RAIDZ parity, and to perform compression.

The default recordsize is 128k, at least on illumos. What that means is that a file you write will have a block size of up to 128k in size. If you end up writing a 64k file it will have a 64k block size. If you write a 256k file it will be made up of two 128k records (and one indirect block to point to the two 128k records, but I digress).

RAIDZ parity is another concept we need to understand. ZFS allocates N disk sectors for each record written, where N is the parity level (0 thru 3). So if we write an 8k file onto a RAIDZ1 pool with 4k sector size disks ZFS will write one 8k record and one 4k RAIDZ parity sector. Padding sectors are additionally added until the on-disk usage is a multiple of the parity level. I'm not sure why this is done. In this example the RAIDZ storage overhead is 50%. We're using (about) 12k to store 8k of user data. That's pretty bad!

This is why a small recordsize is bad with RAIDZ. The efficiency is atrocious.

The larger the records, the less RAIDZ overhead, since RAIDZ overhead is mostly constant per-record. Right? Maybe, but also maybe not. I thought that was going to be the case initially, but after doing some math and observing how ZFS behaves I am less certain.

We know what happens if we write less than recordsized files, but what happens when we write more than recordsized files?

I wrote two files, and examined them with zdb(1m). Two filesystems were used. One with a 128k recordsize and one with a 1M recordsize. 1M is the largest recordsize currently without modifying the ZFS code (though ZFS supports larger record sizes). These two files are larger than recordsize by only one byte:

[root@coke /var/tmp/recordsize_testing]# ls -l /testpool/test1/ /testpool/test0
/testpool/test0:
total 532
-rw-r--r--   1 root     root      131073 Jun  5 20:45 worst_case

/testpool/test1/:
total 4091
-rw-r--r--   1 root     root     1048577 Jun  5 20:45 worst_case

[root@coke /var/tmp/recordsize_testing]# zdb -vvO testpool/test0 worst_case

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    2   128K   128K   266K    512   256K  100.00  ZFS plain file
                                               168   bonus  System attributes
...
Indirect blocks:
               0 L1  0:8a1e00:1800 20000L/1000P F=2 B=214/214
               0  L0 0:cc00:27600 20000L/20000P F=1 B=214/214
           20000  L0 0:2a6600:27600 20000L/20000P F=1 B=214/214

        segment [0000000000000000, 0000000000040000) size  256K

[root@coke /var/tmp/recordsize_testing]# zdb -vvO testpool/test1 worst_case

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    2   128K     1M  2.00M    512     2M  100.00  ZFS plain file
                                               168   bonus  System attributes
...
Indirect blocks:
               0 L1  0:8a3600:1800 20000L/1000P F=2 B=214/214
               0  L0 0:34200:139200 100000L/100000P F=1 B=214/214
          100000  L0 0:16d400:139200 100000L/100000P F=1 B=214/214

        segment [0000000000000000, 0000000000200000) size    2M

We can see that when we write more than recordsize, an entire recordsized record is allocated for the last record in an object. That means we have almost 100% overhead for these recordsize + 1 byte files.

This was a very unfortunate discovery, but I'm glad I noticed this before we suggested deploying this recordsize change to production.

I ended up writing a pretty complicated calculator to simulate how ZFS would use storage capacity. It's available here.

It can take many arguments to tweak the simulator how you see fit, the most important argument four our case was recordsize. We have the benefit of our storage nodes uploading a manifest of files and file sizes that they store. So I am able to quickly see how different recordsizes might lead to different amounts of allocated storage based on real production data (a rare situation!).

This simulator gave us the knowledge we needed to determine the optimal recordsize for the exact objects we are storing.

If you're always storing objects slightly under 1M, a 1M recordsize will definitely be most efficient in terms of space used for your data. In an object storage system we have the benefit of the objects being immutable. This saves us from many of the sticky points of using enormous recordsize values.

In our case we also store medium, but not large objects (where the last-record overhead would be lost in the noise), so a 1M recordsize is not best for us. The simulator confirmed this.

It outputs data like this:

Simulating using account *, parity 2, and recordsize 131072
=== REPORT ===
326233044036162         Bytes Used
4412350962088           Wasted Bytes
2035543840              Records
4071087680              RAIDZ sectors
7001110                 Padding sectors
296.71                  TiB Used
4                       TiB wasted
15.17                   TiB RAIDZ

Another thing to consider: the larger the recordsize the fewer records you will have. This may help you avoid nasty fragmentation-related issues on your zpools.