A Brief Introduction to ZFS

Mo Khan - 3431709

0. Abstract

This paper gives an introduction to the Zettabyte File System, now referred to as “ZFS”, that was the state of the art in the early 2000s. Starting with an overview of relevant events prior and during this time period, this paper presents an introduction to file systems, the design of ZFS and analysis of this file system.

1. Introduction

In order to properly understand the relative importance of file systems, we need to place them in their historical context. [3]:

1977: FAT: Marc McDonald designs/implements a 8-bit file system. [4]
1980: FAT12: Tim Paterson extends FAT to 12 bits. [4]
1984: FAT16: Cluster addresses were increased to 16-bit. [4]
1985: HFS: Apple Inc. develops the Hierarchical File System [7]
1993: NTFS: Microsoft develops a proprietary journaling file system. [5]
1993: ext2: Rémy Card replaces the extended file system. [9]
1994: XFS: Silicon Graphics releases a 64-bit journaling file system. [12]
1996: FAT32: Microsoft designs FAT32 which uses 32-bit cluster addresses.
1998: HFS+: Apple Inc. develops the HFS Plus journaling file system. [6]
2001: ZFS: Zettabyte file system is released as part of Sun Solaris. [14]
2001: ext3: ext2 is extended to support journaling. [10]
2008: ext4: fourth extended file system is a journaling file system. [11]
2009: btrfs: B-tree file system is introduced into the Linux kernel. [13]
2017: APFS: macOS replaces HFS+ with Apple File System [15]

2. Traditional File Systems

Traditionally, file system administration of file systems and disk can be difficult, slow, and error prone. Adding more storage to an existing file system requires unmounting block devices which requires temporary service interruptions.

Many file systems use a one-to-one association between the file system and the block device. Volume managers are responsible for providing virtual address to the underlying physical storage. The virtual blocks are presented to the file system as a logical storage device. System administrators are required to predict the maximum future size of each file system at the time of creation.

Most file systems allow the on-disk data to be inconsistent in some way for varying periods of time. If an unexpected crash or power cycle occurs while the on-disk state is inconsistent, the file system will require some form of repair during the next boot.

If the file system doesn’t validate the data returned from the device controller this can lead to errors from returning corrupted data. If the file system can detect and automatically correct corrupted data this reduces other possible system errors.

  Traditional file system block diagram [1]

              ----------------------------
              |                          |
              |      System Call         |
              |                          |
              ----------------------------
                  |Vnode interface|
              ----------------------------
              |                          |
              |                          |
              |      File System         |
              |                          |
              |                          |
              ----------------------------
              | logical device, offset|
              ----------------------------
              |                          |
              |      Volume Manager      |
              |                          |
              ----------------------------
              | physical device, offset|
              ----------------------------
              |                          |
              |       Device Driver      |
              |                          |
              ----------------------------

3. Data Corruption

Disk corruption occurs when any data access from disk does not have the expected contents due to some problem in the storage stack [2]. This can occur for many reasons such as errors in magnetic media, spikes in power, erratic mechanical movements, and physical damage, and defects in device firmware, operating system code, device drivers. Error correction codes (ECC) can catch many of these corruptions but not all of them.

Some ways to handle data corruption include using checksums to verify data integrity, implementing redundancy by choosing data structures and algorithms that can detect corruption and recover from them such as B-Tree file system structures, or choosing a RAID storage setup to stripe/mirror the data across physical devices.

4. ZFS File System

The Zettabyte File System (ZFS) is a file system developed at Sun Microsystems. ZFS was originally implemented in the Solaris operating system and was intended for use on everything from desktops to database servers. ZFS attempts to achieve the following goals:

strong data integrity
simple administration
handle immense capacity

It uses checksums to verify data integrity and changes the interaction between the file system and volume manager to simplify administration and utilizes 128 bit block addresses to be able to address vast amounts of data.

              ZFS Block diagram [1]

              --------------------------------
              |                              |
              |        System Call           |
              |                              |
              --------------------------------
                    | Vnode interface |
              --------------------------------
              |                              |
              |    ZFS POSIX Layer (ZPL)     |
              |                              |
              --------------------------------
                | dataset, object, offset |
              --------------------------------
              |                              |
              |  Data Management Unit (DMU)  |
              |                              |
              --------------------------------
                  | data virtual address |
              --------------------------------
              |                              |
              | Storage Pool Allocator (SPA) |
              |                              |
              --------------------------------
                | physical device, offset |
              --------------------------------
              |                              |
              |       Device Driver          |
              |                              |
              --------------------------------

The device driver exports a block device to the SPA
The SPA handles:
- block allocation and I/O
- exports virtual addresses
- allocates and frees blocks to the DMU
The DMU turns virtual addresses block into a transactional object for the ZPL
The ZPL implements a POSIX file system on top of the DMU objects and exports vnode operations to the system call layer.

The SPA allocates blocks from all the devices in a storage pool. The SPA provides a malloc() and free() like interface for allocating and freeing disk space. These virtual addresses for disk blocks are called DVA (data virtual addresses).

System administrators no longer have to create logical device or partition storage, they just tell the SPA which devices to use. The SPA uses 128-bit block addresses to allow addressing massive amounts of data (340,282,366,920,938,463,463,374,607,431,768,211,456 addresses).

To protect against data corruption, each block is checksummed before it is written to disk. A block’s checksum is stored in its parent indirect block. Separating the checksum from the data ensures that the data can be checked for integrity using a checksum located in the parent.

                     -------
                    |   |   | uberblock (has checksum of itself)
                    |___|___|
                    |___|___|
                      /     \
                -------      -------
              |   |   |    |   |   |
              |___|___|    |___|___|
              |___|___|    |___|___|
                /     \      /     \
            -----  -----  -----  -----
            |   |  |   |  |   |  |   |
            -----  -----  -----  -----

[1]

When data is received from the block device the checksum is compared to check for corruption. If corruption is detected self-healing is possible in some conditions.

Virtual device drivers (vdevs) implements a small set of routines for a particular features like mirroring, striping etc. The SPA allocates blocks in a round-robin strategy from the top-level vdevs.

The DMU consumes blocks from the SPA and exports objects (flat files). Objects live within a dataset. A dataset provides a private namespace for the objects contained by the dataset. Objects are identified by 64 bit numbers and can be created, destroyed, read and written.

The DMU keeps the on-disk data consistent at all times by treating all blocks as copy-on-write. All data in the pool is part of a tree of indirect blocks, with the data blocks as the leaves of the tree.

5. ZFS Observations

A 2010 analysis [2] of ZFS by Yupa Zhang et al. observed the following:

Data corruption

ZFS detects all corruptions due to the use of checksums.
ZFS gracefully recovers from single metadata block corruptions.
ZFS does not recover from data block corruptions.
In-memory copies of metadata help ZFS to recover from serious multiple block corruptions.
ZFS cannot recover from multiple block corruptions affecting all ditto blocks when no in-memory copy exists.

Memory corruption

ZFS does not use the checksums in the page cache along with the blocks to detect memory corruptions.
The window of vulnerability of blocks in the page cache is unbounded.
Since checksums are created when blocks are written to disk, any corruption to blocks that are dirty (or will be dirtied) is written to disk permanently on a flush.
Dirtying blocks due to updating file access time increases the possibility of making corruptions permanent.
For most metadata blocks in the page cache, checksums are not valid and thus useless in detecting memory corruptions.
When metadata is corrupted, operations fail with wrong results, or give misleading error messages.
Many corruptions lead to a system crash.
The read() system call may return bad data.
There is no recovery for corrupted metadata.

We argue that file systems should be designed with end-to-end data integrity as a goal. File system should not only provide protection against disk corruptions, but also aim to protect data from memory corruptions.

6. Conclusion

The original goals stated for the ZFS project were to address concerns related to many file systems of that generation such as data integrity, simple administration, and handling immense capacity. To accomplish these goals, new abstractions were created such as the ZPL, DMU, SPA. It is my opinion that the addition of these abstractions increased the complexity of the underlying file system while improving data integrity for specific scenarios. The addition of new object data structures, checksums for all read/writes and the use of 128 bit block addresses increases the amount of CPU, memory and disk space required to accommodate this file system. This author rejects the claim that this file system is suitable for general desktop environments but acknowledges that certain server side use cases could benefit from the features that ZFS provides.

7. References

Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum - The Zettabyte File System. https://www.cs.hmc.edu/~rhodes/cs134/readings/The%20Zettabyte%20File%20System.pdf
Yupa Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau - End-to-end Data Integrity for File Systems: A ZFS Case Study. https://www.usenix.org/legacy/event/fast10/tech/full_papers/fast10proceedings.pdf#page=37
Wikipedia authors - List of default file systems. https://en.wikipedia.org/wiki/List_of_default_file_systems
Wikipedia authors - File Allocation Table. https://en.wikipedia.org/wiki/File_Allocation_Table
Wikipedia authors - NTFS. https://en.wikipedia.org/wiki/NTFS
Wikipedia authors - HFS Plus. https://en.wikipedia.org/wiki/HFS_Plus
Wikipedia authors - Hierarchical File System. https://en.wikipedia.org/wiki/Hierarchical_File_System
Wikipedia authors - Unix File System. https://en.wikipedia.org/wiki/Unix_File_System
Wikipedia authors - ext2. https://en.wikipedia.org/wiki/Ext2
Wikipedia authors - ext3. https://en.wikipedia.org/wiki/Ext3
Wikipedia authors - ext4. https://en.wikipedia.org/wiki/Ext4
Wikipedia authors - XFS. https://en.wikipedia.org/wiki/XFS
Wikipedia authors - Btrfs. https://en.wikipedia.org/wiki/Btrfs
Wikipedia authors - ZFS. https://en.wikipedia.org/wiki/ZFS
Wikipedia authors - Apple File System. https://en.wikipedia.org/wiki/Apple_File_System