Commit 02a0646

mo khan <mo@mokhan.ca>
2021-07-25 23:04:51
finish the body of the zfs paper
1 parent fee20e8
Changed files (1)
doc/my-paper.md
@@ -0,0 +1,273 @@
+# An Introduction to the Zettabyte File System (ZFS)
+
+## 0. Abstract
+
+This paper gives an introduction to the Zettabyte File System, now referred
+to as "ZFS", that was the state of the art in the early 2000s. Starting with an
+overview of relevant events prior and during this time period, this paper
+presents an introduction to file systems, the design of ZFS and analysis of
+this file system.
+
+## 1. Introduction
+
+In order to properly understand the relative importance of file systems, one
+has to place them in their historical context. Relevant events start taking
+place well before the first file system was built [3]:
+
+* 1977: FAT: Marc McDonald designs and implements the 8 bit file system. [4]
+* 1980: FAT12: Tim Paterson extends FAT to 12 bits. [4]
+* 1984: FAT16: Cluster addresses were increased to 16-bit. [4]
+* 1993: NTFS: Microsoft develops a proprietary journaling file system. [5]
+* 1993: ext2: Rémy Card designs a replacement for the extended file system (ext) for the Linux kernel. [9]
+* 1994: XFS: Silicon Graphics, Inc releases a high-performance 64-bit journaling file system named XFS. [12]
+* 1996: FAT32: Microsoft designs FAT32 which uses 32-bit cluster addresses.
+* 1998: HFS+: Apple Inc. develops the HFS Plus journaling file system. [6]
+* 19xx: HFS: Apple Inc. develops the proprietary Hierarchical File System [7]
+* 2001: ZFS: Zettabyte file system is released as part of Sun Microsystems Solaris operating system.  [14]
+* 2001: ext3: ext2 is extended to support journaling. [10]
+* 2008: ext4: fourth extended file system is a journaling file system for Linux, developed as the successor to ext3. [11]
+* 2009: btrfs: B-tree file system is introduced into the Linux kernel. [13]
+* 2017: APFS: macOS replaces HFS+ with Apple File System (APFS)
+
+## Traditional File Systems
+
+Traditionally, file system administration of file systems and disk can be
+difficult, slow, and error prone. Adding more storage to an existing file
+system requires unmounting block devices which requires temporary service
+interruptions.
+
+Many file systems use a one-to-one association between the file system and the
+block device. Volume managers are responsible for providing virtual address to
+the underlying physical storage. The virtual blocks are presented to the file
+system as a logical storage device. System administrators are required to
+predict the maximum future size of each file system at the time of creation.
+
+Most file systems allow the on-disk data to be inconsistent in some way for
+varying periods of time. If an unexpected crash or power cycle occurs while the
+on-disk state is inconsistent, the file system will require some form of repair
+during the next boot.
+
+If the file system doesn't validate the data returned from the device
+controller this can lead to errors from returning corrupted data. If the file
+system can detect and automatically correct corrupted data this reduces other
+possible system errors.
+
+```plaintext
+  Traditional file system block diagram [1]
+
+  ----------------------------
+  |                          |
+  |      System Call         |
+  |                          |
+  ----------------------------
+       |Vnode interface|
+  ----------------------------
+  |                          |
+  |                          |
+  |      File System         |
+  |                          |
+  |                          |
+  ----------------------------
+   | logical device, offset|
+  ----------------------------
+  |                          |
+  |      Volume Manager      |
+  |                          |
+  ----------------------------
+   | physical device, offset|
+  ----------------------------
+  |                          |
+  |       Device Driver      |
+  |                          |
+  ----------------------------
+```
+
+## Data Corruption
+
+Disk corruption occurs when any data access from disk does not have the expected
+contents due to some problem in the storage stack [2]. This can occur for many
+reasons such as errors in magnetic media, spikes in power, erratic mechanical
+movements, and physical damage, and defects in device firmware, operating
+system code, device drivers. Error correction codes (ECC) can catch many of these corruptions but not all of
+them.
+
+Some ways to handle data corruption include using checksums to verify data
+integrity, implementing redundancy by choosing data structures and algorithms
+that can detect corruption and recover from them such as B-Tree file system
+structures, or choosing a RAID storage setup to stripe/mirror the data across
+physical devices.
+
+## ZFS File System
+
+The Zettabyte File System (ZFS) is a file system developed at Sun Microsystems.
+ZFS was originally implemented in the Solaris operating system and was intended
+for use on everything from desktops to database servers. ZFS attempts to
+achieve the following goals:
+
+* strong data integrity
+* simple administration
+* handle immense capacity
+
+It uses checksums to verify data integrity and changes the interaction between
+the file system and volume manager to simplify administration and utilizes
+128 bit block addresses to be able to address vast amounts of data.
+
+```plaintext
+  ZFS Block diagram [1]
+
+  --------------------------------
+  |                              |
+  |        System Call           |
+  |                              |
+  --------------------------------
+         | Vnode interface |
+  --------------------------------
+  |                              |
+  |    ZFS POSIX Layer (ZPL)     |
+  |                              |
+  --------------------------------
+    | dataset, object, offset |
+  --------------------------------
+  |                              |
+  |  Data Management Unit (DMU)  |
+  |                              |
+  --------------------------------
+      | data virtual address |
+  --------------------------------
+  |                              |
+  | Storage Pool Allocator (SPA) |
+  |                              |
+  --------------------------------
+     | physical device, offset |
+  --------------------------------
+  |                              |
+  |       Device Driver          |
+  |                              |
+  --------------------------------
+```
+
+1. The device driver exports a block device to the SPA
+1. The SPA handles:
+  * block allocation and I/O
+  * exports virtual addresses
+  * allocates and frees blocks to the DMU
+1. The DMU turns virtual addresses block into a transactional object for the ZPL
+1. The ZPL implements a POSIX file system on top of the DMU objects and exports
+   vnode operations to the system call layer.
+
+The SPA allocates blocks from all the devices in a storage pool.
+The SPA provides a `malloc()` and `free()` like interface for allocating and
+freeing disk space. These virtual addresses for disk blocks are called DVA (data
+virtual addresses).
+
+System administrators no longer have to create logical device or partition
+storage, they just tell the SPA which devices to use. The SPA uses 128-bit
+block addresses to allow addressing massive amounts of data
+(340,282,366,920,938,463,463,374,607,431,768,211,456 addresses).
+
+To protect against data corruption, each block is checksummed before it is
+written to disk. A block's checksum is stored in its parent indirect block.
+Separating the checksum from the data ensures that the data can be checked for
+integrity using a checksum located in the parent.
+
+```plaintext
+          -------
+         |   |   | uberblock (has checksum of itself)
+         |___|___|
+         |___|___|
+          /     \
+    -------      -------
+   |   |   |    |   |   |
+   |___|___|    |___|___|
+   |___|___|    |___|___|
+    /     \      /     \
+ -----  -----  -----  -----
+ |   |  |   |  |   |  |   |
+ -----  -----  -----  -----
+
+[1]
+```
+
+When data is received from the block device the checksum is compared to check
+for corruption. If corruption is detected self-healing is possible in some
+conditions.
+
+Virtual device drivers (vdevs) implements a small set of routines for a
+particular features like mirroring, striping etc. The SPA allocates blocks in
+a round-robin strategy from the top-level vdevs.
+
+The DMU consumes blocks from the SPA and exports objects (flat files). Objects
+live within a dataset. A dataset provides a private namespace for the objects
+contained by the dataset. Objects are identified by 64 bit numbers and can be
+created, destroyed, read and written.
+
+The DMU keeps the on-disk data consistent at all times by treating all blocks as
+copy-on-write. All data in the pool is part of a tree of indirect blocks, with
+the data blocks as the leaves of the tree.
+
+## ZFS Observations
+
+A 2010 analysis [2] of ZFS by Yupa Zhang et al. observed the following:
+
+Data corruption
+
+1. ZFS detects all corruptions due to the use of checksums.
+1. ZFS gracefully recovers from single metadata block corruptions.
+1. ZFS does not recover from data block corruptions.
+1. In-memory copies of metadata help ZFS to recover from serious multiple block
+   corruptions.
+1. ZFS cannot recover from multiple block corruptions affecting all ditto blocks
+   when no in-memory copy exists.
+
+Memory corruption
+
+1. ZFS does not use the checksums in the page cache along with the blocks to
+   detect memory corruptions.
+1. The window of vulnerability of blocks in the page cache is unbounded.
+1. Since checksums are created when blocks are written to disk, any corruption
+   to blocks that are dirty (or will be dirtied) is written to disk permanently
+   on a flush.
+1. Dirtying blocks due to updating file access time increases the possibility of
+   making corruptions permanent.
+1. For most metadata blocks in the page cache, checksums are not valid and thus
+   useless in detecting memory corruptions.
+1. When metadata is corrupted, operations fail with wrong results, or give
+   misleading error messages.
+1. Many corruptions lead to a system crash.
+1. The read() system call may return bad data.
+1. There is no recovery for corrupted metadata.
+
+> We argue that file systems should be designed with end-to-end data integrity
+> as a goal. File system should not only provide protection against disk
+> corruptions, but also aim to protect data from memory corruptions.
+
+## 5. Conclusion
+
+The original goals stated for the ZFS project were to address concerns related
+to many file systems of that generation such as data integrity, simple
+administration, and handling immense capacity. To accomplish these goals, new
+abstractions were created such as the ZPL, DMU, SPA. It is my opinion that the
+addition of these abstractions increased the complexity of the underlying file
+system while improving data integrity for specific scenarios. The addition of
+new object data structures, checksums for all read/writes and the use of 128 bit
+block addresses increases the amount of CPU, memory and disk space required to
+accommodate this file system. This author rejects the claim that this file
+system is suitable for general desktop environments but acknowledges that
+certain server side use cases could benefit from the features that ZFS provides.
+
+## 6. References
+
+1. Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum - The Zettabyte File System. https://www.cs.hmc.edu/~rhodes/cs134/readings/The%20Zettabyte%20File%20System.pdf
+1. Yupa Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau - End-to-end Data Integrity for File Systems: A ZFS Case Study. https://www.usenix.org/legacy/event/fast10/tech/full_papers/fast10proceedings.pdf#page=37
+1. Wikipedia authors - List of default file systems. https://en.wikipedia.org/wiki/List_of_default_file_systems
+1. Wikipedia authors - File Allocation Table. https://en.wikipedia.org/wiki/File_Allocation_Table
+1. Wikipedia authors - NTFS. https://en.wikipedia.org/wiki/NTFS
+1. Wikipedia authors - HFS Plus. https://en.wikipedia.org/wiki/HFS_Plus
+1. Wikipedia authors - Hierarchical File System. https://en.wikipedia.org/wiki/Hierarchical_File_System
+1. Wikipedia authors - Unix File System. https://en.wikipedia.org/wiki/Unix_File_System
+1. Wikipedia authors - ext2. https://en.wikipedia.org/wiki/Ext2
+1. Wikipedia authors - ext3. https://en.wikipedia.org/wiki/Ext3
+1. Wikipedia authors - ext4. https://en.wikipedia.org/wiki/Ext4
+1. Wikipedia authors - XFS. https://en.wikipedia.org/wiki/XFS
+1. Wikipedia authors - Btrfs. https://en.wikipedia.org/wiki/Btrfs
+1. Wikipedia authors - ZFS. https://en.wikipedia.org/wiki/ZFS