Now that's a big pool! (CC BY-NC-SA 2.0 licensed image by Trey Ratcliff, Flickr user stuckincustoms) — Now that’s a big pool! (CC BY-NC-SA 2.0 licensed image by Trey Ratcliff, Flickr user stuckincustoms)

In addition to data integrity, device redundancy, and performance features, ZFS Storage Pools can also expanded in usable storage size through deduplication and compression of the data stored. In other words by shrinking raw data and removing duplicated parts of data, ZFS Storage Pools can store more data on disk. While there are some memory trade-offs using deduplication, it can provide significant storage savings for some types of stored data. There are also some significant performance benefits to compression.

In this article, we will explore how to configure deduplication and compression for storage pools.

Learning Objectives

Learn about deduplication and configure it on a storage pool.
Learn about compression and configure it on a storage pool.

This article is one of a series of articles on ZFS. You can start at the beginning by creating a ZFS playground on which you can play.

ZFS Deduplication

Filesystems can have a lot of the same data written multiple times to disk. Depending on the amount of data involved, duplicated data can be costly to the storage system. If the duplicate data can eliminated, then the storage system can save space by only writing distinct data to disk and provide better performance by reducing the number of expensive writes to disk. Deduplication is the process of managing duplicate data on a storage system.

ZFS provides data deduplication by writing one copy of data to disk and maintaining a list for each time that particular data is needed. While there are several levels of granularity in data deduplication (file, block, and byte) and various trade-offs for each, ZFS is implemented with block-level deduplication. It also manages deduplication in real-time (or synchronously) instead of using a secondary, asynchronous process that runs during low system load. Deduplication uses a table, referred to as the deduplication table, that maintains the reference counts for each block of duplicated data.

The deduplication table is part of the ZFS Adaptive Replacement Cache (ARC). As more duplicated data is stored, the deduplication table will grow. Keeping the deduplication table contained in memory is ideal for performance. However, there is an internal limit of 25% for the amount of metadata, which includes the deduplication table, that can be stored in the ARC. Portions of the table will be pushed to the L2ARC when it grows beyond the allocated memory. Having fast L2ARC devices will help a little if you have a lot of duplicate data and not a lot of RAM.

To configure deduplication, start with configuring a ZFS Storage Pool. In this example, we are not concerned about datasets, data integrity, or replication. We will keep it simple.

# zpool create mypool da0 da1 da2 da3 da4

Then enable ZFS deduplication.

# zfs set dedup=on mypool

The deduplication verification process can be enabled by using the keyword verify instead of on. Verification is provided to ensure that each block is actually unique instead of relying solely on the checksum algorithm to make that determination. You can also override the checksum algorithm to used by defining it when enabling deduplication, though currently only sha256 is supported. It is also possible to specify the use of a specific algorithm and the verify option when enabling deduplication in one step. Use the following command that specify both.

# zfs set dedup=sha256,verify mypool

To determine if a specific pool or dataset has deduplication enable, you can examine the dedup property.

# zfs get dedup
NAME    PROPERTY  VALUE          SOURCE
mypool  dedup     sha256,verify  local

ZFS Compression

Another way to save space and improve performance in ZFS is through compression. Instead of writing only unique data to the file system as in deduplication, all data written to the file system is compressed. The result is a smaller amount of data written the file system, which improves write performance. There is also a benefit in read performance since there is less data to read from the disk. Depending on the compression ratio achieved, significant space savings can occur and lower storage costs as well. By taking advantage of fast CPUs and a lot of memory, ZFS can provide great performance for data compression.

Compression in ZFS is also flexible. It can be applied to storage pools as well as datasets. It can also be inherited from the parent dataset or overridden as needed. A different compression algorithm (or gzip compression level) can be applied as well. Each time the configuration is changed, it is applied only to newly written data. The old data continues to be accessible through whichever algorithm was used when it was written the first time. This allows administrators to experiment with different configurations without negatively impacting users or their data.

ZFS supports several lossless data compression algorithms. The LZJB algorithm was developed by Jeff Bonwick for ZFS and is based on the LZRW1 algorithm. The LZ4 algorithm replaces LZJB and provides better performance. The gzip algorithm is also supported and provides the standard levels of compression as well. Another algorithm is ZLE (Zero Length Encoding), which compresses data with repeating zeroes in it.

Determining which compression algorithm to use is dependent on your performance, cost, and data storage goals and the data that will stored. There are tradeoffs in the decision. If the data to store has a high compression ratio (i.e. plain text), you need the most space possible, and can sacrifice some performance, then gzip-9 is the best option. If the data has a low compression ratio (i.e. it is already compressed), then compression may not make sense at all. With the various levels of gzip compression you can “tune” the amount of space used and the performance based on the data stored. ZLE might be the best choice for data which contains a significant amount repeating zeroes. The LZ4 algorithm, which replaces LZJB, was designed to make the best tradeoff between compression ratio and compression performance. If you are not sure about the data that will be stored, consider using LZ4. Of course, experimentation with the expected data is the best option if time allows.

Compression with the default LZJB algorithm can be enabled on a storage pool or dataset with the following command.

# zfs compression=on mypool

You can also specify the algorithm to use when enabling compression. You can also switch the compression algorithm at any time too.

# zfs compression=lz4 mypool

To enable gzip compression levels, specify the level when enabling compression. The default gzip level is specified with gzip, which is also gzip-6. The highest level, which offers the highest compression ratio but the lowest performance, is gzip-9.

# zfs compression=gzip-9 mypool

To see the overall compression ratio for the storage pool, the following command can be used:

# zfs get compressratio mypool
NAME    PROPERTY       VALUE  SOURCE
mypool  compressratio  1.28x  -

Next Time

In the next article, we will look at datasets and property inheritance in ZFS.

References

Aaron Toponce’s Blog, ZFS Administration, Part XI- Compression and Deduplication
Ben Rockwood, Understanding ZFS: Compression
Constantin Gonzalez’s Blog Constant Thinking, OpenSolaris ZFS Deduplication: Everything You Need to Know
Constantin Gonzalez’s Blog Constant Thinking, ZFS Deduplication: To Dedupe or not to Dedupe…
FreeBSD, zfs(8) manual page
Illumos Wiki, LZ4 Compression
Jeff Bonwick’s Blog, ZFS Deduplication
Thierry Manfe, ZFS compression: perf, disk-space, and watts. Part 1, Part 2

Expanding Your ZFS Pool

Learning Objectives

ZFS Deduplication

ZFS Compression

Next Time

References

Share this:

Leave a comment Cancel reply