If I had to use one and only one file system, there is only one choice. ZFS. It’s robust with stability and has some amazing features. And with that, I’d most likely use FreeBSD because it’s free and it has the most mature implementation of ZFS outside of Oracle/Sun Solaris. ZFS on Linux isn’t as stable, and I hope that Btrfs will will be ready for production.
- Becoming a ZFS Ninja part 1 (Youtube) is an excellent video by Oracle but video is very blurry, and cannot read the terminal well
Benefits and Issues
- No Write-hole in Raid5
- Inflexibility with pool upgrades (and not easy to expand a pool)
Combines volume manager (Raid) and File System.
There are only 2 commands:
zpool : manage pool
zfs : additional tools
Partitioning disk in ZFS is not recommended.
Pools are created from one or more ‘vdevs’ (virtual devices). 1/64th of pool capacity is reserved (to protect COW).
ex: RAIDZ2 (4 x 100GB) = 400GB - 2 parity drive = 200GB - 1/64th = 195GB
- disk: physical disk
- file: /vdevs/vdisk001/ # see below
- spare: spare drive
- log: write log device (ZIL SLOG, typically SSD)
- cache: read cache device (L2ARC, typically SSD)
- files can be used in place of real disks; must be at least 128MB? in size * depends on build, some may be 64MB. not sure on FreeBSD
- best practice to store them in common location
- not intended for production use!
- Great for testing and learning zfs
- Dynamic Stripe: intelligent raid0
- Mirror: n-way RAID 1
- RAIDZ1: improved form of RAID5
- RAIDZ2: improved form of RAID6 (dual parity raid5)
zpool create mypool c0t0d0 c0t1d0 c0t2d0
RAID 1 - simple mirror
zpool create tank1 mirror c0t0d0 c0t1d0
zpool create mypool mirror c0t1d0 c1t1d0 mirror c0t2d0 c1t2d0
And this example creates a new pool out of two vdevs that are RAID-Z groups with 2 data disks and one parity disk each:
zpool create tank2 raidz c0t0d0 c0t1d0 c0t2d0 raidz c0t3d0 c0t4d0 c0t5d0
zpool add (grow zpool)
zpool add <vdev>: when you decide to grow your pool, you just add one or more vdevs to it
zpool add tank1 mirror c0t2d0 c0t3d0 zpool add tank2 raidz c0t6d0 c0t7d0 c0t8d0 add 2 spare disks zpool add mypool spare c1t0d0 c2t0d0 add log (ZIL) zpool add log c5t0d0
- zpool cannot shrink
also see https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux/ for complete guide on Debian
On my server, I’ve added more vdevs
$ sudo zpool create data mirror /dev/ada1.eli /dev/ada2.eli $ zpool status pool: tank state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada2 ONLINE 0 0 0 errors: No known data errors
And add (grow) more hard drives
$ sudo zpool add data mirror /dev/ada3.eli /dev/ada4.eli $ zpool status pool: data state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada1.eli ONLINE 0 0 0 ada2.eli ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 ada3.eli ONLINE 0 0 0 ada4.eli ONLINE 0 0 0 errors: No known data errors $ df -h Filesystem Size Used Avail Capacity Mounted on /dev/ada0s1a 19G 9.1G 9.0G 50% / devfs 1.0k 1.0k 0B 100% /dev /dev/ada0s2d 263G 235G 7.0G 97% /usr/home linprocfs 4.0k 4.0k 0B 100% /compat/linux/proc /dev/da0.eli 28M 144k 26M 1% /mnt/usbkey data 913G 349G 564G 38% /data
- should be run periodically
- verifies on-disk contents and fixes problems if possible
- schedule via cron to run weekly, on off hours
- pools can be moved to another system and imported
- hard disks can be moved to another system
- , usually during system boot/halt
I had multiple tanks, and wouldn’t load it without specifying id
zpool import tank #--> wouldn't work zpool import 12152999....234 #--> works, to see id#, do zpool import Also to mount it, sometimes it needs to be manually mounted sudo zfs mount tank Also, sometimes, I had to delete the folder /tank, as its existence prevented zfs from mounting
Manually replace disk
zpool replace <old_vdev> <new_vdev>
Turn on/off devices, useful for testing or execute it just before replacing hard disk
zpool online <vdev> zpool offline <vdev>
Add or remove devices from mirror
zpool attach ... zpool detach
zpool destroy tank
settings that can be set for each pool
zpool set key=value <pool>???
- failmode: handles what happens when something goes wrong
- Continue: continue even in errors
listsnapshots: if on, it could take long to list, as it needs to go through all snapshots. Turn it off so that snapshots aren’t shown.
autoreplace: it should always be on! spare disk takes over in failure.
View entire history of pool’s history
zpool history mypool
zdb -C see the content of /etc/zfs/zpool.cache
Displays iostat every 10 seconds
zpool iostat 10
RESULT with GELI while WRITING files
$ zpool iostat 1 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- data 352G 576G 0 15 22.7K 998K data 352G 576G 0 366 0 45.2M data 352G 576G 0 347 0 42.9M data 352G 576G 0 407 0 50.1M data 352G 576G 0 443 0 45.1M data 352G 576G 0 263 0 31.3M data 352G 576G 0 312 0 37.8M data 352G 576G 0 317 0 39.0M data 352G 576G 0 318 0 38.8M data 352G 576G 0 348 0 41.3M data 352G 576G 0 303 0 38.0M
Not bad for ZFS on SATA 300 on top of GELI on old AMD X2 Be2300 1.9Ghz, 4GB RAM
READING should be FASTER, more IOPS with more vdevs mirrors.
For better performance, consider using zil on SSD.
replace disk in zpool
$ zpool status pool: data state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: resilvered 84K in 0h0m with 0 errors on Tue May 8 17:37:26 2012 config: NAME STATE READ WRITE CKSUM data DEGRADED 0 0 0 mirror-0 ONLINE 0 0 0 ada1.eli ONLINE 0 0 0 ada2.eli ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 17537411141092671575 REMOVED 0 0 0 was /dev/ada3.eli ada4.eli ONLINE 0 0 0 errors: No known data errors
Replace hard disk
$ sudo zpool replace data ada3.eli ada3.eli
check the status of resilvering
$ zpool status -v pool: data state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun May 20 20:46:27 2012 140G scanned out of 558G at 2.81G/s, 0h2m to go 3.56M resilvered, 25.13% done config: NAME STATE READ WRITE CKSUM data DEGRADED 0 0 0 mirror-0 ONLINE 0 0 0 ada1.eli ONLINE 0 0 0 ada2.eli ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 replacing-0 REMOVED 0 0 0 17537411141092671575 REMOVED 0 0 0 was /dev/ada3.eli/old ada3.eli ONLINE 0 0 0 (resilvering) ada4.eli ONLINE 0 0 0 errors: No known data errors * one of the best comprehensive talk by Oracle
After a coffee break
$ zpool status pool: data state: ONLINE scan: resilvered 151G in 1h11m with 0 errors on Sun May 20 21:57:34 2012 config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada1.eli ONLINE 0 0 0 ada2.eli ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 ada3.eli ONLINE 0 0 0 ada4.eli ONLINE 0 0 0 errors: No known data errors
It took 1 hour 11minutes to resilver about 151G on encrypted hard disk.
DEGRADED STATE (removed 2 drives)
$ zpool status pool: data state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scan: scrub repaired 0 in 4h12m with 0 errors on Tue Jun 19 20:29:46 2012 config: NAME STATE READ WRITE CKSUM data DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 ada1.eli ONLINE 0 0 0 14079505549490135106 UNAVAIL 0 0 0 was /dev/ada2.eli mirror-1 DEGRADED 0 0 0 ada3.eli ONLINE 0 0 0 10106669688987601550 UNAVAIL 0 0 0 was /dev/ada4.eli
$ zfs list NAME USED AVAIL REFER MOUNTPOINT data 560G 354G 560G /data
either a filesystem or a volume (ZVols) inherited by sub-filesystems
zfs create <pool>/<dataset>
zfs create mypool/dataset
zfs get all <pool | ds> zfs set key=value <pool|ds>
ZVols / volume datasets
- block storage (/dev/zvol/rdsk/pool/ds)<- not sure if in FreeBSD???
- used for iScsi, or for non-ZFS local filesystem ie ext2, etc
- can be used to create “sparse”,
On my server
I use the following commands to bring up the ZFS pool on servers.
zpool create -f tank raidz ada1p4 ada0 ada2 # had to use -f (force) because partition were slightly different in size
- RaidZ setup with 3 encrypted (LUKS) HD
- ehd0 = 500GB HD partition / boot
- ehd1 = 320GB SATA HD
ehd2 = 320GB IDE HD
$ sudo zpool create -o ashift=12 tank raidz /dev/mapper/ehd0 /dev/mapper/ehd1 /dev/mapper/ehd2 $ sudo zfs set atime=off tank # do it when unmounted?
compression not enabled due to performance.
deduplication not used due to memory requirement
- FreeBSD: GELI
- Linux: LUKS
- ZFS on Linux install on Debian
su - wget http://archive.zfsonlinux.org/debian/pool/main/z/zfsonlinux/zfsonlinux_4_all.deb dpkg -i zfsonlinux_4_all.deb wget http://zfsonlinux.org/4D5843EA.asc -O - | apt-key add - apt-get update apt-get install debian-zfs
Recommended to always set to 4k sector instead of 512b for future HD, and better performance.
zfs set atime=off tank
Disable dedup (FreeBSD recommends disabling it)
zfs set dedup=off tank
Scrub often via cron
# crontab -e ... 30 19 * * 5 zpool scrub <pool>
Re-enable ZFS after Debian upgrade
To enable ZFS after upgrading debian release, kernel, etc.
sudo apt-get install --reinstall debian-zfs sudo reboot
tip for Linux ZFS
zfs set xattr=sa <pool>
Snapshots and clones
- snapshots can be made hourly using cron
- clone promotes the snapshots into read/write copy
- transactional journal
- acts like write buffer
- purpose: speed up random I/O write ops by writing it sequentially first, resolving the real-write later
- written sequentially
- read only for recovery after crash
- SLOG: locates ZIL on a separate device to improve speed
- only few GB are needed
- must be fast and needs to be redundant
- losing ZIL means all write ops are lost
- usually mirrored RAID SSDs
- In-memory (RAM) file cache, Read-only Cache?
- can use up to 7/8th of physical memory
- rule of thumbs (varies):
- 1GB ARC per 1TB of disk space About ZFS Performance - Percona Database Performance Blog
- Use maximum allowed (7/8th of RAM) Explanation of ARC and L2ARC » ZFS Build
- tools: Kstats
- Level 2 ARC, optional
- aka “cache drive”
- optional, uses disk
- usually using SSD
- no mirroring/redundancy needed, since data is already redundant in ZFS.
- Available since Solaris 10 update 2 (6⁄2006)
- Currently Solaris 11
- Solaris compatibility
- Sun Ultra 20 workstation with Solaris 10 from Ebay for $100
- But Older Sun HW cannot handle large HD
- acceptable to use Raid-Z,Z2,Z3 instead of mirror