Tuesday, July 28, 2009

Technology for Sun Partners

Technology for Sun Partners: "A Hands on Introduction to ZFS Pools; Part 3: RAIDZ2"

The hands on tutorial shows how to manage disks through ZFS with very limited resources. The tutorial is based on the following components
One 7 port USB 2.0 hub
Six 1GB USB 2.0 memory sticks
OpenSolaris 2008.05 as a live CD (and a system to boot off); any Solaris 10 systems will do as well
It has been released in three parts:
Getting started, pool creation import and export of pools
Mirrors, managing disk failures and spare disks
RAIDZ2
Within this part of the tutorial you will learn
How to create a RAIDZ2 volume ZFS with 2 parity volumes out of 6 USB sticks
How to it recovers after disk failures through resilvering
How to scrub a pool
Check out the first part to learn about the Solaris requirements, the six required USB sticks and the USB hub and how to configure them.
As already stated in the other parts; get some extra graphical tools like iobar from the freely available Tools CD. This will visualize nicely what's happening on your disks (see below):

Getting Started
Non Disclaimer: Please be cautious with your system. All commands apply significant changes to your system. Data corruption may easily occur. Have a good backup of your system. The system is not warranted to work since it depends on the USB sticks, the hub and how how Solaris get's informed about status changes. ZFS hasn't been optimized for USB sticks and hick ups may occur at any time. Do not use such a configuration for important work. This is a low cost self learning sand box!
As already mentioned sin the first part:
Become super user ( su)
Clean up your /dev/dsk device list ( devfsadm -C)
This time we will plug in all 6 sticks at the very beginning since we will need them all.
The goal will be to create a 4GB “high trough put” USB volume with 2*1GB parity disks. This configuration will be able to survive the failure of any 2 disks.

The command rmformat will tell us the mapping of sticks to devices. In my case it's:
Stick
controller
1
c9t0d0p0
2
c8t0d0p0
3
c7t0d0p0
4
c6t0d0p0
5
c5t0d0p0
6
c4t0d0p0
Creating a RAIDZ2 File System
A RAIDZ2 pool with the name z2pool is getting created with the command below. The file system /z2pool will get mounted automatically.jack@opensolaris:~# zpool create z2pool raidz2 c4t0d0p0 c5t0d0p0 c6t0d0p0 c7t0d0p0 c8t0d0p0 c9t0d0p0
The next step will be a simple check of the file system availability. Creating some load on the file system will makes things more realistic. The load generation will consist of creating a tar file from /usr in the back ground. The 4GB available should be enough and the command will need at least 5 minutes to complete. iobar is a good way to see graphically what's going on. The command iostat -xnc 2 will do the same job in a separate window.jack@opensolaris:~# cd /z2pool
jack@opensolaris:/z2pool# tar cf usr.tar /usr &
[1] 1641
jack@opensolaris:/z2pool# tar: Removing leading `/' from member
names
tar: Removing leading `/' from hard link targets
The zpool status command will tell us about the configuration of our file system:zpool status -v z2pool
pool: z2pool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
z2pool ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c4t0d0p0 ONLINE 0 0 0
c5t0d0p0 ONLINE 0 0 0
c6t0d0p0 ONLINE 0 0 0
c7t0d0p0 ONLINE 0 0 0
c8t0d0p0 ONLINE 0 0 0
c9t0d0p0 ONLINE 0 0 0
errors: No known data errors
Everything seems to be up and operational...
Creating a Disk Failure in a RAIDZ2 Volume
This is the fun part of the tutorial. The destructive one...
Let's pull out stick 4 and see what's happening:

The zpool status command will give an update:
zpool status -v z2pool pool: z2pool
state: DEGRADED
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
z2pool DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
c4t0d0p0 ONLINE 0 0 0
c5t0d0p0 ONLINE 0 0 0

c6t0d0p0 REMOVED 0 0 0
c7t0d0p0 ONLINE 0 0 0
c8t0d0p0 ONLINE 0 0 0
c9t0d0p0 ONLINE 0 0 0
errors: No known data errors
So far, so good ...
ZFS realizes that disk c6t0d0p0 is gone and the configuration is degraded. The file system is still operational as the 3 blinking LEDs of active sticks indicate in the photo above.
There's only one thing which is more fun than pulling out an operational disks:
Pulling out a second one...
The photo below documents that stick 2 got removed as well.

The command zpool status reflects the trouble we're in now:jack@opensolaris:zpool status -v z2pool
pool: z2pool
state: DEGRADED
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
z2pool DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
c4t0d0p0 ONLINE 0 0 0
c5t0d0p0 ONLINE 0 0 0

c6t0d0p0 REMOVED 0 0 0
c7t0d0p0 ONLINE 0 0 0

c8t0d0p0 REMOVED 0 0 0
c9t0d0p0 ONLINE 0 0 0
errors: No known data errors
This is as far as we can go with this configuration. 2 disks are removed. Everything is degraded and still operational. I leave it to the reader to pull another stick and see a corrupted file system.
Recovering from a degraded RAIDZ2 Volume
We are now in the situation to have four operational disks and two disks which have been ripped out. The next steps will show you how to clean up the mess.
The first step is to plug in stick 2 known to Solaris as c8t0d0p0. This is a special situation since ZFS labelled it and it knows c8t0d0p0 already.

This operation is the equivalent to reconnecting a functional disk. Replacing a broken disk 2 by a new disks would lead to a different situation. A new disk would need to be added and be told to replace the removed disks. I'm not covering this case here (I have no more memory sticks). I'm plugging in the remaining disconnected disk in the hope to regain a complete configuration.

The command zpool status tells us about the actions being taken by Solaris:jack@opensolaris:/z2pool# zpool status -v z2pool
pool: z2pool
state: ONLINE
status:
One or more devices is currently being resilvered. The pool
will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: r
esilver in progress for 0h0m, 21.32% done, 0h0m to go
config:
NAME STATE READ WRITE CKSUM
z2pool ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c4t0d0p0 ONLINE 0 0 0
c5t0d0p0 ONLINE 0 0 0
c6t0d0p0 ONLINE 0 0 0
c7t0d0p0 ONLINE 0 0 0
c8t0d0p0 ONLINE 0 0 0
c9t0d0p0 ONLINE 0 0 0
errors: No known data errors
ZFS put both disks back into the state online and it is resilvering the two disks. 21% of the synchronisation has been completed at the time I checked. Keep in mind that the resilvering process takes much longer if there is concurrent other IO happening. Be aware the the times to bring everything back in a correct state depends
on the medium,
the size of the volume,
the parallel IO.
Real disks will expose very different characteristics!
ZFS may decide as well that the reinserted USB sticks may be damaged and it takes them offline. In this case it takes a
jack@opensolaris:/zpool clear
command to pretend that the disk is OK. The disk may then need to be brought online with the command
jack@opensolaris:/zpool online
These extra operations seem to depend on the time the stick has been removed and the state the stick was in when it has been removed.
Low cost memory sticks aren't made for professional data center operations. They seem to degrade over time through the rude handling from my side. Use the scrub command to validate that they are functional:jack@opensolaris:/z2pool# zpool scrub z2pool
jack@opensolaris:/z2pool# zpool status -v z2pool
pool: z2pool
state: ONLINE
scrub: s
crub completed after 0h0m with 0 errors on Thu Jul 3 06:38:58
2008
config:
NAME STATE READ WRITE CKSUM
z2pool ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c4t0d0p0 ONLINE 0 0 0
c5t0d0p0 ONLINE 0 0 0
c6t0d0p0 ONLINE 0 0 0
c7t0d0p0 ONLINE 0 0 0
c8t0d0p0 ONLINE 0 0 0
c9t0d0p0 ONLINE 0 0 0
Keep in mind that scrub will check the entire disk. This command may need significant time. Calling it when parallel IO happens makes it run much longer!
This it what it takes wo work with RAIDZ2 volumes. RAIDZ volumes with one parity bit and less redundancy work the same way.
You should have learned within this tutorial:
How to create a RAIDZ2 volume ZFS with 2 parity volumes out of 6 USB sticks
How to it recovers after disk failures through resilvering
How to scrub a zpool
Wrap Up: Observations
USB memory sticks seem to come with slightly different capacities. rmformat educates about the available capacity. This may create trouble with operations where equal sized media is expected like mirroring. There are overriding options to force ZFS to move on. You may want to be very cautious with these options in productive situations
Gnome dialog about inserted disk: Gnome and the volume manager see that a ZFS labelled disk has been attached. It shows a modal dialog with an error message which states that a disk belonging to a named zpool has not been able to be mounted. The volume manager is unfortunately not smart enough to mount a multi disk ZFS file system automatically after all disks have been attached. The little modal dialog is however quite useful since it tells the user that the device has been recognized
Which hub and which sticks to pick? No idea. I purchased the cheapest available USB 2.0 components. They worked. Please drop a comment if you know more.
Which Solaris to use?
Solaris 10U4 on a Ultra 20: worked
Solaris Developer Express 1/08 on a Tecra A5: worked
OpenSolaris 2008.05 as live CD on a Ultra 20: worked best!
OpenSolaris 2008.05 in a VmWare image with Fusion on a MacPro seems to work as well
Solaris Developer Express 1/08 in VmWare image with VmWare Player on Windows XP didn't work for me
The simplest way to exercise the tutorial seems to be the OpenSolaris Live CD booted straight from CD. It worked like a charm and there is no significant risk to damage something.
I hope you'll enjoy this tutorial as much as I did. Use the ZFS Administration Manual to learn about the correct usage of ZFS. Consider to use the newest available Solaris version. OpenSolaris 2008.05 as a live CD worked best for me.
Released parts of the hands on tutorial up to now are:
Getting started, pool creation import and export of pools
Mirrors, managing disk failures and spare disks
RAIDZ2

No comments: