Insight into Hot Spares in CLARiiON
-
Hot Sparing - Provides automatic, online rebuilds of redundant RAID groups when any of the disks fails
-
Proactive Hot Sparing - Recognizes when a disk is nearing failure and preemptively copies the disk’s content prior to failure.
-
Rebuild Logging - Ability for a drive in a redundant RAID group to be offline for a period of time while write I/O to this drive is logged.
Remember that RAID 1 , RAID 3 & RAID 5 can withstand Single disks failure whereas RAID 1/0 can withstand a single failure per mirrored pair. Hot spares are used to automatically rebuild data from a failed drive in these RAID Groups
When is Global Hot Sparing invoked ?
- Manually initiated proactive copy
- FLARE initiated (automated) proactive copy
- Drive failure or removal
How is a Hot Spare selected ?
- Global hot spare drive selection algorithm is applied to select appropriate hot spare disks by FLARE.
- For RAID 3/5/6 data from failed drive is rebuilt from parity onto hot spare.
- For RAID 1/RAID 1/0 data is copied from surviving mirror
- Once data rebuild starts rebuild continues to completion even if the failed disk is replaced when a failed drive is replaced, FLARE copies data from hot spare to replacement. This is called equalization.
Rebuild and Equalization time considerations:
- Drive capacity
- Drive type (SSD/FC/SAS/SATA)
- User Space on drive bound to LUNS
- Rebuild Priority
- Background I/O Workload
- RAID type
- Number of drives in RAID group (Parity groups)
- Distribution of drives over multiple FC back-end loops as possible
- With CLARiiON Systems, LUN with higher rebuild priorities rebuild faster and if two LUN, have same priority in a RG, they are further prioritized by size .Smallest LUN will rebuild first. If a RG is idle during rebuild, all LUN will rebuild ASAP regardless of their priority.
- A higher priority will impact on storage system performance.
- Equalization is a disk to disk copy process and hence is faster than rebuilds.
Characteristics of proactive hot sparing:
- Feature available for systems running FLARE release 24 and above
- Proactive copy means the RG is never exposed to a scenario where additional drive failure can cause data loss.
- Performing copy operation than rebuild would mean faster data copy and other drives are not affected due to rebuilds.
- Proactive can be done manually or automatically. Manual using Navisphere Manager (Right click and Copy to Hot Spare) and automatic initiated by FLARE
Rules
- Only one proactive spare may be active in a RAID group at any given time
- Proactive copying starts when drive reaches a certain error threshold or when copying have been manually triggered.
- Data is copied proactively to spare and checkpoints are set throughout. In an event when disks fail during the proactive copy, data after last checkpoint is rebuilt
Rebuild Logging:
This is a feature available in FLARE 24 and above wherein it allows for a drive in a redundant RAID group to be offline for a period of time while the write I/O to a disk is logged .Once the disk becomes accessible, rebuild log will be used to quickly rebuild of a drive.
When I/O to a drive fails due to time out error the drive is considered for probational status. This will delay a hot spare from swapping in and rebuild log is created and write I/O logged. Drive is checked for availability once every 30 seconds for approximately 5 minutes and if disk becomes online data is rebuild from the rebuild log else full rebuild of drive is required.
How to validate if the copy to hotspare is completed and which drive?
You can right click on the System and verify the Disk Summary to verify the status of Hot Spare Replacing & Disk that replaced. Also Right Click on Storage Processor and View Events.
One statement is incorrect:
“With CLARiiON Systems, LUN with higher rebuild priorities rebuild faster and if two LUN, have same priority in a RG, they are further prioritized by size .Smallest LUN will rebuild first. ”
If all rebuild priorities are the same, the LUN’s will rebuild in the order of smallest data offset to largest. So the LUN at 0GB offset first, then the next LUN, all the way to the end of the RAID group.
How do you monitor the rebuild process? I have replaced a failed disk and I can see it “Equalising” but I am sure there is a place that displays the % done.
Thanks… Just found what I was looking for at http://www.penguinpunk.net/blog/clariion-hot-spare-rebuild-progress-and-naviseccli/ Cheers