Insight into Hot Spares in CLARiiON
Hot Sparing - Provides automatic, online rebuilds of redundant RAID groups when any of the disks fails
Proactive Hot Sparing - Recognizes when a disk is nearing failure and preemptively copies the disk’s content prior to failure.
Rebuild Logging - Ability for a drive in a redundant RAID group to be offline for a period of time while write I/O to this drive is logged.
Remember that RAID 1 , RAID 3 & RAID 5 can withstand Single disks failure whereas RAID 1/0 can withstand a single failure per mirrored pair. Hot spares are used to automatically rebuild data from a failed drive in these RAID Groups
When is Global Hot Sparing invoked ?
- Manually initiated proactive copy
- FLARE initiated (automated) proactive copy
- Drive failure or removal
How is a Hot Spare selected ?
- Global hot spare drive selection algorithm is applied to select appropriate hot spare disks by FLARE.
- For RAID 3/5/6 data from failed drive is rebuilt from parity onto hot spare.
- For RAID 1/RAID 1/0 data is copied from surviving mirror
- Once data rebuild starts rebuild continues to completion even if the failed disk is replaced when a failed drive is replaced, FLARE copies data from hot spare to replacement. This is called equalization.
Rebuild and Equalization time considerations:
- Drive capacity
- Drive type (SSD/FC/SAS/SATA)
- User Space on drive bound to LUNS
- Rebuild Priority
- Background I/O Workload
- RAID type
- Number of drives in RAID group (Parity groups)
- Distribution of drives over multiple FC back-end loops as possible
- With CLARiiON Systems, LUN with higher rebuild priorities rebuild faster and if two LUN, have same priority in a RG, they are further prioritized by size .Smallest LUN will rebuild first. If a RG is idle during rebuild, all LUN will rebuild ASAP regardless of their priority.
- A higher priority will impact on storage system performance.
- Equalization is a disk to disk copy process and hence is faster than rebuilds.
Characteristics of proactive hot sparing:
- Feature available for systems running FLARE release 24 and above
- Proactive copy means the RG is never exposed to a scenario where additional drive failure can cause data loss.
- Performing copy operation than rebuild would mean faster data copy and other drives are not affected due to rebuilds.
- Proactive can be done manually or automatically. Manual using Navisphere Manager (Right click and Copy to Hot Spare) and automatic initiated by FLARE
- Only one proactive spare may be active in a RAID group at any given time
- Proactive copying starts when drive reaches a certain error threshold or when copying have been manually triggered.
- Data is copied proactively to spare and checkpoints are set throughout. In an event when disks fail during the proactive copy, data after last checkpoint is rebuilt
This is a feature available in FLARE 24 and above wherein it allows for a drive in a redundant RAID group to be offline for a period of time while the write I/O to a disk is logged .Once the disk becomes accessible, rebuild log will be used to quickly rebuild of a drive.
When I/O to a drive fails due to time out error the drive is considered for probational status. This will delay a hot spare from swapping in and rebuild log is created and write I/O logged. Drive is checked for availability once every 30 seconds for approximately 5 minutes and if disk becomes online data is rebuild from the rebuild log else full rebuild of drive is required.