Home > Storage > Insight into Hot Spares in CLARiiON

Insight into Hot Spares in CLARiiON

Storage Systems are generally preferred for the most mission-critical applications that requires to be running 24x7x365 . Storage Systems are generally built to be redundant at every level and Hot sparing is one of those techniques used to build an additional level of redundancy to disks.This post is based on EMC White paper on Hot Sparing and would highly recommend to have a look at it using your Powerlink ID.
 
First , lets understand three key terms:  
  • Hot Sparing – Provides automatic, online rebuilds of redundant RAID groups when any of the disks fails
  • Proactive Hot Sparing – Recognizes when a disk is nearing failure and preemptively copies the disk’s content prior to failure.
  • Rebuild Logging – Ability for a drive in a redundant RAID group to be offline for a period of time while write I/O to this drive is logged.

 Remember that RAID 1 , RAID 3 & RAID 5 can withstand Single disks failure whereas  RAID 1/0 can withstand a single failure per mirrored pair. Hot spares are used to automatically rebuild data from a failed drive in these RAID Groups

When is Global Hot Sparing invoked ?

  • Manually initiated proactive copy
  • FLARE initiated (automated) proactive copy
  • Drive failure or removal

How is a Hot Spare selected ?

  • Global hot spare drive selection algorithm is applied to select appropriate hot spare disks by FLARE.
  • For RAID 3/5/6 data from failed drive is rebuilt from parity onto hot spare.
  • For RAID 1/RAID 1/0 data is copied from surviving mirror
  • Once data rebuild starts rebuild continues to completion even if the failed disk is replaced when a failed drive is replaced, FLARE copies data from hot spare to replacement. This is called equalization.

 Rebuild and Equalization time considerations:

  • Drive capacity
  • Drive type (SSD/FC/SAS/SATA)
  • User Space on drive bound to LUNS
  • Rebuild Priority
  • Background I/O Workload
  • RAID type
  • Number of drives in RAID group (Parity groups)
  • Distribution of drives over multiple FC back-end loops as possible
  • With CLARiiON Systems, LUN with higher rebuild priorities rebuild faster and if two LUN, have same priority in a RG, they are further prioritized by size .Smallest LUN will rebuild first. If a RG is idle during rebuild, all LUN will rebuild ASAP regardless of their priority.
  • A higher priority will impact on storage system performance.
  • Equalization is a disk to disk copy process and hence is faster than rebuilds.

Characteristics of proactive hot sparing:

  •  Feature available for systems running FLARE release 24 and above
  •  Proactive copy means the RG is never exposed to a scenario where additional drive failure can cause data loss.
  • Performing copy operation than rebuild would mean faster data copy and other drives are not affected due to rebuilds.
  • Proactive can be done manually or automatically. Manual using Navisphere Manager (Right click and Copy to Hot Spare) and automatic initiated by FLARE

Rules

  • Only one proactive spare may be active in a RAID group at any given time
  • Proactive copying starts when drive reaches a certain error threshold or when copying have been manually triggered.
  • Data is copied proactively to spare and checkpoints are set throughout. In an event when disks fail during the proactive copy, data after last checkpoint is rebuilt

Rebuild Logging:

 This is a feature available in FLARE 24 and above wherein it allows for a drive in a redundant RAID group to be offline for a period of time while the write I/O to a disk is logged .Once the disk  becomes accessible, rebuild log will be used to quickly rebuild of a drive.

When I/O to a drive fails due to time out error the drive is considered for probational status. This will delay a hot spare from swapping in and rebuild log is created and write I/O logged. Drive is checked for availability once every 30 seconds for approximately 5 minutes and if disk becomes online data is rebuild from the rebuild log else full rebuild of drive is required.


Advertisements
Categories: Storage Tags: , ,
  1. rolan
    November 10, 2011 at 11:16 am

    How to validate if the copy to hotspare is completed and which drive?

    • November 10, 2011 at 12:11 pm

      You can right click on the System and verify the Disk Summary to verify the status of Hot Spare Replacing & Disk that replaced. Also Right Click on Storage Processor and View Events.

  2. Jeremy Bradshaw
    January 26, 2013 at 7:12 pm

    One statement is incorrect:
    “With CLARiiON Systems, LUN with higher rebuild priorities rebuild faster and if two LUN, have same priority in a RG, they are further prioritized by size .Smallest LUN will rebuild first. ”
    If all rebuild priorities are the same, the LUN’s will rebuild in the order of smallest data offset to largest. So the LUN at 0GB offset first, then the next LUN, all the way to the end of the RAID group.

  3. Nigel
    January 30, 2013 at 8:04 am

    How do you monitor the rebuild process? I have replaced a failed disk and I can see it “Equalising” but I am sure there is a place that displays the % done.

  4. Nigel
    January 30, 2013 at 8:17 am

    Thanks… Just found what I was looking for at http://www.penguinpunk.net/blog/clariion-hot-spare-rebuild-progress-and-naviseccli/ Cheers

  5. September 8, 2014 at 12:56 pm

    Howdy I am so grateful I found your website, I really found
    you by error, while I was searching on Yahoo for
    something else, Nonetheless I am here now and would just like
    to say thanks for a fantastic post and a all round interesting blog (I also love the theme/design), I don’t have time to
    browse it all at the minute but I have bookmarked it and also
    added your RSS feeds, so when I have time I will be back to read more, Please do keep up the
    fantastic work.

  6. September 24, 2017 at 6:47 pm

    Hi, Nice article. I have an situation with my Dad’s work Clariion CX4-120, where the hot sparing kicks in, but no drives have failed well as yet. It has been just over two weeks now and the system is still saying that drives are failing. We have changed 7 drives so far. The system is using 35x 300 GB drives in RAID 5. Please advise on a possible cause and /or solution? Well done again.

  1. January 8, 2011 at 1:28 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: