Announcement

Collapse
No announcement yet.

Pinpoint failing HDD on IMS1000 RAID

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pinpoint failing HDD on IMS1000 RAID

    I am probably overlooking something obvious here, but this is how it is:
    In the last month, twice there were frame underflow incidents.
    The first time, the RAID status was changing multiple times between Degraded and Healthy. The CPL was playing in the midst of the screening day and no other CPL showed issues.
    I thought of deleting and reingesting the DCP at first, since CPL validation was nowhere to be found, but then, I went all the way to reinitialise the array.
    There was nothing I could find to suggest which drive was the culprit. The CPL in question never created issues since.
    The issue came up again, with another CPL that was playing for the first time.

    My take on that is that it is probably a corrupt file due to a HDD that reached its retirement years.
    There is always a chance, though, that the IMS1000 RAID controller is messing with written data when ingesting. I really hope not, because I doubt that there will be a proper replacement available.

    So, until replacing the array with one of the WD or WD(Hitachi) drive models qualified for use, I was thinking of removing the one that caused the trouble.
    The problem there is that, from what I gather (and please correct me if I am wrong), the IMS1000 is using a hardware RAID under the blanket of the software RAID that we are familiar with from SVx and DCP2y.
    So, there is no indication of which drive did the mischief. My reading of kern.log.1 and odeticsd.log didn't bear any fruits.
    The LEDs by the HDDs are all green and if I go to the storage tab of the diagnostic tool, I only get "normal" status for all rd00, rd01 and rd02 with the same SpinUpTime, and zero on ReallocatedSectorCount, SeekErrorRate,OfflineUncorectable and UDMACrCError.

    Any ideas on how to discover the "lazy" one?

  • #2
    Look at the SMART data is one way to see who is slow. However, have you used Dolby's Log Analyzer? I'd start there...it can quickly find things that are easy to overlook. Then, you can also submit a case to Dolby to see if they can find something.

    Comment


    • #3
      If you're lucky, SMART data can help you. Especially if you see a drive that has an increasing number of bad blocks or logs a lot of timeouts, but unfortunately, SMART data isn't always reliable in this regard.

      Also, gotta love the better kind of hardware RAID controller, the kind that detects a dying drive early and opts to eject it if there is still redundancy left in the array... Linux software RAID sucks at finding drives that start lagging your RAID due to some lingering issue, this is also due to the default behavior of drivers to try until you die, often without producing usable log information... Back in the day, I had my own set of hacks for ZFS on FreeBSD, because a single, failing disk could eliminate the performance of your 40+ disk array...

      In previous cases when this happened, I simply swapped all drives, but if you're not able to do so, the most reliable way is also the one that will take the longest: Simply plug disk0, look if the problem repeats, if so, re-insert disk0, let the RAID rebuild and then pull disk1... repeat for all remaining disks until you found the culprit...

      Comment


      • #4
        I'm also wondering about a DCP that has a bitrate that is too large for the IMS1000 to cope with, but the RAID controller is getting confused and putting this down to a hardware fault with the drive that was being read from at the time. If the fault has only ever occurred with one CPL, this could be a possibility. If I remember correctly, the Doremi rack servers (of which the IMS1000 is essentially a miniature variant) had a maximum pix bitrate of 200 MBPS, and DCPs nowadays can and do go above that.

        Pulling the drives one by one, hooking them to a SATA to USB adapter and thence a PC, and looking at their SMART stats with a utility such as CrystalDiskInfo should identify any obvious hardware issues.

        Comment


        • #5
          Steve, I tried smartctl as a command from the terminal, but all I can look for is /dev/sdh1 and ../sdh2, those correspond to JMicron H/W RAID5 and not to individual drives.
          On the WUI, I couldn't find any SMART info that would illuminate the subject. There is this SMART list if one goes to Storage, Storage Details, md0>hw0>rd00/rd01/rd02, but that's essentially the same for all drives and in /dev/ there is no rd0X device to look for (by using the SMART tool). Nor /dev/hdd0...

          I also started from Dolby's Log Analyzer. Yet, the device error is in the form of:
          res 51/04:00:30:f7:61/00:04:e6:00:00/e0 Emask 0x1
          that seems... machine code to me.


          Marcel, so to understand, you are suggesting to just remove one drive at the time?
          It makes sense, but it is - as you said - time consuming. After putting the one that one removed with no result, they will have to let the array synchronise also.
          I was wondering if there is something to watch for in the logs, the status or in the behaviour.
          Previous doremi servers had a check that would come back with each drive's speed. IMS1000 leaves you all alone there.
          And I hit the RAID controller wall without having the knowledge (or resources) to reach the drives. (Unless I take them out and check on a PC, one by one.)


          That's why I was wondering if someone would have an idea (or experience) on how to pinpoint the drive for me to remove.

          My intention is to change the whole array. But I would prefer, in the meanwhile, to remove the "evil doer", so to reduce the frame underflow occurrences.
          Creating a ticket is always an option, but I was hoping for a readily available option.

          Comment


          • #6
            Originally posted by Leo Enticknap View Post
            I'm also wondering about a DCP that has a bitrate that is too large for the IMS1000 to cope with, but the RAID controller is getting confused and putting this down to a hardware fault with the drive that was being read from at the time. If the fault has only ever occurred with one CPL, this could be a possibility. If I remember correctly, the Doremi rack servers (of which the IMS1000 is essentially a miniature variant) had a maximum pix bitrate of 200 MBPS, and DCPs nowadays can and do go above that.

            Pulling the drives one by one, hooking them to a SATA to USB adapter and thence a PC, and looking at their SMART stats with a utility such as CrystalDiskInfo should identify any obvious hardware issues.
            Leo (we were writing simultaneously), it may have been that, but the first DCP that brought up the issue is now playing O.K.
            I also let the DCP play those few minutes when the problem occurred and it didn't create any problem. My impression is that the problem is not a bitrate one, but a hardware one.
            Yet, I can't be sure whether it is a drive or the RAID controller that is reaching its "golden" years. That's what I am afraid the most.
            Here are the smart status, as shown on the WUI:
            Screenshot 2021-08-30 at 13.52.27.png
            All the rest have identical info (on SMART).

            Comment


            • #7
              I would take out each drive and analyse it connected to a PC with a useful tool. In this configuration, it is quite possible that the IMS does not report any useful SMART parameters. And as the system shows issues, it is certainly best to swap all drives, given their age. You may still identify the problematic drive later if you wish.

              Comment


              • #8
                Originally posted by IoannisSyrogiannis View Post
                Marcel, so to understand, you are suggesting to just remove one drive at the time?
                It makes sense, but it is - as you said - time consuming. After putting the one that one removed with no result, they will have to let the array synchronise also.
                I was wondering if there is something to watch for in the logs, the status or in the behaviour.
                Previous doremi servers had a check that would come back with each drive's speed. IMS1000 leaves you all alone there.
                And I hit the RAID controller wall without having the knowledge (or resources) to reach the drives. (Unless I take them out and check on a PC, one by one.)


                That's why I was wondering if someone would have an idea (or experience) on how to pinpoint the drive for me to remove.

                My intention is to change the whole array. But I would prefer, in the meanwhile, to remove the "evil doer", so to reduce the frame underflow occurrences.
                Creating a ticket is always an option, but I was hoping for a readily available option.
                I haven't had a close call with an IMS1000 for a while and I was under the assumption Doremi put in a 100% software RAID in those too, but I'm wrong. They put one of those dreaded Marvell RAID-on-a-Chip hybrid contraptions on there. So, there is probably nothing descriptive in the Linux system logs and anything useful has to be gotten from the RAID controller log, if this thing even keeps one. I've been looking if there is a Linux utility to interface with the RAID chipset, but I can only find a Windows utility...

                Like Carsten indicated, you can still analyze the drives one-by-one on an external machine, but that would still require you to rebuild the array every time you re-insert a disk into the array. Depending on your budget, I'd rather change the drives, since time usually also is money and although most of the work can be done unattended, you still need access to the location.

                Comment


                • #9
                  Ah, I still remember that the very first IMS1000 had notorious failures with the RAID controllers. At least at the coin of Europe I was residing at the time, before I come to this one.
                  I was Away From Keyboard, and I didn't fill you in about my reaching to Dolby.
                  They were mostly concerned about an alert caused by the RAID controller and suggested reseating it.
                  I will follow their advice and do so tomorrow.

                  The replacement of the drives is already put in motion. Thank you all for motivating me, but as I explained on the original post, that was my intention already.
                  Yet, if I could find the culprit in time, I would have removed it.

                  For the time being, they were swapped with those of another IMS1000, for the sake of testing and observing.
                  The "other" server is not screening anything and set to play a movie in loop until tomorrow. It was doing so, without an iota of complain, for the last hours.

                  Comment


                  • #10
                    The detailed report contains more infos about disk health, but you have to look for them manually, or open a case on dolby portal (if you or your technician are registered).

                    Comment


                    • #11
                      Yes, as I wrote, I did open a case.
                      Yet, my search on the logs didn't quite bear fruits. I didn't find anything on the kernel logs for instance.
                      Just info I couldn't interpret to something of value.
                      The case handler wasn't much concerned with the drives, though.

                      Would you care to point me to the right direction to find more, Elia?
                      (And, by the way, did you find that RAID controller you were looking for?)

                      Comment


                      • #12
                        Originally posted by IoannisSyrogiannis View Post
                        The case handler wasn't much concerned with the drives, though. Would you care to point me to the right direction to find more, Elia?
                        If Dolby is not concerned with drives, probably my help would not be very useful. But if you send me the report (or the share link from log analyzer), I'll show you what I would look for.
                        Originally posted by IoannisSyrogiannis View Post
                        And, by the way, did you find that RAID controller you were looking for?
                        No, but finally I obtained a Return&Repair process that at very reasonable cost repaired the ims1000 making it an ims2000

                        Comment


                        • #13
                          I am, the least to say, curious to find out where the info resides on the log and info package.
                          I sent you the link. You can download the file from there as well.

                          Comment


                          • #14
                            Its kind of a waste of time to be honest. The drives are not that expensive and if you have one drive failing the others are probably not too far behind. Just change all of them. It will help alleviate problems again in the near future. At important sites I changed them every 5 years. One lost show or refunded is probably 6 sets of new drives!

                            Comment


                            • #15
                              Mark, I am onto replacing all the drives. I didn't think much about it.
                              Yet, what I was asking doesn't have with the decision whether to change the drives or not, but with removing the one that fails, so to reduce the chance for frame underflows.
                              Elia was very helpful and he pointed out to me that on drmreport.txt (taken out of the diagnostic package) there are more SMART data than what is seen on the WUI.
                              So, the same drive that I shared the picture previously had the following info:
                              Code:
                              196 Reallocated_Event_Count 0x0032 100 100 000 - - - 12
                              It's unfortunate that that information is not immediately available on the SMART section of the Web User Interface for the user, contrary to the User Interface of DSS servers of Dolby. That (Dolby) helped a lot on troubleshooting HDDs.
                              I know that when IMS1000 came out, Doremi was not yet sold, but I would welcome an update for all IMSs to include that specific information and have it readily available. The online log analyser as well.

                              In any case, the log shows what command would bring up the SMART info
                              Code:
                              /doremi/sbin/raidmgr.out --status --smart
                              would output the info needed, but that is for the root user only. Not even admin.

                              Thank you all for your help. Especially Elia.

                              Comment

                              Working...
                              X