Announcement

Collapse
No announcement yet.

One of my DSS200s is constantly verifying the RAID State

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • One of my DSS200s is constantly verifying the RAID State

    Hello Everyone!

    About a week back I had an issue with some of my automation not firing but I also briefly mentioned another problem that I thought was related or maybe coincidental. Thankfully I haven't had any automation errors since that post (but I took note of all your advice, thank you to everyone who helped) so that if it happens again I can follow those steps. As for my current problem:

    One of my screens (Theater 1) is not starting its show when it is supposed to. It is scheduled and on schedule mode but when it gets to the time it never actually starts. Trying to start it manually has no effect, as soon as I go manual and hit play it will turn green and say that it's running but it stays at 00:00:00. Any attempt to jump forward, swap to a different playlist, etc. has no response. The only "fix" is to turn everything off and then back on and then I can start it manually with no problem. On a reboot it will show that feature that we tried to start up previously as "stopped". I've looked around and I've noticed that this theater has been constantly verifying the RAID state. It never says that it is degraded, just that it's verifying and there's a percentage that fills up. I don't see any reallocated sectors on any of our 4 drives, I sent the logs in but didn't get any useful information.

    So far some steps I've tried was to pull one of my drives and reinsert it to actually cause the RAID to rebuild instead of just verifying. No luck there.
    After that, I reinstalled the version we run (4.9.1.22) from an ISO disk and went through config and set everything back to the way it was. Still no luck.

    I really have only ever replaced drives if they have reallocated sectors, but with none showing here I haven't replaced any yet. From reading some older forums here I know some people have said that there can still be drives with issues despite not showing reallocated sectors (and vice versa, some drives with those can still work fine) but I'm thinking tonight I'll just go scorched earth and replace all four drives with new ones I bought recently. I'd rather not do that if there are other steps I can take, but I figure if I replace all drives and we STILL have the issue then at least I'll know it's not RAID related. The fact that of our 7 screens, only Theater 1 is constantly verifying makes me think that's the issue.

    Does anyone know what could be going on here? It's strange that it's happening and even stranger to me that after restarting it, I can start stuff manually fine. I don't know what's causing it to have this issue for the first show of the day after turning our equipment on.

  • #2
    What's the vintage of the drives in there?

    Have you logged into the 3Ware RAID BIOS to look if there might be additional information available in there?

    Comment


    • #3
      The new drives we put in anytime I have to replace bad ones are Western Digital 1TB enterprise drives, WD1003FBYXs to be exact. So it's either those (although possibly unlikely as I don't think I've replaced a drive in Theater 1 in a while) or I think the old ones we used were Seagate? The next time I have a chance I'll restart and see what it says on boot up if I don't pull the drive beforehand.

      As for the BIOS, I can just go into that on start up, right? What sort of thing would I be looking for? I can try that especially if the next showtime in that theater is empty.

      Comment


      • #4
        To get into the RAID controller's BIOS, you need to push alt+3 when you see this during the POST sequence:

        image.png
        As Marcel notes, it might show you if there are any SMART issues that the DSS200 Show Manager software can't read. Can't hurt to take a look.

        Comment


        • #5
          Originally posted by Leo Enticknap View Post
          To get into the RAID controller's BIOS, you need to push alt+3 when you see this during the POST sequence:

          image.png
          As Marcel notes, it might show you if there are any SMART issues that the DSS200 Show Manager software can't read. Can't hurt to take a look.
          Sounds good, thank you! I'll poke around in there and see if anything looks off!

          Comment


          • #6
            Like Leo indicated, the 3Ware BIOS is accessible on boot via ALT+3 keystroke.

            Hard drives are practically consumables, I expect to replace them anywhere between 3 to 7 years, depending on load and also a little bit of luck with the specific models.

            SMART errors indicate a drive having issues. If you can't identify a drive having issues via SMART issues, it might still be worthwhile to replace all drives in the array one by one. If that doesn't solve the problem, there still could be something wrong with the array itself. Next logical step would be to initialize a new array.

            Comment


            • #7
              Originally posted by Marcel Birgelen View Post
              Like Leo indicated, the 3Ware BIOS is accessible on boot via ALT+3 keystroke.

              Hard drives are practically consumables, I expect to replace them anywhere between 3 to 7 years, depending on load and also a little bit of luck with the specific models.

              SMART errors indicate a drive having issues. If you can't identify a drive having issues via SMART issues, it might still be worthwhile to replace all drives in the array one by one. If that doesn't solve the problem, there still could be something wrong with the array itself. Next logical step would be to initialize a new array.
              So it'd be worth it to swap drives one by one instead of just replacing all 4 if I don't see any SMART issues indicating a problem on a specific drive?

              Then as a last resort if it's worse than that, by initializing a new array is that just rebuilding it from scratch (I think to my understanding when you replacing one drive at a time it's different than two or more where all your content is wiped) or a more involved process?

              Comment


              • #8
                If you replace all drives one-by-one, and let the array rebuild after each replaced drive then you don't need to re-install and all content will stay on the machine. The process is a lengthly process, as it may take several days, but it's a low-effort thing to do. Keep in mind though, that during prolonged RAID-rebuilding, chances of a second drive failing may increase. In that case, you still need to re-create the RAID and reconfigure settings. Also, if the issue is the array itself, this process will, unfortunately, not solve you problem. But at the least, you can cross a defective disk off the list of possible issues.

                Keep in mind that, in order to re-create the RAID, you need to have the original install disc too. If you don't have that anymore, try to get hold of the ISO and re-create one.

                Comment


                • #9
                  Originally posted by Nathan Paris
                  ...as I go manual and hit play it will turn green and say that it's running but it stays at 00:00:00. Any attempt to jump forward, swap to a different playlist, etc. has no response. The only "fix" is to turn everything off and then back on and then I can start it manually with no problem.
                  I hope this isn't the case here, but I had this happen several years ago (as in, pre-covid), and the cause turned out to be that the cat862 media block was failing. This was while they could still be return/exchanged, and this fixed it.

                  Have you tried uploading a log from this server to the Dolby Log Analyzer, and seeing what it tells you?

                  Comment


                  • #10
                    I guess there can be different explanations why the show cannot be started. A failing media block is a plausible one, but the system somehow not being able to access the content due to a RAID issue is also a plausible one. Given the strange behavior of the RAID already reported, I have hopes it's not media block related. The RAID side of things, being mostly off-the-shelf PC server hardware, has more fixable avenues than a failing media block...

                    Nathan mentioned that he sent in the logs and it returned nothing strange, so I guess that he used the Log analyzer, but I'm speaking for Nathan now, let's not do that.

                    Comment


                    • #11
                      I pulled the logs and sent them into ACE and they told me they didn't see anything with specific drives, but I was just told "There is no obvious answer why the server won’t start at the first show. There are some early signs of the Cat862 media block failing, but not just yet​"

                      I just tried that Dolby Log Analyzer website but it wouldn't let me upload them and said I had the wrong file type? I just have a folder that I extracted all logs from but none of them will let me upload. Do I need to convert it to a different file type or are my logs just not compatible?

                      I have an ISO disc labeled 4.9.1.22, which is our DSS200 version. Is that the same as the disc needed to rebuild the RAID or would I need something else for that? I have experience running config from that disc and setting everything up there, but if there's a separate install disc for a RAID specifically then I don't know if I have one of those on hand.

                      Comment


                      • #12
                        Originally posted by Marcel Birgelen View Post
                        Keep in mind that, in order to re-create the RAID, you need to have the original install disc too. If you don't have that anymore, try to get hold of the ISO and re-create one.
                        While I understand what you mean, I would like to explain a bit more for the sake of anyone that will read in the future this thread and misses that info:
                        Dolby (DSS) servers update (mostly) with discs running on the boot sequence. The "update" discs and the "install" discs are not interchangeable.
                        The installation discs are meant to install the system from scratch, while the updating ones are meant to make the updates, keeping everything else intact. Installation may as well take place with dolby update packages saved in the system.
                        If one changes all the HDDs (at once) they need to use an installation medium, not an updating one.

                        Where Nathan says:
                        Originally posted by Nathan Paris View Post
                        [...]After that, I reinstalled the version we run (4.9.1.22) from an ISO disk and[...]
                        he doesn't specifically explains if he reinstalled using an installation disc or an "update" one. (Like if one would want to -say- rebuild the base.)
                        Mentioning that he had in the past swapped the hard drives, one may assume with some confidence that he is using an installation disc.
                        But in any case, one shouldn't start the procedure before one has all necessary tools available.

                        Comment


                        • #13
                          I can confirm...if you have a server that goes to "Running" but the timeline doesn't advance...its better than a 90% chance that the CAT862 is nearing death. When you power cycle the server, you are power cycling the CAT862.

                          You can check the SMART logs in the Dolby Logs (unzip them...they are text files). Look in the Devel logs. My rules of thumb are...any ATA errors and the drive is done. More than 10 reallocated sectors and the drive is done. The exception is if I see an advancing reallocated sectors (more each day...then it is failing right there and should be pulled and run on 3-drives).

                          4.9.1.22 is an oldie but a goodie in the 4.9 clan. Most of mine are on 4.9.5.2 with some on 4.9.6.4. I would strongly encourage you to plan on that server's replacement due to the CAT862.

                          If you want to keep with it until it fails...check the bios battery...if they go flat, the server will act up, particularly on boot up (beyond just losing time or forgetting to power on with power...which you'll have to reset when the battery is changed (and possibly tell it to boot from the CD ROM first). Reseat the memory sticks and if you have any deoxit, apply the smallest amount possible to their contacts.

                          The X7 and X8 motherboards do fail. Ethernet ports will stop working as well as VGA ports. If you get crashes...that is another sign of a motherboard failing.

                          But, in the end, the CAT862 or CAT745 is what ends up junking the DSS line of servers. I have had good success with the Western Digital HA210 line of drives (1TB or 2TB) as well as Toshibas that are on the Dolby approved list (or the last list that supported box servers, not necessarily the DSS line).

                          Comment


                          • #14
                            What extensions do those diagnostic packages have? If I remember correctly, the DSS200 produces .tgz compressed TAR bals as log bundles.

                            You can try to open it yourself using software like 7Zip. If the file is corrupted, maybe it got corrupted during download. If it's corrupted on disk already, then I guess we've got another smoking gun.

                            Originally posted by Ioannis Syrogiannis View Post

                            While I understand what you mean, I would like to explain a bit more for the sake of anyone that will read in the future this thread and misses that info:
                            Dolby (DSS) servers update (mostly) with discs running on the boot sequence. The "update" discs and the "install" discs are not interchangeable.
                            The installation discs are meant to install the system from scratch, while the updating ones are meant to make the updates, keeping everything else intact. Installation may as well take place with dolby update packages saved in the system.
                            If one changes all the HDDs (at once) they need to use an installation medium, not an updating one.
                            Yeah, I'm calling it the Original Installation Disc. I guess I could not be much clearer. An update disc is something else. Although, you're going to need those too, if that's your preferred way of updating back to a version you're comfortable with.

                            Personally, I've almost never used update discs for the updates, as the DSS200 also offers comfortable, remote updates.

                            Maybe a general disclaimer: If you're not confident in what you're doing, then call someone who is. You're playing with matches, if you do stuff wrong, stuff could burn you.



                            Comment


                            • #15
                              Originally posted by Steve Guttag View Post
                              I can confirm...if you have a server that goes to "Running" but the timeline doesn't advance...its better than a 90% chance that the CAT862 is nearing death. When you power cycle the server, you are power cycling the CAT862.

                              You can check the SMART logs in the Dolby Logs (unzip them...they are text files). Look in the Devel logs. My rules of thumb are...any ATA errors and the drive is done. More than 10 reallocated sectors and the drive is done. The exception is if I see an advancing reallocated sectors (more each day...then it is failing right there and should be pulled and run on 3-drives).

                              4.9.1.22 is an oldie but a goodie in the 4.9 clan. Most of mine are on 4.9.5.2 with some on 4.9.6.4. I would strongly encourage you to plan on that server's replacement due to the CAT862.

                              If you want to keep with it until it fails...check the bios battery...if they go flat, the server will act up, particularly on boot up (beyond just losing time or forgetting to power on with power...which you'll have to reset when the battery is changed (and possibly tell it to boot from the CD ROM first). Reseat the memory sticks and if you have any deoxit, apply the smallest amount possible to their contacts.

                              The X7 and X8 motherboards do fail. Ethernet ports will stop working as well as VGA ports. If you get crashes...that is another sign of a motherboard failing.

                              But, in the end, the CAT862 or CAT745 is what ends up junking the DSS line of servers. I have had good success with the Western Digital HA210 line of drives (1TB or 2TB) as well as Toshibas that are on the Dolby approved list (or the last list that supported box servers, not necessarily the DSS line).
                              Hey Steve,

                              Thanks for all the info! Honestly I just started purchasing the same drives that I saw here when I first started working, which has always been the Western Digital 1 TB Enterprise SATA drives (WD1003FBYZ) but if those aren't the intended use for the DSS servers then I'll make the shift to the HA210s.

                              I'll check the logs and look for what you mentioned. Where would I see the BIOS battery?

                              Lastly - this has been something that seems up in the air when I look at existing posts, but what is the best way to preserve the life of all of our equipment? Our theater shuts everything down at the end of the night. We switch off our audio processors, our amps, the projector, the DSS200, the automation device, the fans, and then we switch off the breakers for the corresponding equipment so the projector is fully powered down nightly. I've been at this theater since 2021 and we've done it that way, and I inherited that method from previous management who also did the same. It seems like it's mostly agreed upon that the DSS200s should always remain running, which if that's what we need to start doing to preserve our media blocks then I'll make that change I just don't know what effect that has on other things like constant electricity usage, etc.

                              Comment

                              Working...
                              X