20/03/2018
How bad can it get!
OK in a quick breakdown of what happened it goes like this:-
• Around 9.30pm Monday evening the server dropped, this was prior to completing the nightly backups
• Called and came on site at 8am and found the server had a hardware failure(the bios had corrupted)
• Called Dell to troubleshoot and try to work around issue, to no avail, so Tech and parts dispatched, server online at 4pmish
• During the server fix found that one of the hard drives had failed, so that part was ordered.
• Wed AM – server had been working OK but the minute it went under load, the medserver and then the rest of the servers ceased to respond.
• Came in at 8am and found an error on the raid card on the screen, rebooted and system came back up for 20mins. From this point on it went downhill
• Contacted Dell with picture of error and asked for advice. Upon consequent reboots the error did not reoccur.
• Found that as soon as the medserver was started the entire system would crash within 5mins or so
• Spent next few hours on phone with Dell trying to resolve the issue, thinking it was a firmware or software issue.
• Replacement drive arrived and I installed, however this seemed to make the system even less responsive.
• Kept working with Dell to resolve, till hardware tech arrived, he immediately noticed other issues and advised the raid card needed to be replaced, I also wanted the backplane replaced as I felt there was still an issue with it(this is the hardware that runs the harddrives and controls the data throughput.)
• Further investigation showed that the performance of the drives was very degraded. Next little while spent working with Dell to arrange new parts.
• Later found out the only available raid card was in Sydney and was not due till mid Thursday.
• Contacted previous Dell Tech and by a series of very fortunate events he was at the Dell Depot at midnight and was able to send the raid card he had swapped earlier, this was onsite at 1.30am 15/3
• Replaced Backplane and HardDrive and spun server back up. Drive performance was back to normal and the server even under load did not have any hardware issues.
• I then began fixing the servers as due to the number of crashes the main AD server(the one that runs logons etc) had begun to loop and would not start, I restored the OS drive and got it working again
• Dell Tech left around 2.30am
• Begun working on issues with corruptions on BP server and by 3.30 it was logging in and all services had started, but found that BP would not load.
• At this point I put down tools and got a few hours sleep
• 6.30am – began looking at BP and found the Patient database was not connected, tried a few things to bring it online but to no avail. Contacted BP support and handed over to them
• Tier 2 support began working on a fix to bring system online, this was done by around 10am. This is where the issue with BP occurred, in that the database and the patient anchors were not in sync and once we discovered it I was in contact with BP and they eventually got a tier 2 tech to look at it and he then reset the anchors and late this afternoon the system was working properly
• During this time there was also some issues with Pracsoft, but this seems to have resolved itself, but further work needs to be done by me to check that.
35hrs later all is working as it should.