Major service interruption update 7:15 am

At midnight this morning, all automated data reconstruction was completed successfully on 23 of 41 volume groups comprising the disk array. The remaining 18 of 41 volume groups will be reconstructed in a less automated fashion, as was expected. A process to export the 23 reconstructed volume groups is nearly completed. This process serves to protect the volume groups during additional hardware replacement. Then, volume groups will be re-imported while manual data reconstruction is run. Then, systems and services will be restarted. There is no completion estimate at this time, however, no serious problems have been encountered thus far by the Oracle engineers.

We will continue to update you via the NJIT SOS channel.

Major service interruption update and review at 2:05 pm

Yesterday, at approximately 4 am, we experienced a major hardware failure on one of the primary storage disk arrays used by many NJIT applications. We have been working since that time to restore services. As of noon today, approximately 70% of the consistency checks had completed. It is difficult to predict an exact time, but we expect the entire process to continue at least until this evening so we can ensure data integrity.

A full account to-date is below.

Problem Description:

Yesterday, at approximately 4 am, we experienced a major hardware failure on one of the primary storage disk arrays used by NJIT applications. This particular disk array provides one or more services to about 70% of academic and administrative applications at NJIT. Normally this type of hardware failure does not cause a disruption of service due to internal redundant components (in fact, systems administrators have exercised this internal redundancy many times in the past to replace failed components with no impact on service). However, this time, due to multiple failed components, the internal redundancy could not handle the hardware failure.

Systems Affected:

Systems affected by this hardware problem include the NJIT Web site, administrative e-mail (ADM), authentication middleware to cloud services (e.g. Moodle, Webmail by Google student e-mail), Highlander Pipeline, AFS academic file system, DFS departmental file system, and a number of ancillary components of the Banner ERP system (e.g. Cognos reporting, Banner job submission, e-print).

Current Status:

At this time, we do not expect all services to be fully restored until later this evening. A more detailed discussion of the restoration process is included below.

The following services have been temporarily restored on alternate hardware and are available:

  • NJIT website
  • Webmail by Google student email
  • Banner Job Submission

Staff are working to restore Moodle in a similar fashion

Faculty and staff that normally access Internet Native Banner (INB) through Highlander Pipeline may login at http://bninbts1.njit.edu:9090 (this requires VPN from off-campus). Self-Service Banner is not available.

Fixing the Problem:

Oracle system engineers have been on-site and working remotely with NJIT systems administrators since yesterday morning. Replacing failed hardware components is the simplest part of the solution and Oracle has been proactive in having available an ample supply of replacement parts.

The disk array under discussion totals approximately 54 TB of data split among 224 separate disk drives. The recovery process involves consistency checking and rebuilding parity checks on each of the disk drives. This is necessary to make sure no data has been lost or corrupted, and must be completed on all 224 separate disk drives before the entire disk array can be brought back online.

This consistency checks are a slow process. As of noon today, approximately 70% of the consistency checks had completed. We expect this process to continue at least through the business day at which time Oracle engineers will remotely run further diagnostics and manually recover any disk drives that have not passed the automated consistency checks. We expect this final review to last into this evening, after which all services can be brought back online. It is difficult to predict an exact time because the consistency checks for each of the 224 separate disks do not require the same amount of time. The good news is that an analysis of log files by Oracle engineers and the results of all consistency checks so far indicate no data loss or data corruption. Maintaining data integrity is the highest priority in this slow process.

Continuing Updates

Continuing updates, approximately every 3 hours or as needed, will be posted at this site. You may also follow updates on Twitter (http://twitter.com/njit) Blogspot ( http://njit.blogspot.com) and Facebook (http://www.facebook.com/pages/Newark-NJ/NJIT/7185471825).

You can also contact the IST Computing Helpdesk at (973) 596-2900.

Thank you for your patience.

Major service interruption update at 9:15 am

The procedure to reconstruct data damaged in NJIT’s enterprise disk storage array is taken much longer than initially expected.  Fortunately, there are no indications of any data loss, and data integrity is the highest priority.  The Gmail authentication service has been restored and restoring the Moodle authentication service is under investigation.

At this time, the data recovery process that began at around 2:00 pm yesterday is approximately 2/3 complete. Unfortunately, services may remain unavailable for the better part of the day today if not longer.  When there is more specific information available, it will be communicated.

Your understanding is appreciated.

Please continue to visit these sites for the latest:  http://ist.njit.edu, http://twitter.com/njit, http://njit.blogspot.com, and http://www.facebook.com/pages/Newark-NJ/NJIT/7185471825. You can also contact the IST computing Helpdesk at (973) 596-2900.

Major service interruption status update at 10:45 pm

Unfortunately, after careful analysis of the disk and data rebuild progress, and after discussions with Oracle engineers about the safest strategy for the remaining processes, we’ve determined that this service restoration might not be completed before 8:00AM Wednesday morning.

Our highest priority continues to be the integrity of the data. Our work with the Oracle engineers will continue throughout the night, and every effort will be made to restore all services as quickly as possible.

Your understanding is appreciated.

Major service interruption status update at 12:30 pm

Equipment repairs on the enterprise disk storage array are expected to be completed by 1:00PM. Then the 100 plus systems impacted by this failure will be evaluated and brought up in a priority order, with academic-related services being the highest priority. Assuming there are no data loss and/or corruption issues which will require file restores from backups, the estimated time for service restoration is about 3:30 pm.

Additional status updates will be posted on http://ist.njit.edu, http://twitter.com/njit, and http://www.facebook.com/pages/Newark-NJ/NJIT/7185471825. You can also contact the IST computing Helpdesk at (973) 596-2900.

Thank you for your patience.

Major service interruption

At about 4:00 am this morning, an enterprise disk storage system failed which is affecting many critical services including administrative email, Gmail authentication, AFS and DFS file systems, Highlander Pipeline, etc. Field engineers are expected on-site by 10:30 am with replacement parts. We expect to know by 11:00AM whether the service interruption will be extended.

Banner production services are available at http://bninbts1.njit.edu:9090/ for NJIT users.

Additional status updates will be posted on http://ist.njit.edu, http://twitter.com/njit, and http://www.facebook.com/pages/Newark-NJ/NJIT/7185471825. You can also contact the IST computing Helpdesk at (973) 596-2900.

Thank you for your patience.