Home Incident-Reports
Incident Reports

We will post information related to service maintenance and disruptions here, as well as major system upgrades or other changes.

 


 



Partial Email Service Disruption - Outgoing Messages from CoE Mail Server to Hotmail are Being Blocked PDF Print
Wednesday, 07 April 2010 22:35
The Hotmail email service is blocking email sent from the College of 
Engineering's server (smtp.engr.ucsb.edu/mail1.engr.ucsb.edu/128.111.53.4) 
to Hotmail accounts.  The blocking starting at 8:50pm on Monday, April 5th 
when a hacked CoE account was used to send thousands of SPAM messages.  
The blocking is expected to continue through tomorrow.  Anyone that attempted 
to email Hotmail users during this time should have received a "bounce" message 
stating that the message could not be delivered.   Please note that this disruption 
of service only affects OUTGOING email sent to HOTMAIL users from SMTP.ENGR.EDU.  
Incoming email from Hotmail is unaffected as is outgoing email sent from other 
email servers (CS, ECE, Umail).  If you use the College's email server and have an 
urgent need to email a Hotmail user we suggest that you use a non-CoE account 
(i.e. Gmail, Yahoo, etc.) until the problem can be resolved.  I apologize for this 
disruption and assure you that the ECI staff is working on getting service restored.  
If you have any questions please email me at 
  This e-mail address is being protected from spambots. You need JavaScript enabled to view it
 .  
Thank you, 
--Richard   
------------------------------------- 
Richard Kip Acting Postmaster, CoE  
Last Updated on Wednesday, 07 April 2010 22:38
 
Unplanned College Network Outage This Morning from 9:43am to 9:58am PDF Print
Tuesday, 23 March 2010 10:32

Unplanned College Network Outage This Morning from 9:43am to 9:58am

At 9:43 this morning (3/23/2010) the College's core router failed.  This severed our connection to Campus and the Internet and severely crippled our internal (within the College) data communications.  The router was brought back online at 9:58am.  We are still performing an analysis of the router and therefore cannot give a reason for the outage. 

 

03-30-10 update: below is the outage reason giving by Cisco Tech support:

Thanks for the information; we were able to repro the issue locally in our lab and this looks like a new SW bug and we opened a new SW bug for this case.
The bug id is: CSCtf94697, failed SCP transfer can crash router.
The bug is in the NEW stage and our developer team will start working on it shortly to fix the it.
I will keep you posted.

 

Last Updated on Tuesday, 30 March 2010 15:24
 
Service Outage (11/23/09) PDF Print
Monday, 23 November 2009 17:29
Incident Report

Summary
At approximately 11:05am 11/23/09 the LDAP authentication
server known as ldap1 (AKA accounts.engr.ucsb.edu) started
failing in that it no longer served LDAP requests and was
not accessible to the network.  The virtual machine (VM) was
rebooted but did not come back up.  In its stead an older
copy of the VM was brought up and services were restored
at approximately 11:48am.  This incident had a wide impact
as file, mail, and web services were disrupted due to numerous
dependencies of the services involved.

Details of the Incident
ldap1.engr.ucsb.edu stopped providing information services
and a number of services on other servers relying on this stopped
working.  File serving from hal1.engr.ucsb.edu was no longer working,
web files were no longer being served from the COE web server, and COE
mail was unavailable as information lookups were failing.  Adding to
the length of the outage was the fact that the fail over mechanism
of the LDAP service was not working as expected on various LDAP clients.

The Notification Process
The problem was first noticed by an ECI staff member and shortly
afterwards the automatic alerts confirmed the service outage.  After
the problem was confirmed MSOs and IT staff of the major COE departments
were notified by phone.

Conclusion
The basic cause of this incident was the instability of the VM known
as ldap1.engr.ucsb.edu but the failure of some LDAP clients to gracefully
fail over exacerbated the problem.  To help remedy this type of
incident in the future there are plans to put LDAP slave servers on
major service servers that rely on LDAP information.  This should
potentially help to isolate this type of failure in the future so that
these types of cascading failures can be prevented.
 
IMAP service outaage Friday, 23 October PDF Print
Wednesday, 28 October 2009 09:04

On last Friday morning, the imaps (the ssl entry protocol to imap) became unresponsive in the early morning around 5:30am, preventing some users from accessing their email. This affected only people using imaps, and not people using imap with TLS, and took some time to diagnose and was fully restored around 10:30am 

The cause was traced back to the xinetd program (which listens to various ports and starts services) silently refusing to start the imaps process when connections arrived. Restarting xinetd caused the heretofore unseen problem to go away. This happened a second time with the imap protocol a couple of days after.

The initial start of both coincided in time with two other logged anomalies: connections to the ldap0 authenticaion server being refused, and nfs traffic to the hal1 fileserver timing out (which in itself appeared to be due to failing connections to ldap0).

Examining the condition yesterday morning at the time of the recurring event, it became clear that ldap0 was experiencing a very high load condition causing it to refuse to serve requests. This was due to the scheduled VM guest snapshot being performed on the system at that time, and brings to light a deficiency of the current virtualization technology we are using -- disk I/O on the host operating system, and within the guest, can debilitate the performance of the guest.

We affected three changes to alleviate this. Temporarily ceased the VM snapshots, modified the ldap client rollover configuration to to provide more timely rollover to backup ldap servers, and configured the monit program on imap.engr.ucsb.edu to watch connections to services provided by xinetd and to restart xinetd if the connections fail.

Watching the system this morning revealed no further anomalous behavior. 

 
Planned mail and fileserver outage for Tuesday, 20 Oct. 6am-7am PDT PDF Print
Thursday, 15 October 2009 13:31
A service outage of email and fileserver services is planned for Tuesday, 20 October from 6am-7am PDT. During this period, hardware providing these services will be  physically moved from its current location into a new rack.  

During this period, access to email and to files provided by the hal1.engr.ucsb.edu fileserver (primarily home directories to instructional labs) will be unavailable.
 
« StartPrev123456NextEnd »

Page 1 of 6
Copyright © 2012 The Regents of the University of California, All Rights Reserved.