srcf-status - Tumblr blog

srcf-status · 24 days ago

Text

MySQL downtime, 2025-06-01

Our MySQL server squirrel was showing some poor performance, impacting user applications as well as the control panel.

13:00: MySQL service restart initiated

15:00: MySQL service restart completes (yes, two hours to restart)

16:25: Reboot initiated after performance not improving

16:35: Reboot completes, MySQL begins startup

17:25 MySQL service startup completes

18:15: Performance seemingly back to normal levels

0 notes

srcf-status · 4 months ago

Text

Services downtime, 2025-02-22

Overnight system updates seemed to have caused some disk access and processes on sinkhole aka. webserver.srcf.net to lock up from around 6:30am. Investigation ongoing...

Update 10:30am: After some failed attempts to unwedge or diagnose it, sinkhole has been rebooted.

Update 10:45am: Looks like this is an issue with shared disk space (i.e. user/group home directories) available on all user-facing machines -- shell.srcf.net and doom.srcf.net are experiencing similar downtime.

Update 3:20pm: Services were fully restored at 2:00pm. It seems overnight updates caused pip aka. shell.srcf.net to lock up the shared disk space (exactly how remains unclear), in turn locking up several other machines. Rebooting pip fixed the problem.

0 notes

srcf-status · 6 months ago

Text

Webserver downtime, 2024-12-18

The SRCF webserver sinkhole (aka. webserver.srcf.net) experienced a freeze from around 4:30am this morning, taking out all web hosting and remote access. This was noticed just after 8am, and rebooted a little later after some failed attempts to thaw it.

The server itself was back online by 9am, though some user services timed out and failed to start -- these were caught and retried by 10am, where service should have resumed as normal.

Stay warm out there this holiday season! ☃

0 notes

srcf-status · 10 months ago

Text

Brief network disconnection, 2024-09-25 03:00-03:30

Our services will experience a brief disconnection from the outside world from around 3am (BST) on Wednesday 25 September, due to essential maintenance on the University network point-of-presence (PoP) switch connecting us and Cambridge SU to the outside world.

For a period of 10-15 minutes, SRCF services will not be available and systems will have no internet connectivity. We expect no lasting disruption to SRCF infrastructure, and web and email services in particular should recover by themselves after the disconnection.

0 notes

srcf-status · 2 years ago

Text

srcf.ucam.org domains were temporarily nonexistent, 2023-11-10

The ucam.org domain (link may not work) under which 'srcf.ucam.org' exists temporarily disappeared from the Domain Name System today. During that time, we regret that emails to @srcf.ucam.org addresses will have bounced (reported delivery failures to the sender), and websites weren't accessible via the www.srcf.ucam.org redirect service, for older accounts with that feature.

1 note · View note

srcf-status · 2 years ago

Text

webserver.srcf.net systemd services not launched, 2023-10-14

The SRCF web server sinkhole / webserver.srcf.net was rebooted during our scheduled vulnerable period on Saturday 14th October, but a handful of users' systemd services were not launched when the server booted back up. This is likely due to the server startup timing out and systemd giving up launching the remaining user tasks.

If you were affected, attempts to control existing services with systemctl would have resulted in "Failed to connect to bus" errors.

Unfortunately, due to the small number of accounts affected, this wasn't noticed until 9 days later, with the remaining tasks launched around 11:45pm on Monday 23rd October. All user services and service management should be back to normal now.

0 notes

srcf-status · 2 years ago

Text

Ancillary services offline - 2023-10-18 16:05-

Due to a loss of power at the West Cambridge Data Centre (WCDC), some non-user-facing services, including backup storage and one of our monitoring systems, have gone down.

These services have "frozen" because their backing storage disks are hosted on systems in the WCDC, even though the services run in another building. A number of University services we depend on will also be running at reduced capacity or resiliency, or be offline altogether.

In addition, the sysadmins' primary shared mailbox is unavailable so we will have to manually check a secondary mailbox for important messages. We might not be receiving email messages sent to the sysadmins, though, if lists.cam.ac.uk is currently unavailable.

Should a separate calamity now befall us, our capabilities to handle the situation are limited until the WCDC recovers. Updates follow.

17:14 - We are monitoring the situation and will update this post accordingly.

17:22 - University Information Services (UIS) report that "We have experienced a power outage in one of our data centres, which has disrupted multiple University systems. An engineer is en route. We [UIS] will provide an update by 18:30."

19:03 - UIS reported at 18:25 that "An engineer is on site and investigating the power issues at the West Cambridge Data Centre. University services continue to be vulnerable to further outages." Next update from UIS expected by 19:30.

19:27 - the UIS update is the same as before. Next update by 20:30.

20:30 - the UIS update is unchanged, with nothing further expected before 08:30 tomorrow morning.

2023-10-19 08:45 - UIS continued to work to restore service; at this point our servers should have had power restored, but VMs required manual intervention.

18:00 - Our VM management server and sysadmins' mailbox were restored, with mail queued upstream slowly trickling in.

0 notes

srcf-status · 2 years ago

Text

Mailman delivery delays, 2023-09-18 and 2023-09-19

The queue processor for Mailman, which runs user and group account mailing lists, quietly became stuck and stopped handling incoming emails. This meant emails were being accepted by our mail server but not being processed.

The logs suggest the problems started around 8am on Monday 18th, with messages backing up until 7pm on Tuesday 19th when the stuck runner was noticed and restarted.

Queued messages were all released together, initially reaching the sending limits of our upstream email relay ppsw, so some existing messages have been deferred and may take a few hours before they make it through.

0 notes

srcf-status · 2 years ago

Text

Poor website performance, Friday 2023-09-15

The SRCF webserver sinkhole was seeing a large number of incoming requests from various IP addresses and servers of a particular cloud provider, likely being used for a denial-of-service attack, and caused performance to drop significantly as the machine became overloaded.

Alerts started at around 4am BST, initial attempts to block problematic IP ranges from making requests were made at 10:30am, but performance continued to vary until about 7pm as the blocking was adjusted.

0 notes

srcf-status · 2 years ago

Text

Total service outage, 2023-02-05 01:58 to 11:20

The SRCF experienced a total outage of its main server cluster ("thunder"), which our monitoring systems noticed from 01:58 onwards tonight.

Real-time updates from the investigation follow:

02:25 -- corrected the year in the title (it's 2023 now!). Signs point to this being a networking failure, either in our upstream network connection to the outside world or in an intermediate network switch that we rely on for this connection. A physical visit to the datacentre would be necessary to confirm this, which we can conduct in the morning.

11:57 -- we sent someone on site and discovered that a single electrical circuit breaker (technically an RCBO) had tripped. Our the intermediate switch carrying our network connection, mentioned at 02:25, had a single electrical feed on that circuit, causing disruption to our network connection. We have moved this switch over to the alternate power feed, and services have been reachable again since 11:20.

We will continue to monitor the situation remotely and are liaising with building services to resolve any electrical issues. There are opportunities to improve redundancy of power feeds and network uplinks, to eliminate them as single points of failure, which we aim to pursue in due course.

0 notes

srcf-status · 3 years ago

Text

Recovering from power outage, 2022-08-11

Due to a power outage on the West Cambridge site, some of our auxiliary services (those listed here) suffered sudden power loss. Other services, including core services like user files, shell access, email and websites, managed to keep going on battery- or generator-powered backup supplies.

We have since been able to restart downed services after the resumption of power to the site, and we will continue to monitor for any lingering issues.

0 notes

srcf-status · 3 years ago

Text

UPDATE: Power outage - ALL services at risk, 2022-08-11

We've learned that the power outage reported earlier may affect the entirety of the West Cambridge site, which physically contains all of our servers and services. Our earlier at-risk warning now applies to the entire SRCF.

0 notes

srcf-status · 3 years ago

Text

Power outage - some services at risk, 2022-08-11

We've been informed of a mains power outage at the host site for some of our servers. If battery backup power runs out before mains power is restored, then some of our auxiliary services will be unavailable:

Sysadmins' primary mailbox store (so we may take longer to see mail you send us)

Realtime monitoring/probing host (so we will have less oversight of the status of the rest of our infrastructure)

One IRC network node (so the IRC network will have less resiliency, in particular with one remaining node inside the UDN)

Graphical VM management host (so we may have to resort to stone-age command-line methods to manage our fleet of servers...)

0 notes

srcf-status · 3 years ago

Text

UK heatwave watch, Mon 18/Tue 19 Jul 2022

Temperatures over much of England are forecast to reach 40 °C in a few days’ time, on Monday 18 and Tuesday 19 July. We may need to SCRAM some or all of our servers if cooling systems fail as a result.

Cambridge currently holds the UK temperature record of 38.7 °C, so there’s every chance that we will see daytime temperatures peak around 40 degrees! This will put strain on air conditioning/cooling systems, which are critical to prevent our servers from overheating. With elevated outside temperatures that they were not necessarily designed to cope with, there’s a higher risk that cooling systems may fail or underperform.

We’ll be actively monitoring our servers’ temperature sensors for any indications of cooling system failure or underperformance. If necessary, we’ll shut down some servers to reduce thermal load and transfer running services to a smaller set of physical servers, which may mean our services are ‘at-risk’ due to lack of redundancy. If a cooling system gives out completely at one of our server locations, we’ll be forced to shut down those servers immediately to prevent equipment damage or data loss.

Finally: take care for yourselves! Prepare hydration, stay out of the sun and be aware of heat exhaustion or heatstroke.

0 notes

srcf-status · 3 years ago

Text

Full SRCF outage, 2022-04-09 (times TBC)

The SRCF will be fully shut down on Saturday 9th April to facilitate a physical relocation of our file storage system.

We cannot yet indicate what time of day this will take place, but will endeavour to provide updates on the day.

Updates:

* 13:00: Started powering down user-facing servers.

* 13:45: Powered down NetApp file storage servers.

* 19:45: Restored NetApps in new datacentre.

* 20:15: Started powering up user-facing servers.

* 20:30: Normal service resumed.

0 notes

srcf-status · 3 years ago

Text

Network outage, 2022-04-08 09:00 to 12:00

The SRCF will be unreachable while our connection to the University network and the Internet is physically relocated. This will take place tomorrow (Friday 8th April) from 9am.

It may take until 12 noon to restore connectivity to the SRCF, although if things go smoothly then it will be back sooner.

Update: Connectivity was disrupted at 9:13am and restored at 10:28am.

0 notes

srcf-status · 3 years ago

Text

Email from SRCF addresses blocked by Gmail for the time being, 2022-03-04 onwards

Email messages sent from the SRCF's main email domain, @srcf.net, have started to be blocked by Gmail, leading to delivery failures and bounce messages when trying to reach recipients @gmail.com.

Meanwhile, we are observing messages from @srcf.ucam.org has a chance of going to spam if not sent via SRCF Hades or the SRCF mail submission service.

These problems (mainly the blocking) are unlikely to go away on their own, and we are actively considering options on how to resolve them.

Message edited 15:23 to clarify that srcf.ucam.org mail is not blocked, but still not always deliverable to the inbox.

We are currently assessing our options. Given how strict Google chooses to be about email authentication measures, it is almost certain that the SRCF will have to introduce email sender authentication policies, such as SPF or DMARC, urgently, should we wish for @gmail.com addresses to be reachable when using an @srcf sender address.

These measures would not come lightly, as they could impact the current freedom of our users to use their @srcf email addresses to send from anywhere on the Internet, without relying on the SRCF's own email submission service. (Admittedly, this is a freedom that we currently believe a slim minority are actively enjoying.)

Opinions are sought from users who wish to have a say in what we do next -- by email to the sysadmins, or on our bridged chat platforms as detailed on our website. Given Google's strong position of leverage as a massive mailbox host, we inevitably do want to ensure that @gmail.com addresses remain deliverable from our mail domains, but our approach to achieving that is still up for debate.

0 notes