Friday brief service outage - details

Post by **nickk** » Mon Mar 31, 2025 10:29 am

On Friday we had a short service outage. The details...

Friday

Our main guide database system consists of 5 servers... a failover pair of database servers, a redundant pair of api front-end servers, and a server that does the guide import and channel matching (non real time operations).

Rather than take down the live system for hardware replacement/upgrades we decided to set up a new cabinet at a new datacenter in Dallas and make that the master when everything was ready. The old cabinet in the old datacenter in California could then be rebuilt and become the failover for Dallas.

That is what we did. Dallas was correctly replicating California and we thoroughly tested the 5-server setup in Dallas. It was delivering the live replicated data in our testing - everything was working perfectly.

We flipped the switch to make Dallas primary. It should have been seamless.

The first requests came in and were handled correctly, then things went to shit. The database server was using 100% of all cores and the load average was approaching 4 digits. Requests were executing but the backlog of work kept growing and growing.

In the process of checking everything we noticed that the indexes were correct and present but the database wasn't reporting cardinality numbers in the index information. Indexes not working explained the symptom. This was going to take some research and things were down for customers so we switched back to California. The switch back meant that Dallas had to be restored from backup, plus I had flown to the datacenter and couldn't start work.

Saturday

Dallas was restored from backup, reconnected to California as a replica, and verified to be healthy again.

Now knowing what to look for in the index information and we confirmed the same situation had repeated. The indexes were there but the table stats used by the database to select which index to use when executing a query were all 0. Interestingly these stats were not updated by the database automatically after the restore completed. This is scary because restoring a database from backup is common when moving servers and we expected the database to "just work" after a restore. The database was running as a normal standalone database at this point.

The database was then configured to be a replica of California and put in super-read-only mode (our policy for replicas). Once replication caught up we ran a data verification check to make sure the Dallas database contained the same data as California.

After researching how the table/index stats work we tried requesting the database recompute them however it failed due to the the database being in super-read-only mode.

Taking the replica out of super-read-only mode and manually requesting the database recompute the stats worked.

To further test we created a tunnel between the datacenters and redirected some types of requests to Dallas. No problems.

We flipped the switch to make Dallas primary. This time it was seamless as intended.

California hardware upgrade completed and it will soon be a failover for Dallas.