We reported to you yesterday about Facebook being down, and again when it finally came back on. Today we can report that Facebook has apologized for the worst outage the site has seen in over four years.
Dan Nystedt of Computerworld is reporting that the outage was down to a change Facebook made to one of its systems. A piece of data was changed which was used whenever an automated error check found invalid data in Facebook’s system. This piece of data was interpreted as invalid, which resulted in the system to try and replace it with the same piece of data.
This resulted in a feedback loop which created hundreds of thousands of queries per second getting sent to a database cluster, which couldn’t cope. This meant users were met with the DNS error message. Director of software engineering Robert Johnson said “The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site,” “Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.”
Facebook have had to turn off the system to get the site up and running, but it plays an important role in protecting the website. They are now looking at ways to handle the situation differently to prevent it happing again. Johnson added “We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously.”
This latest problem follows the site being down for some users the previous day, which was blamed on a third-party networking provider.