Since last Satuday, there have been some serious full and partial downtimes in our service. With the latest one ending just a while ago that lasted exactly 10 minutes, we seem to have finally fixed the issues that caused these downtimes, and the service should now be completely stable as it was before. If you are more interested what exactly transpired, please read on.
Saturday 14th October
Over the last few days, we've experienced performance anomalies that came with massive spikes in the CPU usage of our servers. While not an immediate threat, these kinds of issues are the ones that can escalate over time to completely de-stabilize a system until resolved. To learn more about what happened, we've added a new profiler to our servers that should have told us over time what exactly caused these performance issues, as well as allow us to investigate any performance problems that may arise in the future. As Royal Road is a constantly evolving platform, being able to point out these problems is very important for us; new features we add over time can cause any number of issues, and it's better if we learn of them on time than not.
The problem came with what the profiler did; instead of giving us performance insights, it - somehow - caused file access issues in the servers' file system, most likely by somehow forcibly dismounting the drives. After spending some time trying to fix this issue, we've instead opted to remove the profiler and try other methods to analyse the issue. The only issue was that removing the profiler proved to be extremely hard - simply removing the files didn't quite work, most likely because it couldn't fix that the drives had to be re-mounted by Azure, so we needed a complete re-deployment of the site.
Sunday 15th October
After one of our servers became unresponsive (in hindsight, likely an aftermath of the day before), we've uploaded a pending update for the moderator tools we've been working on and tested the update in our staging environment. As we found no problems with it, we've swapped the update to production. However, the profiler, even if it was disabled, re-installed itself during the swap onto the production server, most likely to be ready once it was re-enabled. This, however, did not stop the profiler process from crashing the website again until we've removed its files and re-deployed the website.
Monday 16th October
There was a 3 hours long downtime that impacted one of our servers on Monday. This issue was completely outside of our control, and was most likely the combined consequence of the previous day's events and an Azure maintenance.
Today (Friday, 20th October)
Today, the profiler has struck again while we've uploaded a hotfix to some issues regarding the recently added caching mechanisms. After it crawled out of it's well-dug grave, it attempted to attach itself onto the site again, taking down our servers. Thanks to our experience with this issue over the last week, we've managed to fix it within 10 minutes, and proceeded to completely remove all attached Application Insights settings. We've already deployed a small patch since then, and it's been swapped without a problem.
This means that we should not see this issue in the future.
Great fun in the discord, every downtime was fixed fairly quickly though.
Thanks for the chapter.
Thanks for the chapter !
Can we get a fix about it?
The Profilier: Gravecrawler
lol. Is it just me or is the profiler acting like a spoiled virus xD.
I only encountered one downtime so lucky me?
First page XD
All hail the Profiler, the Profiler is immortal!
that profiler sounds aggressive... what is it called?
No-one can destroy the Profiler
The Profiler will strike you down with a vicious blow
We are the vanquished foes of the Profiler
We tried to win for why we do not know...
Ive read enough stories here to know that the Profiler is a dude reincarnated as an AI and is slightly cranky.
anyways thanks for letting us know.
it's Truck-san's fault. It's always Truck-san's fault in cases like these...
I would be willing to bet Truck-san was responsible for the demise ofthe Profiler AI's past life. Keep all large trucks out of sight to avoid flashbacks, and maybe introduce it to an MMO.
Thanks for the chapter!
So your attempts to solve the problem resulted in more problems? Sounds like me programming ._.
Sounds like programming in general. But this is less programming and more system administration, not that they weren't tied together.
The Void on
Thanks for your hard work! *bow*
Sorry, accidentally clicked on the submit button when I meant to click something else. -__-
Since I have a comment here, I might as well state that if you're suffering from the problem on October 26th, try clearing your cookies and all your Internet history. Not just your cookies, but everything in your cache, from temporary files, etc. It should help. If I get more info or alternative solutions, I'll post an update.
thanks for the chapter!
Can someone make a novel with this profiler as the MC?
Steven, the gift that keeps on giving.
I'm on it.
Your profiler follows the Dao of Forced Updates.