Case's Ladder

Infrastructure Overhaul 2002

Prepared by Case
Introduction

As many of you are aware, we have been having problems with the site being slow during peak periods over the last several months. We have been working at a variety of solutions to help speed up the site. To some degree, we have been successful. To a large degree, we have not. I have spent hundreds of hours over the past sixty days figuring out the best way to transition from what we have to where we need to go.

This short article will attempt to fill you in on those details. I hope that this will satisfy the curiosity that many of our users have, as well as to show many of you that we have been working hard on these issues and are taking the steps needed to correct them.

The Problem

In one word: Growth. Case's Ladder has experienced tremendous growth in the last six months. Growth that has been faster than in the entire history of us being here. Our user base, matches played, and tournaments hosted has been skyrocketing! While this can be considered a good thing, it is hard for a small company to budget resources to plan for growth that 95% of the time isn't going to happen. We simply were not ready to handle the surge.

Just for the record, we are talking about a great deal of traffic. We receive over two million page views a day. Not hits, page views. A page view is a complete page, including the images and banners. If you're talking about hits, we do over five million per day! I point this out because some people think we're talking about 10,000 hits per day or so.

When the ladder first started over six long years ago, we had one server. Everything ran from this. Then we grew and the site started to slow down. So we added a second server (named CGI). When players posted a match result, it was done on the CGI server instead of the main site. This kept the pages loading fast and matches posting smoothly.

Then we grew some more. The main site was slow again. We took our most heavily used pages from the web site, and put them onto their own server (WWW2). The main site was fast again!

After this we launched tournaments. Since they were based on a completely different set of programs and were being maintained by a new programmer, we put them on WWW2. That way our programmer could experiment with tournaments without slowing down match reporting for regular ladder games. We grew some more.

Not much changed from the above layout for a long time. We added some extra servers to handle additional functions (such as moving Find Player to WWW3). The general approach was to install faster servers for CGI and WWW2 when things started to slow down. There is a limit to how often you can do this, and we recently ran into this.

Here is a simple picture showing how Case's works today. This omits some of our servers and is just to give you an idea of how things work (it isn't meant to be 100% technically accurate):

The colored lines in the above picture represent database connections. In general, all of our servers need to talk with the CGI database. It contains all the player records, match results, Gold histories, staff management tools, ladder leaderboard information, Hall of Fame, etc. You can't get information on players without the CGI server being involved in some manner.

Tournaments are in a similar boat. Even though we have three servers helping to serve up tournament pages, they all rely on the information stored on the database on WWW2. When WWW2 slows down too much, the other two machines slow down also. In addition, each tournament machine sometimes needs to get information from the player databases on CGI.

This all worked fine until we started hitting very high numbers of users. Throwing more and more powerful servers into these spots is not very cost effective. I'm sure most of you are familiar with this - to get the latest greatest computer you pay double the cost of a pretty fast computer that's a few months old. It makes no sense for us to spend $10,000 for extra powerful machines for WWW2 and CGI and then have to buy $15,000 machines in six months. We needed a better solution.

The Solution

So, here's what we have been working on (again, just a fraction of the servers so you get the idea):

As you can see from the above diagram, we are shifting to spreading our ladders and tournaments over many machines instead of only relying on one larger box. We will have a master machine (CGI) that holds certain key data that is common to all the machines (for example, you can find out what server the Spades ladder might be stored on by asking the master).

Each CGI machine will be configured to handle a certain number of leagues. Both ladders and tournaments will run on these servers. This will allow us to make sure leagues will run faster and more reliably. For example, companies that we have contracts with to provide ladders might run on their own machine, or if someone wants to buy a server just for their own league (People have asked!), while the rest of the ladders run on other servers.

We'll be able to monitor the usage on each machine and move leagues around. If, for example, we see that one server is starting to slow down and that one league on that machine has 10,000 players and is running 500 tournaments a day, we can simply transfer that league to a different server that has more capacity.

Another advantage to this setup is in regards to maintenance and upkeep. With the exception of critical failures, we should have the ability to move ladders from one machine to another when doing system upgrades. Instead of shutting down the entire site we can just move those ladders to a backup machine. In extreme cases (such as a hard drive failure) only part of the site would be down. We want to avoid downtime, but I think you'll all agree it's better to have 75% of our ladders working than them all being unavailable.

The best thing about this design is that it scales very well. If we start seeing lots of traffic, we can simply add another "cheap" machine and move some leagues. It's much easier for us to go out and buy a $2,000 machine on short notice than it is to have to custom order a $10k super machine, then shut down the site and transfer all the data to the new machine.

Another great advantage to this new design is that we avoid having tournament servers talking back and forth. Since the tournaments will be hosted on the same machine as the player database, they will not need to open an external connection to get player information. This should speed things up greatly.

The Timeline

We've been working on a variety of ways to address our load problems. We finally have settled on this design. It's the most work, but has the best long term payoffs. It's going to take a while to get things working right -- we have already started converting software on our development servers. What we're talking about requires changing literally hundreds of programs that run the site.

Thanks to the support of our premium members over the last few months, we have the resources to tackle a project of this size. This wasn't something we could even consider doing three months ago - we only had two programmers! Now we have four and have one more starting next week.

We're hoping to get the guts of this new software working in a test environment within a week. We're going to be moving all of our hardware to a new hosting facility in the very near future as well. We plan on rolling out the new design when we move into the new hosting company (AT&T).

At the new hosting facility, we will have ten times the bandwidth that we currently have available. This will definitely help to speed up the site. But long term, this new design is going to be the key towards growing Case's Ladder in the future.

I thank all of you for your patience and understanding as we have worked on these issues. It's probably going to take a couple weeks to get all the kinks out once we have moved, but I am sure you will see that it is well worth it! I'm very excited about this new design (in fact I'm writing this at 2AM because I wanted to share our progress with you).

Thanks again for your support, and I hope you feel that your membership money is being spent wisely. If you want any feedback, please post in the forums!

Thanks,
Case