Moving the Stendhal server to a new host
TODO: this is just a working document, I haven't finished it yet Kymara 00:45, 11 June 2011 (CEST)
Migration - why - 3 years on existing server, better hardware available for the same cost (* link to association website?), moving to ownership by the association not by an individual?
We had just one server - the old server. And we had a new server we wanted to put data on. We could have shut down the game, copied the data over, and then started the game again on the new server. But that would have meant days (at least a week) of down time. Instead, we've used replication, advanced backup techniques using open source software, and have avoided causing lag in the game during all this activity because the game has asynchronous database access.
Replication
Replication is a MySQL feature where 'slave' servers maintain copies of data from a 'master' in real time. Replication starts with an identical copy of the existing data from the master. Then, the slave mirrors every change made on the master by executing every command which the master executes. This means that the slave can stay microseconds behind the master. They don't have to be physically connected, they just us an internet connection. And one master can have many slaves. It's no load at all on the master, the slaves just find out what commands have been run, and they run them too.
Lots of websites use replication - for example when you look at data from a huge website like Wikipedia which gets lots of 'reads' you are probably reading from one of many practically identical slaves. When you edit a page, it's the master you edit, and the slaves will get instructed about your changes.
We're using replication in a bit different way. The old server was already in a good position to be a master because we had set up the kind of logging needed, long ago. The new server got a copy of the data to start replicating from (more about that below) and since then, the new server has been mirroring every change that happened on the old server. When we are ready to migrate we can simply cross over.
So, as replication starts with a backup, how did we manage to backup XX GB data without days of downtime while we stopped the server?
Backup
Backing up database tables and getting a consistent copy means that the data can't be modified while the backup progresses. So, the tables should be protected during the backup. Protecting the tables usually means locking them and preventing writes - but the game needs to write data all the time (e.g. for logging of events) and read data (e.g. loading a character from the database, or retrieving stored postman messages). Normal backup methods would have meant downtime of X hours in the game while the data was backed up.
We knew about an open source backup tool by Percona which can perform backups really fast - way faster than normal methods - and without needing to lock tables*. It's called xtrabackup.
*Okay, well, it has to lock some old style tables - MyISAM - and we did have a few of those which we'd never been able to convert (without downtime) - we did some magic. Any archive logging tables we could use a normal backup for - doesn't matter about locking them when they're not read by the game now anyway. Then we dropped them on the main server so they didn't get copied again with the rest of the data. The only live game table which was MyISAM was the itemlog. It's huge! Most of the time the data doesn't need to be read from it at all though - it's just being written to, logging item transfer, merge, movements. So just for a short while, we renamed the itemlog to something else - itemlog_date, created a new empty table called itemlog which started logging with an id matching the last id on the previous table, and then the game was happy because it had a table to write to, but we could safely lock the huge old MyISAM table. then we stitched them back together later on the new server.
Async database access
How the game didn't lag like crazy while we did heavy stuff with mysql processes/ disk in out. Message queueing, Gearman, fire and forget is easy but waiting for a response and using the response - clever part. Initially introduced to help solve server lag so the server didn't wait for the response from the database in a single turn (300ms).