Page 1 of 1
server down; possible rollback
Posted: 13 Jan 2013, 22:02
by o11c
While trying to uninvasively fix some corruption in the savefile, I managed to segfault gdb, so there was no chance to recover.
Possible rollback of up to 3 days; updates later.
Re: server down; possible rollback
Posted: 13 Jan 2013, 22:58
by Frost
Update:
The server is back up. The only damage is that about 60 accounts registered in the last ~20 hours no longer exist -- and only some of those ever logged in.
- character inventory and money are fine
- character stats and experience and items are fine
- storage is fine
What happened?
The game account service stopped writing save files. We don't know why, and we faced the difficult choice of whether to try to coax it to behave, or shoot it and pick up the pieces. As a last resort, we asked o11c to talk voodoo at the account server. Even he couldn't make it work, so we had to stop the situation before it went on even longer (and more accounts were unsaved.)
We're still investigating why the account server choked on saving the data files. If this was purely bad luck, then such is life. If it's preventable, then we want to learn how to prevent it from happening again.
Okay, back to the show!
Re: server down; possible rollback
Posted: 14 Jan 2013, 01:29
by prsm
Good work guys!
Re: server down; possible rollback
Posted: 14 Jan 2013, 18:42
by o11c
As a software guy, my instinctive response is "blame the hardware". Though, that is usually responsible for intermittent problems.
In this case, it was odd in that it was generating a corrupt savefile every 5 minutes for 3 whole days. Obviously, this meant the corruption happened in the parent process, and merely got propogated into the fork()'ed child.
My best guess is that the 'auth_num' index into 'auth_dat' got mixed up, because who can mess up a bubble sort?
This would most likely be in the form of a single bit flipped from 0 to 1, which can easily be caused by solar flares.
Re: server down; possible rollback
Posted: 14 Jan 2013, 20:41
by Frost
o11c wrote:As a software guy, my instinctive response is "blame the hardware".
hee hee. I know what you mean.
This would most likely be in the form of a single bit flipped from 0 to 1, which can easily be caused by solar flares.
I agree this was probably some transient phenomenon, and solar flares and cosmic rays have indeed been known to flip bits in RAM. This model server uses ECC RAM, which would detect and correct a single-bit error in hardware.
o11c, thanks again for taking the time and concentration to do surgery on this. I hope you were still able to finish the things you put aside to get us running.
Re: server down; possible rollback
Posted: 14 Jan 2013, 21:06
by 0x0BAL
Does this mean that we lost everything we did this last 3 days?
Re: server down; possible rollback
Posted: 14 Jan 2013, 22:01
by prsm
0x0BAL wrote:Does this mean that we lost everything we did this last 3 days?
No, it means they saved the day!
Re: server down; possible rollback
Posted: 14 Jan 2013, 22:32
by o11c
Frost wrote:o11c, thanks again for taking the time and concentration to do surgery on this. I hope you were still able to finish the things you put aside to get us running.
Instead of printing my assignment at home, I had to finish it later, so it basically ended up as "how do I make this stupid printer work?", which is something I usually avoid (and with cause - the printer in the room I'm in right now does not work, I had to print to the other room).
--
For reference, here is the list of accounts that were lost:
Code: Select all
tehswordfish
deathdemon
shugo
Qyte
thatjon13
Dagda
iSatrix
gygy838
bobafett
pradhitya
fatzy
thebev
kodknackare
mmuse
masterbas
masoud
SKYLINED
srikandy
SKYLINED1
Marinasi
tott
serivires
zaceppin
epicfailed
Fl1nt
sirfancoc
tublikana
clochette31
maxigvornd
bentaisan
endless90
Oddly, the accounts were all created on the 13th, well past the day the corruption started failing (this is possible because we recovered from the corrupted file, instead of rolling back) ... well, since the savefile is not written in order, I guess that's possible ... in retrospect, it might have even been possible to save the few accounts that *were* deleted, just be creating a bunch of dummy accounts.
Re: server down; possible rollback
Posted: 18 Jan 2013, 03:53
by BoomerTheKran
fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
Re: server down; possible rollback
Posted: 18 Jan 2013, 05:05
by Frost
BoomerTheKran wrote:fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
One process was unable to write a certain file after it had exceeded a certain size (about 4MB). The problem went away after that process was restarted.
Are you just playing Buzzword Bingo?
Re: server down; possible rollback
Posted: 18 Jan 2013, 05:59
by Ginaria
!!!BIG THANK YOU to ALL, who HELPED to resuscitate ManaWorld!!!
Re: server down; possible rollback
Posted: 19 Jan 2013, 01:48
by o11c
Frost wrote:BoomerTheKran wrote:fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
One process was unable to write a certain file after it had exceeded a certain size (about 4MB). The problem went away after that process was restarted.
And it was *definitely* an in-memory problem. It is *highly* unlikely that the memory containing those particular variables would ever be swapped.
Frost wrote:Are you just playing Buzzword Bingo?
Ooh ...
Agile. Synergy. Web 2.0. Extreme Programming. Chaos Model.
Re: server down; possible rollback
Posted: 20 Jan 2013, 02:19
by BoomerTheKran
o11c wrote:Frost wrote:BoomerTheKran wrote:fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
One process was unable to write a certain file after it had exceeded a certain size (about 4MB). The problem went away after that process was restarted.
And it was *definitely* an in-memory problem. It is *highly* unlikely that the memory containing those particular variables would ever be swapped.
Frost wrote:Are you just playing Buzzword Bingo?
Ooh ...
Agile. Synergy. Web 2.0. Extreme Programming. Chaos Model.
At this point, I would like to be playing buzzword bingo. I'm however not very good at regular bingo. Too much attention requi........HEY LOOK A SQUIRREL! I was sort-of asking had they been done or were they being used. Thanks for restoring some sort of order to the chaos that is TMW!
Re: server down; possible rollback
Posted: 20 Jan 2013, 21:16
by Sharona
rather happy you've fixed it and ironically happy im not effected, but weirdly enough when i go and click to play the game manaplus is now lagging and the background on it is pitch black however the main tmw program seems fine itself

and the manaplus update i guess which in case fix's the currant issue seems to not even load properly.
i guess its a sign for me to take a break.

Re: server down; possible rollback
Posted: 11 Feb 2013, 23:52
by Dude
Anyone else find it slightly humorous that one of the delted accounts was "epicfailed" ?