server down; possible rollback

Where game and project announcements are made.
Post Reply
User avatar
o11c
Grand Knight
Grand Knight
Posts: 2262
Joined: 20 Feb 2011, 21:09
Location: ^ ^

server down; possible rollback

Post by o11c »

While trying to uninvasively fix some corruption in the savefile, I managed to segfault gdb, so there was no chance to recover.

Possible rollback of up to 3 days; updates later.
Former programmer for the TMWA server.
Frost
TMW Adviser
TMW Adviser
Posts: 851
Joined: 09 Sep 2010, 06:20
Location: California, USA

Re: server down; possible rollback

Post by Frost »

Update:
The server is back up. The only damage is that about 60 accounts registered in the last ~20 hours no longer exist -- and only some of those ever logged in.
  • character inventory and money are fine
  • character stats and experience and items are fine
  • storage is fine
What happened?
The game account service stopped writing save files. We don't know why, and we faced the difficult choice of whether to try to coax it to behave, or shoot it and pick up the pieces. As a last resort, we asked o11c to talk voodoo at the account server. Even he couldn't make it work, so we had to stop the situation before it went on even longer (and more accounts were unsaved.)
We're still investigating why the account server choked on saving the data files. If this was purely bad luck, then such is life. If it's preventable, then we want to learn how to prevent it from happening again.

Okay, back to the show!
You earn respect by how you live, not by what you demand.
-unknown
User avatar
prsm
TMW Classic
TMW Classic
Posts: 1587
Joined: 24 Mar 2009, 17:18

Re: server down; possible rollback

Post by prsm »

Good work guys!
ego is the anesthesia that deadens the pain of stupidity!
User avatar
o11c
Grand Knight
Grand Knight
Posts: 2262
Joined: 20 Feb 2011, 21:09
Location: ^ ^

Re: server down; possible rollback

Post by o11c »

As a software guy, my instinctive response is "blame the hardware". Though, that is usually responsible for intermittent problems.

In this case, it was odd in that it was generating a corrupt savefile every 5 minutes for 3 whole days. Obviously, this meant the corruption happened in the parent process, and merely got propogated into the fork()'ed child.

My best guess is that the 'auth_num' index into 'auth_dat' got mixed up, because who can mess up a bubble sort?

This would most likely be in the form of a single bit flipped from 0 to 1, which can easily be caused by solar flares.
Former programmer for the TMWA server.
Frost
TMW Adviser
TMW Adviser
Posts: 851
Joined: 09 Sep 2010, 06:20
Location: California, USA

Re: server down; possible rollback

Post by Frost »

o11c wrote:As a software guy, my instinctive response is "blame the hardware".
hee hee. I know what you mean. :)
This would most likely be in the form of a single bit flipped from 0 to 1, which can easily be caused by solar flares.
I agree this was probably some transient phenomenon, and solar flares and cosmic rays have indeed been known to flip bits in RAM. This model server uses ECC RAM, which would detect and correct a single-bit error in hardware.

o11c, thanks again for taking the time and concentration to do surgery on this. I hope you were still able to finish the things you put aside to get us running.
You earn respect by how you live, not by what you demand.
-unknown
User avatar
0x0BAL
Peon
Peon
Posts: 40
Joined: 19 Dec 2012, 10:36

Re: server down; possible rollback

Post by 0x0BAL »

Does this mean that we lost everything we did this last 3 days?
User avatar
prsm
TMW Classic
TMW Classic
Posts: 1587
Joined: 24 Mar 2009, 17:18

Re: server down; possible rollback

Post by prsm »

0x0BAL wrote:Does this mean that we lost everything we did this last 3 days?
No, it means they saved the day!
ego is the anesthesia that deadens the pain of stupidity!
User avatar
o11c
Grand Knight
Grand Knight
Posts: 2262
Joined: 20 Feb 2011, 21:09
Location: ^ ^

Re: server down; possible rollback

Post by o11c »

Frost wrote:o11c, thanks again for taking the time and concentration to do surgery on this. I hope you were still able to finish the things you put aside to get us running.
Instead of printing my assignment at home, I had to finish it later, so it basically ended up as "how do I make this stupid printer work?", which is something I usually avoid (and with cause - the printer in the room I'm in right now does not work, I had to print to the other room).

--

For reference, here is the list of accounts that were lost:

Code: Select all

tehswordfish
deathdemon
shugo
Qyte
thatjon13
Dagda
iSatrix
gygy838
bobafett
pradhitya
fatzy
thebev
kodknackare
mmuse
masterbas
masoud
SKYLINED
srikandy
SKYLINED1
Marinasi
tott
serivires
zaceppin
epicfailed
Fl1nt
sirfancoc
tublikana
clochette31
maxigvornd
bentaisan
endless90
Oddly, the accounts were all created on the 13th, well past the day the corruption started failing (this is possible because we recovered from the corrupted file, instead of rolling back) ... well, since the savefile is not written in order, I guess that's possible ... in retrospect, it might have even been possible to save the few accounts that *were* deleted, just be creating a bunch of dummy accounts.
Former programmer for the TMWA server.
BoomerTheKran
Peon
Peon
Posts: 47
Joined: 11 Dec 2010, 21:07
Location: Kentucky, USA

Re: server down; possible rollback

Post by BoomerTheKran »

fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
Frost
TMW Adviser
TMW Adviser
Posts: 851
Joined: 09 Sep 2010, 06:20
Location: California, USA

Re: server down; possible rollback

Post by Frost »

BoomerTheKran wrote:fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
One process was unable to write a certain file after it had exceeded a certain size (about 4MB). The problem went away after that process was restarted.

Are you just playing Buzzword Bingo?
You earn respect by how you live, not by what you demand.
-unknown
User avatar
Ginaria
Warrior
Warrior
Posts: 420
Joined: 28 Oct 2008, 15:22
Location: Planet of the Pinkies

Re: server down; possible rollback

Post by Ginaria »

!!!BIG THANK YOU to ALL, who HELPED to resuscitate ManaWorld!!!
User avatar
o11c
Grand Knight
Grand Knight
Posts: 2262
Joined: 20 Feb 2011, 21:09
Location: ^ ^

Re: server down; possible rollback

Post by o11c »

Frost wrote:
BoomerTheKran wrote:fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
One process was unable to write a certain file after it had exceeded a certain size (about 4MB). The problem went away after that process was restarted.
And it was *definitely* an in-memory problem. It is *highly* unlikely that the memory containing those particular variables would ever be swapped.
Frost wrote:Are you just playing Buzzword Bingo?
Ooh ...

Agile. Synergy. Web 2.0. Extreme Programming. Chaos Model.
Former programmer for the TMWA server.
BoomerTheKran
Peon
Peon
Posts: 47
Joined: 11 Dec 2010, 21:07
Location: Kentucky, USA

Re: server down; possible rollback

Post by BoomerTheKran »

o11c wrote:
Frost wrote:
BoomerTheKran wrote:fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
One process was unable to write a certain file after it had exceeded a certain size (about 4MB). The problem went away after that process was restarted.
And it was *definitely* an in-memory problem. It is *highly* unlikely that the memory containing those particular variables would ever be swapped.
Frost wrote:Are you just playing Buzzword Bingo?
Ooh ...

Agile. Synergy. Web 2.0. Extreme Programming. Chaos Model.
At this point, I would like to be playing buzzword bingo. I'm however not very good at regular bingo. Too much attention requi........HEY LOOK A SQUIRREL! I was sort-of asking had they been done or were they being used. Thanks for restoring some sort of order to the chaos that is TMW!
User avatar
Sharona
Peon
Peon
Posts: 28
Joined: 15 Nov 2012, 12:49
Location: omicron persei 8

Re: server down; possible rollback

Post by Sharona »

rather happy you've fixed it and ironically happy im not effected, but weirdly enough when i go and click to play the game manaplus is now lagging and the background on it is pitch black however the main tmw program seems fine itself :P and the manaplus update i guess which in case fix's the currant issue seems to not even load properly.

i guess its a sign for me to take a break. :lol:
User avatar
Dude
Novice
Novice
Posts: 162
Joined: 28 Sep 2009, 17:31
Location: California

Re: server down; possible rollback

Post by Dude »

Anyone else find it slightly humorous that one of the delted accounts was "epicfailed" ?
Dude - lvl 99
...and yes I know where my car is.....
Post Reply