server down; possible rollback
server down; possible rollback
While trying to uninvasively fix some corruption in the savefile, I managed to segfault gdb, so there was no chance to recover.
Possible rollback of up to 3 days; updates later.
Possible rollback of up to 3 days; updates later.
Former programmer for the TMWA server.
Re: server down; possible rollback
Update:
The server is back up. The only damage is that about 60 accounts registered in the last ~20 hours no longer exist -- and only some of those ever logged in.
The game account service stopped writing save files. We don't know why, and we faced the difficult choice of whether to try to coax it to behave, or shoot it and pick up the pieces. As a last resort, we asked o11c to talk voodoo at the account server. Even he couldn't make it work, so we had to stop the situation before it went on even longer (and more accounts were unsaved.)
We're still investigating why the account server choked on saving the data files. If this was purely bad luck, then such is life. If it's preventable, then we want to learn how to prevent it from happening again.
Okay, back to the show!
The server is back up. The only damage is that about 60 accounts registered in the last ~20 hours no longer exist -- and only some of those ever logged in.
- character inventory and money are fine
- character stats and experience and items are fine
- storage is fine
The game account service stopped writing save files. We don't know why, and we faced the difficult choice of whether to try to coax it to behave, or shoot it and pick up the pieces. As a last resort, we asked o11c to talk voodoo at the account server. Even he couldn't make it work, so we had to stop the situation before it went on even longer (and more accounts were unsaved.)
We're still investigating why the account server choked on saving the data files. If this was purely bad luck, then such is life. If it's preventable, then we want to learn how to prevent it from happening again.
Okay, back to the show!
You earn respect by how you live, not by what you demand.
-unknown
-unknown
Re: server down; possible rollback
Good work guys!
ego is the anesthesia that deadens the pain of stupidity!
Re: server down; possible rollback
As a software guy, my instinctive response is "blame the hardware". Though, that is usually responsible for intermittent problems.
In this case, it was odd in that it was generating a corrupt savefile every 5 minutes for 3 whole days. Obviously, this meant the corruption happened in the parent process, and merely got propogated into the fork()'ed child.
My best guess is that the 'auth_num' index into 'auth_dat' got mixed up, because who can mess up a bubble sort?
This would most likely be in the form of a single bit flipped from 0 to 1, which can easily be caused by solar flares.
In this case, it was odd in that it was generating a corrupt savefile every 5 minutes for 3 whole days. Obviously, this meant the corruption happened in the parent process, and merely got propogated into the fork()'ed child.
My best guess is that the 'auth_num' index into 'auth_dat' got mixed up, because who can mess up a bubble sort?
This would most likely be in the form of a single bit flipped from 0 to 1, which can easily be caused by solar flares.
Former programmer for the TMWA server.
Re: server down; possible rollback
hee hee. I know what you mean.o11c wrote:As a software guy, my instinctive response is "blame the hardware".
I agree this was probably some transient phenomenon, and solar flares and cosmic rays have indeed been known to flip bits in RAM. This model server uses ECC RAM, which would detect and correct a single-bit error in hardware.This would most likely be in the form of a single bit flipped from 0 to 1, which can easily be caused by solar flares.
o11c, thanks again for taking the time and concentration to do surgery on this. I hope you were still able to finish the things you put aside to get us running.
You earn respect by how you live, not by what you demand.
-unknown
-unknown
Re: server down; possible rollback
Does this mean that we lost everything we did this last 3 days?
Re: server down; possible rollback
No, it means they saved the day!0x0BAL wrote:Does this mean that we lost everything we did this last 3 days?
ego is the anesthesia that deadens the pain of stupidity!
Re: server down; possible rollback
Instead of printing my assignment at home, I had to finish it later, so it basically ended up as "how do I make this stupid printer work?", which is something I usually avoid (and with cause - the printer in the room I'm in right now does not work, I had to print to the other room).Frost wrote:o11c, thanks again for taking the time and concentration to do surgery on this. I hope you were still able to finish the things you put aside to get us running.
--
For reference, here is the list of accounts that were lost:
Code: Select all
tehswordfish
deathdemon
shugo
Qyte
thatjon13
Dagda
iSatrix
gygy838
bobafett
pradhitya
fatzy
thebev
kodknackare
mmuse
masterbas
masoud
SKYLINED
srikandy
SKYLINED1
Marinasi
tott
serivires
zaceppin
epicfailed
Fl1nt
sirfancoc
tublikana
clochette31
maxigvornd
bentaisan
endless90
Former programmer for the TMWA server.
-
- Peon
- Posts: 47
- Joined: 11 Dec 2010, 21:07
- Location: Kentucky, USA
Re: server down; possible rollback
fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
memtest?
storage volumes in raid?
fault tolerance filesystem?
Re: server down; possible rollback
One process was unable to write a certain file after it had exceeded a certain size (about 4MB). The problem went away after that process was restarted.BoomerTheKran wrote:fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
Are you just playing Buzzword Bingo?
You earn respect by how you live, not by what you demand.
-unknown
-unknown
Re: server down; possible rollback
!!!BIG THANK YOU to ALL, who HELPED to resuscitate ManaWorld!!!
Re: server down; possible rollback
And it was *definitely* an in-memory problem. It is *highly* unlikely that the memory containing those particular variables would ever be swapped.Frost wrote:One process was unable to write a certain file after it had exceeded a certain size (about 4MB). The problem went away after that process was restarted.BoomerTheKran wrote:fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
Ooh ...Frost wrote:Are you just playing Buzzword Bingo?
Agile. Synergy. Web 2.0. Extreme Programming. Chaos Model.
Former programmer for the TMWA server.
-
- Peon
- Posts: 47
- Joined: 11 Dec 2010, 21:07
- Location: Kentucky, USA
Re: server down; possible rollback
At this point, I would like to be playing buzzword bingo. I'm however not very good at regular bingo. Too much attention requi........HEY LOOK A SQUIRREL! I was sort-of asking had they been done or were they being used. Thanks for restoring some sort of order to the chaos that is TMW!o11c wrote:And it was *definitely* an in-memory problem. It is *highly* unlikely that the memory containing those particular variables would ever be swapped.Frost wrote:One process was unable to write a certain file after it had exceeded a certain size (about 4MB). The problem went away after that process was restarted.BoomerTheKran wrote:fsck?
memtest?
storage volumes in raid?
fault tolerance filesystem?
Ooh ...Frost wrote:Are you just playing Buzzword Bingo?
Agile. Synergy. Web 2.0. Extreme Programming. Chaos Model.
Re: server down; possible rollback
rather happy you've fixed it and ironically happy im not effected, but weirdly enough when i go and click to play the game manaplus is now lagging and the background on it is pitch black however the main tmw program seems fine itself and the manaplus update i guess which in case fix's the currant issue seems to not even load properly.
i guess its a sign for me to take a break.
i guess its a sign for me to take a break.
Re: server down; possible rollback
Anyone else find it slightly humorous that one of the delted accounts was "epicfailed" ?
Dude - lvl 99
...and yes I know where my car is.....
...and yes I know where my car is.....