Re: No servers available?
Posted: 19 Feb 2010, 01:46
Here's why it was down then:
The login server was laggy. The mmo_auth_sync() function that writes the accounts database was taking an inordinate amount of CPU time due to the many accounts on TMW. Being a single-threaded daemon all other processing is suspended during this operation, leading to horribly long login delays. The problem was possibly exacerbated by the web-based account creation script which likes to repeatedly call ladmin with bad parameters, possibly leading to more calls to mmo_auth_sync() than needed.
Several solutions were tried to mitigate this problem. The login-server was calling mmo_auth_sync() whenever any changes were made to the accounts DB(a lot), so first I removed most calls to the function and set it to be triggered on a 5 minute timer. This was not very effective, again possibly due to the ladmin issue, as I left mmo_auth_sync() in there assuming anything from ladmin could be considered trusted input and should trigger an immediate DB write.
The next attempt involved removing all calls to mmo_auth_sync() except the one triggered by the timer. Jaxad had estimated that it takes around 30 seconds to dump the accounts DB to disk, so it was decided that it would be a good idea to fork(2) off the mmo_auth_sync() function to a child process. This solution had proved successful with the character server and had almost totally eliminated it's lag issues(including party/whisper lag). I pushed the patch and went to sleep. While I was drooling on a pillow, Jax had pushed this code live.
Upon waking up I found that TMW was down. Unfortunately the login server operates slightly differently than the char-server, so what appeared to be a cut&dry copy/paste fix went a bit awry. The char-server relies on a SIGINT handler to clean up after itself when it exits. The login-server had to be funky and uses atexit(3) semantics to run additional code on exit. Since the code that forks off writes calls exit() when the child process is through running mmo_auth_sync(), the atexit() function is being called. I won't bother with a full stack dump here, but the code the atexit() handler is calling eventually closes down all sockets in the process. Now if you know anything about POSIX forking semantics you would know that if a child closes a socket shared with the parent process, it's closed for the parent process also. This led to a condition where the parent login-server process was running it's main select()/accept() loop on a closed socket, spewing errors to the log and chewing 100% CPU. Nice. The solution to this issue was to replace the call to exit() with _exit(), which bypasses the atexit() handler and exits immediately. After pushing these changes, Jax pushed them to the main repo, restarted the server.. and here we are.. It seems there still might be problems.
Satisfied with the explanation? It took me 10 minutes to write up. Time I could have been spending looking into this further.
The login server was laggy. The mmo_auth_sync() function that writes the accounts database was taking an inordinate amount of CPU time due to the many accounts on TMW. Being a single-threaded daemon all other processing is suspended during this operation, leading to horribly long login delays. The problem was possibly exacerbated by the web-based account creation script which likes to repeatedly call ladmin with bad parameters, possibly leading to more calls to mmo_auth_sync() than needed.
Several solutions were tried to mitigate this problem. The login-server was calling mmo_auth_sync() whenever any changes were made to the accounts DB(a lot), so first I removed most calls to the function and set it to be triggered on a 5 minute timer. This was not very effective, again possibly due to the ladmin issue, as I left mmo_auth_sync() in there assuming anything from ladmin could be considered trusted input and should trigger an immediate DB write.
The next attempt involved removing all calls to mmo_auth_sync() except the one triggered by the timer. Jaxad had estimated that it takes around 30 seconds to dump the accounts DB to disk, so it was decided that it would be a good idea to fork(2) off the mmo_auth_sync() function to a child process. This solution had proved successful with the character server and had almost totally eliminated it's lag issues(including party/whisper lag). I pushed the patch and went to sleep. While I was drooling on a pillow, Jax had pushed this code live.
Upon waking up I found that TMW was down. Unfortunately the login server operates slightly differently than the char-server, so what appeared to be a cut&dry copy/paste fix went a bit awry. The char-server relies on a SIGINT handler to clean up after itself when it exits. The login-server had to be funky and uses atexit(3) semantics to run additional code on exit. Since the code that forks off writes calls exit() when the child process is through running mmo_auth_sync(), the atexit() function is being called. I won't bother with a full stack dump here, but the code the atexit() handler is calling eventually closes down all sockets in the process. Now if you know anything about POSIX forking semantics you would know that if a child closes a socket shared with the parent process, it's closed for the parent process also. This led to a condition where the parent login-server process was running it's main select()/accept() loop on a closed socket, spewing errors to the log and chewing 100% CPU. Nice. The solution to this issue was to replace the call to exit() with _exit(), which bypasses the atexit() handler and exits immediately. After pushing these changes, Jax pushed them to the main repo, restarted the server.. and here we are.. It seems there still might be problems.
Satisfied with the explanation? It took me 10 minutes to write up. Time I could have been spending looking into this further.