a little analysis of lag

Got something on your mind about the project? This is the correct place for that.


Forum rules

This forum is for feature requests, content changes additions, anything not a Bug in the software.
Please report all bugs on the Support Forums

User avatar
Crush
TMW Adviser
TMW Adviser
Posts: 8046
Joined: 25 Aug 2005, 16:08
Location: Germany

Re: a little analysis of lag

Post by Crush »

shargom wrote:Maybe just focus all your powers on developing manaserv, instead of literally wasting time on eAthena server:codename "The NeverEnding Problems"?
tmwAthena and Manaserv are developed by separate teams.
  • former Manasource Programmer
  • former TMW Pixel artist
  • NOT a game master

Please do not send me any inquiries regarding player accounts on TMW.


You might have heard a certain rumor about me. This rumor is completely false. You might also have heard the other rumor about me. This rumor is 100% accurate.
BoomerTheKran
Peon
Peon
Posts: 47
Joined: 11 Dec 2010, 21:07
Location: Kentucky, USA

Re: a little analysis of lag

Post by BoomerTheKran »

shargom wrote:Maybe just focus all your powers on developing manaserv, instead of literally wasting time on eAthena server:codename "The NeverEnding Problems"?
There are features we have that manaserv doesn't

When I looked through my old chat logs, and compared it to when guildbot was moved to server, I noticed that's when I started seeing a lot more lag complaints.
Is guild chat written to disk?
if so, how often, and would that effect things?
When you log in, guild does several things(that I know of):
1. motd
2. announces to others you have entered guild chat
3. tell how many are in guild chat
If guildbot is logging all that info and dumping it to disk constantly, along with the chats therein, I'd think it might slow things down.

I have an idea of a solution, tho it would take a lot of extra coding in the client side. Remove guildbot. Have all guilds establish a password protected IRC channel, somewhere off Platyna's machines. Implement IRC into the client.
This IS a sig
User avatar
Nard
Knight
Knight
Posts: 1113
Joined: 27 Jun 2010, 12:45
Location: France, near Paris

Re: a little analysis of lag

Post by Nard »

BoomerTheKran wrote:When I looked through my old chat logs, and compared it to when guildbot was moved to server, I noticed that's when I started seeing a lot more lag complaints.
Is guild chat written to disk?
if so, how often, and would that effect things?
Wiki and Forum are on the same server. They probably have far ore intensive disk access than guild bot has, both more read than write, the amount of data transmitted is far more important. They could influence lags too depending.

I noticed that, since guild bot was transferred to platinum, it lags too, and sometimes needs to be synchronized as it fails to get and/or transmit the correct information when you log in.
I have an idea of a solution, tho it would take a lot of extra coding in the client side. Remove guildbot. Have all guilds establish a password protected IRC channel, somewhere off Platyna's machines. Implement IRC into the client.
Until o11c removed it for reasons I still cannot understand, guild support was built in game, guild chat was inchuded in that support.
"The language of everyday life is clogged with sentiment, and the science of human nature has not advanced so far that we can describe individual sentiment in a clear way." Lancelot Hogben, Mathematics for the Million.
“There are two motives for reading a book; one, that you enjoy it; the other, that you can boast about it.” Bertrand Russell, Conquest of Happiness.
"If you optimize everything, you will always be unhappy." Donald Knuth.
User avatar
o11c
Grand Knight
Grand Knight
Posts: 2262
Joined: 20 Feb 2011, 21:09
Location: ^ ^

Re: a little analysis of lag

Post by o11c »

Nard wrote:o11c removed it for reasons I still cannot understand
QFT
Former programmer for the TMWA server.
User avatar
Nard
Knight
Knight
Posts: 1113
Joined: 27 Jun 2010, 12:45
Location: France, near Paris

Re: a little analysis of lag

Post by Nard »

[url=http://forums.themanaworld.org/viewtopic.php?f=13&t=16225&p=130426#p130426]Myself in TEST: Illia Quest[/url] wrote: Third wave causes at least some lags wich are likely to kill hero, at most manaplus (repeatable) crashes or disconnections. Some disconnections had a network issue cause (Player was still visible in party tab and @whogm but invisible on the graphical User Interface).
Concerning Lags: We noticed that when the whole team standed in the desk corner behind Luvia, The lags were so much important that the game was unplayable.
We experienced spawning many (50 was minimum) mobs, with or without player clothes, associated or not with particle effects, in botcheck area and all worked fine even with a lot of magic particles.
The last points (and several others) makes me think that one of the main lag sources could be path finding (It can be rather computing intensive).

Team Client/OS:
  • ManaPlus 1.3.1.20/Ubuntu 12.04
  • ManaPlus 1.3.1.20/openSUSE 12.2
  • ManaPlus 1.3.1.20/Win 7
  • Mana 0.5/Win 8
"The language of everyday life is clogged with sentiment, and the science of human nature has not advanced so far that we can describe individual sentiment in a clear way." Lancelot Hogben, Mathematics for the Million.
“There are two motives for reading a book; one, that you enjoy it; the other, that you can boast about it.” Bertrand Russell, Conquest of Happiness.
"If you optimize everything, you will always be unhappy." Donald Knuth.
Frost
TMW Adviser
TMW Adviser
Posts: 851
Joined: 09 Sep 2010, 06:20
Location: California, USA

Re: a little analysis of lag

Post by Frost »

In real life, I specialize in performance tuning of Linux servers. Here are some things I've learned about computer performance.
  • If your process is running on CPU1, and CPU2 is busy, that's irrelevant. If CPU 3 is also busy, that's irrelevant.
  • If the computer has unused RAM, RAM is not a problem.
  • Linux processes write to the filesystem buffer. Disk commits happen asynchronously (basically non-blocking), unless you write code that specifically calls for a sync. eAthena does not do this.
  • Amount of disk used is irrelevant except in extreme situations (e.g. an SMTP server performing 4000 transactions per second, at 90% full). eAthena is nothing close to an extreme situation.
  • A computer has a limited number of Input/Output operations it can process per second. Performance is generally unaffected until the computer reaches some saturation of this. More below.
  • Performance data is independent of the tool used to measure it. Some tools are more accurate or more precise than others. This is why I get frustrated when someone insists that their favorite tool will give better answers. Unless you know how your tool is superior to the ones already used, you're arguing that your measuring tool will somehow improve absolute performance, which is nonsense.
Input/Output:
I have found I/O to be the most complex and poorly understood aspect of computer performance. Without going too far into the weeds, I'll relate some things I have learned from experience. Unfortunately, I don't have external sources for this.
Computers have different types of I/O. The most common limitation is disk access. (This is also the only type relevant on this server.) It's easy to see how any computer runs more slowly when it's busy accessing the hard drive. Read operations require the program to wait until it receives the data. Write operations are much more forgiving. Most operating systems like Linux and Windows buffer writes for a few seconds, so that programs can continue without waiting. This works well unless the demands are so heavy that the computer cannot keep up. In that situation, programs must wait for the buffer to have more space. Also, the computer itself can be so busy trying to catch up on disk operations that everything else gets bogged down. Note that occasional and minor disk operations have no measurable effect on performance.

To test whether disk I/O affects lag, I used iostat to measure disk operations on the Platinum server. I then used rsync to copy thousands of small files as fast as possible, asked players about lag in the game, and measured again. I cancelled the file copy, measured again, and again asked about lag.
Before the test, Platinum did about 100 reads+writes per second.
During the test, Platinum did 15000-23000 reads+writes per second.
After the test, Platinum did about 100 reads+writes per second.
Players reported no difference in lag when disk I/O increased by a factor of 200. I saw no difference in lag.
I think this test conclusively demonstrates that reducing disk operations on the server will have no measurable effect on lag.

Measurement tools:
Different tools have different degrees of accuracy, and of precision.
"top" is useful for an overall view. It presents data using a combination of averages (CPU metrics) and instantaneous samples (process state). Averages are not precise, and instantanous samples are not necessarily accurate.
To measure disk activity, I used iostat at 1-second intervals. This gives detailed averages per device, and in my experience the results are accurate within the upper and lower data. (This is why I gave a range.)

One final note: if the main server is really limited by hardware performance, we should easily be able to reproduce the problem on the testing server, which is much weaker.
You earn respect by how you live, not by what you demand.
-unknown
User avatar
Nard
Knight
Knight
Posts: 1113
Joined: 27 Jun 2010, 12:45
Location: France, near Paris

Re: a little analysis of lag

Post by Nard »

If I/O (and network is I/O too) is managed the same way, as data bases are we can loose a lot of time before we know the real cause(s) of lags.
I don't know how many packets are sent by client or by server for such a complicated situation as Illia quest but if I can assimilate them to pings and that a ping is between 250 and 530 ms as it was on the testing server/map server for me yesterday (manaplus info), the delay between the Info sent by my client and the answer from the server will be at least of the same order of magnitude even if all is perfect on server and on my machine. 530 ms is half of a second, it is enough to see a zombie jump on you in gy because of a change in minimal length path and the client could only extrapolate from precious info. Now if a higher priority event such as disk I∕O happens meanwhile...
I fear that lag causes are multiple and that only a multiple cross correlation of recorded data may allow a good understanding of the problem: . Anyway I/O (both network and disk usage) are critical nodes in the process so watching them closely and even concurrently with other activity may only have positive results. For example it could be interesting to record lags on the client side (some mobs position on a laggy map in a laggy period maybe).
I think now that path finding is also an important thing to watch because experience shows it. The lags we tested are reproduceable and easy to repeat because we neeed only initial conditions; I (we) am (are) ready to show them to anyone who wants.



btw guild code was not a lag source so it can be reincluded safely :wink:
"The language of everyday life is clogged with sentiment, and the science of human nature has not advanced so far that we can describe individual sentiment in a clear way." Lancelot Hogben, Mathematics for the Million.
“There are two motives for reading a book; one, that you enjoy it; the other, that you can boast about it.” Bertrand Russell, Conquest of Happiness.
"If you optimize everything, you will always be unhappy." Donald Knuth.
Frost
TMW Adviser
TMW Adviser
Posts: 851
Joined: 09 Sep 2010, 06:20
Location: California, USA

Re: a little analysis of lag

Post by Frost »

Nard wrote:If I/O (and network is I/O too) is managed the same way, as data bases are we can loose a lot of time before we know the real cause(s) of lags.
I used to manage a Checkpoint firewall on an older 2-core version of Platinum's hardware. It saturated 2 GigE NICs plus a 100Mb card with CPU to spare. I cannot imagine that interrupts from network I/O are a significant cause of lag here -- and if they are, we would see this reflected in the CPU instruction queue (aka "load".) Current load of 0.3 per CPU is not consistent with the theory that the server is limited by interrupts.
530 ms is half of a second, it is enough to see a zombie jump on you in gy because of a change in minimal length path and the client could only extrapolate from precious info. Now if a higher priority event such as disk I∕O happens meanwhile...
We've demonstrated that 100 or 20,000 disk operations per second has similar lag. Assume that 20,000 is the limit of the hardware (it's not). Assume that all 100 operations occur in the instant that causes the worst lag. That's 5ms.

If lag is caused by eAthena on Platinum reaching some performance limit, we should be able to easily reproduce it on weaker hardware.
You earn respect by how you live, not by what you demand.
-unknown
User avatar
borgomatic
Newly Registered User
Posts: 9
Joined: 29 Apr 2012, 17:13
Location: Down by the creek, Missouri, USA.

Re: a little analysis of lag

Post by borgomatic »

This thread is good reading! Thanks.

I've been developing my own (probably invalid) ideas about the lag cause and there are two that seem to be worth looking at; 1) garbage collection sweeps are taking too long and 2) there is a sync problem between clients and server that's causing surges in DB syncs.

But I'm an old coder, not up on today's OS and coding.

Back in the old days, freezes like this were usually linked to garbage collection delays in the application programs (http://www.linuxjournal.com/article/6679). With the larger ram used now, I don't know if the vastly faster CPU speeds offset the larger task of collection larger amounts of free ram.

And the DB sync idea came from watching how monster objects seem to move differently in the client than the server, causing sudden 'snaps' in their locations when the server 'corrects' the client's error. When I look at possible trajectories of how the server thinks the object moved during the lag vs the client's, they don't seem to be related, Ie: I'd expect the server's object to be on the same path as the client's, but delayed, instead it is divergent. This, to me, indicates the predictive coding in the client isn't accurately mirroring the server's. (I assume both client and server are using the same pseudo-random number algorithms and starting with the same seed numbers to stay synced, otherwise the server would have to be constantly updating every user's client for every move of every object on every map and obviously it isn't.)

So fire away, I have a tough hide.
User avatar
o11c
Grand Knight
Grand Knight
Posts: 2262
Joined: 20 Feb 2011, 21:09
Location: ^ ^

Re: a little analysis of lag

Post by o11c »

borgomatic wrote:Back in the old days, freezes like this were usually linked to garbage collection delays in the application programs (http://www.linuxjournal.com/article/6679). With the larger ram used now, I don't know if the vastly faster CPU speeds offset the larger task of collection larger amounts of free ram.
There is no GC in the server. Memory handling is not optimal, but (barring the parts that relate directly to savefiles, which are terrible) it's good enough that anything I do to change it will likely make it slightly worse initially.

After I fix the savefiles, I might add profiling to see where things take a lot of time.

GC would not be helpful in any program I have ever looked at. The fact that they are inefficient is not the main reason not to use them. Incidentally, the GCC is finally taking steps toward getting rid of their GC.
borgomatic wrote:And the DB sync idea came from watching how monster objects seem to move differently in the client than the server, causing sudden 'snaps' in their locations when the server 'corrects' the client's error. When I look at possible trajectories of how the server thinks the object moved during the lag vs the client's, they don't seem to be related, Ie: I'd expect the server's object to be on the same path as the client's, but delayed, instead it is divergent. This, to me, indicates the predictive coding in the client isn't accurately mirroring the server's. (I assume both client and server are using the same pseudo-random number algorithms and starting with the same seed numbers to stay synced, otherwise the server would have to be constantly updating every user's client for every move of every object on every map and obviously it isn't.)
You think too much of this software. There isn't any such guarantee at all.

The pathfinding implementations on client and server are completely unrelated, and do not have any logic to ensure ties are broken in a predictable manner. (There are typically LOT of ties in pathfinding, but this is mostly an issue for things like this (S = start, E = end, X = unwalkable):

Code: Select all

.....
S.X.E
.....
)

There would be no benefit in synchronizing the RNG, and that *would* allow cheating. Clients receive updates only for a 14x14 area of the current map.

Remember also that one cause of a "snap" is that the client still thinks a monster is walking on the course the server last sent for it, but in the mean time the server has had the monster "think" again and it decided to do something else.
Former programmer for the TMWA server.
User avatar
Nard
Knight
Knight
Posts: 1113
Joined: 27 Jun 2010, 12:45
Location: France, near Paris

Re: a little analysis of lag

Post by Nard »

Frost wrote:
530 ms is half of a second, [...]

If lag is caused by eAthena on Platinum reaching some performance limit, we should be able to easily reproduce it on weaker hardware.
Our tests were made on test server, not on platinum. the lags were on test server, the 530 ms ping was on test server too. The zombie jumps on Gy happen on test server, on germanTMW, Auldsbel and on platinum (other fork too but I can't remind which one). Sorry if I was not clear enough. I do think that lags are not a hardware issue but a software one. If hardware was the cause we would also have noticeable issues with irc for example, wich has never been the case on my side at least.
"The language of everyday life is clogged with sentiment, and the science of human nature has not advanced so far that we can describe individual sentiment in a clear way." Lancelot Hogben, Mathematics for the Million.
“There are two motives for reading a book; one, that you enjoy it; the other, that you can boast about it.” Bertrand Russell, Conquest of Happiness.
"If you optimize everything, you will always be unhappy." Donald Knuth.
User avatar
Nard
Knight
Knight
Posts: 1113
Joined: 27 Jun 2010, 12:45
Location: France, near Paris

Re: a little analysis of lag

Post by Nard »

Path finding in this game can be compared totwo other problems I worked on: Ray Tracing and fluid mechanics. Both of them have to find "minimal" paths between several points (a 2D or 3D field in fact): trajectories of light in the first, of mattter particles in the second, while respecting certain boundary conditions (obstacles, collision here). Both of them have to manage a lot of similar computations which, when they are done sequentially consume a lot of time. To improve the computation performances Vector Computing was invented and subsequently parallel and distributed computing. However, for various historical reasons It was widely brought to microcomputing only in the 2000's (video cards, GPUs, game consoles... (yes MMX was 1996)) . Thus it is likely that Ragnarok and it's clones may not have taken full advantage of SIMD, and of multiprocessor architecture.
I do not know well enough actual languages possibilities to investigate that part of the server code, and suggest a solution in that direction if any. There are developers far better skilled than me to do so; but I think it could be a step (along with I/O and DB management) towards a better knowledge of the problem.


Note: guild code has nothing to do with this except with little I/O.
"The language of everyday life is clogged with sentiment, and the science of human nature has not advanced so far that we can describe individual sentiment in a clear way." Lancelot Hogben, Mathematics for the Million.
“There are two motives for reading a book; one, that you enjoy it; the other, that you can boast about it.” Bertrand Russell, Conquest of Happiness.
"If you optimize everything, you will always be unhappy." Donald Knuth.
User avatar
o11c
Grand Knight
Grand Knight
Posts: 2262
Joined: 20 Feb 2011, 21:09
Location: ^ ^

Re: a little analysis of lag

Post by o11c »

We could get a slight improvement by passing -march=native in CXXFLAGS. This would prevent copying the binaries between systems, but we don't do that anyway. For most of us, we'd get a bigger improvement if we could natively compile the server for amd64, but unfortunately the main server is only x86.

Guild code is not known to have anything to do with lag, but lag is not the only problem with the server, just the one most visible to players.
Former programmer for the TMWA server.
User avatar
borgomatic
Newly Registered User
Posts: 9
Joined: 29 Apr 2012, 17:13
Location: Down by the creek, Missouri, USA.

Re: a little analysis of lag

Post by borgomatic »

I bow to the superior and much more current experience and knowledge of you experts. But as the King of Siam said in The King and I, 'is a puzzlement'!

Lag doesn't increase my ping, it stays about 230ms, so that speaks to more a delay on the server side. I frequently experience lag when trying to traverse a warp (it's hard to keep calm when a swarm of monsters are eating my behind and I can't get through the doorway!). I've had delays of two to five seconds when this happens. Probably a delay in my client now that I think about it. Windoze will get internally laggy when it's busy checking for updates or my email program is checking my myriad email accounts.

I've also noticed some player's icons seem to move in a jerky, back and forth pattern when they pass by my character's location. Most others are smooth and as expected. Could it create or add to this problem if someone is using modified client code that's causing storms of packets when they move? I asked one of them, 'lolz' last night what client he was using, mana+. Didn't think ask what release.
User avatar
o11c
Grand Knight
Grand Knight
Posts: 2262
Joined: 20 Feb 2011, 21:09
Location: ^ ^

Re: a little analysis of lag

Post by o11c »

borgomatic wrote:Lag doesn't increase my ping, it stays about 230ms, so that speaks to more a delay on the server side. I frequently experience lag when trying to traverse a warp
That's client-side lag, which is a known issue, but not the subject of this thread.
Former programmer for the TMWA server.
Post Reply