Desync while testing bzr8001[bug-1418154-collectors-teams]

Bug #1581828 reported by Klaus Halfmann
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
widelands
Fix Released
Critical
Unassigned

Bug Description

While testing bzr8001[bug-1418154-collectors-teams] with Gun,
I got a desync at about 0:24

We where playing Wisent Valley, me playing Atlanteans as black without team,
Team 1 barbarians, with gun
Team 2 Imperials, AIs only

Playing on OSX 10.11.4, FPS 25

GunChleoc on Ubuntu 16.04, FPS 25

Game speed setting doesn't seem to matter.

See also [https://bugs.launchpad.net/widelands/+bug/1545098]

Tags: multiplayer

Related branches

Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :
Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :
Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :
Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :
GunChleoc (gunchleoc)
description: updated
Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :
Revision history for this message
GunChleoc (gunchleoc) wrote :
Revision history for this message
GunChleoc (gunchleoc) wrote :

And the last couple of timestamps on the client side.

Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

We switched speed between 0.5 and 1.5 as I like to play slow after a military building was occupied.
Can this provoke the desync?

Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

Was able to reproduce this without changing the speed:

Host]: Requesting sync reports for time 1202971
[Host]: Client 0: Time 1202854
[Host]: Client 0: Time 1202971
[Host]: comparing syncreports for time 1202971
[Host]: Client 0 (1) sent player command 21 for 2, time = 1203100
[Host]: Client 0: Time 1203100
[Host]: Client 0: Time 1203615
[Host]: Requesting sync reports for time 1204068
[Host]: Client 0: Time 1204068
[Host]: comparing syncreports for time 1204068
[Host]: lost synchronization with client 0!
I have: c9e55ce2f3b4cfd7ce161e6d3e18f67c
Client has: b1494c3292318f153676dd02a80fae8c

based on wl_autosave_nethost_00.wgf

GunChleoc (gunchleoc)
Changed in widelands:
milestone: none → build19-rc1
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

We both where on 25FPs which should be the default.

GunChleoc (gunchleoc)
description: updated
Revision history for this message
TiborB (tiborb95) wrote :

I think the line
[Host]: Client 0 (1) sent player command 21 for 2, time = 1203100
translates to: player 2 sent command SETWAREMAXFILL.

Can this command be a culprit?

Revision history for this message
Miroslav Remák (miroslavr256) wrote :

> I think the line
> [Host]: Client 0 (1) sent player command 21 for 2, time = 1203100
> translates to: player 2 sent command SETWAREMAXFILL.

The command is actually MessageSetStatusRead, not SetWareMaxFill.

Revision history for this message
Miroslav Remák (miroslavr256) wrote :

I can also reproduce the desync with wl_autosave_nethost_00.wgf. Happens during the execution of the CallEconomyBalance command, specifically here:
http://bazaar.launchpad.net/~widelands-dev/widelands/trunk/view/head:/src/economy/economy.cc#L774.

The client writes an additional economy request to the syncstream:

[sync:04762314 t=1352960] 00 -> request type (wwWARE)
[sync:04762315 t=1352960] 2b -> description index (43)
[sync:04762316 t=1352960] 2b 0c 00 00 -> target serial number

Gun's and Hasi's .wss files show the same thing, except the serial number is different.

Seems to be the same bug as #1545098.

Revision history for this message
TiborB (tiborb95) wrote :

Can it be that requests_ can differ between server and a client? One of requests was satisfied here (and removed from requests_) but still exist on the other side? This is just a speculation, I am not familiar with this stuff...

GunChleoc (gunchleoc)
tags: added: multiplayer
summary: - Dessync while testing bzr8001[bug-1418154-collectors-teams]
+ Desync while testing bzr8001[bug-1418154-collectors-teams]
Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

Mhh, I did not mention that I configuread autosave every 5 minutes ...

Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

had this again / still on bzr8010[trunk], SirVer we need some real specialist here.
Who can read those sysnc logs and tell us whats wrong here?

Revision history for this message
TiborB (tiborb95) wrote :

I tried to look into this. But I really dont understand the networking in the game...

Revision history for this message
SirVer (sirver) wrote :

> Who can read those sysnc logs and tell us whats wrong here?

Everybody can read them, but telling what's wrong is not too easy: The wss files are just the gamecommands that are send over the wire in binary format + some extra sentinel data that is run in heartbeat like places (for example the commandqueue). They should be completely identical for each player participating in a game. Diff'ing a hexdump of two wss files shows where they start to differ - looking through src/logic/queue_cmd_ids.h gives the information which packet was the culprit.

Having a 100% surefire way of reproducing a desync quickly locally would help debug this.

Revision history for this message
TiborB (tiborb95) wrote :

> in binary format

It would help id we have something that could convert wss file into more 'human readable' something. F.e. put one command into one line...

Revision history for this message
GunChleoc (gunchleoc) wrote :

What format is behind wss anyway?

So far, I haven't managed to trigger the desync by simply running the savegame, not even if I use 2 computers. So, there must be some form of player interaction involved.

Revision history for this message
SirVer (sirver) wrote :

> What format is behind wss anyway?

No real format, just the binary log of what player commands go back and forth and some state to check that everything is still in sync. grep for 'syncstream()' to see what gets written to it.

Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

Was able to reproduce this with to computers om OSX in the same LAN, both bzr8016[trunk] one a SirVer daily build.

None of the Human players did anything at all (expect scrolling, reading news or such)
Happend with all AIs on normal mode
Did not happend with only player 2 (Red, Barbarian) being an Active AI
Happend with player 3 (Imperial, yellow) being the only active AI at at 23:10
Happend with player 4 (Imperial, green) being the only active AI at at 23:20

will investigate further

Revision history for this message
GunChleoc (gunchleoc) wrote :

When I look at where #13 comes from, the calling function up the chain is Economy::start_request_timer. This function is called in multiple spaces, so maybe adding some log lines there would help? I ran the savegame between my Linux and Windows machine and no desync. Locally on my Linux machine, no desync.

I also tried the hexdump and diff thing, which gives me the following info:

Last identical line:
   10c8230 0000 4700 2003 0000 2300 2003 0000 3000
Decimal:
   17596976 0 18176 8195 0 8960 8195 0 12288

First line with differences:

Hex:
Host: 10c8240 2003 0000 0d01 204e 0000 2b00 098a 0000
Client: 10c8240 2003 0000 0d01 204e 0000 cafa 02de 0000

Decimal:
Host: 17596992 8195 0 3329 8270 0 11008 2442 0
Client: 17596992 8195 0 3329 8270 0 51962 734 0

How do I turn this into useful information so that I can see the player command number?

And here are the commands I used, I think we should have an idiot's guide in the wiki - I definitely need one

   hexdump 2016-05-14T18.00.15_nethost.wrpl.wss > server.txt
   hexdump 2016-05-14T17.00.16_netclient_1.wrpl.wss > client.txt

   diff client.txt server.txt -y --suppress-common-lines | less

Revision history for this message
GunChleoc (gunchleoc) wrote :

#22: The scrolling and reading news might actually be the culprit, if one of the UI functions accidentally uses the wrong rng.

Revision history for this message
TiborB (tiborb95) wrote :

First numbers are obviosly order numbers:

10c81d. ....
10c81e. ....
10c81f. ....

The following digits are parsed by this http://bazaar.launchpad.net/~widelands-dev/widelands/trunk/view/head:/src/network/nethost.cc#L2145 ???

Revision history for this message
Miroslav Remák (miroslavr256) wrote :

#23:
Your hexdump output appears to be little-endian (e.g. cafa 02de should actually be faca de02). I would recommend using vbindiff to get more sensible byte-by-byte comparison with highlighting.

> How do I turn this into useful information so that I can see the player command number?
I've attached a picture explaining some of the "format". Also take a look at https://wl.widelands.org/wiki/DebuggingDesyncs/ if you haven't already. If you are able to reproduce, you can also get more readable text output by defining SYNC_DEBUG in src/game.cc.

Revision history for this message
Notabilis (notabilis27) wrote :

I think I found a source for a/one/the desync-bug.
In src/economy/economy.cc around line 960 the code checks the requirements to create a new worker (called when a new building is finished but no matching worker is currently in a warehouse). For this, the buildcosts of the worker are checked (e.g. for a fisher: carrier + fishing rod). However, for all buildcosts

DescriptionIndex id_w = tribe.ware_index(bc.first);

is called to get the ware index, even for the carrier. This results in id_w=255 (= not found since "barbarian_carrier" is no known ware). In

ware_target_quantity(id_w).permanent

this gives an array access behind the end of the target-quantity-array with varying return values; on each call and most important on client and server. Most of the time this is non-zero but sometimes it is, resulting in a different code path later on (and in the end, in a request which is only created locally).

Oh, and an easy way to produce this desync:
- Start a (localhost) multiplayer game
- On the client build three fisher huts (so the last one stays vacant)
- Build a connected street with multiple flags
- Split and reconnect this street so the game creates/destroys new economies (one per street-network). Doing this often enough results in a desync for me

Revision history for this message
GunChleoc (gunchleoc) wrote :

#26 Thanks, I didn't find that page in the wiki for some reason.

#27 Looks like we are missing a fallback to workers here when a ware is not found, and a safety check in the array access you mentioned.

Revision history for this message
TiborB (tiborb95) wrote :

Guys, I just created branch based on Notabilis's post, please review and test it:

lp:~widelands-dev/widelands/bug-1581828

Revision history for this message
GunChleoc (gunchleoc) wrote :

Regarding your NOCOM, if a tribe has less wares than the ID of the carrier, will this still requesr carriers where they are needed? The problem here is that the economy (unlike the tribes themselves) uses a concept of "wareworker".

Have a look at how the ProductionSiteDescr parses the outputs and inputs, I was thinking of having something similar:

https://bazaar.launchpad.net/~widelands-dev/widelands/trunk/view/head:/src/logic/map_objects/tribes/productionsite.cc#L81

Revision history for this message
TiborB (tiborb95) wrote :

But there are both:
void Economy::set_ware_target_quantity
void Economy::set_worker_target_quantity

there should not be a problem, or would be?

Also, I tested it for some time and my assert was not "activated"

But I dont understand fully the code...

Revision history for this message
GunChleoc (gunchleoc) wrote :

Ah, I see - sorry, I'm not really familiar with the economy code, and browsing it without an IDE is too complicated for me. So, code LGTM.

The TribeDescr should have a function "has_ware" and the Tribes a function "ware_exists" - you could use one of those for the assert. I think we should get your branch in ASAP for easier testing.

Revision history for this message
TiborB (tiborb95) wrote :

I modified the code and proposed for merging. But I did not test server/client game...

Revision history for this message
SirVer (sirver) wrote :

Big Kudos to Notabilis for tracking this down - that is a big achievement.

Thanks for making the branch and testing Tibor and Kaputtnik.

Revision history for this message
kaputtnik (franku) wrote :

Don't forget Klaus Halfmann ;)

Revision history for this message
kaputtnik (franku) wrote :

Shouldn't this be "Fix committed" ? Or do we need more tests?

SirVer (sirver)
Changed in widelands:
status: Confirmed → Fix Committed
GunChleoc (gunchleoc)
Changed in widelands:
status: Fix Committed → Fix Released
Revision history for this message
GunChleoc (gunchleoc) wrote :

Fixed in build19-rc1.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.