Ashley reported "the guild system isn't remembering who is in the guild over restarts." Spent an hour mapping the expedition load/save path, the web admin endpoints, the in-memory cache, the autosave — all of it looked correct. The diagnosis turned up something else entirely.
What actually happened
The game server's /expedition/known API was returning count: 0 — zero expeditions loaded into memory — despite 12 fully-populated rows in MariaDB. The expedition_members table had Ashley + Beric correctly assigned to The Crown. The web admin panel was correctly inserting, the in-memory reload mechanism was correctly wired. Nothing was actively losing data.
The server boot died on May 26 at 01:08:29.
01:08:29 [FATAL] GameService - Failed up load client files! (G:\AAEmu3_client\game_pak)
01:08:29 [FATAL] GameService - Press Ctrl+C to quit
The game_pak file was being rewritten at that moment (last modified 01:11:51 — three minutes AFTER the failed boot). At the exact second the boot tried to open it, a deploy or patch had it locked. ClientFileManager couldn't acquire a read handle, GameService logged FATAL, and called return.
But return from StartAsync doesn't kill the process — the WebApiService thread had already started a few lines earlier. So:
- The process stayed alive (PID 43500, no CPU).
- The Echo gameservers panel showed it as "running" (the PID existed).
- The WebApi answered HTTP requests with empty data.
ExpeditionManager.Load()— which runs much later in the boot sequence — never executed.- Every player who logged in had their
characters.expedition_idcorrectly read from DB, butExpeditionManager.GetExpedition(id)returnednullbecause the cache was empty. - In-game: "you are not in any guild." Database: "Ashley + Beric are members of The Crown."
For roughly 1 hour 45 minutes, the server was a zombie. The fix was a simple stop+start once the pak file was free.
Two defensive patches
1. ClientFileManager retries on transient pak lock. When the pak file exists but can't be opened (the common case during a deploy), the manager now retries 6 times with 5-second delays — a 30-second window that covers a typical pak rewrite. Only after all 6 attempts fail does it log an error and give up. A successful retry logs Pak opened on attempt N/6 after transient lock so we can see when this fires in production.
2. GameService hard-exits on fatal. Both fatal paths in StartAsync (DB updater failure, client-files failure) now call Environment.Exit(1) instead of plain return. This kills the process. The Echo gameservers panel sees a dead PID and surfaces "stopped" cleanly. No more zombies. If you have the supervisor configured to auto-restart on death, the next attempt will pick up the now-free pak and boot normally.
What this means for guild members
If Ashley adds someone to a guild via the panel and they don't see it in-game after relogging, the first thing to check is no longer "is the guild system broken." It's curl -H 'X-AAEmu-Auth: ...' http://127.0.0.1:1280/expedition/known — if that returns count: 0 you have a zombie boot, restart the server. If it returns the full list, the membership genuinely loaded and the bug is elsewhere.
Behind the scenes
- Two C# files changed:
ClientFileManager.cs(+24 lines, retry loop) andGameService.cs(+8 lines, twoEnvironment.Exit(1)calls). - No client patch. No DB migration. The patch is staged in
bin/AAEmu.Game.dlland activates the next time the game server is restarted (the running process is on the older code; it's working fine right now). - The current restart that fixed today's incident was a one-time manual stop+start. The patches make sure the same incident in the future surfaces as a clean "server down" instead of an opaque "guilds are broken."
If you ever see the /expedition/known count drop unexpectedly, that's the canary — file a ticket immediately, don't try to add more members via the panel until the cache is back.