Forum: PogamutUT2004

Recovering from Fatal Errors

Hello,

I have both a general and specific question. The specific question has to do with the following error that I've encountered in the latest SVN version of the code:
-----------------------------

(NBUTServer4) WARNING 06:58:14.268 MediatorWorker: Stopped.
(UCC) INFO 06:58:39.044 ID30 Opening user log ..\UserLogs\BotPrize27_1_2011_6_58_39.log
(UCC) INFO 07:00:24.107 ID16 Opening user log ..\UserLogs\BotPrize27_1_2011_7_0_24.log
(UCC) INFO 07:00:31.604 ID40 Opening user log ..\UserLogs\BotPrize27_1_2011_7_0_31.log
(UCC) INFO 07:01:39.915 ID30 Opening user log ..\UserLogs\BotPrize27_1_2011_7_1_39.log
(UCC) INFO 07:03:24.142 ID16 Opening user log ..\UserLogs\BotPrize27_1_2011_7_3_24.log
(UCC) INFO 07:03:31.619 ID40 Opening user log ..\UserLogs\BotPrize27_1_2011_7_3_31.log
(UCC) INFO 07:04:39.929 ID30 Opening user log ..\UserLogs\BotPrize27_1_2011_7_4_39.log
(UCC) INFO 07:06:24.132 ID16 Opening user log ..\UserLogs\BotPrize27_1_2011_7_6_24.log
(UCC) INFO 07:06:31.631 ID40 Opening user log ..\UserLogs\BotPrize27_1_2011_7_6_31.log
(UCC) INFO 07:07:39.976 ID30 Opening user log ..\UserLogs\BotPrize27_1_2011_7_7_39.log
(UCC) INFO 07:08:08.278 ID30 ATcpLink::execSendText
(UCC) INFO 07:08:08.294 ID30 (BotConnection DM-IceHenge.BotConnection @ Function GameBots2004.BotConnection.SendLine : 00FD)
(UCC) INFO 07:08:08.294 ID30 AActor::ProcessState
(UCC) INFO 07:08:08.294 ID30 Object BotConnection DM-IceHenge.BotConnection, Old State State GameBots2004.BotConnection.monitoring, New State State GameBots2004.BotConnection.monitoring
(UCC) INFO 07:08:08.294 ID30 AActor::Tick
(UCC) INFO 07:08:08.294 ID30 AInternetLink::Tick
(UCC) INFO 07:08:08.294 ID30 ATcpLink::Tick
(UCC) INFO 07:08:08.294 ID30 TickAllActors
(UCC) INFO 07:08:08.294 ID30 ULevel::Tick
(UCC) INFO 07:08:08.296 ID30 (NetMode=1)
(UCC) INFO 07:08:08.296 ID30 TickLevel
(UCC) INFO 07:08:08.296 ID30 UGameEngine::Tick
(UCC) INFO 07:08:08.296 ID30 Level DM-IceHenge
(UCC) INFO 07:08:08.297 ID30 UpdateWorld
(UCC) INFO 07:08:08.297 ID30 UServerCommandlet::Main
(UCC) INFO 07:08:08.339 ID30 Executing UObject::StaticShutdownAfterError
(UCC) INFO 07:08:09.311 ID30 General protection fault!
(UCC) INFO 07:08:09.311 ID30
(UCC) INFO 07:08:09.311 ID30
(UCC) INFO 07:08:09.311 ID30
(UCC) INFO 07:08:09.312 ID30 History: ATcpLink::execSendText
(Strange, my post was cut off in the middle. Here's the rest)

(UCC) INFO 07:08:09.312 ID30
(UCC) INFO 07:08:09.312 ID30 Exiting due to error
(UCC) INFO 07:08:09.312 ID30 Exiting.
(UCC) INFO 07:08:09.312 ID30 FileManager: Reading 0 GByte 59 MByte 110 KByte 873 Bytes from HD took 0.292000 seconds (0.238000 reading, -1.#IND00 seeking).
(UCC) INFO 07:08:09.313 ID30 FileManager: 2.307000 seconds spent with misc. duties
(UCC) INFO 07:08:10.427 ID30 Name subsystem shut down
(NBUTServer33) SEVERE 07:08:10.928 UT2004Parser: Can't parse next message: java.net.SocketException: Connection reset (caused by: java.net.SocketException: Connection reset)
cz.cuni.amis.pogamut.base.communication.parser.exception.ParserException: UT2004Parser: Can't parse next message: java.net.SocketException: Connection reset (caused by: java.net.SocketException: Connection reset) (at cz.cuni.amis.pogamut.base.communication.parser.impl.yylex.YylexParser.parse(YylexParser.java:107))
caused by: cz.cuni.amis.utils.exception.PogamutIOException: java.net.SocketException: Connection reset (at cz.cuni.amis.pogamut.base.communication.connection.impl.AbstractConnection$ConnectionReader.handleException(AbstractConnection.java:445))
caused by: java.net.SocketException: Connection reset (at java.net.SocketInputStream.read(SocketInputStream.java:168))
Stack trace:
ParserExceptionUT2004Parser: Can't parse next message: java.net.SocketException: Connection reset (caused by: java.net.SocketException: Connection reset)
at cz.cuni.amis.pogamut.base.communication.parser.impl.yylex.YylexParser.parse(YylexParser.java:107)
at cz.cuni.amis.pogamut.base.communication.translator.impl.WorldMessageTranslator.getEvent(WorldMessageTranslator.java:121)
at cz.cuni.amis.pogamut.base.communication.mediator.impl.Mediator$Worker.run(Mediator.java:299)
at java.lang.Thread.run(Thread.java:619)
Caused by: PogamutIOExceptionjava.net.SocketException: Connection reset
at cz.cuni.amis.pogamut.base.communication.connection.impl.AbstractConnection$ConnectionReader.handleException(AbstractConnection.java:445)
at cz.cuni.amis.pogamut.base.communication.connection.impl.AbstractConnection$ConnectionReader.read(AbstractConnection.java:418)
at cz.cuni.amis.pogamut.ut2004.communication.messages.gbinfomessages.Yylex.zzRefill(Yylex.java:4534)
at cz.cuni.amis.pogamut.ut2004.communication.messages.gbinfomessages.Yylex.yylex(Yylex.java:4777)
at cz.cuni.amis.pogamut.base.communication.parser.impl.yylex.YylexParser.parse(YylexParser.java:97)
... 3 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at cz.cuni.amis.pogamut.base.communication.connection.impl.AbstractConnection$ConnectionReader.read(AbstractConnection.java:402)
... 6 more
-----------------------------------

Something to keep in mind when looking at this code is that I'm running six servers simultaneously, and this error seems to have happened simultaneously on each of the servers. Naturally, I would like this fixed.

However, my more general question is this: How do I sense errors like this within the code and handle them in such a way that I can recover from them? I'm doing evolution, so I run hundreds of evaluations, each on a new server. If one error comes up, I don't really care about it. I would like my code to automatically deal with the error by shutting down the offending server, maybe waiting a few minutes, and then relaunching it. However, I'm not sure where in my code I'm supposed to intercede to prevent these fatal errors from shutting down the Pogamut platform. All of the stack traces go back to Thread.run, which makes it hard to know where these threads are rooted in the code.

So basically, I would like to be able to sense the fatal errors within my code, and reset the server instead of closing the platform.

-Jacob
Hi!

This is a tought question...

Well the first post you have is refering to general fault of UT2004 that reacts with shutting down itself. This naturally leads to closing all sockets your bots might have and results
in exceptions in the second post.

Basically... there is no easy work around for that. Exception always tears down the whole server. Nevertheless every IUT2004BotController has method botShutdown(); that is (should be) guaranteed
to be called even in the case of failure, allowing you to save your work / note that your bots have been stopped / react as you wish to react after that...

I would address it this way:

1) create some globally accessible variable (or perhaps passed via parameters) that will contain some means to report that your bot has been stopped/killed.
2) upon recieving such report I would wait a bit whether it will affect all bots (lets say 5 seconds)
3) after that I would decide whether the fact that bots have stopped is OK / Failure
4) OK -> proceed with next evolution iteration
5) Failure -> restart evolution

I do not know whether it makes any sense to you? It's a bit cryptic :-)

Best,
Jakub