Contents
|
The communication between the Server and Worker is the backbone of OTS. I will attempt to describe this area from various aspects.
The OTS code derives from code was written to replace an unreliable bit of code. A central problem was that it required all the Workers to be restarted if the server went down - This was extremely labour intensive.
The three philosophies on Server crashing in Client-Server RPC
1. Wait until reboot and retry 2. Give up immediately and report failure 3. Guarantee nothing.
This is a well studied problem in which there is not a "right" answer. The intention of this Wiki is to allow the correct trade-offs to be made.
This slightly outdated figure shows the Server-Worker interaction model. The simple state machine is described here.
Server-Worker comms are as follows:
Also worth noting here is the Response Queue. This appears when the Task is initialised from the Worker and removed when the Server deems the Task to have finished.
The Worker could be unable to send a message to the server for a number of reasons
1. Server is shut down / crashes during run 2. Message Posted after the Response Queue has been removed by the Worker 3. Timed out messages picked up by new workers
These issues are described below
This approach has certain "characteristics":
This reduces pointless traffic load on the system, enables recovery, reduces risk of a timed / synchronised issue etc etc.
This sweeps all problems in the Worker under the carpet and makes the code difficult to work with. There is nothing to distinguish a Worker with problems of it's own or one of the Orphaned Messages scenarios outlined above (until the logs are examined)
This not to be a known problem as long as the code never changes. The past couple of releases have seen bugs in this area for two different reasons:
This happened in the attempts to rationalise all the comms through AMQP and make an AMQP logger. Given the async nature of logging a hardware system it was at best difficult to ensure that all the logs where posted before the response queue was teared down
Unsuspecting Developers adding messages to the system add them after the Response Queue has been teared down.
The system does not cater for this at the moment.
OTS has proved fairly robust. Brute force testing and significant use means that we can be confident that the test system will behave as it should. What has proved less satisfactory is the ability of the system to cope with change. This stems from the weaknesses set out above. To recap:
And the fact that changes in this area require more brute force testing add to the difficulties of having a flexible code base
There are advantages to be had in changing the philosophy.
Messages are posted from the Worker to the Server with AMQP. The ``basic_publish`` attribute has a 'mandatory' attribute. Setting this to False simplifies the worker. It is less stateful, which is a desirable design characteristic.
Posting messages as non-mandatory has the disadvantage that the Task continues to run after the server is down. This might be a waste of resources.
Might be? Thinking briefly I'm not sure - but there are no doubt countless other scenarios...
Guarantee nothing allows the Task to run. But pulling the next Task is only beneficial if the Server is back up and running again. Every Task pulled off whilst the Server is down has to be re-requested as a Testrun by the user so it could be argued that there is some benefit in the delay caused by letting the Task run it's course.
This causes Worker Error in the Give up model and subsequent support overhead. No-one is bothered by the Guarantee Nothing' handling of this case.
In summary it is not that obvious that the Give up model is any more efficient than the Guarantee nothing model. And the Give up model comes at the expense of being more complicated and less robust.
All this is not to say that the Give up approach doesn't offer benefits in showing the failures immediately.
There is no reason why we can't employ different approaches within OTS - In fact that is what we are doing already:
Different messages can be treated in different ways. For example the "amqp-logger' we could use a non - Mandatory message, a stray log message shouldn't affect the run.