Meego Wiki
Views

User:Tgalvin/client-server rpc

From MeeGo wiki
Jump to: navigation, search

Contents

Remote RPC calls between Server and Worker in OTS

Synopsis

The communication between the Server and Worker is the backbone of OTS. I will attempt to describe this area from various aspects.

History

The OTS code derives from code was written to replace an unreliable bit of code. A central problem was that it required all the Workers to be restarted if the server went down - This was extremely labour intensive.

Client-Server RPC

The three philosophies on Server crashing in Client-Server RPC

1. Wait until reboot and retry 
2. Give up immediately and report failure
3. Guarantee nothing.

This is a well studied problem in which there is not a "right" answer. The intention of this Wiki is to allow the correct trade-offs to be made.

OTS 0.1 -> 0.8

This slightly outdated figure shows the Server-Worker interaction model. The simple state machine is described here.

Server-Worker comms are as follows:

  • Control messages
  • Logging
  • File Transfer

Also worth noting here is the Response Queue. This appears when the Task is initialised from the Worker and removed when the Server deems the Task to have finished.

Orphaned Messages

The Worker could be unable to send a message to the server for a number of reasons

1. Server is shut down / crashes during run 
2. Message Posted after the Response Queue has been removed by the Worker 
3. Timed out messages picked up by new workers  

These issues are described below

1. Manual Worker restarts

  • This was "solved" by a retry in the Worker that is a persistent, regular and frequent.
  • Any problems are swallowed up and the Worker restarts trying to pull it's next task.

This approach has certain "characteristics":

1. The standard retry mechanisms is random exponential back-off with maximum retries

This reduces pointless traffic load on the system, enables recovery, reduces risk of a timed / synchronised issue etc etc.

2. The specific problems relating to comms in the pyamqplib weren't studied so the Exception is catch-all

This sweeps all problems in the Worker under the carpet and makes the code difficult to work with. There is nothing to distinguish a Worker with problems of it's own or one of the Orphaned Messages scenarios outlined above (until the logs are examined)

2. Message Posting after Response Queue is removed

This not to be a known problem as long as the code never changes. The past couple of releases have seen bugs in this area for two different reasons:

Asynchronous Posting to Response Queue

This happened in the attempts to rationalise all the comms through AMQP and make an AMQP logger. Given the async nature of logging a hardware system it was at best difficult to ensure that all the logs where posted before the response queue was teared down

Additions to the System

Unsuspecting Developers adding messages to the system add them after the Response Queue has been teared down.

3. Timed out messages

The system does not cater for this at the moment.

Moving Forward

OTS has proved fairly robust. Brute force testing and significant use means that we can be confident that the test system will behave as it should. What has proved less satisfactory is the ability of the system to cope with change. This stems from the weaknesses set out above. To recap:

  • Swallowing exceptions in the Worker
  • Current design doesn't lend well to Asynchronous / Event Driven behaviour
  • Removal of Response Queue is implicit rather than explicit

And the fact that changes in this area require more brute force testing add to the difficulties of having a flexible code base

Obvious Improvements

  • Making the change to random, exponential back off and removing the catch all
  • Explicit tearing down of response queue (or at least make it more obvious to work with - perhaps with Wikis such as this ;-))

Trade offs

The Guarantee Nothing Scenario

There are advantages to be had in changing the philosophy.

Messages are posted from the Worker to the Server with AMQP. The ``basic_publish`` attribute has a 'mandatory' attribute. Setting this to False simplifies the worker. It is less stateful, which is a desirable design characteristic.

Efficiency Savings

Posting messages as non-mandatory has the disadvantage that the Task continues to run after the server is down. This might be a waste of resources.

Might be? Thinking briefly I'm not sure - but there are no doubt countless other scenarios...

Scenario 1: Server crashes

Guarantee nothing allows the Task to run. But pulling the next Task is only beneficial if the Server is back up and running again. Every Task pulled off whilst the Server is down has to be re-requested as a Testrun by the user so it could be argued that there is some benefit in the delay caused by letting the Task run it's course.

Scenario 2: Timed out messages

This causes Worker Error in the Give up model and subsequent support overhead. No-one is bothered by the Guarantee Nothing' handling of this case.

In summary it is not that obvious that the Give up model is any more efficient than the Guarantee nothing model. And the Give up model comes at the expense of being more complicated and less robust.

Mixing models

All this is not to say that the Give up approach doesn't offer benefits in showing the failures immediately.

There is no reason why we can't employ different approaches within OTS - In fact that is what we are doing already:

  • At Task level we apply retries
  • Within the Task we give up.

Different messages can be treated in different ways. For example the "amqp-logger' we could use a non - Mandatory message, a stray log message shouldn't affect the run.

Personal tools