SkyNET should support a collection of long-lived BOSS Participant
daemons which run as unprivileged users. It should be robust and have
a simple management interface to permit control of each participant
(start/stop/reload/status). It provides consolidated logging for the
paticipants. Developers have minimal boilerplate in their participant
code and can run a simplified 'local running' mode for testing.
- As a system developer writing a participant:
- I want to be given a workitem and focus on my system interaction and returning data.
- I want a status/control thread with some stubs for cancel() and status() which I can just fill in for my participant (and maybe ignore)
- I don't want to know about daemons, logging or robust startup/shutdown when the system reboots. If the system is shutting down then don't give me any more work to do.
- If you tell me the system is shutting down I'll try to tidy up.
- I don't want to know about packaging scripts - just give me a simple template and instructions on sending it to OBS.
- I want to be able to run the participant on my local desktop with my "local BOSS". I don't mind running mini-SkyNET from my home directory.
- If a problem occurs in production I'd like to have as much debug information as possible; including a copy of the incoming workitem.
- If I want to do something clever then I can specify an alternative class to Exoskeleton in the config.
- As a sysadmin I want to manage running participant daemons
- I want to be able to start, stop and reconfigure them from the command line (including which start at boot time).
- I want to be sure they'll shutdown and startup nicely when the system reboots. I don't mind waiting a short while for them to shutdown nicely but not too long.
- I want to see log information from them (and have them cope with logrotate)
- I should be notified of any problems (eg crashes)
- If a crash happens the participant should restart in a robust manner (probably rate limited).
- Installation should simply use the system package manager.
- SkyNET runs on openSuse 11.2/3 and Debian Squeeze
- As a process writer I want to see what's going on
- I want to manage the names of participants (without having to change the code)
- I want to easily see output from participants
- I want to filter output by process (wfid) and only see some participant logs. (I don't mind if this is just command line tools doing greps)
- I want to be able to analyse the history of each participant's work to see how long they took and how many times they were called or had problems.
Uses daemontools to manage the services.
A launch manager starts a wrapper process (Exoskeleton) with a control
Q which then dynamically reads the user config, drops privs and loads
and runs (execfile) the code.
To simplify installation, the Participants go in: /usr/lib/boss/participant-services/$P/
Installation is as a script via rpm but can depend on python libs
The overall model is to use daemontools to manage a collection of
scripts running as services.
The pattern used is to run the execution code as a normal user and the
logging aggregator as another 'logging' user and connect the two using
pipes. The daemontools supervise monitors both processes and restarts
them in the event of a failure.
- /etc/boss/boss.conf Systemwide settings (typically amqp settings)
For each deployed participant 'P1' there is a service dir:
And that has:
- ./run : Script run by 'supervise' - run as root and uses setuidgid to run 'exo'
- ./exo : Wrapper to connect to amqp and import user logic - runs as user. This is the meat of SkyNET
- ./config.exo : Supplements /etc/boss/boss.conf and defines participant specific data
- ./$SOMETHING : User logic code
- ./log/run : Logging code run as loguser
- ./supervise/ : binary format status information
- setuidgid : Used to provide a reliable 'run as' user participants
- svscan : Part of the overall service execution and monitoring; ie it starts participants and loggers
- supervise : Each individual participant and log daemon is managed by a supervise process
- multilog : Log output is distributed using multilog
Status and control to be used from cli or webui
- svc : Used to perform individual participant control
- svok : status enquiries for specific participant service
- svstat : overview of all participant services in a directory
Listens to 3 channels:
- BOSS workitem AMQP Q
- BOSS status/cancel AMQP Qs (per WFID channel - multiple if the participant handles concurrent workitems)
- SkyNET python Q
- Start: implicit on launch; both work/status Qs are listening.
- Shutdown: Stop listening on BOSS Q. Exit ASAP. Note that status/cancel queries may still be handled.
- ?? AboutToDie: Process will be terminated in 5 seconds.
Use the AIR library to handle cancel() and status() enquiries on the
WFID. Cancel and status should be in a separate thread / and will use
a different amqp channel for each WFID.