Probe Life-cycle and Timing on the ecc_queue

This is often useful to me, but I'm forever forgetting the details, so here's a handy reference for myself and the few others who care about tracking down where mid server performance is leaking away.

Here's the flow of a probe:

Somebody creates an output record in the ecc_queue (output created time = now, state = ready)
Load balancing of mids within a cluster occurs (currently in a business rule, mid server is assigned randomly)
An AMB message is sent to the assigned mid server notifying him that he has work to do
The mid queries the last 30 minutes of ecc_queue output results, retrieving any new work onto an internal queue
The mid sends a queue.processing message back indicating which ecc_queue outputs got picked up for processing
A business rule in the instance marks all the outputs in processing (output processed time = now, state = processing)
The mid has some number of worker threads grinding away emptying the internal queue
On completion, the result is inserted in an ecc_queue input record (input created time = now, state = ready) (The time reported in the input message is the time spent in the actual worker thread.)
When the instance sees the input, a business rule marks the output processed (output updated time = now, state = processed)
An asynchronous sensor job is scheduled (sys_trigger).
The sensor starts, (input processed time = now, state = processing)
The sensor completes, (input updated time = now, state = processed)

So, if you look at the ecc_queue and watch the columns created, processed, updated, and state, you can see how long it takes the probe to move through each stage in the pipeline:

Output queue time = output processed - output created
Mid server queue time = output updated - output processed - mid server exec time (from input payload)
Mid processing time = mid server exec time (from input payload)
Input queue time = input processed - input created
Input processing time = input updated - input processed

(You can find this logic in code form in the Discovery Timeline implementation, and you can see this graphically, for small discoveries.)

If you look at these times and see one or more of them shoot up to 45 minutes at a certain time every day, you have a problem.

Diagnosing problems:

If the output queue time is huge, you may be seeing congestion in the API_INT semaphores due to excessive SOAP / REST traffic. Generally, API loading issues are rare until you push beyond about 100 mid servers at which point, you may want to start tuning more carefully.
* consider decreasing the number of mid servers to reduce API load on the instance
- upgrading to Kingston to get improved mid server load balancing
Excessive Mid server processing or queueing time and / or outputs stuck in ready state tends to indicate that your mids are overloaded or have stuck worker threads.
* consider pausing that mid server, waiting a few minutes for active probes to complete, and taking a stack trace. If there are worker probes that are not idle, you may need to restart the mid and consider upgrading to get the latest bug fixes.
- otherwise, you may want to try increasing the number of mid servers.
Excessive input queue / processing times indicate too much traffic for the number of nodes you have, or perhaps load balancing issues. Options to consider:
* rebalancing discovery schedules to spread out activity more evenly throughout the day
- increasing node count
- upgrading to Kingston to get improved mid server load balancing

- Tim.

Labels: