User Story
As an SE, I’d like to have a monitoring system for TaskStack (TS) that alerts only when the service degrades in a way that could impact the customer’s business, so that I can take the necessary measures and avoid wasting time on false alerts.
As an SE, I’d also like to avoid situations where I receive a TaskStack alert only to determine that it should be ignored, so that I can save time in such cases.
As an SE, I want the system to alert the CSP Admin directly when an issue requires a business decision or action from the CSP, so that I don’t have to act as a “middleman.”
As a CSP Admin, I’d like to subscribe to alerts that may impact my business and requires my actions, so that I can make timely business decisions.
Example of use
Business model
Technology
Current Solution
The current monitoring of TaskStack generates too many false alerts — alerts that don’t require any action. In other words, even if we ignore them, the service remains unaffected and no one notices an issue.
This happens because we are monitoring the internals of TaskStack rather than the actual-vs-expected results of its work.
Stakeholders and their benefits
Who are the users / whom we bring value to?
Benefit / Stakeholders | More Comfort | Increased Efficiency | Saves Time | Tighter Control | Replaces Human | Regulatory Requirement |
|---|---|---|---|---|---|---|
CSP | ✓ | ✓ | ||||
Sales/marketing of CSP | ||||||
Resellers / distributors | ✓ | ✓ | ||||
Network operations / Support of CSP | ✓ | ✓ | ✓ | |||
Developer | ||||||
3rd party | ||||||
End user | ✓ |
Use Cases
Use Case #1: Retry the connection (operation) instead of generating an alert
Roles: Support Engineer (SE)
Use Scenario #1.1: Connection to Gearman
Preconditions: The BA/system is configured to make 3 retry attempts to connect to Gearman, with a 5-second pause between each attempt.
During the unit task execution (UTE) (e.g. vd_notify), the error "...GEARMAN_COULD_NOT_CONNECT..." occurs when attempting to connect to Gearman.
Initially, instead of immediately ending with status CRITICAL, TS waits for 5 seconds, retries, and encounters the same error again.
TS repeats this cycle 3 times without success and eventually ends with status CRITICAL.
The monitoring system then triggers an alarm.
The SE investigates the issue and observes that periods of overload typically last less than 5 minutes.
To improve resilience, the SE updates the configuration to increase the retry interval from 5 seconds to 60 seconds, and the number of attempts from 3 to 5.
At the next UTE, the error "...GEARMAN_COULD_NOT_CONNECT..." occurs again.
TS waits 60 seconds before retrying, but the same error reappears.
TS repeats this cycle 5 times. On the 5th attempt, it successfully connects to Gearman.
TS is then able to continue the UTE and completes with status OK.
As a result, the monitoring system does not detect any issue.
Use scenario #1.2: Connection to Cassandra DB
Preconditions: The BA/system is configured to make 5 retry attempts to connect to Cassandra, with a 5-second pause between each attempt.
During UTE (e.g. customer_statistics), the error "Cassandra login failed ..." occurs when attempting to connect to Cassandra.
Initially, instead of immediately ending with status CRITICAL, TS waits for 5 seconds, retries, and encounters the same error again.
TS repeats this cycle 5 times without success and eventually ends with status CRITICAL.
The monitoring system then triggers an alarm.
The SE investigates the issue and observes that periods of overload is related to the daily logrotate routine that usually lasts less than 10 min.
To improve resilience, the SE updates the configuration to increase the retry interval from 5 seconds to 30 seconds, and the number of attempts from 5 to 10.
At the next UTE, the error "Cassandra login failed ..." occurs again.
TS waits 30 seconds before retrying, but the same error reappears.
TS repeats this cycle 7 times. On the 7th attempt, it successfully connects to Cassandra.
TS is then able to continue the UTE and completes with status OK.
As a result, the monitoring system does not detect any issue.
Use scenario #1.3: Connection to Kairos DB
Preconditions: The BA/system is configured to make 2 retry attempts to connect to KairosDB, with a 10-second pause between each attempt.
During UTE (e.g. update_system_load_statistic), the error "Cannot send the request..." occurs when attempting to connect to KairosDB.
Initially, instead of immediately ending with status CRITICAL, TS waits for 10 seconds, retries, and encounters the same error again.
TS repeats this cycle 2 times without success and eventually ends with status CRITICAL.
The monitoring system then triggers an alarm.
The SE investigates the issue and observes that the root cause is related to the network connectivity issues that normally last less than 1 min.
To improve resilience, the SE updates the configuration to keep the retry interval 10 seconds, and increase the number of attempts from 2 to 6.
At the next UTE, the error "Cannot send the request..." occurs again.
TS waits 10 seconds before retrying, but the same error re-appears.
TS repeats this cycle 6 times without success and then ends with status CRITICAL.
The SE investigates the issue and determines that the root cause is an incorrect system configuration.
After the SE applies the fix, TS is able to run the UTE and completes successfully with status OK, despite temporary network issues.
As a result, the monitoring system does not report any issue.
Use scenario #1.4: Connection to external 3rd-party API service (e.g. SureTax and the others)
Preconditions: The BA/system is configured to make 2 retry attempts to external service (API service or ANY service), with a 30-second pause between each attempt.
During UTE (e.g. statistics), the error occurs when attempting to send request to SureTax API (e.g. gets response code 500).
Initially, instead of immediately ending with status CRITICAL, TS waits for 30 seconds, retries, and encounters the same error again.
TS repeats this cycle 2 times without success and eventually ends with status CRITICAL.
The monitoring system then triggers an alarm.
The SE investigates the issue and observes that periods of the API service failure usually last less than 10 min.
To improve resilience, the SE updates the configuration to increase the retry interval from 30 seconds to 60 seconds, and the number of attempts from 2 to 10.
At the next UTE, the error "SureTax: status 500" occurs again.
TS waits 60 seconds before retrying, but the same error reappears.
TS repeats this cycle 4 times. On the 4th attempt, it gets response that indicates successful attempt.
TS is then able to continue the UTE and completes with status OK.
As a result, the monitoring system does not detect any issue.
Use case #2: Re-run the task instead of generating alert
Roles: Support Engineer
Preconditions: The monitoring system is configured to ignore repeated identical errors for three consecutive run.
TS executes UTs (e.g., statistics) hourly according to a defined schedule within the allowed period Calculate_Statistics_At.
Use scenario #2.1: Replication delay
Main Flow:
During UTE, TS encounters the error “Replication delay” when attempting to send a request to SlaveDB.
The re-run attempt counter for this unit task (UT) is increased from
0to1.Instead of triggering an alarm, the monitoring system waits for the next scheduled run of the same UT.
On the next scheduled run (e.g., after 1 hour), TS encounters the same error “Replication delay” again.
The re-run attempt counter is increased from
1to2.The monitoring system again waits for the next scheduled run instead of triggering an alarm.
On the following scheduled run, TS executes the UT and does not encounter the error (Status OK or a different error).
The re-run attempt counter is reset from
2back to0.On the next scheduled run (e.g., the next day), TS encounters the error “Replication delay” again.
The two subsequent scheduled re-runs also result in the same error “Replication delay”.
The re-run attempt counter is increased from
0to3.At this point, the monitoring system triggers an alarm.
Exception Flow:
If the UT succeeds before reaching the maximum number of re-run attempts, the counter is reset to
0, and no alarm is triggered.
Resolution:
The SE investigates the issue and determines that replication delays are typically caused by high load and last up to 4 hours.
To improve resilience, the SE updates the configuration to increase the maximum number of re-run attempts from
3to6.
Postconditions:
At the next UTE, the error “Replication delay” occurs again.
TS retries on the next scheduled run (e.g., after 1 hour) and succeeds with status OK.
The re-run attempt counter is reset from 4 to 0.
The monitoring system does not detect or report an issue.
Use Case #3: Send an alert to the CSP Admin instead of alert when the issue requires a business decision
Use Scenario #3.1: Errors from external systems managed by the CSP and/or their vendor/partner
Use Scenario #3.1.1: Bad Response from E-Commerce Payment Gateway
Preconditions:
The BA/system is configured to make 3 retry attempts to an external service (payment processor or any other service), with a 5-second pause between each attempt.
The monitoring system is configured to ignore repeated identical errors for 3 consecutive runs.
CSP Admin is subscribed to periodic payment errors (via user notifications, template: “auto-payment error”).
Main Flow:
During UTE of auto_payment, the error occurs:
“Payment failed, errcode 9999, error: Bad response from Authorize.Net payment gateway...” when attempting to make an e-commerce payment.Instead of immediately ending with status CRITICAL, TS waits 5 seconds, retries, and encounters the same error again.
TS repeats this cycle 3 times without success and eventually ends with status CRITICAL.
A notification is sent to the CSP Admin’s mailbox.
The monitoring system does not trigger an alarm (since the "ignore repeated identical errors for 3 consecutive runs" is not exceeded).
CSP Admin receives a notification and reviews the details of the failed payment (e.g., counter and/or list of the affected objects (payment/customer/account), details such as affected customer name/ID, timestamp, charged amount, error text explaining the reason, and a proposed action plan).
CSP Admin follows the proposed action plan, contacts Authorize.Net, and obtains a fix from the vendor.
On the subsequent UTE of auto_payment, the e-commerce payment is processed successfully.
Use Scenario #3.1.2: Taxation service rejects requests from PB
Preconditions:
- The BA/system is configured to make 3 retry attempts to an external service (payment processor or any other service), with a 5-second pause between each attempt.
The monitoring system is configured to ignore repeated identical errors for 2 consecutive run.
- CSP Admin is subscribed to errors related to tax calculation error (via user notifications, template: “tax calculation error.”)
Main Flow:
During UTE of statistics/taxes, the error occurs:
“LoadRequest Failure - Request has invalid inputs: Failure - Invalid Validation Key\r\nSchema Errors”
when attempting to calculate taxes.Instead of immediately ending with status CRITICAL, TS waits for 5 seconds, retries, and encounters the same error again.
TS repeats this cycle 3 times without success and eventually ends with status CRITICAL.
- The monitoring system does not trigger an alarm (since the "ignore repeated identical errors for 2 consecutive runs" is not exceeded).
A notification is sent to the CSP Admin’s mailbox.
During the next run (e.g. next day), TS retries the UT without success (same error).
The monitoring system still does not trigger an alarm (since the "ignore repeated identical errors for 2 consecutive runs" is still not exceeded).
CSP Admin receives 2 notifications and reviews the details of the failed tax calculation.
CSP Admin follows the proposed action plan (e.g., disables tax calculation or prolongs his validation key).
On the subsequent UTE, the invoice is generated successfully.
- This specific error doesn't occur anymore.
Use scenario #3.2: Errors that have to be fixed by CSP Admin
Use scenario #3.2.1: Broken invoice template
Preconditions:
The monitoring system is configured to ignore repeated identical errors for 2 consecutive runs.
CSP Admin is subscribed to errors related to invoice generation (via user notifications, template: “regular invoice cannot be created...”).
Main Flow:
During UTE of customer_statistics, the error occurs:
“Failed to process /porta_var/tmp/invoices_2787_bLNDS/invoice-1.html HTML file, error: bless( ['undef','WHILE loop terminated (> 500000 iterations)...”
when attempting to generate invoice.- The monitoring system does not trigger an alarm (since the "ignore repeated identical errors for 10 consecutive runs" is not exceeded).
A notification is sent to the CSP Admin’s mailbox.
During the 2nd run, TS retries the UT without success (same error).
- Another notification is sent to the CSP Admin’s mailbox. The alarm is still not triggered.
- During the 3rd run, TS retries the UT without success (same error).
The monitoring system triggers an alarm (since the "ignore repeated identical errors for 2 consecutive runs" is exceeded).
CSP Admin receives 3 notifications and reviews the details of the failed invoice.
- SE detects the alarm on monitor and opens a ticket to the CSP.
CSP Admin follows the proposed action plan (e.g., modifies the invoice template for the customer to a single-page format, or requests Support to increase the row limit).
On the subsequent UTE, the invoice is generated successfully.
- This specific error doesn't occur anymore.
Use scenario #3.2.2: Declined E-Commerce payment
Preconditions:
The BA/system is configured to make 3 retry attempts to an external service (payment processor or any other service), with a 5-second pause between each attempt.
The monitoring system is configured to ignore repeated identical errors for 2 consecutive run.
CSP Admin is subscribed to periodic payment errors (via user notifications, template: “auto-payment error”).
Main Flow:
During UTE of auto_payment, the error occurs:
“Payment failed, errcode 2, error: This transaction has been declined...” when attempting to make an e-commerce payment.Instead of retrying after 5 seconds, TS ends with status CRITICAL.
A notification is sent to the CSP Admin’s mailbox.
TS repeats this cycle 3 times without success (same error).
The monitoring system triggers an alarm (since "ignore repeated identical errors for 2 consecutive run" is exceeded).
- SE detects the alarm on monitor and opens a ticket to the CSP.
- CSP is not responding to the ticket, the alarms is still there.
- CSP Admin is still receiving the notifications and ignoring them.
- SE sets param "ignore repeated identical errors for 2 consecutive run" = null aiming to supress this type of alarm forever.
- The monitoring alarm is OFF now (supressed) and doesn't occur anymore.
Use Scenario #3.2.3: Data entered by the CSP Admin is incorrect
Preconditions:
The monitoring system is configured to ignore repeated identical errors for 3 consecutive run.
- CSP Admin is subscribed to errors related to tax calculation error (via user notifications, template: “tax calculation error.”)
Main Flow:
During UTE of statistics/taxes, the error occurs:
“SureTax plugin with enabled "individual_jurisdictions" - zip code is missing for customer”
when attempting to to calculate taxes.A notification is sent to the CSP Admin’s mailbox.
TS retries the cycle three times without success.
The monitoring system does not trigger an alarm.
CSP Admin receives three notifications and reviews the details of the failed tax calculation (e.g., affected customer name/ID, timestamp, error text explaining the reason, and a proposed action plan).
CSP Admin follows the proposed action plan (e.g., enters ZIP code).
On the subsequent UTE, the invoice is generated successfully.
Use scenario #4: Overdue tasks
User Story
I, as a PortaSwitch Platform Engineer, need to control how many background tasks run simultaneously on a single web node — distinguishing between lightweight and heavy tasks — so that the system remains stable under load without blocking high-priority and important for the CSP work or sacrificing throughput.
Current Solution
Today: Each web node runs a cron job every minute that launches the Task Scheduler (function of TaskStack). TS picks the highest-priority pending task and runs it as a separate process. The only concurrency control is per priority. The system keeps a single slot for "normal" task and unlimited slots for "high" tasks.
Though this limitation is aimed to avoid overload of the system (nodes that run TS and DB servers), there is no distinction between heavy and lightweight tasks.
Pain points:
- A heavy task occupies the single normal slot for hours, blocking all other normal-prio tasks on that node for the entire duration. Only few tasks can run in parallel (those marked as high-prio).
- The overdue-task monitor fires false alarms during legitimate long-running tasks and requires constant tuning of thresholds "Number of overdue tasks", as it is difficult to calculate the severity of the overdue delay by amount of ovedue tasks.
- Scaling to more web nodes improves throughput by accident, not by design — there is no slot-awareness across nodes (each node is independent). More nodes usually decrease time spent on heavy parralelized tasks, thus increasing the "allowed for normal prio" time interval.
Current list of tasks with their priority, category, schedule.
Concrete current problem
Example 1 — Critical task delayed by a heavy job
On 2026-04-01 at 04:00:00, statistics starts on server Single_Node and occupies the single normal slot for the entire billing run. At 10:01:00, clone_tariff (normal priority, expected duration: ~1 minute) is queued. At 11:00:00 another task xdr_rerating (normal priority, expected duration: ~3 minute) is queued. As the Single_Node is fully occupied by statistics, clone_tariff and xdr_rerating cannot start on either node until statistics finishes at 18:00:00 — an 8/7-hour delay for a 1/3-minute task. The result is a delayed recalculated invoice and an unresolved billing dispute with the customer.
Example 2 — Overdue monitoring noise hides the real problem
Between 04:00:00 and 18:00:00, 30 additional tasks accumulate in the queue. The monitoring system fires an alarm: 30 tasks overdue. This alert gives no indication that xdr_rerating is the operationally critical task, nor that a single heavy job is the root cause of the entire backlog.
The SE cannot simply disable the overdue-task monitor — without it, a genuinely hung task that blocks the queue indefinitely would go undetected. At the same time, narrowing the Calculate_Statistics_At window is not an option: billing-cycle invoices at the start of the month are the highest-priority business operation and must complete as early as possible, without artificial time constraints.
Solution and Key changes:
- New boolean value
is_heavyon thetaskstable is configurable by SE. - New system configuration parameter
HeavySlotsPerNodedefines amount of slots for heavy tasks per each TS node. 0 - disabled. - New TS startup logic: before executing, counts currently running heavy TS tasks on each node, checks the slot availability, and exits if limits are reached. If there is available slot on a TS node - runs the task on that node.
- New TS startup logic: runs unlimited amount of non-heavy tasks on each node.
- New monitoring options
MaxWaitingTimeBeforeAlertis configurable by SE. 0 - disabled. - New monitoring options
MaxRunningTimeBeforeAlertis configurable by SE. 0 - disabled. - New monitoring rule: alert if any task is overdue for more than
MaxWaitingTimeBeforeAlert. - New monitoring rule: alert if any task is running for more than
MaxRunningTimeBeforeAlert. - Removal of the legacy "overdue task" monitor.
Preconditions:
- PortaSwitch deployment with 2 web nodes:
web-node-1andweb-node-2. Tasks.HeavySlotsPerNode = 1(each node allows at most 1 heavy task concurrently).Tasks.MaxWaitingTimeBeforeAlert = 1 hour;Tasks.MaxRunningTimeBeforeAlert = 24 hours;- Procedures
statistics,charge_subscriptions,CDR_cleanup,recalculate_invoices, andxdr_rerating(allpriority=normal) are configured by SE withis_heavy = True. - Procedures
clone_tariff,update_xchange_rates, did_cleanup,ua_profile,mark_inactive_rates, andnotify_monitor(allpriority=normal) remainis_heavy = False. - Cron fires Task Scheduler (TS) every 60 seconds on each node independently.
- At 2026-05-19 04:00:00, no tasks are running on either node.
Use Scenario #4.1 — Heavy task starts on first available node; second heavy task takes the other node
- At 2026-05-19 04:00:00,
statistics(is_heavy=True) is due. TS startsstatisticsonweb-node-1. Heavy slot state onweb-node-1: 1/1 used. - At 2026-05-19 10:01:00,
xdr_rerating(is_heavy=True) is due. There are no other overdue tasks at the moment. - The TS instance on
web-node-1detects its heavy slot is 1/1 occupied and exits without startingxdr_rerating. - In 30 sec, at 2026-05-19 10:01:30 another task
recalculate_invoices (is_heavy = True)is due; - At 2026-05-19 10:01:45, the TS instance on
web-node-2detects its heavy slot is 0/1 free and startsxdr_reratingonweb-node-2. - Heavy slot state on
web-node-2: 1/1 used.
Expected result:
statisticsruns onweb-node-1with statusrunning,started_at = 2026-05-19 04:00:02.xdr_reratingruns onweb-node-2with statusrunning,started_at = 2026-05-19 10:01:45.- Both nodes have heavy slots fully occupied; no node is over-allocated.
- Log entries on both nodes record the heavy slot decision with node_id and procedure name.
Use Scenario #4.2 — Non-heavy task runs in parallel with a heavy task on the same node
Preconditions: Following the sequence of Use Scenario #4.1.
- At 2026-05-19 10:02:00,
clone_tariff(priority=normal, is_heavy=False) is due. - The TS instance on
web-node-1detects that the heavy slot constraint does not apply to non-heavy tasks and startsclone_tariffonweb-node-1. clone_tariffcompletes at 2026-05-19 10:06:05 (5 minute 3 seconds total runtime).
Expected result:
clone_tariffstarts at2026-05-19 10:01:02, with no wait imposed by the heavy slot.statisticscontinues running onweb-node-1without interruption.- Concurrent processes on
web-node-1:statistics(heavy) andclone_tariff(non-heavy). - The 8-hour delay described in the current problem statement is eliminated for non-heavy tasks.
Use Scenario #4.3 — Third heavy task waits when both nodes are saturated
Preconditions: Following the sequence of Use Scenario #4.2. Both cluster heavy slots are occupied: statistics on web-node-1, xdr_rerating on web-node-2.
- At 2026-05-19 10:05:00,
recalculate_invoices(is_heavy=True) is due. The TS instances on both nodes detect heavy slot 1/1 occupied onweb-node-1and 1/1 occupied onweb-node-2respectively →recalculate_invoicesstayspendingcluster-wide. - At 2026-05-19 11:30:00,
xdr_reratingcompletes onweb-node-2. Heavy slot onweb-node-2: 0/1. - At 2026-05-19 11:31:02, TS detects free heavy slot on
web-node-2and startsrecalculate_invoicesonweb-node-2.
Expected result:
recalculate_invoicesstarts at2026-05-19 11:31:02, delayed by 1 hour 26 minutes due to cluster-wide heavy slot saturation.- Neither
web-node-1norweb-node-2exceedsTasks.HeavySlotsPerNodeat any point. - During the wait,
recalculate_invoicesremainspending; no system errors occur, no monitoring alarm is triggered.
Use Scenario #4.4 — Alert fires when a single task waits longer than the configured threshold
Preconditions: Following the sequence of Use Scenario #4.3. recalculate_invoices (is_heavy=True) is pending since 2026-05-19 10:05:00, with both cluster heavy slots saturated.
- At 2026-05-19 11:05:00,
recalculate_invoiceshas beenpendingfor exactly 60 minutes =Tasks.MaxWaitingTimeBeforeAlert. The monitoring system fires an alert. - Alert payload in log:
{ procedure: "recalculate_invoices", reason: "heavy_slots_saturated"}. - At 2026-05-19 11:31:02, the heavy slot on
web-node-2frees andrecalculate_invoicesstarts running. - The waiting alert auto-resolves once
recalculate_invoicesisrunning.
Expected result:
- Single, targeted alert identifying the specific waiting procedure (
recalculate_invoices), not an aggregate "30 tasks overdue" message. - Alert includes the actual wait duration and the root-cause reason (
heavy_slots_saturated), helping the SE diagnose the issue immediately. - Alert auto-resolves when the task starts.
- No alerts produced by the legacy "Number of Overdue Tasks" monitor (it is removed).
Use Scenario #4.5 — SE adjusts the monitoring option Tasks.MaxWaitingTimeBeforeAlert to suppress acceptable waits
Preconditions: Following the sequence of Use Scenario #4.4.
- At 2026-05-19 11:20:00, SE reviews the alert and determines that 1-hour waits during are expected and acceptable.
- SE changes the monitoring option
Tasks.MaxWaitingTimeBeforeAlertfrom1 hourto4 hours. - At 2026-06-01 10:05:00, the same waiting scenario occurs. No alert fires while wait time < 4 hours.
- At 2026-06-01 14:05:00, wait time reaches 4 hours → alert fires.
Expected result:
- Alerts fire only for genuinely problematic waits, as defined by the SE.
- No code change or service restart required to adjust the threshold.
Use Scenario #4.6 — Alert fires for a heavy task exceeding its running-time threshold
- At 2026-05-18 04:00:00,
statistics(is_heavy=True) starts onweb-node-1. - At 2026-05-19 04:00:00,
statisticshas beenrunningfor 24 hours =Tasks.MaxRunningTimeBeforeAlert (monitoring option).The monitoring system fires an alert. - Alert payload:
{ procedure: "statistics", reason: "task_running_too_long"}. statisticscontinues running — alert is informational, not a kill signal.- The alert is resolved when the task completes.
Expected result:
- One alert fires at the 24-hour mark, identifying the task (procedure), node and reason.
- Task continues running; no automatic interruption.
- SE has actionable information (procedure name, node, reason) to diagnose whether the task is genuinely stuck.
Non-functional requirements
A backport to 125-6, 120-6, 115-6, and 110-6 is required. This will allow Support to benefit from the enhancement immediately, without having to wait two years for the CSP to migrate to MR131+.
Sorry, this text is too long to be checked.