Reviewers (YT:PMD-3322):

User Story

As an SE, I’d like to have a monitoring system for TaskStack (TS) that alerts only when the service degrades in a way that could impact the customer’s business, so that I can take the necessary measures and avoid wasting time on false alerts.

As an SE, I’d also like to avoid situations where I receive a TaskStack alert only to determine that it should be ignored, so that I can save time in such cases.

As an SE, I want the system to alert the CSP Admin directly when an issue requires a business decision or action from the CSP, so that I don’t have to act as a “middleman.”

As a CSP Admin, I’d like to subscribe to alerts that may impact my business and requires my actions, so that I can make timely business decisions.

Example of use

Business model

Technology

Current Solution

The current monitoring of TaskStack generates too many false alerts — alerts that don’t require any action. In other words, even if we ignore them, the service remains unaffected and no one notices an issue.

This happens because we are monitoring the internals of TaskStack rather than the actual-vs-expected results of its work.

Stakeholders and their benefits

Who are the users / whom we bring value to?

Benefit / Stakeholders	More Comfort	Increased Efficiency	Saves Time	Tighter Control
CSP			✓	✓
Sales/marketing of CSP
Resellers / distributors			✓	✓
Network operations / Support of CSP		✓	✓	✓
Developer
3rd party
End user	✓

Use Cases

Use Case #1: Retry the connection (operation) instead of generating an alert

Roles: Support Engineer (SE)

Use Scenario #1.1: Connection to Gearman

Preconditions: The BA/system is configured to make 3 retry attempts to connect to Gearman, with a 5-second pause between each attempt.

During the unit task execution (UTE) (e.g. vd_notify), the error "...GEARMAN_COULD_NOT_CONNECT..." occurs when attempting to connect to Gearman.

Initially, instead of immediately ending with status CRITICAL, TS waits for 5 seconds, retries, and encounters the same error again.

TS repeats this cycle 3 times without success and eventually ends with status CRITICAL.
The monitoring system then triggers an alarm.

The SE investigates the issue and observes that periods of overload typically last less than 5 minutes.
To improve resilience, the SE updates the configuration to increase the retry interval from 5 seconds to 60 seconds, and the number of attempts from 3 to 5.

At the next UTE, the error "...GEARMAN_COULD_NOT_CONNECT..." occurs again.
TS waits 60 seconds before retrying, but the same error reappears.

TS repeats this cycle 5 times. On the 5th attempt, it successfully connects to Gearman.

TS is then able to continue the UTE and completes with status OK.
As a result, the monitoring system does not detect any issue.

Use scenario #1.2: Connection to Cassandra DB

Preconditions: The BA/system is configured to make 5 retry attempts to connect to Cassandra, with a 5-second pause between each attempt.

During UTE (e.g. customer_statistics), the error "Cassandra login failed ..." occurs when attempting to connect to Cassandra.

Initially, instead of immediately ending with status CRITICAL, TS waits for 5 seconds, retries, and encounters the same error again.

TS repeats this cycle 5 times without success and eventually ends with status CRITICAL.
The monitoring system then triggers an alarm.

The SE investigates the issue and observes that periods of overload is related to the daily logrotate routine that usually lasts less than 10 min.
To improve resilience, the SE updates the configuration to increase the retry interval from 5 seconds to 30 seconds, and the number of attempts from 5 to 10.

At the next UTE, the error "Cassandra login failed ..." occurs again.
TS waits 30 seconds before retrying, but the same error reappears.

TS repeats this cycle 7 times. On the 7th attempt, it successfully connects to Cassandra.

TS is then able to continue the UTE and completes with status OK.
As a result, the monitoring system does not detect any issue.

Use scenario #1.3: Connection to Kairos DB

Preconditions: The BA/system is configured to make 2 retry attempts to connect to KairosDB, with a 10-second pause between each attempt.

During UTE (e.g. update_system_load_statistic), the error "Cannot send the request..." occurs when attempting to connect to KairosDB.

Initially, instead of immediately ending with status CRITICAL, TS waits for 10 seconds, retries, and encounters the same error again.

TS repeats this cycle 2 times without success and eventually ends with status CRITICAL.
The monitoring system then triggers an alarm.

The SE investigates the issue and observes that the root cause is related to the network connectivity issues that normally last less than 1 min.
To improve resilience, the SE updates the configuration to keep the retry interval 10 seconds, and increase the number of attempts from 2 to 6.

At the next UTE, the error "Cannot send the request..." occurs again.
TS waits 10 seconds before retrying, but the same error re-appears.

TS repeats this cycle 6 times without success and then ends with status CRITICAL.

The SE investigates the issue and determines that the root cause is an incorrect system configuration.

After the SE applies the fix, TS is able to run the UTE and completes successfully with status OK, despite temporary network issues.
As a result, the monitoring system does not report any issue.

Use scenario #1.4: Connection to external 3rd-party API service (e.g. SureTax and the others)

Preconditions: The BA/system is configured to make 2 retry attempts to external service (API service or ANY service), with a 30-second pause between each attempt.

During UTE (e.g. statistics), the error occurs when attempting to send request to SureTax API (e.g. gets response code 500).

Initially, instead of immediately ending with status CRITICAL, TS waits for 30 seconds, retries, and encounters the same error again.

TS repeats this cycle 2 times without success and eventually ends with status CRITICAL.
The monitoring system then triggers an alarm.

The SE investigates the issue and observes that periods of the API service failure usually last less than 10 min.
To improve resilience, the SE updates the configuration to increase the retry interval from 30 seconds to 60 seconds, and the number of attempts from 2 to 10.

At the next UTE, the error "SureTax: status 500" occurs again.
TS waits 60 seconds before retrying, but the same error reappears.

TS repeats this cycle 4 times. On the 4th attempt, it gets response that indicates successful attempt.

TS is then able to continue the UTE and completes with status OK.
As a result, the monitoring system does not detect any issue.

Use case #2: Re-run the task instead of generating alert

Roles: Support Engineer

Preconditions: The monitoring system is configured to ignore repeated identical errors for three consecutive run.

TS executes UTs (e.g., statistics) hourly according to a defined schedule within the allowed period Calculate_Statistics_At.

Use scenario #2.1: Replication delay

Main Flow:

During UTE, TS encounters the error “Replication delay” when attempting to send a request to SlaveDB.
The re-run attempt counter for this unit task (UT) is increased from 0 to 1.
Instead of triggering an alarm, the monitoring system waits for the next scheduled run of the same UT.
On the next scheduled run (e.g., after 1 hour), TS encounters the same error “Replication delay” again.
The re-run attempt counter is increased from 1 to 2.
The monitoring system again waits for the next scheduled run instead of triggering an alarm.
On the following scheduled run, TS executes the UT and does not encounter the error (Status OK or a different error).
The re-run attempt counter is reset from 2 back to 0.
On the next scheduled run (e.g., the next day), TS encounters the error “Replication delay” again.
The two subsequent scheduled re-runs also result in the same error “Replication delay”.
The re-run attempt counter is increased from 0 to 3.
At this point, the monitoring system triggers an alarm.

Exception Flow:

If the UT succeeds before reaching the maximum number of re-run attempts, the counter is reset to 0, and no alarm is triggered.

Resolution:

The SE investigates the issue and determines that replication delays are typically caused by high load and last up to 4 hours.
To improve resilience, the SE updates the configuration to increase the maximum number of re-run attempts from 3 to 6.

Postconditions:

At the next UTE, the error “Replication delay” occurs again.
TS retries on the next scheduled run (e.g., after 1 hour) and succeeds with status OK.
The re-run attempt counter is reset from 4 to 0.
The monitoring system does not detect or report an issue.

Use Case #3: Send an alert to the CSP Admin instead of alert when the issue requires a business decision

Use Scenario #3.1: Errors from external systems managed by the CSP and/or their vendor/partner

Use Scenario #3.1.1: Bad Response from E-Commerce Payment Gateway

Preconditions:

The BA/system is configured to make 3 retry attempts to an external service (payment processor or any other service), with a 5-second pause between each attempt.
The monitoring system is configured to ignore repeated identical errors for 3 consecutive runs.
CSP Admin is subscribed to periodic payment errors (via user notifications, template: “auto-payment error”).

Main Flow:

During UTE of auto_payment, the error occurs:
“Payment failed, errcode 9999, error: Bad response from Authorize.Net payment gateway...” when attempting to make an e-commerce payment.
Instead of immediately ending with status CRITICAL, TS waits 5 seconds, retries, and encounters the same error again.
TS repeats this cycle 3 times without success and eventually ends with status CRITICAL.
A notification is sent to the CSP Admin’s mailbox.
The monitoring system does not trigger an alarm (since the "ignore repeated identical errors for 3 consecutive runs" is not exceeded).
CSP Admin receives a notification and reviews the details of the failed payment (e.g., counter and/or list of the affected objects (payment/customer/account), details such as affected customer name/ID, timestamp, charged amount, error text explaining the reason, and a proposed action plan).
CSP Admin follows the proposed action plan, contacts Authorize.Net, and obtains a fix from the vendor.
On the subsequent UTE of auto_payment, the e-commerce payment is processed successfully.

Use Scenario #3.1.2: Taxation service rejects requests from PB

Preconditions:

The BA/system is configured to make 3 retry attempts to an external service (payment processor or any other service), with a 5-second pause between each attempt.
The monitoring system is configured to ignore repeated identical errors for 2 consecutive run.
CSP Admin is subscribed to errors related to tax calculation error (via user notifications, template: “tax calculation error.”)

Main Flow:

During UTE of statistics/taxes, the error occurs:
“LoadRequest Failure - Request has invalid inputs: Failure - Invalid Validation Key\r\nSchema Errors”
when attempting to calculate taxes.
Instead of immediately ending with status CRITICAL, TS waits for 5 seconds, retries, and encounters the same error again.
TS repeats this cycle 3 times without success and eventually ends with status CRITICAL.
The monitoring system does not trigger an alarm (since the "ignore repeated identical errors for 2 consecutive runs" is not exceeded).
A notification is sent to the CSP Admin’s mailbox.
During the next run (e.g. next day), TS retries the UT without success (same error).
The monitoring system still does not trigger an alarm (since the "ignore repeated identical errors for 2 consecutive runs" is still not exceeded).
CSP Admin receives 2 notifications and reviews the details of the failed tax calculation.
CSP Admin follows the proposed action plan (e.g., disables tax calculation or prolongs his validation key).
On the subsequent UTE, the invoice is generated successfully.
This specific error doesn't occur anymore.

Use scenario #3.2: Errors that have to be fixed by CSP Admin

Use scenario #3.2.1: Broken invoice template

Preconditions:

The monitoring system is configured to ignore repeated identical errors for 2 consecutive runs.
CSP Admin is subscribed to errors related to invoice generation (via user notifications, template: “regular invoice cannot be created...”).

Main Flow:

During UTE of customer_statistics, the error occurs:
“Failed to process /porta_var/tmp/invoices_2787_bLNDS/invoice-1.html HTML file, error: bless( ['undef','WHILE loop terminated (> 500000 iterations)...”
when attempting to generate invoice.
The monitoring system does not trigger an alarm (since the "ignore repeated identical errors for 10 consecutive runs" is not exceeded).
A notification is sent to the CSP Admin’s mailbox.
During the 2nd run, TS retries the UT without success (same error).
Another notification is sent to the CSP Admin’s mailbox. The alarm is still not triggered.
During the 3rd run, TS retries the UT without success (same error).
The monitoring system triggers an alarm (since the "ignore repeated identical errors for 2 consecutive runs" is exceeded).
CSP Admin receives 3 notifications and reviews the details of the failed invoice.
SE detects the alarm on monitor and opens a ticket to the CSP.
CSP Admin follows the proposed action plan (e.g., modifies the invoice template for the customer to a single-page format, or requests Support to increase the row limit).
On the subsequent UTE, the invoice is generated successfully.
This specific error doesn't occur anymore.

Use scenario #3.2.2: Declined E-Commerce payment

Preconditions:

The BA/system is configured to make 3 retry attempts to an external service (payment processor or any other service), with a 5-second pause between each attempt.
The monitoring system is configured to ignore repeated identical errors for 2 consecutive run.
CSP Admin is subscribed to periodic payment errors (via user notifications, template: “auto-payment error”).

Main Flow:

During UTE of auto_payment, the error occurs:
“Payment failed, errcode 2, error: This transaction has been declined...” when attempting to make an e-commerce payment.
Instead of retrying after 5 seconds, TS ends with status CRITICAL.
A notification is sent to the CSP Admin’s mailbox.
TS repeats this cycle 3 times without success (same error).
The monitoring system triggers an alarm (since "ignore repeated identical errors for 2 consecutive run" is exceeded).
SE detects the alarm on monitor and opens a ticket to the CSP.
CSP is not responding to the ticket, the alarms is still there.
CSP Admin is still receiving the notifications and ignoring them.
SE sets param "ignore repeated identical errors for 2 consecutive run" = null aiming to supress this type of alarm forever.
The monitoring alarm is OFF now (supressed) and doesn't occur anymore.

Use Scenario #3.2.3: Data entered by the CSP Admin is incorrect

Preconditions:

The monitoring system is configured to ignore repeated identical errors for 3 consecutive run.
CSP Admin is subscribed to errors related to tax calculation error (via user notifications, template: “tax calculation error.”)

Main Flow:

During UTE of statistics/taxes, the error occurs:
“SureTax plugin with enabled "individual_jurisdictions" - zip code is missing for customer”
when attempting to to calculate taxes.
A notification is sent to the CSP Admin’s mailbox.
TS retries the cycle three times without success.
The monitoring system does not trigger an alarm.
CSP Admin receives three notifications and reviews the details of the failed tax calculation (e.g., affected customer name/ID, timestamp, error text explaining the reason, and a proposed action plan).
CSP Admin follows the proposed action plan (e.g., enters ZIP code).
On the subsequent UTE, the invoice is generated successfully.

Use scenario #4: Overdue tasks

User Story

I, as a PortaSwitch Platform Engineer, need to control how many background tasks run simultaneously on a single web node — distinguishing between lightweight and heavy tasks — so that the system remains stable under load without blocking high-priority and important for the CSP work or sacrificing throughput.

Current Solution

Today: Each web node runs a cron job every minute that launches the Task Scheduler (function of TaskStack). TS picks the highest-priority pending task and runs it as a separate process. The only concurrency control is per priority. The system keeps a single slot for "normal" task and unlimited slots for "high" tasks.

Though this limitation is aimed to avoid overload of the system (nodes that run TS and DB servers), there is no distinction between heavy and lightweight tasks.

Pain points:

A heavy task occupies the single normal slot for hours, blocking all other normal-prio tasks on that node for the entire duration. Only few tasks can run in parallel (those marked as high-prio).
The overdue-task monitor fires false alarms during legitimate long-running tasks and requires constant tuning of thresholds "Number of overdue tasks", as it is difficult to calculate the severity of the overdue delay by amount of ovedue tasks.
Scaling to more web nodes improves throughput by accident, not by design — there is no slot-awareness across nodes (each node is independent). More nodes usually decrease time spent on heavy parralelized tasks, thus increasing the "allowed for normal prio" time interval.

Current list of tasks with their priority, category, schedule.

List of tasks (Procedures)

mysql> select * from Procedures order by priority desc;
+-----+------+-------------------------------------+-----------------------------------------------------------------------------+----------+----------+---------+--------------------------------+------------+-------------+-------------+--------------+
| id  | type | name                                | description                                                                 | priority | periodic | blocked | block_reason                   | start_time | exit_status | category    | exit_details |
+-----+------+-------------------------------------+-----------------------------------------------------------------------------+----------+----------+---------+--------------------------------+------------+-------------+-------------+--------------+
|   7 | I    | auto_payments                       | Auto-Payments                                                               | high     | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  10 | I    | credit_limit_warnings               | Send credit limit warnings                                                  | high     | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  14 | I    | custom_report                       | Execute Custom Reports                                                      | high     | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  20 | I    | regenerateInvoice                   | Regenerate Invoice                                                          | high     | N        | N       | NULL                           | NULL       |        NULL | unimportant | NULL         |
|  23 | I    | clone_product                       | Clone Product                                                               | high     | N        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  28 | I    | update_system_load_statistic        | Collect system load statistic                                               | high     | Y        | Y       | NULL                           | NULL       |           0 | important   | NULL         |
|  29 | I    | adaptive_routing_update             | Update Adaptive Routing statistics                                          | high     | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  36 | I    | notify_radius                       | Send notifications to Radius servers                                        | high     | N        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  37 | I    | notify_um                           | Send notifications to PortaUM nodes                                         | high     | N        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  38 | I    | notify_sip                          | Send notifications to PortaSIP nodes                                        | high     | N        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  43 | I    | check_notify_status                 | Check status of notify* tasks                                               | high     | N        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  47 | I    | check_access_policies               | Check access policies phases                                                | high     | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  52 | I    | fraud_protection_update             | Update status for Voice Protection feature                                  | high     | Y        | N       | new procedure                  | NULL       |           0 | unimportant | NULL         |
|  63 | I    | notify_media_server                 | Send notifications to MediaServer nodes                                     | high     | N        | N       | NULL                           | NULL       |        NULL | unimportant | NULL         |
|  83 | I    | mm_active_calls                     | Schedule admin::collecting_active_calls                                     | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  84 | I    | mm_call_recording_storage           | Schedule admin::collecting_call_recording_storage                           | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  85 | I    | mm_ip_centrex_phone_line            | Schedule admin::collecting_ip_centrex_phone_line                            | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  86 | I    | mm_concurrent_calls                 | Schedule admin::collecting_concurrent_calls                                 | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  88 | I    | spending_plan_limit_warnings        | Send spending plan limit warnings                                           | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  97 | I    | check_expired_vd_counters           | Check Expired VD Counters                                                   | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 105 | I    | sap_create_reports                  | Generate SAP reports                                                        | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 106 | I    | sap_send_reports                    | Send SAP reports                                                            | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 107 | I    | vd_quota_usage_stat                 | Quota usage statistic                                                       | high     | Y        | Y       | :do-not-monitor: obsolete task | NULL       |           0 | important   | NULL         |
| 108 | I    | process_cqt_data                    | Fetch and aggregate voice quality metric values from call quality tracker   | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 109 | I    | vd_discount_usage_stat              | Discount usage statistic                                                    | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 110 | I    | send_cdr_files                      | Send custom CDR files                                                       | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 112 | I    | sync_ldap_users                     | Sync LDAP users                                                             | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 113 | I    | apply_commitment_action             | Apply commitment action                                                     | high     | N        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 114 | I    | check_commitments                   | Check commitments                                                           | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 119 | I    | bundles_maintenance                 | Maintain prepaid bundles (e.g. activation, notification, ESPF events)       | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 120 | I    | cleanup_sim_cards                   | Cleanup disposed SIM cards                                                  | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 121 | I    | activate_bundles_for_accounts       | Trigger prepaid bundles activation for accounts                             | high     | N        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 122 | I    | cleanup_bundle_activations          | Cleanup activation records ot bundles                                       | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 123 | I    | cleanup_custom_reports              | Cleanup custom reports data from permanent storage (local and/or cassandra) | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 124 | I    | charge_voice_recordings             | Billing of call recordsing                                                  | high     | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|   1 | I    | garbage_collection                  | garbage collection for TasksStack                                           | normal   | Y        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
|   3 | I    | profiles_cleanup                    | Profiles Cleanup                                                            | normal   | Y        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
|   4 | I    | statistics                          | Statistics Calculation                                                      | normal   | Y        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
|   6 | I    | update_xchange_rates                | Update Exchange Rates                                                       | normal   | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|   8 | I    | offset_balance                      | Offset Balance                                                              | normal   | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|   9 | I    | CDR_cleanup                         | CDR tables cleanup                                                          | normal   | Y        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
|  11 | I    | truncate_Log_table                  | Move old Log records to csv-file                                            | normal   | Y        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
|  13 | I    | charge_subscriptions                | Subscription Fees                                                           | normal   | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  15 | I    | check_invoices                      | Check Invoices                                                              | normal   | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  18 | I    | cardissue                           | Accounts Generator                                                          | normal   | N        | N       | NULL                           | NULL       |        NULL | important   | NULL         |
|  22 | I    | xdr_rerating                        | xDR Re-rating                                                               | normal   | N        | N       | NULL                           | NULL       |        NULL | unimportant | NULL         |
|  24 | I    | db_analyze_tables                   | Execute ANALYZE on tables                                                   | normal   | Y        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
|  26 | I    | mark_inactive_rates                 | Mark Inactive Rates                                                         | normal   | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  30 | I    | reapply_product_subscriptions       | Reapply product subscriptions to accounts having the product assigned       | normal   | N        | N       | NULL                           | NULL       |        NULL | important   | NULL         |
|  31 | I    | void_invoice                        | Put invoice into void state                                                 | normal   | N        | Y       | :do-not-monitor: obsolete task | NULL       |        NULL | important   | NULL         |
|  32 | I    | recalculate_invoices                | Recalculate invoices                                                        | normal   | N        | N       | NULL                           | NULL       |        NULL | unimportant | NULL         |
|  33 | I    | vd_notify                           | Volume Discount Threshold notifications                                     | normal   | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  34 | I    | calc_netaccess_stats                | Netaccess Usage Stats recalculation                                         | normal   | Y        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
|  35 | I    | upload_wizard_cleanup               | Upload Wizard Temporary Tables Cleanup                                      | normal   | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  40 | I    | notify_monitor                      | Send notifications to Monitor nodes                                         | normal   | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  41 | I    | cfgptag_maintenance                 | Cfgptag Maintenance                                                         | normal   | N        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  42 | I    | ua_profile                          | UA Profiles Generator                                                       | normal   | N        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  45 | I    | expire_balance                      | Expire balance task                                                         | normal   | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  46 | I    | web_sessions_cleanup                | Deletes old web sessions                                                    | normal   | Y        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  48 | I    | alive_xdr_cleanup                   | Cleanup Alive_XDRs table                                                    | normal   | Y        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
|  49 | I    | clone_tariff                        | Clone Tariff                                                                | normal   | N        | N       | NULL                           | NULL       |           0 | important   | NULL         |
|  50 | I    | db_checksum                         | Check DB checksum on Master/Slave                                           | normal   | Y        | Y       | new code                       | NULL       |           0 | unimportant | NULL         |
|  51 | I    | did_order_renew                     | Renew order for provisioned DID numbers                                     | normal   | Y        | N       | new code                       | NULL       |           0 | important   | NULL         |
|  55 | I    | credit_limit_increase               | Temporary credit limit increase feature                                     | normal   | Y        | N       | new procedure                  | NULL       |           0 | unimportant | NULL         |
|  56 | I    | create_one_time_invoice             | Generate one-time invoice                                                   | normal   | N        | N       | new procedure                  | NULL       |           0 | unimportant | NULL         |
|  57 | I    | invoices_review_status              | Check Under Review invoices and send notifications                          | normal   | Y        | N       | new procedure                  | NULL       |           0 | unimportant | NULL         |
|  58 | I    | process_review_invoices             | Process Under Review invoices                                               | normal   | N        | N       | new procedure                  | NULL       |           0 | unimportant | NULL         |
|  59 | I    | did_cleanup                         | Cleanup canceled DID numbers                                                | normal   | Y        | N       | new procedure                  | NULL       |           0 | unimportant | NULL         |
|  61 | I    | cdr_cleanup_fixup                   | Verify and Fix cleanup xDRs                                                 | normal   | N        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
|  62 | I    | password_encryption                 | Encrypt/Decrypt Accounts' and Customers' passwords                          | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
|  64 | I    | change_customer_status              | Check customer status and change it if necessary                            | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
|  66 | I    | did_billing_cleanup                 | Cleanup finished and billed DID billing records                             | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
|  68 | I    | clone_rates                         | Procedure for cloning rates                                                 | normal   | N        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
|  69 | I    | number_porting                      | Number Porting                                                              | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  71 | I    | synchronize_LERG_database           | Synchronize LERG Database                                                   | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
|  72 | I    | delete_system_load_statistic        | Remove system load statistic                                                | normal   | N        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
|  73 | I    | sync_rw_devices                     | Restore ReadyWireless devices                                               | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
|  74 | I    | credit_card_notify                  | Notify Customers about expired Credit Card                                  | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
|  75 | I    | re_auth_inet_customers              | Re-Authorize INET Customers                                                 | normal   | Y        | Y       | :do-not-monitor: optional task | NULL       |           0 | important   | NULL         |
|  76 | I    | process_bitcoin_transactions        | Process Bitcoin transactions                                                | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  77 | I    | media_delete_accounts               | Schedule media::delete_accounts                                             | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  79 | I    | media_imap_mbox_cleanup             | Schedule media::imap_mbox_cleanup                                           | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  80 | I    | media_move_accounts                 | Schedule media::move_accounts                                               | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  81 | I    | media_sessions_purge                | Schedule media::sessions_purge                                              | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  82 | I    | retrigger_codec_converter           | Re-Trigger Codec Converter                                                  | normal   | N        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
|  87 | I    | sync_iptv_packages                  | Synchronize IPTV Packages with external platform                            | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
|  89 | I    | aggregate_measured_services         | Aggregate collected measured services data                                  | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
|  90 | I    | process_belog                       | Procedure for processing of BE log                                          | normal   | N        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
|  91 | I    | api_connections_cleanup             | Cleanup expired API connections information                                 | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  92 | I    | sync_sip_subscriptions              |  Synchronize PortaAdmin connection subscriptions with PortaSIP database     | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  94 | I    | archive_sip_calls                   | Schedule media::archive_sip_calls                                           | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  95 | I    | check_pending_transactions          | Check pending transactions                                                  | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
|  96 | I    | cleanup_mediator_collections        | Procedure for deletion old CDR collections                                  | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           1 | unimportant | NULL         |
|  98 | I    | otp_cleanup                         | Cleanup expired One-Time passwords                                          | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
|  99 | I    | media_cleanup_db                    | Schedule media::media_cleanup_db                                            | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
| 100 | I    | cleanup_async_requests              | Cleanup expired async API method calls data                                 | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 102 | I    | cleanup_subscriptions_history       | Cleanup outdated subscriptions history data                                 | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 103 | I    | cleanup_customer_personal_data      | Cleanup terminated customer personal data                                   | normal   | Y        | Y       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 104 | I    | delete_tariffs                      | Cleanup outdated tariffs                                                    | normal   | Y        | N       | NULL                           | NULL       |           0 | unimportant | NULL         |
| 111 | I    | check_account_expiration            | Check expired accounts                                                      | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 115 | I    | cleanup_bill_status_change_counters | Cleanup daily bill status change counters                                   | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 116 | I    | cleanup_audio_files                 | Cleanup temporary audio files                                               | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 117 | I    | cleanup_try_on_context              | Cleanup expired try-on context records                                      | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 125 | I    | sync_session_logs                   | Sync session logs (index data only)                                         | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 126 | I    | cleanup_funds_usage_history         | Cleanup funds usage history for obsolete payments                           | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 127 | I    | cleanup_services                    | Obsolete services removal                                                   | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 128 | I    | cpe_directories_auto_sync           | Synchronize UA config directories                                           | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 129 | I    | cpe_directories_generation          | Generate UA config directory file                                           | normal   | N        | N       | :do-not-monitor: new procedure | NULL       |           0 | unimportant | NULL         |
| 130 | I    | media_vm_trx_billing_export         | Schedule media::vm_trx_billing_export                                       | normal   | Y        | N       | :do-not-monitor: new procedure | NULL       |           0 | important   | NULL         |
+-----+------+-------------------------------------+-----------------------------------------------------------------------------+----------+----------+---------+--------------------------------+------------+-------------+-------------+--------------+

Concrete current problem

Example 1 — Critical task delayed by a heavy job

On 2026-04-01 at 04:00:00, statistics starts on server Single_Node and occupies the single normal slot for the entire billing run. At 10:01:00, clone_tariff (normal priority, expected duration: ~1 minute) is queued. At 11:00:00 another task xdr_rerating (normal priority, expected duration: ~3 minute) is queued. As the Single_Node is fully occupied by statistics, clone_tariff and xdr_rerating cannot start on either node until statistics finishes at 18:00:00 — an 8/7-hour delay for a 1/3-minute task. The result is a delayed recalculated invoice and an unresolved billing dispute with the customer.

Example 2 — Overdue monitoring noise hides the real problem

Between 04:00:00 and 18:00:00, 30 additional tasks accumulate in the queue. The monitoring system fires an alarm: 30 tasks overdue. This alert gives no indication that xdr_rerating is the operationally critical task, nor that a single heavy job is the root cause of the entire backlog.

The SE cannot simply disable the overdue-task monitor — without it, a genuinely hung task that blocks the queue indefinitely would go undetected. At the same time, narrowing the Calculate_Statistics_At window is not an option: billing-cycle invoices at the start of the month are the highest-priority business operation and must complete as early as possible, without artificial time constraints.

Solution and Key changes:

New boolean value is_heavy on the tasks table is configurable by SE.
New system configuration parameter HeavySlotsPerNode defines amount of slots for heavy tasks per each TS node. 0 - disabled.
New TS startup logic: before executing, counts currently running heavy TS tasks on each node, checks the slot availability, and exits if limits are reached. If there is available slot on a TS node - runs the task on that node.
New TS startup logic: runs unlimited amount of non-heavy tasks on each node.
New monitoring options MaxWaitingTimeBeforeAlert is configurable by SE. 0 - disabled.
New monitoring options MaxRunningTimeBeforeAlert is configurable by SE. 0 - disabled.
New monitoring rule: alert if any task is overdue for more than MaxWaitingTimeBeforeAlert.
New monitoring rule: alert if any task is running for more than MaxRunningTimeBeforeAlert.
Removal of the legacy "overdue task" monitor.

Preconditions:

PortaSwitch deployment with 2 web nodes: web-node-1 and web-node-2.
Tasks.HeavySlotsPerNode = 1 (each node allows at most 1 heavy task concurrently).
Tasks.MaxWaitingTimeBeforeAlert = 1 hour;
Tasks.MaxRunningTimeBeforeAlert = 24 hours;
Procedures statistics, charge_subscriptions, CDR_cleanup, recalculate_invoices, and xdr_rerating (all priority=normal) are configured by SE with is_heavy = True.
Procedures clone_tariff, update_xchange_rates, did_cleanup, ua_profile, mark_inactive_rates, and notify_monitor (all priority=normal) remain is_heavy = False.
Cron fires Task Scheduler (TS) every 60 seconds on each node independently.
At 2026-05-19 04:00:00, no tasks are running on either node.

Use Scenario #4.1 — Heavy task starts on first available node; second heavy task takes the other node

At 2026-05-19 04:00:00, statistics (is_heavy=True) is due. TS starts statistics on web-node-1. Heavy slot state on web-node-1: 1/1 used.
At 2026-05-19 10:01:00, xdr_rerating (is_heavy=True) is due. There are no other overdue tasks at the moment.
The TS instance on web-node-1 detects its heavy slot is 1/1 occupied and exits without starting xdr_rerating.
In 30 sec, at 2026-05-19 10:01:30 another task recalculate_invoices (is_heavy = True) is due;
At 2026-05-19 10:01:45, the TS instance on web-node-2 detects its heavy slot is 0/1 free and starts xdr_rerating on web-node-2.
Heavy slot state on web-node-2: 1/1 used.

Expected result:

statistics runs on web-node-1 with status running, started_at = 2026-05-19 04:00:02.
xdr_rerating runs on web-node-2 with status running, started_at = 2026-05-19 10:01:45.
Both nodes have heavy slots fully occupied; no node is over-allocated.
Log entries on both nodes record the heavy slot decision with node_id and procedure name.

Use Scenario #4.2 — Non-heavy task runs in parallel with a heavy task on the same node

Preconditions: Following the sequence of Use Scenario #4.1.

At 2026-05-19 10:02:00, clone_tariff (priority=normal, is_heavy=False) is due.
The TS instance on web-node-1 detects that the heavy slot constraint does not apply to non-heavy tasks and starts clone_tariff on web-node-1.
clone_tariff completes at 2026-05-19 10:06:05 (5 minute 3 seconds total runtime).

Expected result:

clone_tariff starts at 2026-05-19 10:01:02, with no wait imposed by the heavy slot.
statistics continues running on web-node-1 without interruption.
Concurrent processes on web-node-1: statistics (heavy) and clone_tariff (non-heavy).
The 8-hour delay described in the current problem statement is eliminated for non-heavy tasks.

Use Scenario #4.3 — Third heavy task waits when both nodes are saturated

Preconditions: Following the sequence of Use Scenario #4.2. Both cluster heavy slots are occupied: statistics on web-node-1, xdr_rerating on web-node-2.

At 2026-05-19 10:05:00, recalculate_invoices (is_heavy=True) is due. The TS instances on both nodes detect heavy slot 1/1 occupied on web-node-1 and 1/1 occupied on web-node-2 respectively → recalculate_invoices stays pending cluster-wide.
At 2026-05-19 11:30:00, xdr_rerating completes on web-node-2. Heavy slot on web-node-2: 0/1.
At 2026-05-19 11:31:02, TS detects free heavy slot on web-node-2 and starts recalculate_invoices on web-node-2.

Expected result:

recalculate_invoices starts at 2026-05-19 11:31:02, delayed by 1 hour 26 minutes due to cluster-wide heavy slot saturation.
Neither web-node-1 nor web-node-2 exceeds Tasks.HeavySlotsPerNode at any point.
During the wait, recalculate_invoices remains pending; no system errors occur, no monitoring alarm is triggered.

Use Scenario #4.4 — Alert fires when a single task waits longer than the configured threshold

Preconditions: Following the sequence of Use Scenario #4.3. recalculate_invoices (is_heavy=True) is pending since 2026-05-19 10:05:00, with both cluster heavy slots saturated.

At 2026-05-19 11:05:00, recalculate_invoices has been pending for exactly 60 minutes = Tasks.MaxWaitingTimeBeforeAlert. The monitoring system fires an alert.
Alert payload in log: { procedure: "recalculate_invoices", reason: "heavy_slots_saturated"}.
At 2026-05-19 11:31:02, the heavy slot on web-node-2 frees and recalculate_invoices starts running.
The waiting alert auto-resolves once recalculate_invoices is running.

Expected result:

Single, targeted alert identifying the specific waiting procedure (recalculate_invoices), not an aggregate "30 tasks overdue" message.
Alert includes the actual wait duration and the root-cause reason (heavy_slots_saturated), helping the SE diagnose the issue immediately.
Alert auto-resolves when the task starts.
No alerts produced by the legacy "Number of Overdue Tasks" monitor (it is removed).

Use Scenario #4.5 — SE adjusts the monitoring option `Tasks.MaxWaitingTimeBeforeAlert` to suppress acceptable waits

Preconditions: Following the sequence of Use Scenario #4.4.

At 2026-05-19 11:20:00, SE reviews the alert and determines that 1-hour waits during are expected and acceptable.
SE changes the monitoring option Tasks.MaxWaitingTimeBeforeAlert from 1 hour to 4 hours.
At 2026-06-01 10:05:00, the same waiting scenario occurs. No alert fires while wait time < 4 hours.
At 2026-06-01 14:05:00, wait time reaches 4 hours → alert fires.

Expected result:

Alerts fire only for genuinely problematic waits, as defined by the SE.
No code change or service restart required to adjust the threshold.

Use Scenario #4.6 — Alert fires for a heavy task exceeding its running-time threshold

At 2026-05-18 04:00:00, statistics (is_heavy=True) starts on web-node-1.
At 2026-05-19 04:00:00, statistics has been running for 24 hours = Tasks.MaxRunningTimeBeforeAlert (monitoring option). The monitoring system fires an alert.
Alert payload: { procedure: "statistics", reason: "task_running_too_long"}.
statistics continues running — alert is informational, not a kill signal.
The alert is resolved when the task completes.

Expected result:

One alert fires at the 24-hour mark, identifying the task (procedure), node and reason.
Task continues running; no automatic interruption.
SE has actionable information (procedure name, node, reason) to diagnose whether the task is genuinely stuck.

Non-functional requirements

A backport to 125-6, 120-6, 115-6, and 110-6 is required. This will allow Support to benefit from the enhancement immediately, without having to wait two years for the CSP to migrate to MR131+.

Sorry, this text is too long to be checked.

Page tree

BRS - Suppress TaskStack Alerts When SE Action Is Not Required

Use scenario #4: Overdue tasks

User Story

Current Solution

Use Scenario #4.1 — Heavy task starts on first available node; second heavy task takes the other node

Use Scenario #4.2 — Non-heavy task runs in parallel with a heavy task on the same node

Use Scenario #4.3 — Third heavy task waits when both nodes are saturated

Use Scenario #4.4 — Alert fires when a single task waits longer than the configured threshold

Use Scenario #4.5 — SE adjusts the monitoring option Tasks.MaxWaitingTimeBeforeAlert to suppress acceptable waits

Use Scenario #4.6 — Alert fires for a heavy task exceeding its running-time threshold

Non-functional requirements

Use Scenario #4.5 — SE adjusts the monitoring option `Tasks.MaxWaitingTimeBeforeAlert` to suppress acceptable waits