Ticket Number

YT:PMD-3135, YT:FR-641

Project GoalReduce the time spent by Support Team on handling false alerts and routine TasksStack monitoring issues. Implement retry mechanisms for transient connection failures, re-run logic with counters for replication delays and overdue tasks, and direct CSP Admin notifications for issues requiring business decisions. Ensure that monitoring alerts are generated only when the service degrades in a way that could impact the customer's business. Support Engineers will focus on issues that require investigation and action, rather than transient failures that resolve themselves automatically.
Why this project exists ?The current monitoring of TaskStack generates too many false alerts — alerts that don't require any action from Support Team. From 2024-09-01 to 2025-01-25, 181 monitoring tickets related to TaskStack CRITICAL alerts were created, consuming over 643 hours of support resources. Many alerts are caused by transient connection failures (Gearman, Cassandra, KairosDB, external APIs) that resolve themselves within minutes, or by replication delays and overdue tasks that clear on the next scheduled run. Many alerts require business decisions or actions from CSP Admin (e.g., fixing invoice templates, updating payment gateway credentials, correcting customer data), but currently Support Engineers act as intermediaries. The system monitors the internals of TaskStack rather than the actual-vs-expected results of its work, leading to false positives.
Who are the users / whom we bring value ?

Support Engineers (SE) - Reduced false alerts allow focus on issues that require investigation and action. Avoid situations where an alert is received only to determine it should be ignored.

CSP Admins - Direct notifications for issues requiring their business decisions or actions (e.g., payment gateway errors, tax calculation issues, invoice template problems). Eliminates the need for Support Engineers to act as intermediaries.

CSP owners, operators, and End users - Reduced support overhead and fewer service interruptions from transient failures.

What are the benefits for CSP/ PortaSwitch owner?

For CSP: Reduced time spent on routine monitoring ticket handling. CSP Admins receive direct notifications for issues requiring their action (payments, invoicing, tax calculations). Fewer service interruptions from transient failures due to automatic retry mechanisms and re-run logic.

For PortaOne SE: Reduced Support Team time spent on false alerts. Faster resolution of real issues due to fewer false alerts. Support Engineers focus on problems that require investigation rather than self-healing transient issues.

Target Release
MR125+
AreaPortaAdmin

Additional Info

Specifications

References