Business Analyst IV – Alert Management & Observability Standards Lead

Remote from
USA flag
USA
Salary, yearly, USD
98,040 - 154,800
Employment type
Full Time,
Job posted
Apply before
26 Jun 2026
Experience level
Senior
Views / Applies
27 / 2

About Astreya

IT services that put people at the center of your business

Verified job posting
This job post has been manually reviewed for authenticity and compliance.

AI Summary

This role leads the rationalization and governance of system alerts to align with operational priorities and reliability goals. The Alert Management & Observability Standards Lead establishes alerting standards, reviews alerts before routing to 24x7 operations, and ensures runbooks are maintained. They work at the intersection of IT Operations, engineering, and service owners to ensure alerts are actionable and prioritized. The position requires expertise in alert fatigue reduction, monitoring tools, and cross-functional collaboration.

Role DNA

Job Complexity
Easy Hard
Pace & Pressure
Relaxed Fast-paced
Autonomy Level
Guided Full Ownership
Communication Load
Independent Highly Collaborative
AI Insight The role involves complex cross-functional coordination, defining standards, and strategic decision-making, which requires advanced skills and experience.

Salary Analysis

Median Highly Competitive
USD126,420
US Market
USD90k – USD160k
0 USD176k
AI Insight The offered salary range of $98,040 to $154,800 is competitive for a senior Business Analyst role in the US market, with a median of $126,420 aligning well with market rates for similar positions requiring specialized skills in alert management and observability.

Key Skills

Alert Management Observability Incident Management IT Operations Runbook Development Service Reliability Monitoring Tools Cross-functional Collaboration Process Improvement Data Analysis

I am excited to apply for the Business Analyst IV - Alert Management & Observability Standards Lead position. With extensive experience in IT operations and alert governance, I have successfully rationalized alert systems to reduce noise and improve incident response. My background includes defining alerting standards and leading cross-functional teams to enhance operational reliability. I am confident in my ability to drive the alert rationalization framework and ensure actionable alerts. I look forward to contributing to your team's goals.

Describe your experience with alert rationalization and reducing alert fatigue.
In my previous role, I led an initiative to review and rationalize over 500 alerts, identifying and suppressing low-value alerts. I implemented a framework to evaluate alerts based on criticality, actionability, and signal-to-noise ratio, resulting in a 40% reduction in alert volume while improving detection of critical incidents.
How would you enforce alerting standards across multiple teams?
I would establish a clear alert design checklist and approval workflow as part of the 'Definition of Done' for onboarding alerts. I'd partner with tool owners to embed standards into monitoring templates and conduct regular reviews to ensure compliance. Training and governance forums would be used to drive adoption.
Explain how you would determine whether an alert should go to 24x7 eyes-on-glass or be handled during business hours.
I would evaluate the alert based on severity, business impact, and required response time. Alerts that indicate immediate service degradation or outage and require immediate triage would go to 24x7. Lower severity alerts or those that can be addressed within normal business hours would be ticketed. Routing would align with operational coverage and team skills.
What KPIs would you use to measure alerting health?
Key KPIs include alert volume trends by service and severity, percentage of alerts with valid runbooks, alert actionability rate (alerts with clear operator actions), noise reduction percentage, and mean time to acknowledge for critical alerts. These metrics help track improvements in alert quality and operational efficiency.
How would you ensure runbooks stay current as systems change?
I would establish a regular review cadence, such as quarterly runbook audits, and integrate runbook updates into change management processes. Service owners would be responsible for reviewing runbooks when their systems are updated. Version control and automated reminders would help maintain accuracy.

What this Job Entails: 

The Business Analyst IV will provide solutions that help attain business outcomes. The Alert Management & Observability Standards Lead is responsible for rationalizing and governing all system alerts to ensure they align with department priorities, operational coverage models, and service reliability goals. This role defines alerting standards, reviews and approves alerts before they are routed to the 24×7 Eyes-on-Glass Operations team, and establishes a scalable approach to cataloging alert response instructions (runbooks/playbooks) so responders can take consistent, high-quality actions.

This position operates at the intersection of the IT Operations Command Center (OCC), engineering/application teams, platform/monitoring tool owners, and service owners, ensuring alerts are actionable, prioritized, and paired with clear response guidance.

Your Roles and Responsibilities:

1) Alert Rationalization & Prioritization (Core)

Establish and maintain a department-wide alert rationalization framework that evaluates alerts for:

  • Business/service criticality and operational priority
  • Actionability (clear operator action available)
  • Signal-to-noise (duplicate/low-value alerts removed or suppressed)
  • Ownership and escalation paths

Perform regular alert reviews (new + existing) to ensure alert quality, correct routing, and alignment with operational coverage.

Lead continuous improvement efforts to reduce alert fatigue while preserving detection of true incidents and high-impact degradation.

2) Standards, Policies, and Guardrails

Define and enforce alerting standards including:

  • Severity definitions and thresholds
  • Required metadata (service, CI, owner, runbook link, escalation)
  • Naming conventions and tagging taxonomy
  • Routing rules and “when to page vs. when to ticket”

Create a standardized Alert Design Checklist and approval workflow (e.g., “Definition of Done” for alert onboarding).

Partner with tool/platform owners to ensure standards are embedded in monitoring tooling (templates, required fields, automated validation).

3) Routing Decisions to 24×7 Eyes-on-Glass

Act as gatekeeper (or lead the governance process) for determining which alerts should:

  • Go to 24×7 Eyes-on-Glass for immediate triage
  • Route to on-call engineering directly
  • Create tickets for business-hours handling
  • Be suppressed, aggregated, or converted to dashboards/health indicators

Ensure routing aligns with:

  • Operational responsibilities and skills of the Eyes-on-Glass team
  • Department priorities (e.g., safety, reliability, customer impact)
  • Service ownership and support models

4) Runbook / Response Instruction Cataloging (Knowledge System)

Establish a consistent approach to cataloging response instructions for every actionable alert, including:

  • “What does this alert mean?” (symptoms + impact)
  • “What to check first” (triage steps)
  • “What actions to take” (standard remediation)
  • “When to escalate and to whom” (clear escalation triggers)
  • Links to dashboards, logs, SOPs, and known issues

Own the runbook template and ensure runbooks are versioned, maintained, and reviewed on a defined cadence.

Partner with service owners to ensure runbooks stay current as systems change.

5) Reporting & Operational Outcomes

Define and publish KPIs that demonstrate alerting health and operational performance, such as:

  • Alert volume trends by service and severity
  • Percentage of alerts with runbooks and valid ownership
  • Alert “actionability rate” and noise reduction
  • Mean time to acknowledge / triage effectiveness (as applicable)

Facilitate governance forums (weekly/monthly) with service owners and engineering leads to review alert quality and backlog.

6) Cross-Functional Enablement

Coach service teams on best practices: SLIs/SLOs, alert thresholds, dependency monitoring, and incident correlation.

Drive adoption of observability patterns (golden signals, health indicators, multi-signal alerting).

Support major incident learning by feeding post-incident insights back into improved alerts and runbooks.

7) Able to Deliver the following in the first 45 days:

Alerting standards (severity model, metadata, naming, routing policy) published and adopted

Intake and approval workflow established for new/changed alerts

Top 20 noisy services rationalized (dedupe/suppress/threshold tuning) with measurable noise reduction

Runbook template launched; minimum runbook coverage targets set (e.g., 80% of paged alerts)

Central alert catalog created (ownership + routing + runbook link + last review date) 

Required Qualifications/Skills:

5+ years in IT Operations, SRE, Observability, Monitoring Engineering, or Incident Management

Demonstrated success reducing noise and improving actionability across enterprise alerting ecosystems

Experience with common monitoring/observability tools (e.g., Splunk, AppDynamics, Dynatrace, Datadog, Prometheus/Grafana, Azure Monitor, CloudWatch, ServiceNow Event Mgmt or similar)

Strong understanding of:

  • Incident response workflows and operational coverage models (24×7 vs. business hours)
  • CMDB/service ownership concepts and dependency mapping
  • Standard operating procedures/runbooks and knowledge management

Excellent stakeholder management and ability to drive standards across teams

Preferred Qualifications: 

  • Experience designing or operating an Operations Command Center / NOC / SOC-style “eyes-on-glass” model
  • Familiarity with ITIL Event Management, SRE principles, and service reliability practices
  • Experience with automation for alert enrichment, correlation, and routing (e.g., event correlation, deduplication, noise suppression)
  • Background in governance frameworks and operating rhythm design (cadences, controls, compliance traceability)

 Physical Demand & Work Environment:

  • Must have the ability to perform office-related tasks which may include prolonged sitting or standing
  • Must have the ability to move from place to place within an office environment
  • Must be able to use a computer
  • Must have the ability to communicate effectively 
  • Some positions may require occasional repetitive motion or movements of the wrists, hands, and/or fingers

What can Astreya offer you?

  • Employment in the fast-growing IT space providing you with a variety of career options
  • Opportunity to work with some of the biggest firms in the world as part of the Astreya delivery network
  • Introduction to new ways of working and awesome technologies
  • Career paths to help you establish where you want to go
  • Focus on internal promotion and internal mobility – we love to build teams from within
  • Free 24/7 accessible Professional Development through LinkedIn Learning and other online courses to give you opportunities to upskill at your own pace
  • Education Assistance
  • Dedicated management to provide you with on point leadership and care
  • Numerous on the job perks
  • Market competitive compensation and insurance, health and wellness benefits

Salary Range

$98,040.00 – $154,800.00 USD (Salary)

  • Please note that the salary information provided herein is base pay only (gross); it does not include other forms of compensation which may or may not apply to this specific position, namely, performance-based bonuses, benefits-related payments, or other general incentives – none of which are guaranteed, may be subject to specific eligibility requirements, and are wholly within the discretion of Astreya to remit.
  • Further, the salary information noted above is a range that consists of a minimum and maximum rate of pay for this specific position. Where an applicant or employee is placed on this range will depend and be contingent on objective, documented work-related considerations like education, experience, certifications, licenses, preferred qualifications, among other factors.

Astreya offers comprehensive benefits to all Regular, Full-Time Employees, including:

  • Medical provided through UHC (PPO, HSA, Surest options) / Medical provided through Kaiser (HMO option only) for California employees only

  • Dental provided through UHC

  • Nationwide Vision provided by UHC

  • Flexible Spending Account for Health & Dependent Care

  • Pre-Tax Account for Commuter Benefit/Parking & Transit (location-specific)

  • Continuing Education and Professional Development via various integrated platforms, e.g. Udemy and Coursera

  • Corporate Wellness Program provided by Goomi Group

  • Employee Assistance Program

  • Wellness Days

    401k Plan

  • Basic and Supplemental Life Insurance

  • Short Term & Long Term Disability

  • Critical Illness, Critical Hospital, and Voluntary Accident Insurance

  • Tuition Reimbursement (available 6 months after start date, capped)

  • Paid Time Off (accrued and prorated, maximum of 120 hours annually)

  • Paid Holidays

  • Any other statutory leaves, paid time, or other ancillary benefits required under state and federal law

Apply now >

This job listing has been manually reviewed by the Jobicy Trust & Safety Team for compliance with our posting guidelines, including verification of the company's legitimacy, accuracy of job details, clarity of remote work policy, and absence of misleading or fraudulent content.

How to apply

Did you apply? Let us know, and we’ll help you track your application.

See a few more

Similar Business Development remote jobs

Job Search Safety Tips

Here are some tips to help you search and apply for jobs safely:
Watch out for suspicious jobs Don't apply for jobs that offer high pay for little work or offer to hire you without an interview. Read more ›
Check the employer's profile Make sure you're applying for a trustworthy job by visiting the employer's profile and learning more about them. Read more ›
Protect your information Don't share personal details like your bank account or government-issued ID on suspicious websites or messengers. Read more ›
Report jobs that feel unsafe If you see a job that seems misleading, inappropriate or discriminatory, report it for going against our policies and we'll review it.

Share this job

Jobicy+ Subscription

Jobicy

614 professionals pay to access exclusive and experimental features on Jobicy

Free

USD $0/month

For people just getting started

  • • Unlimited applies and searches
  • • Access on web and mobile apps
  • • Weekly job alerts
  • • Access to additional tools like Bookmarks, Applications, and more

Plus

USD $8/month

Everything in Free, and:

  • • Ad-free experience
  • • Daily job alerts
  • • Personal career consultant
  • • AI-powered job advice
Go to account ›