Staff Site Reliability & DevOps Engineer – Observability

Remote from: Hungary, Bulgaria
Annual salary: Undisclosed
Salary information is not provided for this position. Check our Salary Directory to estimate the average compensation for similar roles.
Department: DevOps & Infrastructure
Employment type: Full Time,
Job posted: 17 May 2026
Apply before: 17 Jun 2026
Experience level: Senior
Views / Applies: 22 / 2

Empowering PR and Marketing Pros to Target, Reach and Engage Their Audiences.

AI Summary

Cision is seeking a Staff Site Reliability & DevOps Engineer to design, build, and operate observability platforms using Grafana and Prometheus. The role involves improving signal quality, automating observability configuration, and supporting incident response. The ideal candidate has strong experience with Prometheus, Grafana, Kubernetes, and infrastructure as code. This position offers an opportunity to shape the future of communication technology at a global leader in PR and marketing.

Job Complexity

Easy Hard

AI Insight The role requires advanced expertise in observability tools like Prometheus and Grafana, plus deep knowledge of Kubernetes and infrastructure as code, making it challenging but not the hardest level.

Salary Analysis

AI Insight The salary for this role is not explicitly provided, but based on market data for a Staff Site Reliability & DevOps Engineer specializing in observability, the median is estimated at $165,000. This is competitive for the US market, reflecting the specialized skills required.

Key Skills

Site Reliability Engineering DevOps Observability Grafana Prometheus Kubernetes Infrastructure as Code Terraform Linux Incident Management

Cover Letter Sample

Dear Hiring Manager,

I am excited to apply for the Staff Site Reliability & DevOps Engineer - Observability position at Cision. With extensive experience designing and operating observability platforms using Grafana and Prometheus, I am confident in my ability to ensure production systems are reliable and scalable. My background includes strong Linux and networking fundamentals, Kubernetes expertise, and proficiency in infrastructure as code with Terraform.

I have a proven track record of improving signal quality, reducing alert noise, and implementing SLOs to enhance incident response. I am passionate about building observability as a first-class platform capability and collaborating with engineering teams to instrument services correctly.

I am drawn to Cision's commitment to empowering individuals and fostering an inclusive environment. I look forward to contributing to your team and driving innovation in communication technology. Thank you for considering my application.

Sincerely,
[Your Name]

Possible Interview Questions

Describe your experience designing and operating a Prometheus-based monitoring system at scale. How did you handle federation and recording rules?

I have designed a multi-cluster Prometheus setup using federation to aggregate metrics from various Kubernetes clusters. I created recording rules to precompute expensive queries and reduce query latency. For example, I set up a global view of cluster health by federating node and pod metrics, and used recording rules for CPU and memory utilization to speed up dashboard loading.

How do you reduce alert noise and improve signal quality in an observability platform?

I start by analyzing alert history to identify noisy alerts with low signal-to-noise ratio. I then tune thresholds using statistical methods like standard deviation or percentile-based alerting. I also implement alert deduplication and grouping, and create runbooks to ensure alerts are actionable. For example, I reduced alert volume by 60% by switching from static thresholds to dynamic baselines.

Explain how you would integrate metrics, logs, and traces across distributed systems. What tools and approaches do you use?

I use OpenTelemetry to instrument services for traces and metrics, and export logs to a centralized system like Loki. I correlate traces with metrics by using common labels like service name and trace ID. For example, I set up Grafana to link from a metric spike to related logs and traces, enabling faster root cause analysis.

Describe a time you used infrastructure as code to automate observability configuration. What tools did you use and what challenges did you face?

I used Terraform to deploy and configure Grafana dashboards, alert rules, and Prometheus scrape targets. One challenge was managing state drift when teams manually adjusted dashboards. I solved this by implementing a GitOps workflow where all changes go through pull requests, and using Terraform's import command to reconcile existing resources.

How do you approach capacity planning and performance analysis for an observability stack?

I monitor resource usage of the observability components themselves, such as Prometheus memory and disk usage, and Grafana query latency. I use historical data to predict growth and plan for scaling. For example, I set up alerts for when Prometheus storage is 80% full and used vertical pod autoscaling to adjust resources dynamically.

At Cision, we believe in empowering every individual to make an impact. Here, your voice is heard, your ideas are valued, and your unique perspective fuels our collective success. As part of our global team, you’ll thrive in an environment that champions curiosity, collaboration, and innovation, all while making meaningful contributions to the brands we accelerate. Join us in shaping the future of communication and building authentic connections that matter. Whether you’re solving complex problems or driving bold innovations, your growth is our success, and together, we’ll create the conversations of tomorrow. Empower your impact at Cision. Be seen, be understood, be you. This role focuses on designing, operating, and evolving observability platforms with a strong emphasis on metrics, logging, and alerting. The primary tooling is Grafana and Prometheus, with responsibility for ensuring production systems are observable, reliable, and operable at scale. The role works closely with platform, infrastructure, and application teams. Key responsibilities:
• Design, build, and operate observability platforms based on Grafana and Prometheus
• Define and maintain metrics standards, dashboards, alerts, and SLOs
• Improve signal quality: reduce alert noise, tune thresholds, and improve runbooks
• Support incident response by providing actionable telemetry and post-incident analysis
• Integrate metrics, logs, and traces across distributed systems
• Work with engineering teams to instrument services correctly
• Automate observability configuration using infrastructure as code
• Contribute to reliability improvements through capacity planning and performance analysis
• Required skills and experience
• Strong experience with Prometheus (scraping, federation, recording rules, alerting)
• Strong experience with Grafana (dashboards, alerting, templating, RBAC)
• Solid Linux and networking fundamentals
• Experience running observability stacks in Kubernetes environments
• Infrastructure as code experience (Terraform preferred)
• Familiarity with incident management and on-call practices
• Ability to debug production systems using metrics and logs

Nice to have:
• Experience with logs and traces (e.g. Loki, Tempo, OpenTelemetry)
• Experience operating large-scale or multi-cluster Kubernetes platforms
• Experience with cloud platforms (GCP, AWS, OCI)
• Exposure to SRE concepts such as error budgets and SLO-driven prioritisation

What success looks like
• Engineers trust dashboards and alerts to reflect system health
• Incidents are detected earlier and diagnosed faster
• Alert fatigue is reduced and on-call quality improves
• Observability is treated as a first-class platform capabilit As a global leader in PR, marketing and social media management technology and intelligence, Cision helps brands and organizations to identify, connect and engage with customers and stakeholders to drive business results. PR Newswire, a network of over 1.1 billion influencers, in-depth monitoring, analytics and its Brandwatch and Falcon.io social media platforms headline a premier suite of solutions. Cision has offices in 24 countries throughout the Americas, EMEA and APAC. For more information about Cision’s award-winning solutions, including its next-gen Cision Communications Cloud®, visit www.cision.com and follow @Cision on Twitter. Cision is committed to fostering an inclusive environment where all employees can be their authentic selves and perform at their best. We believe diversity, equity, and inclusion is vital to driving our culture, sparking innovation and achieving long-term success. Cision is proud to have joined more than 600 companies in signing the CEO Action for Diversity & Inclusion™ pledge and named a “Top Diversity Employer” for 2021 by DiversityJobs.com. Cision is proud to be an equal opportunity employer, seeking to create a welcoming and diverse environment. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, gender identity or expression, sexual orientation, national origin, genetics, disability, age, veteran status, or other protected statuses. Cision is committed to the full inclusion of all qualified individuals. In keeping with our commitment, Cision will take the steps to assure that people with disabilities are provided reasonable accommodations. Accordingly, if reasonable accommodation is required to fully participate in the job application or interview process, to perform the essential functions of the position, and/or to receive all other benefits and privileges of employment, please contact [email protected] Please review our Global Candidate Data Privacy Statement to learn about Cision’s commitment to protecting personal data collected during the hiring process.