Senior Customer Reliability Engineer – Infrastructure

Remote from
Ireland flag
Ireland
Annual salary
Undisclosed
Salary information is not provided for this position. Check our Salary Directory to estimate the average compensation for similar roles.
Employment type
Full Time,
Job posted
Apply before
2 Jul 2026
Experience level
Senior
Views / Applies
47 / 14

About Astronomer

The Apache Airflow Company

Verified job posting
This job post has been manually reviewed for authenticity and compliance.

AI Summary

Astronomer is hiring a Senior Customer Reliability Engineer to ensure the reliability of their managed Airflow service. The role involves operating cloud infrastructure and Kubernetes clusters, responding to incidents, and improving observability. This customer-facing position requires troubleshooting, automation, and collaboration with product teams. Ideal candidates have 6+ years of experience with large-scale cloud infrastructures and 4+ years with Kubernetes.

Role DNA

Job Complexity
Easy Hard
Pace & Pressure
Relaxed Fast-paced
Autonomy Level
Guided Full Ownership
Communication Load
Independent Highly Collaborative
AI Insight The role requires deep technical expertise in Kubernetes, cloud infrastructure, and incident response, combined with strong customer-facing skills, making it highly challenging.

Salary Analysis

Median Market Rate
$155,000
US Market
$130k – $200k
0 $220k
AI Insight The salary was not provided in the listing, so an estimate based on US market data for a Senior Customer Reliability Engineer with Kubernetes and cloud expertise suggests a median of $155,000. This is competitive for a senior-level role.

Key Skills

Kubernetes Cloud Infrastructure Site Reliability Engineering Incident Response Python Linux CI/CD Troubleshooting Customer Support Automation

Dear Hiring Manager,

I am excited to apply for the Senior Customer Reliability Engineer - Infrastructure position at Astronomer. With over 6 years of experience managing large-scale cloud infrastructures and deep expertise in Kubernetes, I am confident in my ability to ensure the reliability and performance of your managed Airflow service.

In my previous role, I successfully led incident response and improved observability for distributed systems, reducing downtime by 30%. I thrive in customer-facing environments, working directly with clients to troubleshoot complex issues and deliver exceptional support.

I am particularly drawn to Astronomer's mission of empowering data teams and would love to contribute to your CRE team. Thank you for considering my application.

Sincerely,
[Your Name]

Describe a time you resolved a critical incident in a Kubernetes cluster. What steps did you take?
I once encountered a cluster-wide network failure affecting customer workloads. I immediately isolated the issue using monitoring tools, identified a misconfigured network policy, and applied a hotfix. I then implemented automated checks to prevent recurrence and communicated updates to stakeholders.
How do you prioritize customer issues when multiple incidents occur simultaneously?
I assess impact based on severity, SLAs, and customer tier. Critical issues affecting production get top priority. I delegate tasks if possible, communicate transparently with customers, and escalate if needed.
Explain your experience with Kubernetes Custom Resources and their use cases.
I've used Custom Resource Definitions (CRDs) to extend Kubernetes API for managing application-specific resources, like custom operators for backup and monitoring. This allows declarative management and automation of complex workflows.
How do you approach building an observability platform for a distributed system?
I start by defining key metrics (latency, errors, traffic) and implement logging, metrics, and tracing using tools like Prometheus, Grafana, and ELK. I set up alerts based on SLOs and ensure dashboards provide actionable insights.
Describe a time you improved automation for operational tasks. What was the impact?
I automated the deployment of Kubernetes clusters using Terraform and Helm, reducing provisioning time from days to hours. This improved consistency and freed up time for proactive reliability improvements.

Astronomer empowers data teams to bring mission-critical software, analytics, and AI to life and is the company behind Astro, the industry-leading unified DataOps platform powered by Apache Airflow®. Astro accelerates building reliable data products that unlock insights, unleash AI value, and powers data-driven applications. Trusted by more than 800 of the world’s leading enterprises, Astronomer lets businesses do more with their data. To learn more, visit www.astronomer.io.

About this role

The Astronomer Customer Reliability Engineering (CRE) team is responsible for the success of our customers’ usage of our managed Airflow service.

The CREs are responsible for operating, monitoring, and maintaining the platform to ensure availability, predictability, and reliable operations.

As a senior infrastructure specialist within the team, you will focus on the reliability of the underlying cloud infrastructure and Kubernetes clusters. This entails responding to incidents either raised by a customer, or from our monitoring system and then taking further steps to ensure problems are permanently resolved or monitored. As owners of the observability platform, CRE has unlimited potential to improve the reliability of the product and deliver the best possible outcome for our customers.

This role is directly customer-facing and gives exposure to very diverse problems and requirements. CRE get the opportunity to interface with customers from a variety of industries across different cloud providers, and all with different expectations. Your contributions will directly impact customers’ success with using the Astronomer products, and you will be able to help make meaningful improvements to the customer experience.

What you get to do:

  • Provide solutions to customers to make them successful using our products.

  • Troubleshoot customer environments and engage in active triaging with customers

  • Participate in on-call rotation for weekend coverage

  • Provide feedback to the product development teams on customer needs and pain points.

  • Build out our monitoring and alerting systems.

  • Build and maintain automation to ensure daily operational tasks are handled as efficiently as possible.

  • Help direct the architecture of the products and contribute where possible.

  • Own the customer experience, working directly with customers to prioritize and solve issues, meet SLAs, and provide “white glove” guidance on the path to production.

  • Participate remotely within a fully distributed team.

  • Enhance and enrich customer documentation

  • Work with the latest technology and multi-cloud implementations

What you bring to the role:

  • 6 years of experience, preferably with large, complex cloud infrastructures operating at scale

  • 4 years of experience with Kubernetes

  • Experience managing a Production distributed system with at least one major cloud provider (one or all: AWS, GCP, Azure)

  • Strong Linux experience

  • Knowledge of how to operate and monitor issues for distributed systems

  • Previous experience in handling customers issues (internal or external)

  • Strong communication skills

  • DevOps or CI/CD experience

  • Python scripting

  • Good troubleshooting Skills

Bonus points if you have:

  • Experience as a Site Reliability Engineer

  • Worked with Kubernetes Custom Resources

  • Depth of knowledge with Azure

  • Airflow/Big Data Orchestration experience

  • IaC experience

#LI-Fulltime

#LI-Remote

At Astronomer, we value diversity. We are an equal opportunity employer: we do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Apply now >

Annual salary information is not provided for this position. Explore salary ranges for similar roles in our Salary Directory ›

This job listing has been manually reviewed by the Jobicy Trust & Safety Team for compliance with our posting guidelines, including verification of the company's legitimacy, accuracy of job details, clarity of remote work policy, and absence of misleading or fraudulent content.

How to apply

Did you apply? Let us know, and we’ll help you track your application.

See a few more

Similar DevOps & Infrastructure remote jobs

Job Search Safety Tips

Here are some tips to help you search and apply for jobs safely:
Watch out for suspicious jobs Don't apply for jobs that offer high pay for little work or offer to hire you without an interview. Read more ›
Check the employer's profile Make sure you're applying for a trustworthy job by visiting the employer's profile and learning more about them. Read more ›
Protect your information Don't share personal details like your bank account or government-issued ID on suspicious websites or messengers. Read more ›
Report jobs that feel unsafe If you see a job that seems misleading, inappropriate or discriminatory, report it for going against our policies and we'll review it.

Share this job

Jobicy+ Subscription

Jobicy

614 professionals pay to access exclusive and experimental features on Jobicy

Free

USD $0/month

For people just getting started

  • • Unlimited applies and searches
  • • Access on web and mobile apps
  • • Weekly job alerts
  • • Access to additional tools like Bookmarks, Applications, and more

Plus

USD $8/month

Everything in Free, and:

  • • Ad-free experience
  • • Daily job alerts
  • • Personal career consultant
  • • AI-powered job advice
Go to account ›