Site Reliability Engineer

Type
Full Time
Opening date
Closing date
22 Oct 2021
Views
255

You will be a key member of a tight-knit group of talented Engineers who are responsible for keeping ours and our customer’s Kubernetes clusters operational and healthy. You’ll also have a key role in the development of the product itself, working together with our Platform Engineers to deliver the greatest Kubernetes service possible.

Giant Swarm is a fast-growing open-source infrastructure management platform used by modern enterprises. Our vision is to empower developers around the world to ship great products. We are a diverse, fully remote (since 2014) and experienced team that is growing and spread across Europe – with a headquarters in Cologne.

YOUR JOB

  • You maintain, operate and upgrade our own and our customer’s Kubernetes clusters.
  • You will design, configure, build, and maintain our core infrastructure, from kernel parameters to the cloud provider templates.
  • You understand how servers and systems work and you tweak their behavior to your needs.
  • You will be responsible for our monitoring, logging and alerting.
  • You will help resolve incidents on our own and our customer’s clusters.
  • You participate in the on-call support schedule
  • You are a go-to person in case our developers need advice regarding infrastructure.
  • You will automate all the things, and the thought of Terraform doesn’t make you cry.
  • We (and the majority of our customers) are currently mostly distributed around Europe (around UTC), thus, your main time zone should be somewhere between +/-2UTC to ensure better communication.

REQUIREMENTS

  • You must have deep, hands-on knowledge of Kubernetes from both the end-user and the operational side.
  • You’re comfortable debugging systems at all levels, from kernel fundamentals right up to workloads running on Kubernetes.
  • You’re happy troubleshooting a wide variety of issues and you’re not afraid to parse thousands of lines of logs in pursuit of an answer.
  • You have good coding skills (preferably Go, but Python or similar is fine as well)
  • You have experience with maintaining infrastructure with code and you know the pros and cons of various automation tools (We use Terraform & Ansible but Chef, Puppet and the lot is also a good start).
  • You are fluent with Cloud Native Tools running on top of Kubernetes (prometheus, grafana, ingress controller, …) you know how to use them and how to configure them.
  • You automate all the things by writing code. Using bash scripts makes you sad 🙂
Report · Embed ·

How to apply

ATTN. Be careful! You should never send cash or cheques to a prospective employer, or provide your bank details or any other financial information. We pay great attention to vetting all jobs that appear on our site, but please get in touch if you see any roles asking for such payments or financial details from you. The employer won't know who reported this job.


Share this job

Personalised job alerts

Set up personalised e-mail alerts about similar jobs.

See a few more

Related jobs in DevOps & SysAdmin

Report this job

    The employer won't know who reported this job. Contact your local law enforcement for immediate help if someone is in danger or the victim of a scam.
    All Job Ads are subject to Jobicy's Job Posting Policies. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by Jobicy. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

    Job Widget Code

    Place this code wherever you want the widget to appear on your page.

    <script src="//jobicy.com/api/widget.js?5XyPbk5QqyZg=6331" async></script>

    Ask a Question

    Position: Site Reliability Engineer.


    Login to Send Message