I am a Staff-level Platform Engineer with over 10 years of experience architecting enterprise-scale cloud infrastructure on AWS, Azure, and GCP. My expertise lies in building and governing multi-account Kubernetes platforms, implementing GitOps, DevSecOps, and FinOps practices to drive efficiency, security, and cost savings. Throughout my career, I have delivered Internal Developer Platforms adopted by more than 30 teams, achieving 99.99% availability and significant annual cost reductions.
I specialize in designing scalable and reliable cloud environments, including multi-region Kubernetes clusters and AWS landing zones, which have drastically reduced provisioning times and improved security postures. I am passionate about automation and governance, having implemented policy-as-code frameworks and comprehensive observability stacks that reduce incident rates and mean time to recovery.
My experience includes leading FinOps initiatives that resulted in over $1.2 million in annual savings without compromising reliability or performance. I am skilled in mentoring teams to adopt modern deployment strategies such as blue/green deployments and GitOps, which have increased deployment frequency threefold while reducing incidents by 70%.
I have a strong background in infrastructure as code using Terraform and other tools, container orchestration with Kubernetes and Helm, and continuous integration and delivery pipelines. Security and compliance are integral to my work, ensuring least-privilege access and full auditability aligned with SOC 2 Type II standards.
I thrive in dynamic environments where I can leverage my skills in cloud platforms, automation, and reliability engineering to build robust systems that support large-scale, mission-critical applications. I am committed to continuous learning and applying best practices to deliver high-quality solutions that meet business needs and exceed expectations.
Architected and governed multi-region, multi-account Kubernetes platform serving as Internal Developer Platform for 30+ teams at 99.99% availability. Built 4-account AWS landing zone (Organizations/Control Tower/SCPs) in Terraform, cutting provisioning from 5 days to 2 hours (95% reduction) and standardizing it enterprise-wide. Implemented multi-region EKS cluster/workload account separation, minimizing blast radius and simplifying IAM policy management across 30+ production, staging, and dev teams. Built GitOps platform with ArgoCD, Helm, and blue/green deployments, enabling self-service and increasing deployment frequency 3ร (bi-weekly to daily) while cutting incidents 70%, mentored teams on adoption. Replaced static AWS credentials with IRSA for per-service-account least-privilege access, improving security posture and enabling full CloudTrail auditability for SOC 2 Type II. Enforced DevSecOps via policy-as-code (OPA + Checkov + tfsec) in CI pipelines, blocking 100% of non-compliant changes and reducing high/critical CVEs by 80% through Trivy scanning and KMS encryption. Engineered comprehensive SLI/SLO observability stack (Prometheus + Grafana + Loki + OpenTelemetry), decreasing MTTR 35% (85 to 55 minutes) with proactive anomaly detection. Drove FinOps program (Reserved Instances + Spot + rightsizing), delivering $1.2M annual savings (25โ30% reduction) with zero impact to reliability or performance, including cost allocation and chargeback models. Established EKS upgrade governance (disruption budgets + staged rollouts + PDBs), achieving zero-downtime upgrades across all clusters in line with Well-Architected reliability practices. Standardized secure pod-to-AWS access patterns using IRSA and RAM subnet sharing, enabling least-privilege workload isolation in multi-account environments.
Owned infrastructure and reliability for Kafka-based Centrifuge event delivery platform processing 2B+ daily events at 500K+ peak throughput with 99.9%+ availability. Architected AWS foundation (EC2, VPC, MSK, RDS) for Centrifuge distributed job scheduler, reliably delivering billions of events to hundreds of partner APIs while absorbing transient failures. Migrated 200+ EC2 instances to modular Terraform modules, reducing provisioning time from days to 30 minutes and establishing IaC standards adopted across the organization. Designed Kafka lag + CPU-driven autoscaling for Directors (ECS), handling 4ร traffic spikes and reducing cloud spend 20% through efficient resource scaling. Engineered per-tenant retry isolation and backpressure, preventing cascading failures during partner outages and preserving global delivery SLAs for high-cardinality workloads. Built end-to-end observability stack (Prometheus + Grafana + ELK + Datadog), slashing mean time to detection 80% (25 to 5 minutes) and accelerating incident triage. Led post-incident analysis and remediation processes, implementing job state machine improvements (exponential backoff, archival) that drove systemic reliability gains.
Contributed to backend services and monolith-to-SOA migration supporting global-scale booking traffic with safe, incremental rollout practices. Executed monolith-to-SOA migration using dual-read gating (1% โ 100% ramp), Diffy comparisons, and canary promotions, eliminating rollout incidents during decomposition. Built services with Thrift IDL and generated RPC clients (mTLS, retries, context propagation), reducing cross-language drift and accelerating multi-language development. Operationalized services with standardized metrics, templated dashboards, and IDL-defined alerts (p95/p99 latency, error rate, QPS), significantly reducing time-to-production readiness. Developed high-performance REST APIs (Java/Spring Boot/PostgreSQL) achieving sub-100ms P99 latency under 50K+ concurrent requests. Reduced peak API latency 25% by introducing asynchronous Kafka workflows and Redis caching, enhancing system resilience and user experience.
Jobicy
592 professionals pay to access exclusive and experimental features on Jobicy
Free
USD $0/month
For people just getting started
Plus
USD $8/month
Everything in Free, and: