About Us
Core42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally. Leveraging advanced resources and partnerships, Core42 empowers clients to harness sovereign AI infrastructure, especially in sectors with stringent regulatory needs. With a mission to redefine digital transformation, we combine sovereign capabilities with scalable, high-performance compute infrastructure, positioning itself at the forefront of AI innovation in the Middle East and beyond.
The Opportunity
We are seeking a Principal Site Reliability Engineer to architect and lead the evolution of our globally distributed infrastructure supporting AI and private cloud workloads. This is a high-impact technical leadership role focused on building scalable, resilient, and self-healing platforms through advanced automation and AIOps.
You will act as a technical authority, partnering with engineering, product, and leadership teams to drive autonomous service delivery, improve reliability, and enable large-scale AI innovation.
Key Responsibilities
Platform Architecture & Strategy
- Define and lead the long-term roadmap for infrastructure, CI/CD, and Kubernetes platforms
- Design scalable, distributed systems aligned with AI/ML and HPC workloads
- Establish standards for infrastructure-as-code and platform engineering
Automation & AIOps
- Design and implement AI-driven automation and self-healing systems
- Develop autonomous workflows for incident remediation and capacity optimisation
- Evolve observability into predictive AIOps capabilities
Kubernetes & Infrastructure Engineering
- Architect high-performance Kubernetes environments for multi-tenancy and GPU-intensive workloads
- Optimize infrastructure for performance, scalability, and cost efficiency
- Support advanced scheduling and orchestration frameworks for AI workloads
Observability & Reliability
- Build and enhance observability platforms integrating metrics, logs, and tracing
- Define SLOs/SLIs aligned with business outcomes
- Lead root cause analysis (RCA) and promote reliability best practices including error budgets
Leadership & Technical Excellence
- Act as the escalation point for complex system issues
- Mentor and develop SRE and DevOps teams, driving a culture of excellence
- Lead architectural reviews and contribute to internal Centers of Excellence
Cross-Functional Collaboration
- Partner with product and engineering teams to balance innovation with reliability
- Translate technical challenges into business impact for senior stakeholders
- Influence infrastructure and platform strategy across the organisation
Required
Qualifications & Experience
- 10+ years of experience in Site Reliability Engineering, Platform Engineering, or Systems Architecture
- Proven experience designing and operating large-scale distributed systems
- Deep expertise in Kubernetes environments (EKS, GKE, or bare metal), including GPU workloads
- Strong programming skills in Python, Go, or Rust
- Extensive experience with Terraform, Helm, and infrastructure-as-code practices
- Strong understanding of observability systems (metrics, logging, tracing)
Preferred
- Experience with AI/ML infrastructure, including model serving and data pipelines
- Familiarity with scheduling frameworks (e.g., Ray, Kueue, Volcano)
- Experience building automation or AI-driven operational tools
- Certifications such as CKA, AWS/Azure Solutions Architect
- Experience influencing technical strategy across large organisations
What We’re Looking For
A highly experienced and forward-thinking engineer with deep technical expertise and a passion for building resilient, scalable systems. You are a strong problem solver, an influential leader, and a strategic thinker who can drive innovation while maintaining operational excellence.
What Working At Core42 Offers
With a diverse team of 1,100+ employees from 68 nationalities, we foster an inclusive, innovative and collaborative environment. At Core42, we foster a culture grounded in trust, accountability and high performance. We are united by our values: Grit, where we overcome challenges with resilience and determination, Passion, which drives us to pursue excellence in everything we do, and Impact, as we aim to inspire progress and create meaningful change. Our team members thrive in an environment where each person’s contributions propel us forward, and together, we commit to achieving extraordinary results.
- Competitive Salary: We offer an attractive salary package based on your skills and experience
- Yearly Bonus: In recognition of your contributions, you will receive a performance-based annual bonus
- Exclusive Discount Cards: Access special benefits with Esaad and Fazaa cards, offering discounts across a wide range of services
- Premium Family Insurance: We provide comprehensive health coverage, including dental, vision and life insurance, ensuring the well-being of you and your family
- Learning & Development: We offer access to top-tier learning platforms to help you grow in your career. Learn at your own pace with unlimited access to premium courses