Overview
We are looking for a hands-on Lead Site Reliability Engineer to own the reliability, observability, and automation of our Azure and hybrid (Azure Stack / on-prem) platforms. You will lead SRE practices for our AI, data, and application services, drive a cloud-agnostic DevSecOps toolchain, and partner with engineering, data, and security teams to ensure our platforms are secure, scalable, and cost-efficient. This role is ideal for a senior engineer with 10+ years of experience who can combine deep technical expertise with strong leadership and coaching skills.
Inception, a G42 company, is the region’s leading innovator of AI-powered domain-specific as well as industry-agnostic products, built on a rich heritage of research and development. Within the G42 ecosystem, Inception functions as the core intelligence layer – transforming data and compute infrastructure into real-world, applied AI solutions. Beyond its commercial endeavors, Inception is committed to creating positive societal impact. For more information, please visit www.inceptionai.ai
Responsibilities
- Own SLOs/SLIs and overall reliability for key Azure and on-prem platforms (data, AI/ML, and business-critical applications).
- Plan and optimise capacity, performance, and cost for compute, storage, networking, and GPU/accelerator workloads.
- Build and maintain observability (metrics, logs, traces, dashboards, alerts) using Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, and central log platforms.
- Lead automation of infrastructure and operations using Terraform, Bicep, Ansible, and scripting (Python, PowerShell, Bash/Go); drive self-healing and runbook-driven operations.
- Operate Azure, Azure Stack, and on-prem Kubernetes/AKS clusters; ensure secure, resilient hybrid connectivity, identity, and access across environments.
- Lead P0/P1 incident response, on-call rotations, communication, and blameless post-mortems; drive long-term fixes and reliability improvements.
- Use ITSM and DevSecOps tools (e.g. cloud-agnostic CI/CD, ServiceNow, Jira, ManageEngine, security scanning and policy-as-code) to manage change, incidents, and compliance.
- Provide technical leadership and mentoring to SREs and platform engineers; collaborate with data, AI/ML, application, and security teams to design for reliability and security from day one.
Qualifications
Skills & Experience
- 10+ years in SRE/DevOps/platform engineering roles, including 5+ years designing and running workloads on Microsoft Azure at scale.
- Strong experience with Azure Data and AI services, including Azure Synapse Analytics, Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Machine Learning, Azure OpenAI Service, and Azure Cognitive Services.
- Deep hands-on skills with containers and Kubernetes (AKS or equivalent), including autoscaling, upgrades, and production operations.
- Proficiency with Infrastructure-as-Code (Terraform, Bicep, Ansible) and scripting/programming in Python and/or PowerShell (Go/Bash a plus).
- Solid understanding of observability practices and tools (metrics, logs, traces) and experience implementing monitoring and alerting in production.
- Proven track record implementing SRE practices (SLOs/SLIs, error budgets, capacity planning, cost/performance optimisation).
- Familiarity with hybrid networking, identity, and security (ExpressRoute/VPN, private endpoints, Azure AD, key management).
- Experience working within Agile/Scrum and ITIL processes; exposure to ISO 27001 and external audits is an advantage.
- Excellent communication and stakeholder management skills, with a proven ability to lead, mentor, and influence cross-functional teams.
What Success Looks Like
99.9%+ availability for core platforms and customer-facing services.Fast and predictable incident handling (MTTD
End-to-end observability with meaningful, low-noise alerting across Azure and on-prem environments.Significant reduction in manual toil through automation and self-service (target ~50% reduction over time).Documented and tested DR/BCP for key AI, data, and application platforms.
What We Look For
If you are a performance-driven, inquisitive mind with the agility to adapt to ambiguity, you will fit right in. You should be eager to explore opportunities to build meaningful collaborations with stakeholders and aspire to create unique customer-centric solutions. Bias for action and a passion to conquer new frontiers in the AI space is at the heart of the Inception community.
What Working At Inception Offers
Culture: An open, diverse and inclusive environment with a global vision that encourages personal growth and focuses on ground-breaking, industry-first innovations.
Career: Outstanding learning, development & growth opportunities via structured training programs and innovative, high-tech projects.
Rewards: A competitive remuneration package with a host of perks including healthcare, education support, leave benefits and more.
If you can confidently demonstrate that you meet the criteria above, please contact us as soon as possible.