Job Description

— Senior DevOps Engineer, Observability (Datadog-focused)

About The Role

Observability is a first-class product on our platform. We run a large Datadog estate fully managed as code (Terraform/Terragrunt) covering APM, RUM, Logs, Infrastructure, Synthetics, SLOs, and IRM/On-Call for 100+ services across multiple regions.

You will own this estate end-to-end: dashboards, monitors, SLOs, on-call routing, and the automated release gates that block bad deploys from reaching production. You will partner with service teams to raise the bar on telemetry quality, instrument new services, and make Datadog genuinely actionable.

What youll work on

  • Own our Datadog-as-code monorepo — dashboards (global, app, database, pipeline, performance, GraphQL, RUM, release-validation), monitors (APM, K8s, CNPG Postgres, Ingress, Logs, Composite, FinOps, TLS, Tenant drift, Deployment health), SLOs, Log Indexes & Pipelines, and RUM metrics.
  • Drive the Observability Compliance release gate — enforce that every service ships with SLOs, monitors, dashboards, and log pipelines before it can go to prod.
  • Design and run Datadog IRM / On-Call — escalation policies, routing rules, schedules, and JSM integration driven by client SLAs.
  • Lead standardisation initiatives: health, log formats, trace tagging, RUM instrumentation, APM service naming.
  • Build SRE dashboards and evidence reports tied to release gates and quarterly reviews.
  • Close observability gaps.
  • Partner with product/engineering to turn raw telemetry into SLOs that match client contracts.
  • Mentor service teams on instrumentation — you are the internal Datadog expert.

Our observability stack

  • Datadog: APM, RUM, Logs, Infrastructure, Network, Synthetics, SLOs, IRM/On-Call, Notebooks, CI Visibility
  • IaC: Terraform + Terragrunt (Datadog provider), GitHub Actions delivery
  • Signals: OpenTelemetry (Go, Node/TS, Python), Datadog Agent, CNPG exporter, pgwatch, Kyverno policy metrics
  • Adjacent: Elasticsearch, Prometheus (limited), Slack (flux-events), PagerDuty-style routing via Datadog IRM + JSM
  • Languages: Terraform/HCL daily; Go and Python for tooling

What were looking for

  • 6+ years in DevOps / SRE / Observability with deep, hands-on Datadog expertise (not just "used Datadog" — designed and scaled an estate).
  • Strong Terraform skills — you are comfortable authoring Datadog provider resources at scale (hundreds of monitors/dashboards as code).
  • Demonstrated ability to define and drive SLOs from business/contract requirements to implemented monitors and error budgets.
  • Real experience with APM tracing, log pipelines, log-based metrics, composite monitors, anomaly detection, forecast alerts.
  • Hands-on Kubernetes observability — Datadog Agent, DaemonSets, Admission Controller, cluster checks, autodiscovery.
  • Experience building or operating an on-call / incident response program (Datadog IRM, PagerDuty, Opsgenie, or similar).
  • Scripting in Python or Go — you can automate Datadog API workflows, backfill tags, migrate resources.
  • You care about signal quality over noise — you have killed more monitors than you have created.

Nice to have

  • OpenTelemetry contributions or deep tuning of the OTel Collector.
  • Regulated industry experience (FinServ, HealthTech) with audit-ready observability.
  • FinOps / cost-observability experience (Kubecost, Datadog Cloud Cost Management).
  • Experience migrating from another APM (New Relic, Dynatrace, AppDynamics) to Datadog.
  • Jira Service Management integration for incident → ticket workflows.

――――――――――――――――――――――――――――――――――――――――――――――――――

Why join us

  • Scale & impact: Our platform powers digital wealth for top-tier banks — real AUM, real regulatory stakes.
  • Modern stack: Flux, Terragrunt, Datadog-as-code, Envoy Gateway, streaming SQL, Temporal — running real production workloads.
  • Autonomy: Senior engineers own platforms end-to-end. No ticket-pushing, no gatekeepers.
  • Strategic initiatives: Autonomous agent platform, automated release gates, SSL for SaaS, multi-region DR — lots to build.
  • Team: Small, senior, opinionated DevOps/SRE group. Youll ship on day one.

Observability SME (Datadog) in Abu Dhabi, United Arab Emirates


Job Details

Role Level: Not Applicable Work Type: Contract
Country: United Arab Emirates City: Abu Dhabi
Company Website: http://www.halian.com/ Job Function: Others
Company Industry/
Sector:
Staffing and Recruiting

What We Offer


About the Company

Searching, interviewing and hiring are all part of the professional life. The TALENTMATE Portal idea is to fill and help professionals doing one of them by bringing together the requisites under One Roof. Whether you're hunting for your Next Job Opportunity or Looking for Potential Employers, we're here to lend you a Helping Hand.

Report

Disclaimer: talentmate.com is only a platform to bring jobseekers & employers together. Applicants are advised to research the bonafides of the prospective employer independently. We do NOT endorse any requests for money payments and strictly advice against sharing personal or bank related information. We also recommend you visit Security Advice for more information. If you suspect any fraud or malpractice, email us at abuse@talentmate.com.


Recent Jobs
View More Jobs
Talentmate Instagram Talentmate Facebook Talentmate YouTube Talentmate LinkedIn