Job Description

Department

Digital & Technology Office

Employee Type

Probationary

The Senior Site Reliability Engineer will serve as the first line of defense for our 24/7 operations. You will act as the guardian of our production environment, utilizing Dynatrace to maintain a holistic view of both Infrastructure and Application health.

You will not just monitor uptime; you will actively test system resilience, manage major incidents, and facilitate stability reporting. You will be the primary notification point for all P1/P2 incidents, responsible for deep-dive triage, quick remediation, and coordinating Major Incident Management (MIM).

Key Responsibilities

24/7 Incident Command & Alerting

  • 24/7 Availability: Participate in a shift rotation or on-call schedule to ensure continuous coverage. You are the "eyes on glass" for the organization.
  • Unified Alerting: Manage the notification workflow. Ensure that Critical Alerts for both Infrastructure failures and Application failures trigger immediate notifications to the 24/7 team.
  • Major Incident Management (MIM): Lead the technical response during critical outages. Coordinate cross-functional teams to restore service rapidly.

Observability Strategy (Dynatrace Focus)

  • Dynatrace Administration: Act as the Subject Matter Expert (SME) for our Dynatrace implementation.
  • Configure Management Zones, Alerting Profiles, and Dashboards to provide a "Single Pane of Glass."
  • Utilize Dynatrace PurePath for distributed tracing to identify bottlenecks in microservices.
  • Leverage Davis AI to automatically detect anomalies and reduce alert noise.
  • Comprehensive Monitoring Scope:
  • Network Health: Monitor VPN Tunnel status, Load Balancer (ALB/NLB) health, and DNS latency. Trigger: Alert on packet loss or high latency.
  • Infrastructure Health: Monitor Disk/Volume usage, CPU/Memory saturation, and SSL Certificate expiry.
  • Security: Monitor for DDoS attack patterns and WAF spikes.

Resilience & Chaos Engineering

  • Chaos Engineering: Plan and execute Chaos Engineering exercises (e.g., simulating pod failures, network latency, zone outages) to test the systems resilience and verify that failover mechanisms work as expected.
  • Reliability Recommendations: Proactively analyze trends and provide architectural recommendations to development and infrastructure teams to improve system stability.
  • First Line Troubleshooting: Serve as the L1/L2 troubleshooter for Kubernetes (EKS), AWS, and Linux issues. Execute "Quick Fix" runbooks to mitigate impact before escalating to platform engineering.

Application Triage & Analysis

  • Deep-Dive Triage: Go beyond "system check" to perform deep analysis using Dynatrace. Analyze stack traces and exception logs to pinpoint the exact line of code causing the failure.
  • Root Cause Differentiation: Rapidly differentiate between an Infrastructure Issue (e.g., Network timeout) vs. an Application Logic Error (e.g., NullPointer caused by bad data).
  • Blameless RCA: Facilitate Root Cause Analysis sessions to ensure permanent fixes are applied to recurring problems.

Governance & Reporting (Stability Cadence)

  • Stability Calls: Facilitate and lead the Weekly/Bi-Weekly Stability Call. Present the health status of all technical towers to leadership and stakeholders.
  • Reporting: Generate regular reports on system uptime, error budgets, incident trends, and MTTR (Mean Time To Recovery).
  • Cross-Tower Visibility: Ensure that the dashboards and reports provide value to all teams (Network, App, Cloud), ensuring no siloed "blind spots" in production.

Automation & Toil Reduction

  • Remediation Scripting: Develop scripts (Python/Bash) to "Auto-Heal" common issues (e.g., clearing logs when disk is full, restarting stuck services).
  • Process Improvement: Identify manual checks and convert them into automated Dynatrace alerts or synthetic tests.

Required Qualifications

  • Shift Availability: Must be willing to work in a 24/7 shift environment or strictly defined on-call rotation.
  • Dynatrace Expertise: Deep experience administering and using Dynatrace in a production environment (Dashboards, OneAgent, PurePaths).
  • Troubleshooting Expertise:
  • Network: Understanding of DNS, TCP/IP, Load Balancing, and Firewalls.
  • Compute/Storage: Understanding of block vs. object storage, CPU stealing, and memory management.
  • Governance: Experience facilitating technical management calls and producing executive-level reliability reports.
  • Application Debugging: Ability to read application logs (Java, Node, Python) to understand why a service failed.
  • Cloud (AWS) & K8s: Solid understanding of EKS, EC2, and other AWS Services

Experience Range Range (Years)

4 - 8 years

Job posted on

2026-03-12


Job Details

Role Level: Mid-Level Work Type: Full-Time
Country: Philippines City: Pasay National Capital Region
Company Website: https://www.cebupacificair.com/en-PH/pages/about/careers Job Function: DevOps & QA
Company Industry/
Sector:
Airlines and Aviation

What We Offer


About the Company

Searching, interviewing and hiring are all part of the professional life. The TALENTMATE Portal idea is to fill and help professionals doing one of them by bringing together the requisites under One Roof. Whether you're hunting for your Next Job Opportunity or Looking for Potential Employers, we're here to lend you a Helping Hand.

Report

Disclaimer: talentmate.com is only a platform to bring jobseekers & employers together. Applicants are advised to research the bonafides of the prospective employer independently. We do NOT endorse any requests for money payments and strictly advice against sharing personal or bank related information. We also recommend you visit Security Advice for more information. If you suspect any fraud or malpractice, email us at abuse@talentmate.com.


Recent Jobs
View More Jobs
Talentmate Instagram Talentmate Facebook Talentmate YouTube Talentmate LinkedIn