The Senior Site Reliability Engineer will serve as the first line of defense for our 24/7 operations. You will act as the guardian of our production environment, utilizing Dynatrace to maintain a holistic view of both Infrastructure and Application health.
You will not just monitor uptime; you will actively test system resilience, manage major incidents, and facilitate stability reporting. You will be the primary notification point for all P1/P2 incidents, responsible for deep-dive triage, quick remediation, and coordinating Major Incident Management (MIM).
Key Responsibilities
24/7 Incident Command & Alerting
24/7 Availability: Participate in a shift rotation or on-call schedule to ensure continuous coverage. You are the "eyes on glass" for the organization.
Unified Alerting: Manage the notification workflow. Ensure that Critical Alerts for both Infrastructure failures and Application failures trigger immediate notifications to the 24/7 team.
Major Incident Management (MIM): Lead the technical response during critical outages. Coordinate cross-functional teams to restore service rapidly.
Observability Strategy (Dynatrace Focus)
Dynatrace Administration: Act as the Subject Matter Expert (SME) for our Dynatrace implementation.
Configure Management Zones, Alerting Profiles, and Dashboards to provide a "Single Pane of Glass."
Utilize Dynatrace PurePath for distributed tracing to identify bottlenecks in microservices.
Leverage Davis AI to automatically detect anomalies and reduce alert noise.
Comprehensive Monitoring Scope:
Network Health: Monitor VPN Tunnel status, Load Balancer (ALB/NLB) health, and DNS latency. Trigger: Alert on packet loss or high latency.
Security: Monitor for DDoS attack patterns and WAF spikes.
Resilience & Chaos Engineering
Chaos Engineering: Plan and execute Chaos Engineering exercises (e.g., simulating pod failures, network latency, zone outages) to test the systems resilience and verify that failover mechanisms work as expected.
Reliability Recommendations: Proactively analyze trends and provide architectural recommendations to development and infrastructure teams to improve system stability.
First Line Troubleshooting: Serve as the L1/L2 troubleshooter for Kubernetes (EKS), AWS, and Linux issues. Execute "Quick Fix" runbooks to mitigate impact before escalating to platform engineering.
Application Triage & Analysis
Deep-Dive Triage: Go beyond "system check" to perform deep analysis using Dynatrace. Analyze stack traces and exception logs to pinpoint the exact line of code causing the failure.
Root Cause Differentiation: Rapidly differentiate between an Infrastructure Issue (e.g., Network timeout) vs. an Application Logic Error (e.g., NullPointer caused by bad data).
Blameless RCA: Facilitate Root Cause Analysis sessions to ensure permanent fixes are applied to recurring problems.
Governance & Reporting (Stability Cadence)
Stability Calls: Facilitate and lead the Weekly/Bi-Weekly Stability Call. Present the health status of all technical towers to leadership and stakeholders.
Reporting: Generate regular reports on system uptime, error budgets, incident trends, and MTTR (Mean Time To Recovery).
Cross-Tower Visibility: Ensure that the dashboards and reports provide value to all teams (Network, App, Cloud), ensuring no siloed "blind spots" in production.
Automation & Toil Reduction
Remediation Scripting: Develop scripts (Python/Bash) to "Auto-Heal" common issues (e.g., clearing logs when disk is full, restarting stuck services).
Process Improvement: Identify manual checks and convert them into automated Dynatrace alerts or synthetic tests.
Required Qualifications
Shift Availability: Must be willing to work in a 24/7 shift environment or strictly defined on-call rotation.
Dynatrace Expertise: Deep experience administering and using Dynatrace in a production environment (Dashboards, OneAgent, PurePaths).
Troubleshooting Expertise:
Network: Understanding of DNS, TCP/IP, Load Balancing, and Firewalls.
Compute/Storage: Understanding of block vs. object storage, CPU stealing, and memory management.
Searching, interviewing and hiring are all part of the professional life. The TALENTMATE Portal idea is to fill and help professionals doing one of them by bringing together the requisites under One Roof. Whether you're hunting for your Next Job Opportunity or Looking for Potential Employers, we're here to lend you a Helping Hand.
Disclaimer: talentmate.com is only a platform to bring jobseekers & employers together.
Applicants
are
advised to research the bonafides of the prospective employer independently. We do NOT
endorse any
requests for money payments and strictly advice against sharing personal or bank related
information. We
also recommend you visit Security Advice for more information. If you suspect any fraud
or
malpractice,
email us at abuse@talentmate.com.
You have successfully saved for this job. Please check
saved
jobs
list
Applied
You have successfully applied for this job. Please check
applied
jobs list
Do you want to share the
link?
Please click any of the below options to share the job
details.
Report this job
Success
Successfully updated
Success
Successfully updated
Thank you
Reported Successfully.
Copied
This job link has been copied to clipboard!
Apply Job
Upload your Profile Picture
Accepted Formats: jpg, png
Upto 2MB in size
Your application for Senior Site Reliability Engineer
has been successfully submitted!
To increase your chances of getting shortlisted, we recommend completing your profile.
Employers prioritize candidates with full profiles, and a completed profile could set you apart in the
selection process.
Why complete your profile?
Higher Visibility: Complete profiles are more likely to be viewed by employers.
Better Match: Showcase your skills and experience to improve your fit.
Stand Out: Highlight your full potential to make a stronger impression.
Complete your profile now to give your application the best chance!