Back to listing

Site Reliability Engineer II

Talentmate

India

21st May 2026

2605-5795-634

Job Description

Job Description

Site Reliability Engineer II collaborates with engineering teams to enhance system resilience, scalability, and performance through feature development, automation, architectural design, resiliency testing, and disaster recovery planning, while promoting best practices for continuous improvement.

Responsibilities

Key Responsibilities

Monitor application and infrastructure health using enterprise monitoring and observability tools, including ELF, to ensure availability, performance, and reliability of enterprise platforms
Configure, tune, and maintain alerting mechanisms in ELF, aligned to service health indicators and SLOs, to enable timely incident detection and reduce noise and false positives
Develop and maintain dashboards providing visibility into system performance, availability, reliability trends, and key operational metrics
Analyze metrics, logs, and distributed traces across application and infrastructure layers to proactively identify issues and support effective root cause analysis (RCA)
Own and execute blameless RCAs for production incidents, identify corrective and preventive actions, and track them to closure
Implement minor code fixes, configuration updates, and reliability enhancements as part of incident remediation and preventive measures
Collaborate with application development and platform teams to review defects, propose fixes, and improve overall service reliability
Participate in Agile sprint planning ceremonies, backlog grooming, estimation, and delivery of SRE‑owned work items
Drive reliability improvements through sprint‑based commitments, including automation, operational fixes, and platform enhancements
Participate in Disaster Recovery (DR) planning, testing, and execution to ensure resilience of business‑critical services
Perform regular system patching and maintenance activities in line with organizational security, compliance, and audit requirements
Support ITIL‑based Incident, Problem, and Change Management processes, including planning, documentation, approvals, execution, and post‑implementation validation
Monitor network performance and troubleshoot connectivity, latency, and access‑related issues impacting platform traffic
Participate in certificate lifecycle management, including provisioning, renewal, validation, and troubleshooting of SSL/TLS certificates
Maintain and manage service accounts (Service IDs), including access provisioning, credential rotation, and compliance with security policies
Drive automation and operational toil reduction using scripting, CI/CD pipelines, and platform tooling to improve reliability and scalability
Maintain accurate documentation of system configurations, runbooks, SOPs, platform operational guidelines, and troubleshooting procedures, and generate reports on system performance, incidents, and resolutions

Qualifications

Education and Knowledge

Minimum of 5+ years of relevant experience in application development, maintenance, and production support, along with hands-on exposure to Java and distributed systems in enterprise environments.
Bachelor’s degree in computer science, Information Technology, Engineering, or equivalent practical experience; advanced degree is a plus
Strong knowledge of operating systems and application runtimes such as Java and .NET
Knowledge of distributed systems and service‑based architectures from an operations and reliability perspective
Strong knowledge of modern observability stacks and platforms, including Splunk, Elasticsearch, Prometheus, and Grafana
Knowledge of observability practices including logging, monitoring, tracing, and performance analysis
Knowledge of RDBMS and NoSQL databases including MySQL, PostgreSQL, Couchbase, HBase, and Cassandra
Knowledge of scripting and automation using languages such as PowerShell and Python
Basic understanding of AI, analytics, or AIOps platforms from an operational perspective is a plus

Work Experience

Experience in Incident, Problem, and Change Management using ServiceNow or similar ITSM tools
Experience supporting production systems in large‑scale enterprise environments with a focus on reliability and availability
Experience in system administration, infrastructure operations, and network troubleshooting
Experience with CI/CD pipeline implementation and support using tools such as Jenkins, GitHub Actions, XL Release (XLR), or similar
Experience managing and troubleshooting technology infrastructure and services, including servers, networks, and cloud platforms
Knowledge of cloud‑based Site Reliability Engineering (SRE) practices with hands‑on experience on public cloud platforms such as AWS, Azure, or Google Cloud Platform
Knowledge of containerization and orchestration technologies such as Docker and Kubernetes, and microservices‑based architectures
Experience using enterprise monitoring and alerting platforms such as ELF
Exposure to AI‑assisted monitoring, automation, or AIOps tools is a plus
Experience accessing and managing remote systems using tools such as RDP and Citrix
Proficiency in connecting to and administering servers via SSH (Secure Shell)
Knowledge of core networking concepts including ports, protocols, firewalls, and secure remote access

Licenses & Certifications

Certification in at least one programming language or runtime such as Java, .NET, or Python
Certification in containerization and orchestration technologies (Docker, Kubernetes, OpenShift) is a plus
Public cloud certification in AWS or GCP is a plus
Certification or training related to AI platforms, analytics platforms, or AIOps is a plus

About Us

At American Express, our culture is built on a 175-year history of innovation, shared values and Leadership Behaviors, and an unwavering commitment to back our customers, communities, and colleagues. From delivering differentiated products to providing world-class customer service, we operate with a strong risk mindset, ensuring we continue to uphold our brand promise of trust, security, and service.

As part of Team Amex, you’ll experience our powerful backing with comprehensive support for your holistic well-being and many opportunities to learn new skills, develop as a leader, and grow your career. Here, your voice and ideas matter, your work makes an impact, and together, you will help us define the future of American Express.

About The Team

We back you with benefits that support your holistic well-being so you can be and deliver your best. This means caring for you and your loved ones physical, financial, and mental health, as well as providing the flexibility you need to thrive personally and professionally:

Competitive base salaries
Bonus incentives
Support for financial-well-being and retirement
Comprehensive medical, dental, vision, life insurance, and disability benefits (depending on location)
Flexible working model with hybrid, onsite or virtual arrangements depending on role and business need
Generous paid parental leave policies (depending on your location)
Free access to global on-site wellness centers staffed with nurses and doctors (depending on location)
Free and confidential counseling support through our Healthy Minds program
Career development and training opportunities

American Express is an equal opportunity employer and makes employment decisions without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran status, disability status, age, or any other status protected by law.

Offer of employment with American Express is conditioned upon the successful completion of a background verification check, subject to applicable laws and regulations.

Job Details

Role Level:	Not Applicable	Work Type:	Full-Time
Country:	India	City:	Chennai ,Tamil Nadu
Company Website:	https://www.americanexpress.com/	Job Function:	DevOps & QA
Company Industry/ Sector:	Financial Services

What We Offer

About the Company

Searching, interviewing and hiring are all part of the professional life. The TALENTMATE Portal idea is to fill and help professionals doing one of them by bringing together the requisites under One Roof. Whether you're hunting for your Next Job Opportunity or Looking for Potential Employers, we're here to lend you a Helping Hand.

Report

Disclaimer: talentmate.com is only a platform to bring jobseekers & employers together. Applicants are advised to research the bonafides of the prospective employer independently. We do NOT endorse any requests for money payments and strictly advice against sharing personal or bank related information. We also recommend you visit Security Advice for more information. If you suspect any fraud or malpractice, email us at abuse@talentmate.com.