The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, scalability, and performance of critical applications and systems. SREs work at the intersection of development and operations, applying software engineering practices to infrastructure and operations problems. The role requires a deep understanding of system administration, programming, automation, cloud infrastructure, cloud native applications and a strong focus on operational excellence.
Responsibilities
MAJOR RESPONSIBILITIES AND ACCOUNTABILITIES
Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible, Dynatrace to automate deployment and management of infrastructure.
Build and manage CI/CD pipelines to ensure efficient and reliable application deployments.
Improve infrastructure provisioning and configuration through automation, minimizing manual interventions and reducing human error.
Monitor the health, performance, and reliability of production systems and applications.
Design, implement, and maintain automated monitoring solutions, using tools such as Datadog
Define and monitor service level objectives (SLOs), service level indicators (SLIs), and error budgets to ensure system reliability and availability meet customer expectations.
Implement effective alerting systems to identify and address potential issues before they impact users.
Lead root cause analysis (RCA) and post-mortem investigations after incidents to identify improvements and avoid recurrence.
Respond to production incidents, diagnose root causes, and implement corrective actions.
Create and maintain playbooks and documentation for incident response, troubleshooting, and recovery processes.
Collaborate closely with development teams during the post-deployment phase to ensure smooth rollouts and address any production issues.
Work alongside software engineers to design, deploy, and scale applications that are highly available, resilient, and fault tolerant.
Provide guidance and support in ensuring that code is written with an operational mindset, enabling easy deployment, monitoring, and debugging.
Act as a bridge between development, operations, and business teams, ensuring that infrastructure and software align with business goals.
Experience working with cloud platforms such as AWS, Microsoft Azure and/or GCP
Expertise with Git, Jenkins, CircleCI, GitLab CI, or similar CI/CD platforms.
Stay current with emerging technologies, tools, and trends in site reliability engineering, DevOps, and cloud computing.
Lead or contribute to internal initiatives aimed at improving system performance, reliability, and operational efficiency.
Propose and lead process improvements, optimizations, and innovations in automation and system design.
Strong written and verbal communication skills, able to collaborate with cross-functional teams, write documentation, and explain technical concepts to non-technical stakeholders.
Ability to work effectively in a fast-paced environment, collaborating with software developers, other SREs, operations teams, and business stakeholders.
R EPORTING STRUCTURE
Does this position formally supervise employees? (Y / N)
No
JOB SPECIFICATIONS
Qualification
Bachelors / Master’s degree in computer science or related technical field.
Years of Experience
5- 8 years of relevant experience.
Skills And Capabilities
Details
People Management Skills
N/A
Technical Skills
Proficiency in using SLDC tools like AzureDevOps/JIRA
Programming/Scripting Languages: Proficiency in one or more of the following: Python, Go, Ruby, Shell scripting, or JavaScript (Node.js).
Cloud Platforms: Hands-on experience with cloud services (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).
Monitoring & Observability Tools: Experience with Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, or similar tools.
Infrastructure as Code (IaC): Proficient with tools such as Terraform, Ansible and Dynatrace.
Version Control & CI/CD: Expertise with Git, Jenkins, CircleCI, GitLab CI, or similar CI/CD platforms.
Database Management: Experience with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Oracle).
Networking and Security: Strong understanding of networking protocols (TCP/IP, DNS, HTTP/S) and security best practices (firewalls, IAM, VPNs, etc.).
Other Skills
Strong analytical and problem–solving abilities, with a keen attention to detail
Ability to thrive in a fast–paced, dynamic environment and manage multiple priorities simultaneously
Proactive attitude towards learning new technologies and industry trends
Exceptional organizational and time management skills, with the ability to meet deadlines and deliver results under pressure
Willingness to contribute to a positive and inclusive team culture
Searching, interviewing and hiring are all part of the professional life. The TALENTMATE Portal idea is to fill and help professionals doing one of them by bringing together the requisites under One Roof. Whether you're hunting for your Next Job Opportunity or Looking for Potential Employers, we're here to lend you a Helping Hand.
Disclaimer: talentmate.com is only a platform to bring jobseekers & employers together.
Applicants
are
advised to research the bonafides of the prospective employer independently. We do NOT
endorse any
requests for money payments and strictly advice against sharing personal or bank related
information. We
also recommend you visit Security Advice for more information. If you suspect any fraud
or
malpractice,
email us at abuse@talentmate.com.
You have successfully saved for this job. Please check
saved
jobs
list
Applied
You have successfully applied for this job. Please check
applied
jobs list
Do you want to share the
link?
Please click any of the below options to share the job
details.
Report this job
Success
Successfully updated
Success
Successfully updated
Thank you
Reported Successfully.
Copied
This job link has been copied to clipboard!
Apply Job
Upload your Profile Picture
Accepted Formats: jpg, png
Upto 2MB in size
Your application for Senior - Site Reliability Engineer
has been successfully submitted!
To increase your chances of getting shortlisted, we recommend completing your profile.
Employers prioritize candidates with full profiles, and a completed profile could set you apart in the
selection process.
Why complete your profile?
Higher Visibility: Complete profiles are more likely to be viewed by employers.
Better Match: Showcase your skills and experience to improve your fit.
Stand Out: Highlight your full potential to make a stronger impression.
Complete your profile now to give your application the best chance!