Job Description
We are looking for a Site Reliability Engineer to manage and maintain mission-critical cloud infrastructure for global customers. This role focuses on ensuring high availability, reliability, and performance of production systems. The candidate will work on monitoring systems, automating infrastructure, improving scalability, and collaborating with development teams to enhance overall system efficiency.
Responsibilities
Production & System Reliability
- Monitor system availability and maintain overall system health
- Ensure smooth functioning of production environments
- Provide operational support for large-scale distributed systems
Performance & Optimization
- Analyze system metrics and optimize performance
- Provide predictive insights to prevent system failures
- Improve system reliability and scalability
Automation & Infrastructure
- Build tools and automation for infrastructure management
- Develop systems to manage cloud and on-premise environments
- Improve deployment processes and reduce manual efforts
Collaboration & Engineering Support
- Work with development teams to improve system quality and releases
- Participate in system design, capacity planning, and architecture discussions
- Support testing and deployment processes
Compliance & Process
- Follow organizational policies and quality standards
- Participate in risk assessment and system governance processes
Requirements
- 2+ years of experience in Site Reliability Engineering / DevOps
- Strong experience in managing cloud infrastructure and production systems
- Experience in monitoring, troubleshooting, and performance tuning
- Ability to analyze system and application metrics
- Knowledge of automation and infrastructure tools
- Experience working with distributed systems
- Strong problem-solving and analytical skills
Skills
- Cloud Platforms (AWS, Azure, GCP)
- DevOps Tools (Jenkins, Git, CI/CD pipelines)
- Containerization (Docker, Kubernetes)
- Infrastructure as Code (Terraform, Ansible, Puppet)
- Programming/Scripting (Python, Shell)
- Monitoring Tools (Nagios, etc.)
- Linux / Operating Systems
- System Automation & Performance Optimization
Good to Have Skills
- Experience with Citrix Cloud or CloudStack
- Data center or ISP experience
- Knowledge of GPU systems and virtualization
- Experience supporting AI/ML workloads