Job Description
This role focuses on managing and scaling CloudVision’s global cloud infrastructure deployed on Kubernetes with a strong emphasis on:
- Reliability
- Scalability
- Automation
- Observability
- Security
- Incident Response
The Senior SRE will manage large-scale distributed systems and production infrastructure while improving operational efficiency through automation and infrastructure engineering.
Technology Stack
Cloud & Infrastructure
- Kubernetes
- GKE (Google Kubernetes Engine)
- Docker
- Virtualization Technologies
CI/CD & DevOps
- Spinnaker
- GitLab
- Terraform
- Infrastructure-as-Code (IaC)
Databases & Storage
- HBase
- Hadoop
- PostgreSQL
- ClickHouse
- ElasticSearch
Streaming & Analytics
- Kafka
- Real-time Stream Processing
- TensorFlow
Monitoring & Observability
- Prometheus
- Grafana
- Loki
Key Responsibilities
1. Production Infrastructure Management
- Operate and manage global CloudVision services
- Ensure:
- High availability
- Reliability
- Performance
- Scalability
- Security
2. Deployment & CI/CD Operations
- Build and deploy systems safely and incrementally
- Manage:
- Kubernetes deployments
- CI/CD pipelines
- Production rollouts
- Support deployment improvements across services
3. Automation & Infrastructure Engineering
- Build automation tools to reduce operational toil
- Implement Infrastructure-as-Code solutions
- Improve operational efficiency through scripting and automation
4. Monitoring & Observability
- Monitor infrastructure using:
- Prometheus
- Grafana
- Loki
- Enhance:
- Alerts
- Monitoring systems
- Automated incident handling
5. Incident Response & Troubleshooting
- Respond to infrastructure and platform incidents
- Create:
- Incident response runbooks
- Postmortem reports
- Perform:
- Root cause analysis
- Debugging
- System troubleshooting
6. Scalability & Reliability Improvements
- Design fault-tolerant systems
- Improve:
- System availability
- Performance
- Infrastructure scalability
7. Collaboration with Engineering Teams
- Work closely with product development teams
- Resolve infrastructural bottlenecks
- Implement platform-level improvements
8. OSS & Platform Engineering
- Study and troubleshoot open-source systems
- Work with distributed systems and platform components
Required Skills
DevOps & SRE Skills
- Site Reliability Engineering
- Infrastructure Automation
- CI/CD
- Infrastructure-as-Code
- Terraform
Cloud & Containerization
- Kubernetes
- GKE
- Docker
- Virtualization
Programming & Scripting
- Go
- Python
- Bash/Shell Scripting
Linux & System Administration
- Linux/UNIX Administration
- System Debugging
- Server Provisioning
Monitoring & Observability
- Prometheus
- Grafana
- Loki
- Alerting Systems
Databases & Distributed Systems
- PostgreSQL
- HBase
- Hadoop
- ElasticSearch
- ClickHouse
- Kafka
Troubleshooting Skills
- Production Debugging
- Infrastructure Troubleshooting
- Incident Analysis
- Performance Optimization
Experience Required
- Bachelor’s or Master’s degree in Computer Science/Engineering
- 5+ years of experience in:
- SRE
- DevOps
- Infrastructure Operations
- Large-scale Production Systems
Role Details
- Role: Senior Site Reliability Engineer (SRE)
- Industry: IT Services & Consulting
- Department: Engineering – Software & QA
- Employment Type: Full Time, Permanent
- Role Category: Quality Assurance and Testing
Key Skills
- Kubernetes
- Linux
- Automation
- Terraform
- CI/CD
- Python
- Go
- Shell Scripting
- Networking
- Debugging
- Prometheus
- Grafana
- Docker
- Kafka
- PostgreSQL
- Analytics
- Infrastructure Engineering