Software Developer

May 29, 2026

Job Description

This role focuses on managing and scaling CloudVision’s global cloud infrastructure deployed on Kubernetes with a strong emphasis on:

  • Reliability
  • Scalability
  • Automation
  • Observability
  • Security
  • Incident Response

The Senior SRE will manage large-scale distributed systems and production infrastructure while improving operational efficiency through automation and infrastructure engineering.


Technology Stack

Cloud & Infrastructure

  • Kubernetes
  • GKE (Google Kubernetes Engine)
  • Docker
  • Virtualization Technologies

CI/CD & DevOps

  • Spinnaker
  • GitLab
  • Terraform
  • Infrastructure-as-Code (IaC)

Databases & Storage

  • HBase
  • Hadoop
  • PostgreSQL
  • ClickHouse
  • ElasticSearch

Streaming & Analytics

  • Kafka
  • Real-time Stream Processing
  • TensorFlow

Monitoring & Observability

  • Prometheus
  • Grafana
  • Loki

Key Responsibilities

1. Production Infrastructure Management

  • Operate and manage global CloudVision services
  • Ensure:
    • High availability
    • Reliability
    • Performance
    • Scalability
    • Security

2. Deployment & CI/CD Operations

  • Build and deploy systems safely and incrementally
  • Manage:
    • Kubernetes deployments
    • CI/CD pipelines
    • Production rollouts
  • Support deployment improvements across services

3. Automation & Infrastructure Engineering

  • Build automation tools to reduce operational toil
  • Implement Infrastructure-as-Code solutions
  • Improve operational efficiency through scripting and automation

4. Monitoring & Observability

  • Monitor infrastructure using:
    • Prometheus
    • Grafana
    • Loki
  • Enhance:
    • Alerts
    • Monitoring systems
    • Automated incident handling

5. Incident Response & Troubleshooting

  • Respond to infrastructure and platform incidents
  • Create:
    • Incident response runbooks
    • Postmortem reports
  • Perform:
    • Root cause analysis
    • Debugging
    • System troubleshooting

6. Scalability & Reliability Improvements

  • Design fault-tolerant systems
  • Improve:
    • System availability
    • Performance
    • Infrastructure scalability

7. Collaboration with Engineering Teams

  • Work closely with product development teams
  • Resolve infrastructural bottlenecks
  • Implement platform-level improvements

8. OSS & Platform Engineering

  • Study and troubleshoot open-source systems
  • Work with distributed systems and platform components

Required Skills

DevOps & SRE Skills

  • Site Reliability Engineering
  • Infrastructure Automation
  • CI/CD
  • Infrastructure-as-Code
  • Terraform

Cloud & Containerization

  • Kubernetes
  • GKE
  • Docker
  • Virtualization

Programming & Scripting

  • Go
  • Python
  • Bash/Shell Scripting

Linux & System Administration

  • Linux/UNIX Administration
  • System Debugging
  • Server Provisioning

Monitoring & Observability

  • Prometheus
  • Grafana
  • Loki
  • Alerting Systems

Databases & Distributed Systems

  • PostgreSQL
  • HBase
  • Hadoop
  • ElasticSearch
  • ClickHouse
  • Kafka

Troubleshooting Skills

  • Production Debugging
  • Infrastructure Troubleshooting
  • Incident Analysis
  • Performance Optimization

Experience Required

  • Bachelor’s or Master’s degree in Computer Science/Engineering
  • 5+ years of experience in:
    • SRE
    • DevOps
    • Infrastructure Operations
    • Large-scale Production Systems

Role Details

  • Role: Senior Site Reliability Engineer (SRE)
  • Industry: IT Services & Consulting
  • Department: Engineering – Software & QA
  • Employment Type: Full Time, Permanent
  • Role Category: Quality Assurance and Testing

Key Skills

  • Kubernetes
  • Linux
  • Automation
  • Terraform
  • CI/CD
  • Python
  • Go
  • Shell Scripting
  • Networking
  • Debugging
  • Prometheus
  • Grafana
  • Docker
  • Kafka
  • PostgreSQL
  • Analytics
  • Infrastructure Engineering