Job Description
This role focuses on Site Reliability Engineering (SRE), DevOps automation, cloud infrastructure management, incident response, and customer reliability operations for Arista’s Network Detection and Response (NDR) platform.
The candidate will:
- Handle deployments, upgrades, and incident management
- Automate manual operational tasks
- Improve system reliability and observability
- Work directly with customers to solve technical issues
- Build scalable infrastructure and automation solutions
This is a highly technical role combining:
- SRE
- DevOps
- Cloud Infrastructure
- Automation Engineering
- Production Support
Key Responsibilities
1. Customer Reliability & Operations
- Support mission-critical customer infrastructure
- Handle:
- Manual deployments
- Upgrades
- Incident response
- On-call escalations
- Support air-gapped/on-prem customer environments
2. Incident Management & Troubleshooting
- Lead production incident handling
- Perform:
- Root cause analysis
- System debugging
- Performance troubleshooting
- Stabilize critical production issues quickly
3. Automation & Tooling
- Automate repetitive operational tasks
- Develop:
- Internal tools
- Automation scripts
- Declarative infrastructure
- Reduce manual operational workload
4. Cloud Infrastructure Management
Work with AWS services such as:
- EC2
- VPC
- IAM
- S3
Responsibilities include:
- Infrastructure automation
- CI/CD implementation
- Secure environment deployment
5. Monitoring & Observability
- Build and manage observability systems
- Handle:
- Metrics
- Logs
- Traces
- Define and monitor SLOs
- Debug production systems using telemetry data
6. Software Development & Scripting
Develop automation using:
- Python
- Go
- Bash/Shell scripting
Work with diverse technology stacks including:
- Python
- Scala
- C
- C++
- Rust
- Haskell
- PureScript
7. Customer-Facing Support
- Work directly with customers
- Resolve complex technical problems
- Communicate clearly during incidents and deployments
8. Collaboration & Engineering Improvements
- Work with:
- Product Engineering
- Platform Teams
- Internal Tooling Teams
- Convert operational challenges into long-term automated solutions
Required Skills
DevOps & SRE Skills
- Site Reliability Engineering (SRE)
- DevOps
- CI/CD Pipelines
- Infrastructure Automation
- Terraform
Cloud & Infrastructure
- AWS
- Linux Administration
- Networking Fundamentals
- DNS
- TCP/IP
- Routing
Monitoring & Observability
- Metrics & Logging
- Tracing
- Telemetry Pipelines
- Incident Monitoring
- SLO Management
Programming & Automation
- Python
- Go
- Bash/Shell Scripting
- Automation Development
Troubleshooting & Debugging
- System Debugging
- Networking Debugging
- Log Analysis
- Root Cause Analysis
Soft Skills
- Customer Communication
- Incident Leadership
- Problem Solving
- Collaboration
- Operational Ownership
Experience Required
- 3+ years in:
- Site Reliability Engineering
- DevOps
- Infrastructure Operations
- Production Support
Role Details
- Role: Customer Reliability Engineer (CRE)
- Industry: IT Services & Consulting
- Department: Engineering – Software & QA
- Employment Type: Full Time, Permanent
- Role Category: Software Development
Education
- Any Graduate
- Any Postgraduate
Key Skills
- Linux
- AWS
- Terraform
- CI/CD
- Python
- Shell Scripting
- Monitoring
- Troubleshooting
- Automation
- Networking
- DNS
- DevOps
- SRE
- Incident Response
- C++
- Observability