Customer Reliability Engineer (CRE)

Job Description

This role focuses on Site Reliability Engineering (SRE), DevOps automation, cloud infrastructure management, incident response, and customer reliability operations for Arista’s Network Detection and Response (NDR) platform.

The candidate will:

  • Handle deployments, upgrades, and incident management
  • Automate manual operational tasks
  • Improve system reliability and observability
  • Work directly with customers to solve technical issues
  • Build scalable infrastructure and automation solutions

This is a highly technical role combining:

  • SRE
  • DevOps
  • Cloud Infrastructure
  • Automation Engineering
  • Production Support

Key Responsibilities

1. Customer Reliability & Operations

  • Support mission-critical customer infrastructure
  • Handle:
    • Manual deployments
    • Upgrades
    • Incident response
    • On-call escalations
  • Support air-gapped/on-prem customer environments

2. Incident Management & Troubleshooting

  • Lead production incident handling
  • Perform:
    • Root cause analysis
    • System debugging
    • Performance troubleshooting
  • Stabilize critical production issues quickly

3. Automation & Tooling

  • Automate repetitive operational tasks
  • Develop:
    • Internal tools
    • Automation scripts
    • Declarative infrastructure
  • Reduce manual operational workload

4. Cloud Infrastructure Management

Work with AWS services such as:

  • EC2
  • VPC
  • IAM
  • S3

Responsibilities include:

  • Infrastructure automation
  • CI/CD implementation
  • Secure environment deployment

5. Monitoring & Observability

  • Build and manage observability systems
  • Handle:
    • Metrics
    • Logs
    • Traces
  • Define and monitor SLOs
  • Debug production systems using telemetry data

6. Software Development & Scripting

Develop automation using:

  • Python
  • Go
  • Bash/Shell scripting

Work with diverse technology stacks including:

  • Python
  • Scala
  • C
  • C++
  • Rust
  • Haskell
  • PureScript

7. Customer-Facing Support

  • Work directly with customers
  • Resolve complex technical problems
  • Communicate clearly during incidents and deployments

8. Collaboration & Engineering Improvements

  • Work with:
    • Product Engineering
    • Platform Teams
    • Internal Tooling Teams
  • Convert operational challenges into long-term automated solutions

Required Skills

DevOps & SRE Skills

  • Site Reliability Engineering (SRE)
  • DevOps
  • CI/CD Pipelines
  • Infrastructure Automation
  • Terraform

Cloud & Infrastructure

  • AWS
  • Linux Administration
  • Networking Fundamentals
  • DNS
  • TCP/IP
  • Routing

Monitoring & Observability

  • Metrics & Logging
  • Tracing
  • Telemetry Pipelines
  • Incident Monitoring
  • SLO Management

Programming & Automation

  • Python
  • Go
  • Bash/Shell Scripting
  • Automation Development

Troubleshooting & Debugging

  • System Debugging
  • Networking Debugging
  • Log Analysis
  • Root Cause Analysis

Soft Skills

  • Customer Communication
  • Incident Leadership
  • Problem Solving
  • Collaboration
  • Operational Ownership

Experience Required

  • 3+ years in:
    • Site Reliability Engineering
    • DevOps
    • Infrastructure Operations
    • Production Support

Role Details

  • Role: Customer Reliability Engineer (CRE)
  • Industry: IT Services & Consulting
  • Department: Engineering – Software & QA
  • Employment Type: Full Time, Permanent
  • Role Category: Software Development

Education

  • Any Graduate
  • Any Postgraduate

Key Skills

  • Linux
  • AWS
  • Terraform
  • CI/CD
  • Python
  • Shell Scripting
  • Monitoring
  • Troubleshooting
  • Automation
  • Networking
  • DNS
  • DevOps
  • SRE
  • Incident Response
  • C++
  • Observability