Job Description
Databricks helps data teams solve some of the world’s biggest problems—such as building new transportation systems and speeding up medical research. We do this by creating one of the best data and AI platforms in the world.
Founded by engineers and driven by customer needs, Databricks works on challenging technical problems every day—from building modern user interfaces to running large-scale infrastructure across millions of virtual machines. We are growing fast and continuously improving our platform.
About the Role
We are looking for an experienced Senior Platform Monitoring Engineer to join our Platform Monitoring Team. This is a critical role focused on platform reliability, monitoring, incident response, and customer experience.
You will act as a key first responder when platform issues occur, investigate problems deeply, improve monitoring systems, and help prevent future incidents. Your work will directly impact how reliable and stable the Databricks platform is for customers.
Key Responsibilities
Incident Management & Reliability
- Act as a lead responder during platform incidents to reduce customer impact
- Coordinate with multiple engineering and infrastructure teams to quickly identify and resolve issues
- Ensure problems are detected early and handled efficiently
Root Cause Analysis
- Perform detailed post-incident investigations
- Identify the real root cause of failures across infrastructure, services, and cloud platforms
- Look for recurring patterns and propose long-term fixes to avoid repeat incidents
Monitoring & Observability
- Design and improve monitoring, alerting, and observability systems
- Build customer-focused alerting pipelines to detect issues faster
- Correlate metrics, logs, and traces to improve system visibility
- Reduce mean time to detect (MTTD) and resolve (MTTR) issues
Automation & System Improvements
- Develop automation tools to reduce manual work and improve reliability
- Create reusable monitoring patterns and best practices
- Continuously improve platform stability and customer experience
Required Skills & Experience
- 5+ years of experience as an:
- Site Reliability Engineer (SRE)
- DevOps Engineer
- Production Engineer
- Or similar role
- Strong experience working in production environments
- Hands-on experience with at least one cloud provider:
- AWS
- Azure
- Google Cloud Platform (GCP)
- Experience with containers and orchestration tools:
- Docker
- Kubernetes
- Strong knowledge of monitoring and alerting tools, such as:
- Prometheus
- Grafana
- ELK stack
- PagerDuty
- Ability to design monitoring systems using metrics, logs, and traces
- Strong programming skills in Python (or similar languages)
- Experience building automation tools used in real production systems
- Deep understanding of the full incident lifecycle:
- Detection
- Mitigation
- Resolution
- Post-incident review
Education
- Bachelor’s, Master’s, or PhD in:
- Computer Science
- Computer Engineering
- Or a related engineering field
Key Skills
Platform Monitoring, Root Cause Analysis, Incident Management, Python, Automation Tools, Cloud Platforms (AWS/Azure/GCP), Kubernetes, Docker, Observability, Customer Experience, Computer Science