Job Description
Roles & Responsibilities
- Design, architect, and build Big Data platforms (Data Lake, Data Warehouse, Lakehouse) using Databricks integrated with AWS cloud services.
- Develop and support Data Engineering (ETL/ELT) and Machine Learning (ML) solutions using Python, Spark, Scala, or R.
- Build and optimize distributed Spark workloads, ensuring performance and scalability.
- Implement batch and streaming pipelines using Databricks Jobs, DLT, and Spark Streaming.
- Design and maintain data models, databases, and tables across multiple subject areas.
- Build, test, and maintain medium to large-scale data pipelines from multiple source systems.
- Implement data quality checks, validations, and reusable pipeline frameworks.
- Use Infrastructure as Code (IaC) and CI/CD pipelines to automate deployment of data platforms.
- Collaborate on architecture design, documentation, and best practices.
Preferred Candidate Profile
Big Data & Databricks
- Strong experience with Databricks, Spark, Hadoop, EMR, Hortonworks.
- Hands-on expertise with Databricks components:
- Notebooks, Jobs, DLT
- Interactive & Job Clusters
- SQL Warehouses
- Unity Catalog, MLflow
- DBFS, Secrets, Policies
- Hive & Glue Metastore
Programming & Querying
- Strong proficiency in:
- Python
- PySpark / Spark SQL
- SQL
- Hive, Presto
- Spark Streaming
AWS Cloud Services
- Experience with:
- S3, EC2, VPC, IAM
- Lambda, API Gateway
- Glue, Redshift, Spectrum
- Athena, Kinesis
- Cognito, ALB
DevOps, CI/CD & Automation
- Source control: Git, Bitbucket, AWS CodeCommit
- CI/CD tools: Jenkins, GitHub Actions, AWS CodeBuild & CodeDeploy
- Infrastructure automation using Terraform and Databricks APIs
- Experience in MLOps pipelines
Documentation & Delivery
- Create:
- Architecture & design documents
- Low-level designs (LLD)
- Test cases & traceability matrix
- Build reference architectures, demos, and how-to guides
- Willingness to pursue cloud and Databricks certifications
Education
- B.Tech / B.E. in any specialization
Key Skills
PySpark, Databricks, Spark, SQL, AWS, Data Lake, Data Warehousing, ETL/ELT, Machine Learning, CI/CD, Terraform, Python