Job Description
We are seeking a Data Scientist / Data Engineer to support a large-scale engineering modernization initiative focused on transforming years of legacy engineering data into an intelligent, searchable platform.
The role involves handling messy, semi-structured historical datasets, building scalable ETL/data pipelines, and preparing the foundation for future AI/ML-driven similarity matching systems.
This is a highly practical, real-world data engineering role with future exposure to AI and machine learning applications.
Project Overview
The project involves:
- Processing 2,100+ Excel files containing 10–15 years of engineering configurations
- Automating a currently manual engineering search process
- Building a scalable backend system to support an internal UI
- Enabling future AI-powered recommendation and similarity matching features
Engineers will eventually input parameters into an internal application, which will surface the most relevant historical configurations automatically.
Responsibilities
Data Engineering & ETL
- Design and maintain scalable ETL/data pipelines using Python and SQL
- Process large volumes of legacy engineering data
- Standardize, normalize, and clean inconsistent datasets
Data Processing
- Parse and transform Excel-based datasets
- Handle semi-structured and unstructured historical data
- Improve data quality and consistency across systems
Platform & Cloud
- Work with Azure data platforms and Databricks
- Build scalable processing workflows for large datasets
AI/ML Foundation
- Prepare datasets for future:
- similarity matching
- ML models
- AI-powered search systems
- Support future integrations involving NLP, LangChain, and LangGraph
Requirements
- 3+ years of experience in Data Engineering or Data Science
- Strong Python programming skills
- Experience building ETL/data pipelines
- Strong SQL skills (PostgreSQL preferred)
- Experience working with:
- Databricks
- Azure data platforms
- Hands-on experience processing Excel datasets
- Strong experience with:
- data cleaning
- normalization
- preprocessing
- Experience handling legacy or inconsistent datasets
Skills
- Python
- SQL
- PostgreSQL
- Data Engineering
- ETL Pipelines
- Data Cleaning & Normalization
- Azure Databricks
- Data Preprocessing
- Machine Learning Foundations
- NLP
- LangChain
- LangGraph
- Legacy Data Processing
- Excel Data Parsing