Data Lake Optimization for an International Car Manufacturer R&D Department

Data Lake Optimization for an International Car Manufacturer R&D Department

The client is one of the largest automobile manufacturers in the world. We worked more specifically with the technical center in Zaventem, home to their European Research & Development department, aimed at enhancing the role of vehicle research and development. We were first consulted for setting up the R&D engineers’ application in the cloud using Infrastructure as Code. After successfully conducting this migration, we came to realize that the performance of the existing component querying the data could be improved. We offered our experience to advise the client on a better approach to handle its data by optimizing its data lake.

Key challenges

The R&D department wanted to collect and store the data from the fleet the R&D engineers’ moving cars for research purposes. 4G modems are installed in every car, sending data to the head office (like speed, steering angle, pressure, etc.) coming in from a multitude of sensors every two minutes. This resulted in a massive amount of unknown and untreated data from a large set of cars driven in real-world conditions.

The main challenge was to clean this data and make it easy to query so the engineers could actually use it for their R&D use cases (i.e: decreasing CO2 emissions, reducing fuel consumption). The engineers would then use this data to run pre-analysis and compare it to the data issued by prototype cars with test engines.

In this context, the client was facing several issues:

Our client was expecting us to find a better solution to improve the quality of the data, clean it, treat it and store it properly in a central data platform dedicated to R&D analysis.

Our approach

We were responsible for the full solution cloud architecture, design and set up in AWS.

We created an initial architecture based on DataOps best practices, that we then adapted to meet our client’s requirements.

Our Data architects therefore performed the analysis, designed the architecture and implemented the project (including the data ingestion, the data lake, and the web administration).

The project was composed of several phases:

The data lake is an architecture in four zones (temporary, raw, secure and clean) where data flows from right to left. The data lake is hosted on S3 and indexed using Athena for a perfect balance between performance and cost optimization.

Benefits

By offering a combination of our expertise in Cloud and Data, we successfully reached the expectations of the client, providing an optimized data lake architecture. On top of this, we provided a solution that is:

Our infrastructure approach also involves all the DataOps practices like CI/CD and IaaC, which was not a standard for the client at the time.

Technologies & Partners

Data Lake Optmization AWS Athena Batch Cloudformation EC2 ECR ECS Airflow Python Dask
Data Lake Optimization with AWS Athena Batch Cloudformation EC2 ECR ECS Airflow Python & Dask