Powering Autonomous Trucks with Fleet-scale Data Collection Pipelines

May 27, 2025 · 5 min read

Engineering Manager

This blog post describes how Einride Autonomous leverages a hybrid fleet of manually-driven and autonomous trucks to build a massive data catalogue collected in a wide variety of driving scenarios. It highlights how data constitutes the backbone for the development of a safe and robust autonomous driving stack and what challenges need to be addressed.

fleet-data

Autonomous trucks are revolutionizing the logistics and transportation sector with a promise of unbeatable efficiency and safety. This revolution is being built on massive volumes of data, which provide the foundation for any successful autonomous driving system. Self-driving vehicles are expected to operate safely in complex scenarios, requiring robustness and reliability that can be achieved only with extensive testing, a process that heavily leverages these data volumes to create, simulate, and validate performance across a wide variety of operational settings.

Data Diversity – Preparing for Every Scenario

Autonomous trucks must be primed for every possible scenario: from ideal conditions like sunny days to challenging ones like heavy rain, snow, dense fog, and operating in complex urban, industrial or rural environments. Handling unexpected and safety-critical scenarios, like aggressive maneuvers from other drivers, or the presence of vulnerable road users or objects on the road is a core requirement autonomous vehicles are expected to fulfill. Preparing for every possible case is implausible, given the infinite amount of variables into play. However, the broader the range of scenarios an AV is exposed to during training, the better it can generalize and respond effectively to new, unforeseen situations. A fleet-scale data collection pipeline is the key to capturing this diversity, ensuring that our autonomous trucks can operate safely and efficiently in any environment.

Data Collection at a Glance

One of Einride’s strengths is the possibility to collect data from both our manually-driven fleet as well as our autonomous trucks. Our manual trucks operate in multiple countries across three different continents, representing a potential of ~10 million kilometers worth of driving scenarios per year (based on 2024 total mileage). This massive volume of driving scenarios represents the foundations we are building our long-term vision on – a high-performance autonomous truck capable of seamlessly transitioning among diverse operational environments. Our autonomous trucks instead collect customer- and ODD-specific data. This more targeted volume allows us to improve our autonomous performance and fulfill our short-term development needs. Although based on different low-level hardware, our manual and autonomous trucks are equipped with the same multi-modal setup, consisting of lidar, radar, cameras, GPS, and inertial data, to maximize data consistency and streamline data ingestion and preprocessing steps.

Downstream Modules – How do we use the data?

Collecting data is just one piece of the puzzle. Before it’s ready to be used, data needs to be ingested, which means it needs to be transferred to our cloud-based data lakes, and then preprocessed, which typically means sanitizing it and transforming it into formats compatible with our downstream processes, as well as selecting scenarios with high informative value. Once preprocessing is complete, the data is fed into CI/CD and ML pipelines.

CI/CD Pipeline

This pipeline supports feature development, verification and validation, data analysis, and simulation. We need large amounts of data to benchmark the performance of our autonomous vehicle in real scenarios, as well as to create synthetic scenarios with rarely occurring driving situations we want to evaluate. The more data we have the better we can verify and validate our system and ensure we are deploying a safe autonomous vehicle.

ML Pipeline

Here, we leverage the processed data to train machine learning models, such as deep neural networks, that can perceive the environment, model the behavior of other road agents, and make driving decisions. The richness and diversity of our curated datasets enable models to achieve high accuracy and to enhance their robustness, ensuring reliable performance even in unexpected or challenging situations.

Challenges

While some of the steps discussed above may sound trivial, there are quite a few challenges to address to realize a streamlined data pipeline. To begin with, autonomous vehicles typically collect large amounts of data while operating, and their storage and data transfer capabilities are limited. As this factor hinders the scalability of our system, we are continuously expanding the capabilities of our event-based logging system to ensure we focus only on the most relevant data. Data quality is also extremely important. Ensuring the accuracy, consistency, and completeness of the data is crucial to enable downstream processes to function optimally, which we achieve through data integrity verification and anomaly detection algorithms, and more. Computational demands are another challenge. Depending on the downstream consumers, data preprocessing may require powerful computing resources and optimized algorithms, especially when dealing with rich data like point clouds. While the in-vehicle filtering helps us reduce the overall data volume, we offload more intensive tasks to our scalable cloud infrastructure where we can leverage dedicated hardware accelerators. This highlights the need for scalable and efficient infrastructure for storage, processing, and analysis.

Looking ahead

Our data collection pipeline is designed to support current and future needs, with extensibility and scalability as core principles. Our vision is ambitious – to deliver a high-performance autonomous truck capable of safely driving across diverse operational environments, spanning industrial, rural, suburban, and urban environments alike, and urban environments – and our data pipeline is at its core.

Data Diversity – Preparing for Every Scenario​

Data Collection at a Glance​

Downstream Modules – How do we use the data?​

CI/CD Pipeline​

ML Pipeline​

Challenges​

Looking ahead​