Range predictions for electric trucks through data standardization
The key to reliable and cost-efficient electric freight lies in accurate range modeling, also known as energy consumption modeling. In this blog post, we outline Einride's approach to collecting and processing electric truck data to implement range models that consistently achieve over 90% accuracy.
To reach international targets of reduced global warming, the transport industry needs to accelerate its transition towards electric. In order to go electric at scale you need to allow for a mixed brand fleet setup. You also need accurate range predictions for these brands. Given that different truck manufacturers use different data models makes this challenging. What if we could define a standard data format that we could map all truck data to, regardless of brand? Read on to learn how Einride is working with standardizing data to enable accurate and unbiased range models.
The data used when training range predictive models has a significant impact on transport planning
The importance of data standardization for range models
In order for machine learning models to excel, the data they are built on needs to accurately capture the reality. At Einride, we run electric fleets consisting of mixed truck brands. This is a necessity to get as many electric trucks on the roads as possible, with the target to reduce global emissions from road freight by 7%. This also means that we have to build predictive models that can accurately predict range, for all brands. If the model accuracy is worse for some brand, we will fail to assign the most suitable truck to a given transport flow.
However, given that all truck manufacturers have their own R&D departments with different developer’s perspectives, the telematics data they send is naturally not aligned. For diesel trucks there is a standard for how to communicate in-vehicle data, called SAE J1939. This standard is being adapted for electric, but since the electric truck is still a new product to the market with plenty of innovation we see little adoption of it.
Let us illustrate an example to better understand the problem of non-standardized data. One truck manufacturer might send information about energy consumption through an accumulated energy signal over time. If we want to know the energy consumed within a time period, we just take the difference between the start and the end time, see Fig 1. Another truck manufacturer might only send snapshots of battery power, in this case the signal needs to be either integrated over time or aggregated in some other way. The key here is that, in the end, we want to model energy consumption and then that feature needs to exist in a predefined way for all data sources.
Fig 1. Example of how data for energy consumption can be mapped to a standardized format.
Standardizing the data should be done early. Since this data is used for many more applications than just range predictive models, we avoid costs and risks associated with repeated logic. We want all our stakeholders to reap the benefits of having well defined data with no need of manual pre-processing. We have learned that investing in standardizing data quickly pays off. When we do that for truck telematics data, we do not only unlock unbiased range models but we also unlock automated truck activity reporting, CO2 emission tracking and battery degradation analytics. All this combined enable us to find an optimal transport plan in a mixed fleet setup.
Now that we are all onboard with the concept of data standardization, how does one start when facing such a problem?
The process of standardizing data
Before getting started on building our data product, it is wise to take some steps back and make sure that we are building the right thing. At Einride we have had great outcome when taking the following approach
- Identify stakeholder and purpose of use
- Write basic requirements
- Identify a common data representation
- Write tests
- Start incremental development
- Monitor the quality
Let us scratch the surface and see what each step here entails.
1. Identify stakeholder and purpose of use
The key to successfully deliver any data product is to answer the fundamental questions;
What question are you trying to answer ?
Who are your stakeholders ?
Make sure to spend time here, you do not want to waste time building a data product that cannot be used for the intended purpose. In our case mentioned above about energy consumption, we also decided that we would use the data for vehicle activity and health reporting. The data would also be used for customer reporting. These use cases have a direct impact on the data product development, which is demonstrated in the following bullets.
2. Write basic requirements
Once you have identified the main purpose of use for your data you should move on to writing the basic requirements. The requirements we have found matters the most are related to;
- Resolution
- Update frequency
- Uptime
Since energy consumption is highly dynamic and changes with every press on the gas pedal we require high resolution data. There are also demands on robustness and uptime, given that the data is used for customer reporting. This means that we have to think twice before deciding that all data pipeline maintainers should go on Christmas vacation at the same time.
3. Identify a common data representation
Finding an appropriate model representation is about mapping what data is available to the data you need. For energy consumption reporting we have about as many different input formats as we have connected vehicle manufacturers. Looking back at the example mentioned above, in Fig 1, we see how two completely different entities, power and accumulated energy consumption, can be mapped to the same feature, energy consumption.
You also have to make a decision on what to fall back to when new data has not been reported. Depending on what is safe for your use case you can consider to fill forward the previous sample or perform some interpolation.
4. Write tests
Defining your tests before starting implementation is a great way of discovering flaws early. This is all in accordance with test driven development, which works well not only for software development but also for developing data pipelines.
To exemplify this, we identified the maximum energy consumption that would be theoretically possible for an electric truck to consume. When we implemented this test we saw many occurrences of failing events, which made us realize how to properly clean this signal before feature extraction.
5. Start incremental development
Identify the minimal set of attributes needed and deliver only those, end to end, including tests. Make sure to get feedback from any stakeholders already on the first iteration before continuing to the next. When we built our vehicle data pipeline we started with only reporting distance driven and speed. This made us confident that our pipeline design worked before moving on to implementing energy consumption reporting, which is much more complex.
6. Monitor the quality
Data is a living product and even though you might feel that you are finished, the truth is you never will be. There will likely be new data sources to integrate, new stakeholders coming in and of course new bugs in the data. Make sure to continuously monitor the quality of your data to keep it a value creating product. We discover anomalies and deviations in the data constantly. It can for example be a data blackout caused by a vehicle software update after a workshop visit, if we don’t catch these errors we risk losing a lot of important data.
Maintain data richness through modularization
We have now seen that data standardization is key to enable range predictive models across truck manufacturers. We have also seen the approach to building a data pipeline that transforms all data to the same common format. So now we are done, right? Well, there is one big piece missing. A challenge to data transformation is that you risk losing data richness during the process. In its raw format each signal contains a lot of information. Some are garbage and need to be cleaned out, while others are highly informative and need to be kept. How do you keep only the valid data, when you have multiple input formats to handle? Let us illustrate this challenge by presenting two alternatives for a pipeline setup and finally the solution we have found works best.
Two different architectural approaches are displayed in Fig 2. Here, pipeline A has separated streams for each source where cleaning, feature extraction and testing are performed individually. The upside is full flexibility but the downside is that you will likely end up copy-pasting a lot of code that is similar or identical to parts of the pipelines. A way to solve this issue is displayed in pipeline B, where all blocks of the pipeline are shared, but it means that you have to put all data on the same common format already from start. Finding a suitable format before data cleaning will be difficult. Anyone who has ever attempted to take the derivative of a noisy signal can relate to this pain.
Fig 2. Two approaches to building a data pipeline.
We have decided to take a hybrid approach, in which we keep the flexibility of having separate cleaning and feature extraction for the different sources, simply because the sources are on such different formats that you would lose too much information if aligning them too early. In this approach we instead create shared components of code that can be reused in each pipeline, see Fig 3. This way, whenever we integrate a new data source, we can easily set up a new stream in the existing pipeline using a template version and then cherry-pick code components to enhance the data according to the specific format sent. This strategy for balancing between custom and standard pipelines has been a success for us.
Fig 3. A data pipeline strategy that is flexible and modular.
When the pipelining strategy is set you need to orchestrate it, meaning that the pipeline execution is automated and produces a data output that is monitored. Here, you can take inspiration from Einride’s toolchain for creating these data products, see Fig 4. Google Cloud Platform is our cloud provider and our data warehouse is in BigQuery. We have a data pipeline that pulls data from each source table and cleans it, extracts the attributes in the model and performs tests. This is all managed through dbt. The output of the dbt pipeline is a curated dataset made available to our users. To ensure that it stays useful we monitor the data quality continuously using elementary. We see big benefits from how we have set up our tech stack but we constantly assess new tools that would further enhance our capabilities. See our tech radar for more information.
Fig 4. The pipeline setup at Einride for creating standardized data out of multiple reporting source tables, using BigQuery, dbt and elementary.
With all these components in place we have a fully automated data pipeline that standardizes data from any truck manufacturer that we partner with. The data richness is maintained even though the underlying formats differ completely. Every time new data passes through the pipeline and the corresponding quality checks pass, it is made available for further modeling, reporting and ad-hoc analytics cases.
Summary and key takeaways
In this blog post, we have seen why data standardization is a necessity for accurate energy consumption modeling when running a mixed fleet of electric trucks. We have gone through the process of development, in which we saw that we should start with identification of intended use and our stakeholders. They will be the guiding stars when working our way through requirements, design, testing and monitoring. Furthermore we showed what tooling can be used when implementing a pipeline, and that a modular approach enables maintainability without losing data richness.
Standardized data allows for unbiased and high accuracy range predictive models and automated truck activity reporting. The data is made available to any consumer within the company, without a requirement on being a domain expert in order to make use of it. For us at Einride, investing in this kind of work has quickly paid off.
With this data, we can make decisions on what kind of truck should be placed where to perform a certain task, with reduced emissions, maximized utilization and lifetime in mind. This is intelligent movement by Einride.