Back to Blog

Good data quality requires machine learning – and vice versa

October 29, 2018

Back to Blog

Good data quality requires machine learning – and vice versa

Good product data quality is more important than ever in today’s Fast-Moving Consumer Goods environment. Not only that bad master data disrupts and increases the price of logistics processes, but it is much more important that without complete and good product data, sales in the e-commerce environment decline. Today, product data is decisive for the purchase of a product by the consumer. Since EU 1169 came into force, foods without complete and correct nutrition information may no longer be sold online.

Unfortunately, data quality can hardly be guaranteed by static or individually defined rules. There are far too many different products with different attributes and value ranges. Nobody can determine how heavy a pallet of toilet paper should be or how many calories can be in a chocolate bar, in addition, the physical examination of the data against the product would be expensive.

This is where modern machine learning methods from the field of artificial intelligence can help. These methods learn from existing data. The more data is imported in order to “train” the procedure, the better the procedures can distinguish correct from incorrect values. If, for example, the correct nutritional values of 50 different chocolate bars were used to learn a correct calorie value, then the procedure is more accurate than if only 5 products were available for learning.

Machine learning techniques can detect anomalies and outliers within a product category. But what if the data used for learning was wrong? For instance, if you train the procedure to correctly recognize nutritional information for milk, but also use products such as banana milk or soy milk for learning, then it becomes impossible to recognize false sugar values.

Machine learning procedures help to distinguish between correct and incorrect values for any product category, but it is important that the data used to train the procedures are available in sufficient quantity and are already of high quality.