Food Data Normalization Using Lexical and Semantic Similarities Heuristics
G. Ispirova, G. Popovski, E. Valenčič, N. Hadzi-Kotarova, T. Eftimov, B. Koroušić Seljak
, 2021,
Food is one of the main health and environmental factors in today’s society. With modernization the food supply is expanding and food-related data is increasing. This type of data comes in many different forms and making it inter-operable is one of the main requirements for using in any kind of analyses. One step towards this goal is data normalization of data coming from different sources. Food-related is collected regarding various aspects – food composition, food consumption, recipe data, etc. The most commonly encountered form is food data related to food products, which in order to serve its purpose – sales and profits, is often distorted and manipulated for marketing plans of producers and retailers. This causes the data to be often misinterpreted. There exist some studies addressing the problem of heterogeneous data by data normalization based on lexical similarity of the food products’ English names. We took this task a step further by considering data in non-English, low-resourced language – Slovenian. Working with such languages is challenging, as they have very limited resources and tools for Natural Language Processing (NLP). In our previously published work we considered different heuristics for matching food products: one based on lexical similarity, and two semantic similarity heuristics, i.e. based on word vector representations (embeddings). These data normalization approaches are evaluated once on a data set with 439 ground truth pairs of food products, obtained by matching their EAN barcodes. In this work, we extend this approach by introducing a new semantic similarity heuristic, based on sentence vector embeddings. Additionally, we extend the evaluation by taking real-world examples and tasking a subject-matter expert to rate the relevance of the top three matches for each example. The results show that using semantic similarity with the sentence embedding method yields best results, achieving 88% accuracy for the ground truth data set and 91% accuracy from the human expert evaluation, while the lexical similarity heuristic provides comparing results with 75% and 85% accuracy.
BIBTEX copied to Clipboard