Evaluating Missing Value Imputation Methods for Food Composition Databases
G. Ispirova, T. Eftimov, B. Koroušić Seljak
Food and Chemical Toxicology, 2020
Missing data are a common problem in most research fields and introduce an element of ambiguity into data analysis. They can arise due to different reasons: mishandling of samples, measurement error, deleted aberrant value or simply lack of analysis. The nutrition domain is no exception to the problem of missing data. This paper addresses the problem of missing data in food composition databases (FCDBs). Missing data in FCDBs results in incomplete FCDBs, which have limited usage, because any dietary assessment can be performed only on a complete dataset. Most often, this problem is resolved by calculating means/medians from excising data in the same database or borrowing data from other FCDBs. These solutions introduce significant error. We focus on missing data imputation techniques based on methods for substituting missing values with statistical prediction: Non-Negative Matrix Factorization (NMF), Multiple Imputations by Chained Equations (MICE), Nonparametric Missing Value Imputation using Random Forest (MissForest), and K-Nearest Neighbors (KNN), and compared them with commonly used approaches - fill-in with mean, fill-in with median. The data used was from national FCDBs collected by EuroFIR (European Food Information Resource Network). The results show that the state-of-the-art methods for imputation yield better results than the traditional approaches.
BIBTEX copied to Clipboard