The talk starts at 12:15.
Speakers: Susie Jentoft (SSB) and Annabelle Redelmeier (NR)
Location: Norsk Regnesentral and Zoom.
Abstracts:
Susie Jentoft: The coding of data is time-consuming task performed by most statistical bureaus. Manually converting text to classifications can provide good quality data but requires both expert coders with good knowledge of the standards and many resource hours.
In this study, we provide a case study for coding of COICOP (classification of individual consumption according to purpose) groups for the Norwegian Household Budget Survey (HBS). Text containing the goods name from receipt data needs to be coded to COICOP groups under the new UN COICOP classifications. Previous coding of goods from the Consumer Price Index division provided an initial training dataset after converting to the new standard. We expanded this using data from declarations of imported goods, a goods/nutritional information database (Tradesolution) and mass manual food/non-food classifications. Different algorithms were tested including random forest, logistic regression and support vector machines. Linear support vector machine algorithms provided the best predictive classification of those we tested. A double coding study showed that the algorithms are able to predict at an accuracy not far off of that of the manual coders. Future work continues on quality assurance of the training data, balancing the training dataset COICOP groups, implementing these algorithms in production and determining the training frequency.
Annabelle Redelmeier: Norway’s most sold food products can paint a picture on the nutrition trends of the society as a whole. However, analyzing nutrition trends is only possible if we have complete nutritional information on the food products purchased each month. Although manual data labelling based on the product names is possible to some extent, the data set in this study consists of 58 500 food products where 433 000 (67%) nutritional values are missing making this a very time consuming task. In this study, we do not assume products have any known features and instead, rely on the product names. We make the assumption that products with similar product names will have similar nutritional values.
We train independent random forest models to predict each nutritional value, where the features are entirely based on the attributes of similar products. We find these similar products by deriving the Jaccard similarity index on pairs of product names. Then, for pairs with a Jaccard similarity over a threshold, we use K-nearest neighbours on the known nutritional values to find the best match. We train and validate each model on a subset of data where the given nutritional value is known. Finally, we compare the results of the machine learning model with results from a simple substitution imputation model.
Join Zoom Meeting:
https://uio.zoom.us/j/63199595088?pwd=Y05GYkJwR1dCUHFhUnlGMFpsdGc3UT09
Meeting ID: 631 9959 5088
Passcode: 302551
Welcome!
Best regards,
Thea Roksvåg and Lars Henry Berge Olsen.