2021 FDA Science Forum
A Programmatic Approach to Parsing Ingredient Lists from Consumer Packaged Goods for Effective Data Analysis
- Authors:
- Center:
-
Contributing OfficeCenter for Food Safety and Applied Nutrition
Abstract
INTRODUCTION:
Advancing technology to increase usability of ingredient list information from consumer packaged goods is of interest to both the government and private sector. Basic parsing practices do not account for valuable meta-data associated with ingredient lists. Enhancing parsing technology allows full-text ingredient labels to be divided into individual ingredient terms, while eliminating extraneous noise, preserving inter-ingredient relationships, and capturing ingredient predominance in each product.
METHODS:
Food label data was queried from products in FDA’s structured food database (FoodTrak), which includes data from four sources (Label Insight, Mintel Global New Products Database, Gladson Nutrition Database, and NuVal). Products were limited to seven grocery store-aisle-based food categories and excluded if they had no barcode identifier or ingredient list. Iterative steps were designed in structured query language (SQL) to parse full-text ingredient labels while maintaining their relational hierarchy. These queries transformed ingredient lists containing separators such as commas, colons, conjunctions, or parenthetical sub-ingredient listings into exclusively comma separated lists for parsing in a layered approach. Three tables of specific terms– the REMOVE registry, CONVERT registry, and EXEMPT registry- were manually created during logic development to update queries and mitigate terms that did not adhere to the comma-based parsing structure.
RESULTS:
205,176 product records qualified for parsing. Products were categorized as bread (16%), condiments (10%), cookies (18%), crackers (8%), frozen meals (18%), ice cream (19%), and soup (11%). Parsing resulted in 86,808 unique ingredient terms, with an average 17,954 ingredient terms per category.
DISCUSSION/CONCLUSION:
The resulting parsing technology enabled users to quickly query products containing a specific ingredient, identify related ingredient terms, and view contextual descriptions of the ingredient on the label. Applying this parsing technology has many applications in big data: one use case is presented in the context of allergen research to identify products containing the top eight allergens described in the Food Allergen Labeling and Consumer Protection Act (FALCPA). Future goals include utilizing the resulting unique list of parsed terms to build a comprehensive food label ingredient thesaurus from synonymous terms, find connections between co-occurring ingredient terms, and contribute towards effective post-market ingredient analyses while supporting other FDA databases.