Contribute Media
A thank you to everyone who has made this possible: Read More

Detection of Duplicate Records in Large-scale Multi-tenant Platforms

Description

Zalando's quest to create an open multi-tenant fashion platform has its challenges. One of these is the detection of duplicate product records within the platform, where different tenants input the same product using different product descriptions. Commonly referred to as the_Record Linkage_problem in Machine Learning, the task is to group together similar product records under a single canonical identifier, which is useful for business intelligence purposes and for product search etc. The kernel of the solution is the computation of an ~O(n**2) all-pairs similarity join, where the runtime explodes quadratically with an increase in input. At Zalando's Fashion Insight Centre in Dublin we are looking at solutions to this problem that work at scale (i.e., more than one million products). For our particular problem, which involves Categorical Data (cosine similarity will not work here), we employ a data-driven similarity measure and approximate the similarity join using a two-step approach. In this talk we introduce the standard approaches to the problem and illustrate our work-to-date using Python.

Details

Improve this page