In an ideal world, you can join on candidate keys, names are never misspelt, people never move house, and I have a pony. I don’t have a pony. But we do have record linkage.
Record linkage (also known as data linkage or fuzzy record matching) is a naive Bayes algorithm which matches data about individuals across databases or within a single database, by constructing the probabilities that two records apply to the same individual or to different individuals.
In this talk I will discuss the techniques of record linkage in Python, the usefulness and the limitations of linkage, and the effect that this technique is having on healthcare research in particular.
Healthcare/epidemiology studies often require data from more than one source, and individuals frequently have multiple interactions with a data set without a unique identifier. The outcomes of these studies are only as good as the record linkage which underlies them, so the ways in which record linkage is done can have a direct impact on our ability to understand, prevent and treat serious medical conditions.