Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Like families, tidy datasets are all alike but every messy dataset is messy in its own way. – Hadley Wickham
We often work with datasets where there is a unique identifier that can be used to link or filter the data, typically involving some type of a join.
However, sometimes the identifiers that we wish to use to join are near matches. For instance, maybe you are joining a dataset by names, where one dataset has John, and the dataset you are joining has john. Or maybe there are text entry errors and it is instead Jon.
What you need is a way to join data that is a little bit fuzzy. Enter (David Robinson)[https://github.com/dgrtwo/fuzzyjoin]’s excellent fuzzyjoin
package.
Introducing an example – schools in the Virgin Islands
There are 20 schools listed in the public school database for the National Center for Education Statistics in the Virgin Islands. I’ve downloaded the dataset and you can pull it from my github repo through the code below. A scrollable DT
table to explore below (disabling a lot of features just to see the table):
Code in R
library(tidyverse) library(fuzzyjoin) library(kableExtra) VI <- read_csv("https://github.com/drjohnrussell/drjohnrussell.github.io/raw/refs/heads/master/posts/2025-01-17-fuzzyjoin-in-action/data/VIschools.csv") VI |> kbl() |> kable_paper() |> ##nice theme kable_styling(bootstrap_options = c("striped", "hover")) |> scroll_box(height = "200px") ##adds scrolling
Alfredo Andrews Elementary School | 780000200002 | Saint Croix School District | RFD 1 KINGSHILL | VI | PK | 06 | 495 |
Claude O. Markoe Elementary School | 780000200006 | Saint Croix School District | PLOTS 71 75 MARS HILL | VI | PK | 06 | 403 |
Eulalie Rivera | 780000200011 | Saint Croix School District | ROUTE 1 GROVE PLACE | VI | PK | 08 | 645 |
St. Croix Educational Complex High School | 780000200013 | Saint Croix School District | RR2 KINGSHILL | VI | 09 | 12 | 893 |
Juanita Gardine | 780000200021 | Saint Croix School District | ESTATE RICHMOND | VI | PK | 08 | 305 |
Lew Muckle Elementary School | 780000200023 | Saint Croix School District | 317 SION FARM | VI | PK | 06 | 339 |
Pearl B. Larsen | 780000200028 | Saint Croix School District | ESTATE ST PETERS | VI | PK | 08 | 432 |
Ricardo Richards Elementary School | 780000200029 | Saint Croix School District | 491 BARREN SPOT | VI | PK | 06 | 392 |
St. Croix Central High School | 780000200030 | Saint Croix School District | RSD 2 KINGSHILL | VI | 09 | 12 | 717 |
John H. Woodson Junior High School | 780000200037 | Saint Croix School District | RURAL ROUTE 1 KINGSHILL | VI | 07 | 08 | 461 |
Charlotte Amalie High School | 780003000005 | Saint Thomas – Saint John School District | 8 and 9 ESTATE THOMAS | VI | 09 | 12 | 1076 |
Ivanna Eudora Kean High School | 780003000015 | Saint Thomas – Saint John School District | 1 and 2 ESTate NAZARETH | VI | 09 | 12 | 738 |
Jane E. Tuitt Elementary School | 780003000018 | Saint Thomas – Saint John School District | 19 LEVOKI STRAEDE | VI | KG | 04 | 154 |
Joseph Gomez Elementary School | 780003000019 | Saint Thomas – Saint John School District | 142 ANNAS RETREAT | VI | PK | 05 | 462 |
Joseph Sibilly Elementary School | 780003000020 | Saint Thomas – Saint John School District | 14 15 16 ESTATE ELIZABETH | VI | PK | 05 | 226 |
Julius E. Sprauve School | 780003000022 | Saint Thomas – Saint John School District | 14 18 ESTATE ENIGHED | VI | PK | 08 | 225 |
Lockhart Elementary School | 780003000024 | Saint Thomas – Saint John School District | 41 ESTATE THOMAS | VI | KG | 08 | 977 |
Ulla F. Muller Elementary School | 780003000026 | Saint Thomas – Saint John School District | 7B ESTATE CONTANT | VI | PK | 05 | 401 |
Yvonne E. Milliner-Bowsky Elementary School | 780003000027 | Saint Thomas – Saint John School District | 15B and 16 ESTATE MANDAHL | VI | PK | 05 | 433 |
Bertha C. Boschulte Middle School | 780003000034 | Saint Thomas – Saint John School District | 9 1 and 12A BOVONI | VI | 06 | 08 | 538 |
Let’s say we have a dataset with the following names of schools, and we want to pull in information from NCES.
Code in R
sample <- read_csv("https://github.com/drjohnrussell/drjohnrussell.github.io/raw/refs/heads/master/posts/2025-01-17-fuzzyjoin-in-action/data/sample.csv") sample |> kbl() |> kable_paper()
eulalie rivera | VI |
Joseph Elementary School | VI |
Alfredo Andrews Elementary School | NY |
Enter the fuzzyjoin
package, which allows the data to be messy in many ways, depending on what you need. Some of the ways to join outlined in the package are as follows:
- difference_join – joins that are numeric and within a specified distance
- geo_join – joins that use distances based on latitude and longitude
- regex_join – joins that look for common regex patterns (text and position)
- stringdist_join – joins that take into account small differences in the string
Let’s focus on stringdist_join
for special case uses.
Joins that ignore case
Joins that ignore case, where you don’t have to mutate using something like the stringr str_to_lower on the data to change it is a big win. We can use it to match eulalie rivera.
This is what would happen as is:
Code in R
sample[1,1] |> inner_join(VI,by=c("schoolname")) |> kbl() |> kable_paper()
On the other hand, we can use the stringdist_join
and set ignore_case
to equal TRUE.
Code in R
sample[1,1] |> stringdist_join(VI, by=c("schoolname"), max_dist=0, mode="inner", ignore_case=TRUE) |> kbl() |> kable_paper()
eulalie rivera | Eulalie Rivera | 780000200011 | Saint Croix School District | ROUTE 1 GROVE PLACE | VI | PK | 08 | 645 |
Joins that take advantage of string distance
The way that I’ve taken advantage of this is in dealing with typos or data where someone may have subtly different names for a school (e.g., one has the word school while the other drops it). What is nice about the fuzzyjoin
package is that you can do the join. What is less nice is that it requires you to do a line check afterwards, especially when you are loose with the distances.
Let’s look at the second row, Joseph Elementary School, and have a maximum distance
of 8.
Code in R
sample[2,1] |> stringdist_join(VI, by="schoolname", max_dist=8, mode="inner", ignore_case=TRUE, distance_col="stringdistance") |> kbl() |> kable_paper()
Joseph Elementary School | Joseph Gomez Elementary School | 7.80003e+11 | Saint Thomas – Saint John School District | 142 ANNAS RETREAT | VI | PK | 05 | 462 | 6 |
Joseph Elementary School | Joseph Sibilly Elementary School | 7.80003e+11 | Saint Thomas – Saint John School District | 14 15 16 ESTATE ELIZABETH | VI | PK | 05 | 226 | 8 |
Joseph Elementary School | Lockhart Elementary School | 7.80003e+11 | Saint Thomas – Saint John School District | 41 ESTATE THOMAS | VI | KG | 08 | 977 | 7 |
You can see the power, and the danger, of the fuzzyjoin
package here. It’s amazing that it picks up the two schools that also have Joseph in their name, but it also suggests that you could change a few letters in the beginning and form the name of another school.
When we used fuzzyjoin, we would arrange the data by the school name and then the distance_col variable so that we could easily do a line check and choose the best match (if available).
Matching off of multiple columns
As a side note, it is nice to use the fuzzyjoin
package for seeing errors in multiple columns. Here we can see how it works with the third row of the dataset.
Code in R
sample[3,] |> stringdist_join(VI, by=c("schoolname","state"), mode="inner", max_dist = 4, ignore_case=TRUE, distance_col="stringdistance") |> kbl() |> kable_paper()
Alfredo Andrews Elementary School | NY | Alfredo Andrews Elementary School | 780000200002 | Saint Croix School District | RFD 1 KINGSHILL | VI | PK | 06 | 495 | 0 | 2 | NA |
What is nice about this is that, in review, you can see easily through the distance_col variable where the changes were found.
Others have used fuzzyjoin
to great effect – I am inspired reading this vignette in particular on geo_joins