Using fuzzyjoin to work with NCES data



[This article was first published on John Russell, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Like families, tidy datasets are all alike but every messy dataset is messy in its own way. – Hadley Wickham

We often work with datasets where there is a unique identifier that can be used to link or filter the data, typically involving some type of a join.

However, sometimes the identifiers that we wish to use to join are near matches. For instance, maybe you are joining a dataset by names, where one dataset has John, and the dataset you are joining has john. Or maybe there are text entry errors and it is instead Jon.

What you need is a way to join data that is a little bit fuzzy. Enter (David Robinson)[https://github.com/dgrtwo/fuzzyjoin]’s excellent fuzzyjoin package.

Introducing an example – schools in the Virgin Islands

There are 20 schools listed in the public school database for the National Center for Education Statistics in the Virgin Islands. I’ve downloaded the dataset and you can pull it from my github repo through the code below. A scrollable DT table to explore below (disabling a lot of features just to see the table):

Code in R
library(tidyverse)
library(fuzzyjoin)
library(kableExtra)
VI <- read_csv("https://github.com/drjohnrussell/drjohnrussell.github.io/raw/refs/heads/master/posts/2025-01-17-fuzzyjoin-in-action/data/VIschools.csv")

VI |> 
  kbl() |> 
  kable_paper() |> ##nice theme
  kable_styling(bootstrap_options = c("striped", "hover")) |> 
  scroll_box(height = "200px") ##adds scrolling
Public Schools of Virgin Islands
Alfredo Andrews Elementary School 780000200002 Saint Croix School District RFD 1 KINGSHILL VI PK 06 495
Claude O. Markoe Elementary School 780000200006 Saint Croix School District PLOTS 71 75 MARS HILL VI PK 06 403
Eulalie Rivera 780000200011 Saint Croix School District ROUTE 1 GROVE PLACE VI PK 08 645
St. Croix Educational Complex High School 780000200013 Saint Croix School District RR2 KINGSHILL VI 09 12 893
Juanita Gardine 780000200021 Saint Croix School District ESTATE RICHMOND VI PK 08 305
Lew Muckle Elementary School 780000200023 Saint Croix School District 317 SION FARM VI PK 06 339
Pearl B. Larsen 780000200028 Saint Croix School District ESTATE ST PETERS VI PK 08 432
Ricardo Richards Elementary School 780000200029 Saint Croix School District 491 BARREN SPOT VI PK 06 392
St. Croix Central High School 780000200030 Saint Croix School District RSD 2 KINGSHILL VI 09 12 717
John H. Woodson Junior High School 780000200037 Saint Croix School District RURAL ROUTE 1 KINGSHILL VI 07 08 461
Charlotte Amalie High School 780003000005 Saint Thomas – Saint John School District 8 and 9 ESTATE THOMAS VI 09 12 1076
Ivanna Eudora Kean High School 780003000015 Saint Thomas – Saint John School District 1 and 2 ESTate NAZARETH VI 09 12 738
Jane E. Tuitt Elementary School 780003000018 Saint Thomas – Saint John School District 19 LEVOKI STRAEDE VI KG 04 154
Joseph Gomez Elementary School 780003000019 Saint Thomas – Saint John School District 142 ANNAS RETREAT VI PK 05 462
Joseph Sibilly Elementary School 780003000020 Saint Thomas – Saint John School District 14 15 16 ESTATE ELIZABETH VI PK 05 226
Julius E. Sprauve School 780003000022 Saint Thomas – Saint John School District 14 18 ESTATE ENIGHED VI PK 08 225
Lockhart Elementary School 780003000024 Saint Thomas – Saint John School District 41 ESTATE THOMAS VI KG 08 977
Ulla F. Muller Elementary School 780003000026 Saint Thomas – Saint John School District 7B ESTATE CONTANT VI PK 05 401
Yvonne E. Milliner-Bowsky Elementary School 780003000027 Saint Thomas – Saint John School District 15B and 16 ESTATE MANDAHL VI PK 05 433
Bertha C. Boschulte Middle School 780003000034 Saint Thomas – Saint John School District 9 1 and 12A BOVONI VI 06 08 538

Let’s say we have a dataset with the following names of schools, and we want to pull in information from NCES.

Code in R
sample <- read_csv("https://github.com/drjohnrussell/drjohnrussell.github.io/raw/refs/heads/master/posts/2025-01-17-fuzzyjoin-in-action/data/sample.csv")

sample |> 
  kbl() |> 
  kable_paper()
Dataset to join
eulalie rivera VI
Joseph Elementary School VI
Alfredo Andrews Elementary School NY

Enter the fuzzyjoin package, which allows the data to be messy in many ways, depending on what you need. Some of the ways to join outlined in the package are as follows:

  • difference_join – joins that are numeric and within a specified distance
  • geo_join – joins that use distances based on latitude and longitude
  • regex_join – joins that look for common regex patterns (text and position)
  • stringdist_join – joins that take into account small differences in the string

Let’s focus on stringdist_join for special case uses.

Joins that ignore case

Joins that ignore case, where you don’t have to mutate using something like the stringr str_to_lower on the data to change it is a big win. We can use it to match eulalie rivera.

This is what would happen as is:

Code in R
sample[1,1] |> 
  inner_join(VI,by=c("schoolname")) |> 
  kbl() |> 
  kable_paper()

On the other hand, we can use the stringdist_join and set ignore_case to equal TRUE.

Code in R
sample[1,1] |> 
  stringdist_join(VI,
                  by=c("schoolname"),
                  max_dist=0,
                  mode="inner",
                  ignore_case=TRUE) |> 
  kbl() |> 
  kable_paper()
A successful join
eulalie rivera Eulalie Rivera 780000200011 Saint Croix School District ROUTE 1 GROVE PLACE VI PK 08 645

Joins that take advantage of string distance

The way that I’ve taken advantage of this is in dealing with typos or data where someone may have subtly different names for a school (e.g., one has the word school while the other drops it). What is nice about the fuzzyjoin package is that you can do the join. What is less nice is that it requires you to do a line check afterwards, especially when you are loose with the distances.

Let’s look at the second row, Joseph Elementary School, and have a maximum distance of 8.

Code in R
sample[2,1] |> 
  stringdist_join(VI,
                   by="schoolname",
                   max_dist=8,
                   mode="inner",
                   ignore_case=TRUE,
                  distance_col="stringdistance") |> 
  kbl() |> 
  kable_paper()
Multiple rows loosely match
Joseph Elementary School Joseph Gomez Elementary School 7.80003e+11 Saint Thomas – Saint John School District 142 ANNAS RETREAT VI PK 05 462 6
Joseph Elementary School Joseph Sibilly Elementary School 7.80003e+11 Saint Thomas – Saint John School District 14 15 16 ESTATE ELIZABETH VI PK 05 226 8
Joseph Elementary School Lockhart Elementary School 7.80003e+11 Saint Thomas – Saint John School District 41 ESTATE THOMAS VI KG 08 977 7

You can see the power, and the danger, of the fuzzyjoin package here. It’s amazing that it picks up the two schools that also have Joseph in their name, but it also suggests that you could change a few letters in the beginning and form the name of another school.

When we used fuzzyjoin, we would arrange the data by the school name and then the distance_col variable so that we could easily do a line check and choose the best match (if available).

Matching off of multiple columns

As a side note, it is nice to use the fuzzyjoin package for seeing errors in multiple columns. Here we can see how it works with the third row of the dataset.

Code in R
sample[3,] |> 
  stringdist_join(VI,
                  by=c("schoolname","state"),
                  mode="inner",
                  max_dist = 4,
                  ignore_case=TRUE,
                  distance_col="stringdistance") |> 
  kbl() |> 
  kable_paper()
Alfredo Andrews Elementary School NY Alfredo Andrews Elementary School 780000200002 Saint Croix School District RFD 1 KINGSHILL VI PK 06 495 0 2 NA

What is nice about this is that, in review, you can see easily through the distance_col variable where the changes were found.

Others have used fuzzyjoin to great effect – I am inspired reading this vignette in particular on geo_joins





Source link

Related Posts

About The Author

Add Comment