Look-up in R across data tables with "IF" condition
lookup excel in r
r create table from data frame
match tables in r
r lookup column name
r mapping table
r data table dictionary
I have two data tables. A table with customer orders (it shows a customer ID, and the order date when a purchase was made) and a table with customer segmentation (it shows in which segment a customer was classified as in a certain time period).
I want to add the segment from data table 2) as a new variable in data table 1) but of course only the segment the customer was in at the time of the order.
Customer_Orders <- data.table( customer_ID = c("A", "A"), order_date = c("2017-06-30", "2019-07-30") ) head(Customer_Orders) customer_ID order_date 1: A 2017-06-30 2: A 2018-07-30 Customer_Segmentation <- data.table( customer_ID = c("A", "A", "A"), segment = c("1", "2", "3"), valid_from = c("2017-01-01", "2018-01-01", "2019-01-01"), valid_until = c("2017-12-31", "2018-12-31", "2019-12-31") ) head(Customer_Segmentation) customer_ID segment valid_from valid_until 1: A 1 2017-01-01 2017-12-31 2: A 2 2018-01-01 2018-12-31 3: A 3 2019-01-01 2019-12-31
This here is the manually constructed result I´m looking for
Result <- data.table( customer_ID = c("A", "A"), order_date = c("2017-06-30", "2019-07-30"), segment = c(1, 3) ) head(Result) customer_ID order_date segment 1: A 2017-06-30 1 2: A 2019-07-30 3
Currently, my solution consists of doing a right-join to basically add all possible segments to each line in the customer orders table, and then exclude all rows where the order date is not in between the period of the segment. However, as my dataset is huge, this is a really slow and cumbersome solution
Probably the easiest method would be using sqldf package:
library(sqldf) sqldf("select * from Customer_Orders left join Customer_Segmentation on order_date between valid_from and valid_until and Customer_Orders.ID = Customer_Segmentation.ID") # customer_ID order_date customer_ID..3 segment valid_from valid_until # 1 A 2017-06-31 A 1 2017-01-01 2017-12-31 # 2 A 2019-07-30 A 3 2019-01-01 2019-12-31
It simply joins the tables if the date falls between the period of time provided
But if you insist on using data.table look below;
setkey(Customer_Segmentation,customer_ID,valid_from) setkey(Customer_Orders,customer_ID,order_date) ans <- Customer_Segmentation[Customer_Orders,list(.valid_from=valid_from, valid_until,order_date,segment), by=.EACHI,roll=T][,`:=`(.valid_from=NULL)] ans # customer_ID valid_from valid_until order_date segment # 1: A 2017-06-31 2017-12-31 2017-06-31 1 # 2: A 2019-07-30 2019-12-31 2019-07-30 3
It is easy to get rid of extra columns if unwanted.
Fast data lookups in R: dplyr vs data.table, In this post I compare dplyr and data.table data lookup methods in R. These routines benchmark a piece of code by running it multiple times. Look-up in R across data tables with “IF” condition. Ask Question Asked 1 year, 4 months ago. Active 1 year, 3 months ago. Viewed 103 times 3. I have two data
Your data (fixed):
library(tidyverse) library(lubridate) Customer_Orders <- tibble( customer_ID = c("A", "A"), order_date = c("2017-06-30", "2019-07-30")) Customer_Segmentation <- tibble( customer_ID = c("A", "A", "A"), segment = c("1", "2", "3"), valid_from = c("2017-01-01", "2018-01-01", "2019-01-01"), valid_until = c("2017-12-31", "2018-12-31", "2019-12-31"))
Code - the first two tables are just to create dates from the initial tables using
lubridate. The next one joins everything.
Customer_Orders2 <- Customer_Orders %>% mutate(order_date = ymd(order_date)) Customer_Segmentation2 <- Customer_Segmentation %>% mutate(valid_from = ymd(valid_from)) %>% mutate(valid_until = ymd(valid_until)) Customer_Orders_join <- full_join(Customer_Orders2, Customer_Segmentation2)
This picks out the segments based on the interval.
Customer_Orders3 <- Customer_Orders_join %>% filter(order_date %within% interval(valid_from, valid_until))
# A tibble: 2 x 5 customer_ID order_date segment valid_from valid_until <chr> <date> <chr> <date> <date> 1 A 2017-06-30 1 2017-01-01 2017-12-31 2 A 2019-07-30 3 2019-01-01 2019-12-31
How to Work with Lookup Tables in R, Sometimes doing a full merge of the data in R isn't exactly what you want. In these cases, it may be more appropriate to match values in a lookup table. This means you use it by placing it between two vectors, unlike most other functions R data.table symbols and operators you should know. Beginner's guide to R: Syntax quirks you'll want to know Chances are, at some point you’d like to look up the value by category, sometimes
Here's how I would approach the problem:
Data Generation (defining as proper
Customer_Orders <- data.table( customer_ID = c("A", "A"), order_date = as.Date(c("2017-06-30", "2019-07-30")) ) Customer_Segmentation <- data.table( customer_ID = c("A", "A", "A"), segment = c("1", "2", "3"), valid_from = as.Date(c("2017-01-01", "2018-01-01", "2019-01-01")), valid_until = as.Date(c("2017-12-31", "2018-12-31", "2019-12-31")) )
Non-equi Update Join to Add Segment
When using the
A[B] syntax supported by
data.table, it's relatively simple to add a single column from the
B table to the original
A table by using the
i. prefix to reference columns in
B. The remaining portion is just the
on statement, which can be defined as a list using the
.() notation in
data.table with any number of conditions.
Customer_Orders[Customer_Segmentation, segment := i.segment, on = .(customer_ID==customer_ID, order_date>=valid_from, order_date<valid_until)] print(Customer_Orders) # customer_ID order_date segment #1: A 2017-06-30 1 #2: A 2019-07-30 3
Modifying data with lookup tables, For example you may want to add a new column of data, or do a “find” and The third – and nicest – way of adding information is to use a lookup table. variable names when they are embedded in code all over the place. Specifically, the lookup columns in both data frames have the same name, and both data frames have the same number of rows (one per column), making this a simple 1-to-one lookup. INNER / LEFT / RIGHT JOIN in R
Do more with R: Quick lookup tables using named vectors, What's the state abbreviation for Arkansas? Is it AR? AK? AS? Maybe you've got a data frame with the information. Or any info where there's The data.table R package is considered as the fastest package for data manipulation. This tutorial includes various examples and practice questions to make you familiar with the package. Analysts generally call R programming not compatible with big datasets ( > 10 GB) as it is not memory efficient and loads everything into RAM.
Introduction to data.table, Then we will look at performing data aggregations by group over R's base::order that the R project adopted the data.table algorithm as its FUNCTIONS => Passing data.table column names as function arguments => Beware of scoping within data.table 4. PRINTING => Print data.table with  => Hide output from := with knitr Tips and tricks learned along the way This is mostly a running list of data.table tricks that took me a while to figure out either by digging into the official
Keys and fast binary search based subset, and finally conclude by looking at the advantage of setting keys - perform In the “Introduction to data.table” vignette, we saw how to subset rows in i We can set keys on multiple columns and the column can be of different The problem of efficient lookups is not specific to R. One of the altervnative approaches is to use a hash table. Without delving into the details, a hash table is a very efficient data structure with a constant lookup time. It is implemented in most modern programming languages and it is widely utilised in many areas.
"2017-06-31"is not a valid date. You might mean
- This seems like an obvious case for a "non-equi join". Search for that term together with "R data.table date" and you should find several Q&A
- I didn't realise
data.tablewas a package. Interesting.
- The non-equi join would be more straight_forward than the rolling join
Customer_Segmentation[Customer_Orders, on=c("valid_from<=order_date","valid_until>=order_date")], as it pretty much recreates the logic of the SQL code.
- @thelatemail agreed. Didn't give much thinking to the
- @thelatemail: If i use your code, i'm not getting the correct result though. I get a table with five columns: customer_ID, segment, valid_from, valid_until, i.customer_ID. The information on order date, however, is not in there
- I actually did this now, seems to work:
Customer_Orders[Customer_Segmentation, on=.(customer_ID, order_date>=valid_from, order_date<=valid_until), segment:=segment]
- @JenniferWeingarten you can select using the j place instead of the assignment (which I would think is slower for large data tables):
Customer_Segmentation[Customer_Orders, .(customer_ID, order_date, segment), on = .( valid_from <= order_date, valid_until >= order_date, customer_ID == customer_ID ) ]