Skip to content

Joins in data.table vignette. Simplify and extend the "update by reference section" #6846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lucasmation opened this issue Mar 1, 2025 · 4 comments · May be fixed by #6847
Open

Joins in data.table vignette. Simplify and extend the "update by reference section" #6846

lucasmation opened this issue Mar 1, 2025 · 4 comments · May be fixed by #6847
Labels
joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins

Comments

@lucasmation
Copy link

lucasmation commented Mar 1, 2025

The new Joins in data.tablevignette is great, thank you!

I think the updating by reference section could improve by:

  1. provinding a simpler example with simple one-to-one match example. In the current example the beauty of the NO-COPY solutioun is a bit obscured by the complicated grouping ("last element selection", "tail", etc) used.
  2. provide an efficient solution for a RIGH JOIN that minimizes copying. At least in my workflow, that is the main use case. I have a main DT, on which I performe several join operations a[DT,on=.(idA)], b[DT,on=.(idB)],.... I assume this is a super common usecase and would be great to discuss the canonical solution to it (even if the answer is that the x[i,on=, j =] syntax does not allow for updating i by reference)

In this SO question, I provided sample code for 1., with a simpler join example, and 2., with my current solution to avoid the expensive copy.

UPDATE1: the solution proposed in SO is super elegant and performant:

x = data.table(id = c(1:5,8), newvar1=c(LETTERS[1:5],'h'),newvar2=c(5:1,-2))
#In practice x would have more vars: newvar2, ..., newvarN 
i = data.table(id = 1:7,var1 = c('bla','ble','bli','blo','blu','blA','blS'),var2=7:1,var3=2*(7:1),var4=1)
cols <- setdiff(names(x), 'id')
i[, (cols) := x[.SD, on = "id", .SD, .SDcols = cols]]

This should definetivelly go on the tutorial I wish I had known this ealier. On my application ( i: 220m rows, 40ish columns. x: 130m and 5 cols), the performance gain was massive. This method is 3x faster and uses only 1/5th to 1/6th of the RAM increase relative to i <- x[i,on=.(id)]

[I could not find an issue specific to this vignette, just the general #944]

@venom1204
Copy link
Contributor

#attempting

@iagogv3
Copy link
Contributor

iagogv3 commented Mar 26, 2025

@venom1204 Another very helpful SO answer on this is https://stackoverflow.com/a/44592473. I always return to it. It would be great something similar was included in the joins vignette

@iagogv3 iagogv3 added the joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins label Mar 26, 2025
@venom1204
Copy link
Contributor

hi @iagogv3
Thank you for pointing me to that helpful Stack Overflow answer and your clear examples. I agree these recoding patterns would make the vignette more practical.
I’ll aim to commit these changes within the next day or two and tag you for review. Let me know if you’d like me to emphasize anything else!

@trobx
Copy link

trobx commented May 21, 2025

Just noticed this after coming from #6997. Nice work on this section, but I think the by=.EACHI example really needs to be ditched altogether as there is a mult-based solution that is 500X faster. by=.EACHI could do with its own section as it is not especially linked to update-by-reference joins (where it isn't by each row of i anyway).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants