-
Notifications
You must be signed in to change notification settings - Fork 14
Add sv_run_id to training data ingest #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sv_run_id to training data ingest #202
Conversation
jeancochrane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me! You'll indeed need to re-run ingest and update dvc.lock for the changes to the training data to be persisted. Happy to pair on that if it would be helpful (and we should wait for Dan to review this as well to double-check my understanding, I think).
pipeline/02-assess.R
Outdated
| meta_pin, meta_year, | ||
| meta_sale_price, meta_sale_date, meta_sale_document_num | ||
| meta_sale_price, meta_sale_date, meta_sale_document_num, | ||
| sv_outlier_type, sv_run_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nitpick, non-blocking] You can leave out the sv_outlier_type stuff since it's only pertinent to my PR (#199), one of us can just resolve merge conflicts if the other merges first!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, a question for @dfsnow: Do we actually need sale_recent_{n}_run_id in the assessment data? I figure it's most pertinent in the training data, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could always merge the training data back to the assessment data to get the run ID. Alright, @wagnerlmichael I think @jeancochrane is right, let's actually nix this from the assess stage. Sorry for the extra work.
Adds
sv_run_idto outputs from00-ingest.Rand02-assess.R. This will make it easier for us to look back on a sale's outlier status in previous model runs. I believe these changes are consistent with #199.I've double checked that this column persists through the following locations:
write_parquet(paths$input$training$local)in line 354 in00-ingest.Rwrite_parquet(paths$output$assessment_pin$local)on line 570 in02-assess.RWant to double check with @dfsnow if I need to do something with the dvc lock file per @jeancochrane's advice.
Closes #201