Skip to content

feat: add ip address backfill cli command #13804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 31, 2023

Conversation

miketheman
Copy link
Member

@miketheman miketheman commented May 29, 2023

To support the eventual removal of *Event.ip_address_string, create a CLI command that provides the operator the ability to run the necessary queries to associate the relationship between any existing UserEvent and an IpAddress record.

Related to #8158

Once merged, relies on an operator to execute:

python -m warehouse hashing backfill-ipaddrs

which should backfill 10,000 events at a time.

Has toggles for --batch-size and --sleep-time, as well --continue-until-done once we're happy with the batch size and sleep time impact on the system overall.

Note: Need to update/modification for other Events: File.Event, Project.Event, Organization.Event, Team.Event

@miketheman miketheman requested a review from a team as a code owner May 29, 2023 18:46
f"Backfilled {batch_size} rows. Sleeping for {sleep_time} second(s)..."
)
time.sleep(sleep_time)
_backfill_ips(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have more than 10,000,000 entries that need backfilled this will cause an error, but we've already committed the session so we won't lose progress, so can just run it again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good note - where does the 10m entries limit come from?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also curious of this.

@miketheman miketheman force-pushed the miketheman/backfill-ipaddr-objs branch from d7f4f59 to c082bca Compare May 30, 2023 18:24
@miketheman miketheman requested a review from ewdurbin May 30, 2023 18:24
Comment on lines +27 to +35
registry_dict = {}
config = pretend.stub(
registry=pretend.stub(
__getitem__=registry_dict.__getitem__,
__setitem__=registry_dict.__setitem__,
settings={"warehouse.ip_salt": "NaCl"},
)
)
config.registry["sqlalchemy.engine"] = engine
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: ‏I struggled with this syntax for pretend.stub() to make an object that is both dict-accessible as well as method-friendly, since we call both config.registry["somekey"] as well as config.registry.settings["somekey"].
If there's a better way to represent this nesting with pretend.stub(), happy to change to that!

@dstufft
Copy link
Member

dstufft commented May 31, 2023 via email

@ewdurbin ewdurbin merged commit d1b1581 into pypi:main May 31, 2023
@ewdurbin
Copy link
Member

Just as a sanity check, I manually applied the python hashing to some records already populated in the DB via CDN and found the hashes to match! 🎉

@miketheman miketheman deleted the miketheman/backfill-ipaddr-objs branch May 31, 2023 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants