Skip to content

🐛 Deduplicate projects in cron job by excluding URL queries and fragments#2201

Merged
azeemshaikh38 merged 1 commit intoossf:mainfrom
spencerschrock:project-dedup
Aug 26, 2022
Merged

🐛 Deduplicate projects in cron job by excluding URL queries and fragments#2201
azeemshaikh38 merged 1 commit intoossf:mainfrom
spencerschrock:project-dedup

Conversation

@spencerschrock
Copy link
Copy Markdown
Member

What kind of change does this PR introduce?

bug fix

What is the current behavior?

cron/internal/data/projects.csv has duplicate entries, which leads to duplicates in the public BigQuery data
e.g.:

github.com/adobe-fonts/source-code-pro,criticality_score:0.386850
github.com/adobe-fonts/source-code-pro#release,

In the latest BigQuery data, there are two repos with the name github.com/adobe-fonts/source-code-pro. One received a score of 4.8 and the other 4.4. The difference is only due to a rate limit error that occurred during one run. They return identical results for me locally.

What is the new behavior (if this is a feature change)?**

Uses https://pkg.go.dev/net/url#URL to pull out host and path when deduplicating cron/internal/data/projects.csv entries. Ignoring queries, fragments, etc.

  • Tests for the changes have been added (for bug fixes/features)

Which issue(s) this PR fixes

NONE

Special notes for your reviewer

Changes to cron/internal/data/projects.csv were generated by running make add-projects

Does this PR introduce a user-facing change?

For user-facing changes, please add a concise, human-readable release note to
the release-note

(In particular, describe what changes users might need to make in their
application as a result of this pull request.)

NONE

@spencerschrock spencerschrock temporarily deployed to integration-test August 26, 2022 18:55 Inactive
@codecov
Copy link
Copy Markdown

codecov bot commented Aug 26, 2022

Codecov Report

Merging #2201 (f34e0db) into main (9460030) will increase coverage by 2.40%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #2201      +/-   ##
==========================================
+ Coverage   42.28%   44.68%   +2.40%     
==========================================
  Files          95       95              
  Lines        7871     7871              
==========================================
+ Hits         3328     3517     +189     
+ Misses       4283     4087     -196     
- Partials      260      267       +7     

Copy link
Copy Markdown
Member

@naveensrinivasan naveensrinivasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@github-actions
Copy link
Copy Markdown

@azeemshaikh38 azeemshaikh38 merged commit 11ff78e into ossf:main Aug 26, 2022
@spencerschrock spencerschrock deleted the project-dedup branch August 29, 2022 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants