-
Notifications
You must be signed in to change notification settings - Fork 6
feat(pipeline): Data Pipeline for Safaa main PR #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(pipeline): Data Pipeline for Safaa main PR #22
Conversation
…nt, and added it into the pipeline
…t, and added it into the pipeline, I also included an upgraded declutter script with regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @smilingprogrammer ,
There is a lot of gap in what we were planning to do and this version. I have left comments where I could understand things could be done differently and in a better way maybe.
Also: Safaa/src/safaa/pipeline_dir/data/copyrights.csv
why are we commiting this CSV file?
Take a closer look to all the requested changes and try to align with the priorities.
Okay, will make adjustment to the comments. As for the file, as indicated in the PR details, it's just a dummy example to try the functionalities out. |
…rom the fossology serv
… various functionalities
Thank you very much for the feedbacks, it was really insightful. I have made corrections to the places you noted. For the 2 others i haven't marked resolved, i have just some little questions about it before pushing what i have for it in my local environment, and will be asking in the next meeting. Thank you once again, very much appreciated. |
27dab47
to
3ed9383
Compare
…dependency installation
…retraining script
…ility path for easier pipeline metrics
Description
This PR is an ongoing project on the Data Pipeline for Safaa that allows us to automate most manual Safaa tasks
Changes
extra_decluter.py
) for improving our decluttering using regex (Experimental purpose).How to test for
script_for_copyright.py
.env
to store your server variables, which are [DB_NAME, DB_USER, DB_PASSWORD, DB_HOST, DB_PORT]How to test for
preprocessing_script.py
copyrights.csv
obtained from the fossology server in the data directory (An example dataset is already available)Pipeline Script