Skip to content

Conversation

smilingprogrammer
Copy link

Description

This PR is an ongoing project on the Data Pipeline for Safaa that allows us to automate most manual Safaa tasks

Changes

  1. Created a new directory with this script to include the ongoing work on creating a pipeline to automate most manual Safaa tasks.
  2. I also included the script to preprocess our fetched copyright content from the fossology server, while also implementing it into the pipeline through the pipeline.yml
  3. I implemented the available decluttering script into the pipeline. I also included another seperate script (extra_decluter.py) for improving our decluttering using regex (Experimental purpose).

How to test for script_for_copyright.py

  1. Start the fossology server instance in localhost
  2. Upload a project zip file to scan its copyright
  3. create a .env to store your server variables, which are [DB_NAME, DB_USER, DB_PASSWORD, DB_HOST, DB_PORT]
  4. Run this .py script directly in the folder

How to test for preprocessing_script.py

  1. Have copyrights.csv obtained from the fossology server in the data directory (An example dataset is already available)
  2. Run the preprocesing script
  3. Trigger it on GitHub Actions under the Pipeline Script

@smilingprogrammer smilingprogrammer changed the title Copyright pipeline feat(pipeline): Data Pipeline for Safaa main PR Jul 9, 2025
Copy link
Member

@Kaushl2208 Kaushl2208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @smilingprogrammer ,

There is a lot of gap in what we were planning to do and this version. I have left comments where I could understand things could be done differently and in a better way maybe.

Also: Safaa/src/safaa/pipeline_dir/data/copyrights.csv why are we commiting this CSV file?

Take a closer look to all the requested changes and try to align with the priorities.

@smilingprogrammer
Copy link
Author

Hey @smilingprogrammer ,

There is a lot of gap in what we were planning to do and this version. I have left comments where I could understand things could be done differently and in a better way maybe.

Also: Safaa/src/safaa/pipeline_dir/data/copyrights.csv why are we commiting this CSV file?

Take a closer look to all the requested changes and try to align with the priorities.

Okay, will make adjustment to the comments. As for the file, as indicated in the PR details, it's just a dummy example to try the functionalities out.

@smilingprogrammer
Copy link
Author

Thank you very much for the feedbacks, it was really insightful. I have made corrections to the places you noted. For the 2 others i haven't marked resolved, i have just some little questions about it before pushing what i have for it in my local environment, and will be asking in the next meeting.

Thank you once again, very much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants