-
Notifications
You must be signed in to change notification settings - Fork 198
Add Data Loss Prevention Example #2226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rapids-bot
merged 212 commits into
nv-morpheus:branch-25.06
from
dagardner-nv:david-tzm-dlp
Jul 9, 2025
Merged
Changes from 193 commits
Commits
Show all changes
212 commits
Select commit
Hold shift + click to select a range
b367cb0
Fix out of date docker-py deps
dagardner-nv 1806914
Work-around for #2219
dagardner-nv e81702a
Update conda deps
dagardner-nv d882215
Adding new deps from Tad
dagardner-nv 658648c
Adding new deps from Tad
dagardner-nv 9ff316f
Resolve dependency conflicts, add missing dep for yaml (currently in …
dagardner-nv 94c577d
WIP
dagardner-nv d962de0
Adding cr header
dagardner-nv ef664bf
WIP
dagardner-nv d6f9920
Update datasets and huggingface_hub libraries to match requirements o…
dagardner-nv 03f9bdb
Pin click and setuptools
dagardner-nv 89abcbd
Be more specific with setuptools versions
dagardner-nv 6da41ee
Updae conda envs
dagardner-nv a1f0cd3
WIP
dagardner-nv 711254f
Merge branch 'david-dep-issues' of github.com:dagardner-nv/Morpheus i…
dagardner-nv 060d9e6
Don't pin the model to device 1
dagardner-nv da74c07
Just use the std lib re
dagardner-nv be47c08
Remove eval and unused imports
dagardner-nv a6fbcee
Remove unneeded bit
dagardner-nv f5bfa4f
Remove unused dep
dagardner-nv de03d39
Add set of regular expressions
dagardner-nv 04a6201
Add module dir
dagardner-nv 81d6499
Remove unused modules
dagardner-nv d2f77c8
Don't print all the results
dagardner-nv 509fe3a
WIP
dagardner-nv ff59da7
datasets is now a CLI flag
dagardner-nv 086044a
Source stage for pulling data from huggingface
dagardner-nv e51fae1
Refactor DLPInputProcessor as a Morpheus stage
dagardner-nv ec9e75a
First pass at refactoring RegexProcessor as a stage
dagardner-nv fd4fb6d
Make num_samples a flag
dagardner-nv 5fe1bc8
Fix syntax
dagardner-nv 5196df5
Fixes
dagardner-nv 6c464f8
WIP
dagardner-nv a6b950a
Switch to processing records one row at a time, using one of the cudf…
dagardner-nv 8ce9072
First pass at a gliner stage
dagardner-nv f6059e9
Remove unused import
dagardner-nv 7f38ddd
Switch to applying the regex on a per-row basis, as this allows us to…
dagardner-nv 1cd00da
First pass at refactoring RiskScorer as a stage
dagardner-nv 98fedca
Fix type hint
dagardner-nv 2772bb3
WIP
dagardner-nv 3779e9a
Cleanup
dagardner-nv e6580c8
Set max_model_length
dagardner-nv 6c84ead
Remove model_max_length, as it was not working it appears to be a kno…
dagardner-nv e090ca5
Flatten the scores output
dagardner-nv 9c01359
Minor improvements
dagardner-nv 9e9adfb
Rename gliner_findings to dlp_findings
dagardner-nv 668850d
Fix setting fo GpuAndCpuMixin
dagardner-nv 8b29060
Run the pipeline in CPU execution mode
dagardner-nv 0dbb14d
Remove unused import
dagardner-nv 6db058f
Work-around what appears to be a bug in the serialization stage
dagardner-nv d6fed72
Remove temporary work-around
dagardner-nv 3abc552
Fix spelling errors, restructure readme
dagardner-nv 5a5394b
Merge branch 'branch-25.06' of github.com:nv-morpheus/Morpheus into d…
dagardner-nv cd882be
Revert unintentional changes
dagardner-nv 9fde32e
Lint fixes and other cleanups
dagardner-nv 2dab666
Update README.md
tzemicheal a62e58f
Add CR header
dagardner-nv dabea93
Merge branch 'david-tzm-dlp' of github.com:dagardner-nv/Morpheus into…
dagardner-nv 2fb2a5e
Add preallocations
dagardner-nv 67f3c72
Cleanup monitor stage labels
dagardner-nv e3ebb9e
Optimization to handle situation where chunking isn't used
dagardner-nv e856bf8
Fix handling of output file path
dagardner-nv 944d484
Batch process data
dagardner-nv 50f7890
Lazily load the model
dagardner-nv 54af433
Remove unused import
dagardner-nv 7785d5c
Install gliner from pip, ensuring we don't accidentally install a cpu…
dagardner-nv b73d575
Document the need to install torch by hand on Arm
dagardner-nv be14b32
pin to 0.2.19, 2.20 isn't working with our version of torch
dagardner-nv 9448db5
Switch to updated gliner, and specify cache dir
dagardner-nv 903c5fa
Switch to performing regexes in cudf
dagardner-nv 12c9be9
Switch to using cudf regex
dagardner-nv dcfbdbb
Exclude the privacy mask by default, enabled with flag
dagardner-nv 246125d
Switch to performing a pandas apply
dagardner-nv 0dac664
Remove the PreallocatorMixin from the DLPInputProcessor stage
dagardner-nv 468bf9d
Merge branch 'branch-25.06' of github.com:nv-morpheus/Morpheus into d…
dagardner-nv 66b766c
Optionally use an input file, optionally repeat the input data
dagardner-nv f1842ae
Add triton inference processor
tzemicheal ffada50
Merge branch 'tz-david-tzm-dlp' of github.com:tzemicheal/Morpheus int…
dagardner-nv bf1b51c
Triton code as-is from Tad
dagardner-nv e6e552a
Lazily load the model, construct the client once in the constructor
dagardner-nv 04b2aea
Add License headers
dagardner-nv 77791f9
Use grpc
dagardner-nv cc68730
Ensure that post processing happens on the GPU
dagardner-nv 5161d67
Merge pull request #11 from dagardner-nv/david-tzm-dlp-tz-triton
dagardner-nv 9660526
Use a list of tuples
dagardner-nv e02b8ca
Misc cleanups, drop usage of onnx locally as this casuses the model t…
dagardner-nv e5973b7
WIP
dagardner-nv 499af04
Cleanup
dagardner-nv 8ecb9bf
Merge branch 'david-tzm-dlp' of github.com:dagardner-nv/Morpheus into…
dagardner-nv e3382f5
Switch to async
dagardner-nv 37d2e7d
Merge branch 'david-tzm-dlp-tz-triton' into david-tzm-dlp
dagardner-nv a6b09f2
Lint fix
dagardner-nv ece50f8
Remove unused model_cache_dir
dagardner-nv fd69b3b
Lint fix
dagardner-nv fdea060
Make server_url a cli flag
dagardner-nv 4cbbe47
Remove the need for a second loop
dagardner-nv bbb1fde
Clean up type hints
dagardner-nv 192ef01
Expose command line flags to enable chunking
dagardner-nv 183429b
Fix requesting needed columns
dagardner-nv 17e4e98
Replace broken chunking feature with a split on new-line chars
dagardner-nv a5c2159
Always split on paragraphs, don't fallback, filter non-matched rows (…
dagardner-nv ca19410
Lint fixes
dagardner-nv 9d3b630
Aggregate data by the original index
dagardner-nv c15e31b
Adjust weights to match the labels
dagardner-nv 58e86de
Remove weights not in the labels
dagardner-nv f82a429
Remove unused import, adjust weight calculations per TZM
dagardner-nv ebfba25
Fix the calculations of scores
dagardner-nv 8ffb846
Better handling of output columns
dagardner-nv 10a7768
Add a --regex_only flag
dagardner-nv eb618d3
Run the scorer for regex only
dagardner-nv 8120bbd
Handle the findings from regex only
dagardner-nv 74787aa
Fix handling of regex labels
dagardner-nv 119d0f2
Relocate the mode to the models dir
dagardner-nv 78ff92b
Update README to include triton instructions
dagardner-nv 9a6da64
Moving files to LFS
dagardner-nv 58f4a5c
Move json files to LFS
dagardner-nv c883e51
Include information about fetching the model with git lfs
dagardner-nv c373d2f
Update to no longer use the pytorch model using only the onnx model
dagardner-nv 2b3c791
Ensure a CUDA enabled version of onnxruntime is installed, install gl…
dagardner-nv 9ca61b9
DLPInputProcessor is now responsible for converting from MessageMeta …
dagardner-nv 5917ea2
WIP
dagardner-nv 3223073
WIP
dagardner-nv 10b40a3
Shelving this for a while, the current cpp impl is somehow resulting …
dagardner-nv adba646
WIP
dagardner-nv e896dfb
Add pipeline batch size
dagardner-nv 3ca6548
Merge branch 'david-tzm-dlp' into david-tzm-dlp-cpp-regex
dagardner-nv fb03e8a
Fix building AST tree
dagardner-nv ff34b9d
Fix label concatenation
dagardner-nv f25e291
Log the df length
dagardner-nv 25fc535
Time the entire run, not a single call
dagardner-nv d739df0
David tzm dlp cpp regex (#13)
dagardner-nv 845e114
Remove debug printing
dagardner-nv 090d59a
Ugh LFS
dagardner-nv 9d30c98
Merge branch 'david-tzm-dlp-cpp-regex' into david-tzm-dlp
dagardner-nv 8deda9a
WIP
dagardner-nv b540d95
Remove redundant patterns
dagardner-nv ef4b42c
Remove redundant patterns
dagardner-nv f4f3161
Remove old work-around, and print statements
dagardner-nv c229146
Clean up the timing code
dagardner-nv bc4ae45
Remove more redundant regexes
dagardner-nv 2e2d492
Remove more redundant regexes
dagardner-nv 0f29382
Use apply rather than iterating over groups
dagardner-nv 433f30c
Remove unused import
dagardner-nv fc662a7
Revert unintentional change
dagardner-nv 0d3379b
Revert temporary timing code
dagardner-nv 7d0385b
Remove timing code
dagardner-nv 10a5566
Remove timing code
dagardner-nv 902c1f0
Remove scripts from LFS
dagardner-nv 3b05df3
Adjust LFS matching
dagardner-nv 29d3aed
Add scripts back in
dagardner-nv 7c4fb14
Remove debug stage
dagardner-nv 9a929c1
Rename extension to conform with Morpheus naming
dagardner-nv d7fd972
IWYU fixes
dagardner-nv 988520f
Remove nervaluate
dagardner-nv 1140411
Remove unused parameter
dagardner-nv a4da5ae
Add num_threads flag
dagardner-nv 8869774
Add docstrings to RiskScorer
dagardner-nv 0391c38
Merge branch 'branch-25.06' of github.com:nv-morpheus/Morpheus into d…
dagardner-nv d2290a1
Fix serializing to JSON for dataframes when they contain a struct field
dagardner-nv bbc8279
Remove redundant for-loop
dagardner-nv a719a66
Replace custom DLP output stages with Morpheus built-in stages
dagardner-nv 610c242
Remove un-needed DLP output stages
dagardner-nv 8c644ee
Make get_data public and avoid making copies of TableInfoData
dagardner-nv f3e024b
IWYU fixes
dagardner-nv 88f831f
Fix gramerical error in error message
dagardner-nv 0296bcb
Remove redundant code
dagardner-nv 1e8bb34
Rename variable
dagardner-nv 6bfd9ba
Remove unneeded monitor stage
dagardner-nv 9987117
Remove redundant loop
dagardner-nv 3ffe3f1
Revert "Remove redundant loop"
dagardner-nv 5edaf2b
Cleanup pull the if statement out of the loop
dagardner-nv f8d177c
Add missing else clause for Paquet
dagardner-nv 3ac6a7d
Add missing case statement for Parquet
dagardner-nv 5288b9d
WIP
dagardner-nv 438acd0
Handle index columns
dagardner-nv da98af1
Add triton repo configs for gliner model
dagardner-nv 9ac3ffa
Slim down the input processor stage
dagardner-nv 0755343
Combine the two replace statements
dagardner-nv 2bb3228
WIP
dagardner-nv b1d803c
Revert "WIP"
dagardner-nv 0a3b9c5
Specify no capture
dagardner-nv 25b39fc
Add missing include
dagardner-nv 87a33ab
Don't emit an empty table
dagardner-nv c635639
Update examples/data_loss_prevention/dlp_stages/_lib/CMakeLists.txt
dagardner-nv 47255df
Replace explicit version pins with version ranges
dagardner-nv 0eedd6f
Remove timing code, document default values, remove restriction on NL…
dagardner-nv 4ebd4e0
Document default values in the docstring
dagardner-nv 1fa2bb4
Add docstrings
dagardner-nv 0ed70f1
Add default values to docstrings
dagardner-nv 870ba30
Merge branch 'branch-25.06' of github.com:nv-morpheus/Morpheus into d…
dagardner-nv 674adaa
Add round-trip test for write_df_to_file/read_file_to_df and include …
dagardner-nv c43a58c
Open parquet files as binary, add optional include_index_col arg for …
dagardner-nv bf12770
Add new tests
dagardner-nv 792b302
Add unittest for the new get_column override
dagardner-nv 7864d53
IWYU fixes
dagardner-nv fd502d7
Merge branch 'branch-25.06' of github.com:nv-morpheus/Morpheus into d…
dagardner-nv 5a0d40e
Support single regex pattern
dagardner-nv 4c589bb
replace print with logger
dagardner-nv 88b1f71
Apply suggestions from code review
dagardner-nv e9368de
Address PR feedback
dagardner-nv fa5f3d4
Remove --include_privacy_masks flag as this isn't supported in the pi…
dagardner-nv b4e0de6
Merge branch 'david-tzm-dlp' of github.com:dagardner-nv/Morpheus into…
dagardner-nv 6f73723
Add comment explanation
dagardner-nv 4a4b466
Remove unused logger
dagardner-nv 6e7861d
Remove unneeded assert
dagardner-nv bfd9554
Remove early/unneeded variable assignment
dagardner-nv a320b9d
Remove old docstring
dagardner-nv 9687226
Update README.md of HC PR feedback
tzemicheal 349056b
Move max_score to a class var
dagardner-nv c75bb89
Merge branch 'david-tzm-dlp' of github.com:dagardner-nv/Morpheus into…
dagardner-nv dabf1a7
Fix allignment
dagardner-nv 80ba916
Adjust regex to match missing row
dagardner-nv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| cmake_minimum_required(VERSION 3.25 FATAL_ERROR) | ||
|
|
||
| list(APPEND CMAKE_MESSAGE_CONTEXT "dlp") | ||
|
|
||
| # Set the cache to be the same to allow for CCache to be used effectively | ||
| set(MORPHEUS_CACHE_DIR "${CMAKE_SOURCE_DIR}/.cache" CACHE PATH "Directory to contain all CPM and CCache data") | ||
| mark_as_advanced(MORPHEUS_CACHE_DIR) | ||
|
|
||
| # Add the Conda environment to the prefix path and add the CMake files | ||
| list(PREPEND CMAKE_PREFIX_PATH "$ENV{CONDA_PREFIX}") | ||
|
|
||
| project(dlp | ||
| VERSION 25.06.00 | ||
| LANGUAGES C CXX | ||
| ) | ||
|
|
||
| set(CMAKE_CXX_STANDARD 20) | ||
| set(CMAKE_CXX_STANDARD_REQUIRED ON) | ||
| set(CMAKE_CXX_EXTENSIONS ON) | ||
| set(CMAKE_POSITION_INDEPENDENT_CODE TRUE) | ||
| set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE) | ||
| set(CMAKE_INSTALL_RPATH "$ORIGIN") | ||
|
|
||
| # Set the option prefix to match the outer project before including. Must be before find_package(morpheus) | ||
| set(OPTION_PREFIX "MORPHEUS") | ||
|
|
||
| # Set the policy to allow for CMP0144, avoids warning about MORPHEUS_ROOT being set | ||
| cmake_policy(SET CMP0144 NEW) | ||
|
|
||
| find_package(morpheus REQUIRED) | ||
| find_package(glog REQUIRED) # work-around for #2149 | ||
|
|
||
| morpheus_utils_initialize_cpm(MORPHEUS_CACHE_DIR) | ||
|
|
||
| # Ensure CPM is initialized | ||
| rapids_cpm_init() | ||
|
|
||
| morpheus_utils_python_configure() | ||
|
|
||
| rapids_find_package(CUDAToolkit REQUIRED) | ||
| rapids_find_package(cudf REQUIRED) | ||
|
|
||
| set(CMAKE_POSITION_INDEPENDENT_CODE TRUE) | ||
| set(CMAKE_EXPORT_COMPILE_COMMANDS ON) | ||
|
|
||
| morpheus_utils_create_python_package(dlp_stages | ||
| PROJECT_DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}" | ||
| SOURCE_DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}/dlp_stages" | ||
| ) | ||
|
|
||
| add_subdirectory(dlp_stages/_lib) | ||
|
|
||
| # Complete the python package | ||
| if(MORPHEUS_PYTHON_INPLACE_BUILD) | ||
| list(APPEND extra_args "IS_INPLACE") | ||
| endif() | ||
|
|
||
| if(TARGET morpheus-package-install) | ||
| list(APPEND extra_args "PYTHON_DEPENDENCIES" "morpheus-package-install") | ||
| endif() | ||
|
|
||
| morpheus_utils_build_python_package(dlp_stages ${extra_args}) | ||
|
|
||
| list(POP_BACK CMAKE_MESSAGE_CONTEXT) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.