-
Notifications
You must be signed in to change notification settings - Fork 47
[DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file push
#923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pushpush
|
|
||
| # Need to add a `data` attribute for MultFileItem? Where do we get `fps` from? | ||
| def serialize_v2(self): | ||
| optional_properties = ["fps"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional_properties only contains 1 item for now, but this is to mirror how the serialize_v2() method works for LocalFile. If we add support for others such as as_frames or extract_views in future, this will be easy to extend
| single_file_items = self.local_files | ||
| upload_payloads = [] | ||
| if self.multi_file_items: | ||
| upload_payloads.extend( | ||
| [ | ||
| { | ||
| "items": [file.serialize_v2() for file in file_chunk], | ||
| "options": {"ignore_dicom_layout": True}, | ||
| } | ||
| for file_chunk in chunk(self.multi_file_items, chunk_size) | ||
| ] | ||
| ) | ||
| local_files_for_multi_file_items = [ | ||
| file | ||
| for multi_file_item in self.multi_file_items | ||
| for file in multi_file_item.files | ||
| ] | ||
| single_file_items = [ | ||
| file | ||
| for file in single_file_items | ||
| if file not in local_files_for_multi_file_items | ||
| ] | ||
|
|
||
| upload_payloads.extend( | ||
| [ | ||
| {"items": [file.serialize_v2() for file in file_chunk]} | ||
| for file_chunk in chunk(single_file_items, chunk_size) | ||
| ] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do it this way so that if in the future, we want to develop a more complex multi-file item upload feature: A single push() operation is capable of uploading single-file items and multi-file items
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is safe to default ignore_dicom_layout to true.
| filtered_files = [ | ||
| f for f in found_files if str(f) not in files_to_exclude_full_paths | ||
| ] | ||
| if sort: | ||
| return natsorted(filtered_files) | ||
| return filtered_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this because when uploading multi-file items, the order of the files is important. Natural sorting seems the more appropriate approach to me
c69d1db to
b4a52c0
Compare
b4a52c0 to
ce85a68
Compare
pushpush
6ee0fad to
da92554
Compare
e3d261e to
b5ef828
Compare
darwin/cli_functions.py
Outdated
| item.reason, | ||
| ) | ||
| for slot in item.slots: | ||
| if slot["reason"] != "ALREADY_EXISTS": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is different from what we do at line 802. Do we expect reason to always be defined and uppercase at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From looking at the relevant BE section, yes - Every blocked item will have a reason, and all the reasons are uppercase. However, I don't see a downside to introduce handling for potentially missing or lowercase reasons, so I can adjust this section to function as above
darwin/cli_functions.py
Outdated
| other_skipped_items.append(item) | ||
| for slot in item.slots: | ||
| if (slot["reason"] is not None) and ( | ||
| slot["reason"].upper() == "ALREADY_EXISTS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was pre-existing but I fear strings such as "ALREADY_EXISTS" and "UPLOAD_REQUEST" it's too easy to make a typo. I usually create a var to contain such constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll move reasons into a const based on their BE values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, "UPLOAD_REQUEST" doesn't appear to be a BE constant. Instead, it looks like an arbitrarily chosen string to represent the stage of the upload where the error occurred
| merge_incompatible_args = { | ||
| "preserve_folders": preserve_folders, | ||
| "as_frames": as_frames, | ||
| "extract_views": extract_views, | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here IMO high risk of typo, I'd:
# Define constants for the keys
PRESERVE_FOLDERS_KEY = "preserve_folders"
AS_FRAMES_KEY = "as_frames"
EXTRACT_VIEWS_KEY = "extract_views"
# Use the constants in the dictionary
merge_incompatible_args = {
PRESERVE_FOLDERS_KEY: preserve_folders,
AS_FRAMES_KEY: as_frames,
EXTRACT_VIEWS_KEY: extract_views,
}
darwin/dataset/remote_dataset_v2.py
Outdated
| return results | ||
|
|
||
|
|
||
| def _find_files_to_upload_merging( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _find_files_to_upload_with_merging. I think it reads a bit better.
Looking at the function below this could also be _find_files_to_upload_as_multislot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in comment below, there are better, more generic names I'll use
darwin/dataset/remote_dataset_v2.py
Outdated
| Finds files to upload as either: | ||
| - Multi-slotted items | ||
| - Multi-channel items | ||
| - Single-slotted items containing multiple `.dcm` files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this too specific? In case we'll add an additional modality we'll have to modify the doc as well. To be fair not a big deal, but it's easy to forget to updated the doc. We could say:
find files to upload accordingly to the item_merge_mode?
darwin/dataset/remote_dataset_v2.py
Outdated
| return local_files, multi_file_items | ||
|
|
||
|
|
||
| def _find_files_to_upload_no_merging( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_find_files_to_upload_as_singleslot reads a bit better. Usually I avoid to define something by negating something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about naming the pair of functions:
_find_files_to_upload_as_single_file_items_find_files_to_upload_as_multi_file_items
This is generic enough that we can add modalities in the future
darwin/dataset/upload_manager.py
Outdated
| Creates the layout to be used when uploading the files as a dataset item: | ||
| - For multi-slotted items: LayoutV2 | ||
| - For series items: LayoutV2, but only with `.dcm` files | ||
| - For multi-channel items: LayoutV3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB. LayoutV3 can be used also to upload multi-slot items and series. Maybe it's easier to use a single layout version below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to maintain the ability to upload files with >16 slots, then unfortunately I think we have to stick to LayoutV2. Confirming with the channels engineers
| raise ValueError( | ||
| f"No multi-channel item can have more than 16 files. The following directory has {len(self.files)} files: {self.directory}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we get this from the API response? So we don't have to update dpy when changing this on the BE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, yes. The problem is: How do we get the limit from the API? Is there a call we can make to get the limit?
My understanding is that there isn't a limit at the moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't, basically dpy tries to upload and gets an error from the BE. We could argue it's late in terms of UX, but I've the feeling the limitation will vary in the future and I'm not sure it's a good use of our time to update dpy everytime for such scenario 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see 2 difficulties with this:
- If we get an upload error from the BE that we've exceeded the limit, how do we know what the limit is? Is the error guaranteed to include the limit?
- This section of the code where we're creating each
MultiFileItemisn't where we perform the upload. We construct all theMultiFileItems first, then move to upload afterward. This means that if one item exceeds the limit but others don't, we could end up with a partially completed upload at the point or error
darwin/dataset/upload_manager.py
Outdated
| "slots_grid": [[[file.local_path.name for file in self.files]]], | ||
| } | ||
|
|
||
| def serialize_v2(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does v2 refer to the layout version? Cause I also see the CHANNELS mode managed below and that is only v3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v2 here refers to version 2 of Darwin JSON - It's leftover from when darwin-py used to support Darwin JSON 1.0 as well. The function can be renamed to serialize_darwin_json_v2 or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that could work 💯
darwin/dataset/upload_manager.py
Outdated
| f"Cannot match {item.full_path} from payload with files to upload" | ||
| for slot in item.slots: | ||
| upload_id = slot["upload_id"] | ||
| slot_path = (Path(item.path) / Path((slot["file_name"]))).as_posix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is slot a TypedDict? In that case we shouldn't be able to have a typo in file_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slot is part of a raw API response so a typo shouldn't be possible, but I'm not sure I follow - Could you expand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you type "file_nameS" would the linter tell you that it's not a valid value or would we get an error only at runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, I've put in 2 classes for better type safety of API responses
Problem
Currently,
pushonly supports single-slotted, single-file items. It should support multi-slotted, multi-channel, and DICOM series itemsSolution
Support for multi-file items is added through the
item_merge_modepushcommand. Details below:item_merge_mode: Enum["slot", "series", "channel"]It is available through both CLI & SDK-initiated
push()with this syntax:Because it is incompatible with
preserve_folders, we raise an error ifpreserve_foldersis set:dataset.push(files_to_upload=files_to_upload, item_merge_mode=item_merge_mode, preserve_folders=True)Given this, we throw 1 non-blocking warning for each instance of:
push()that references a specific file rather than a directory, since these are ignoreditem_merge_moderules, results in an empty directoryBehaviour of
item_merge_modeoptionsEach file path passed to
pushmust be a directory. The files inside the 1st level of that directory are treated as follows:Multi-slotted items
Multi-channel items
DICOM series
.dcmfile is concatenated into the same slotignore_dicom_layoutoption to ensure they are concatenated, even if multiple volumes are detectedChangelog
Added
pushsupport for multi-file items, specifically:.dcmconcatenated files