Skip to content

Conversation

@erickpeirson
Copy link
Contributor

@erickpeirson erickpeirson commented Jun 5, 2019

The initial implementation of the file manager service focused on replicating the behavior of the legacy file checks, and so closely reproduced much of the logic and abstractions. This implementation was quite fast in local tests, correct, and had excellent test coverage over known problematic uploads.

Problem: EFS I/O is too slow to run checks

EFS I/O limits scale with overall volume size, which makes things difficult from the start. But even if we provision a high level of I/O ahead of time, there are too many low-level reads and writes to perform file checks on EFS directly. A modestly sized tarball could take around 12 seconds to unpack and check, which is not acceptable.

So we decided that we need to refactor the application to perform checks on a local disk (e.g. in memory), and then move files over to EFS at the end of the request. This way we are keeping I/O to a minimum.

Problem: the core checking and file-shuffling routines were too complex

Because the initial implementation stuck closely to the logic of the legacy system, the file management and checking logic had grown to nearly 3,000 lines largely contained in a single class, with really long methods doing the actual checking. This made it hard to understand what checks were being performed, and make changes that were predictable.

Since we were going to refactor to address the I/O issues, we decided to go ahead and perform an initial refactor for clarity, composability, and extensibility.

Refactor

Here is an overview of the refactor in 2019-06:

  • Abstracted away the underlying filesystem logic by encapsulating it in
    adapters. A SimpleStorageAdapter implements logic that is close to the original implementation (on a single volume). A QuarantineStorageAdapter implements the two-volume logic needed to do fast checks before shipping data to EFS. The storage adapters implement an API/protocol that is formalized in the domain.

  • Separated the file/workspace checks from the upload workspace class itself using multiple-dispatch (something like the visitor pattern). The advantage of this pattern is that we can add more checks, reorder the checks, etc, without creating an even more enormous class. We can also test them separately, which we should start to do...

  • For clarity sake, decomposed most of the UploadWorkspace into mixins that add sets of
    functionality
    .

  • Consolidated the storage of metadata about the workspace.

    • Previously, details about the workspace state, and about the files in the workspace, were generated on each request (indeed, all or most checks were re-run on each request). In addition, a database row was maintained with some metdata about the last run through the checks.
    • In the reimplemented checks, file-level checks are only run once per file, and workspace-level checks are only run after files are added or deleted.
    • In the new implementation, the database service has the primary responsibility for loading metadata about the workspace, and initializing the UploadWorkspace with its storage adapter. This means that loading the workspace requires only a single function call, and there is no need to keep track of both a filesystem-based workspace object and a database object (both of which represent the workspace) in the controllers.
  • Since the core classes were significantly altered, much of the work involved modifying the extensive test suite to use the new internal APIs, and modifying the controllers to leverage the new patterns in the domain and processes. Since this involved going through the test suite one case at a time, I took the liberty of breaking things up into more manageable pieces:

    • Split controllers out into submodules, for easier navigation.
    • Split up super long test routines in the API tests into separate modules in
      (tests/test_api/), smaller TestCases, and smaller test_
      methods. Hopefully this makes it easier to find relevant tests and also see
      what specifically is being tested.
  • Removed some cruft that wasn't being used and wasn't likely to be used
    (async boilerplate, etc).

DavidLFielding and others added 29 commits May 1, 2019 20:51
return self.path

@property
def name_sans_ext(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the same as (os.path.) basename? To be consistent should delete_file() become directory_sans_file()? The method name definitely caught my attention. I'm sure this is just fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly different from basename:

>>> os.path.basename('foo/bar.txt')
'bar.txt'
>>> os.path.splitext('foo/bar.txt')[0]
'foo/bar'

To be consistent should delete_file() become directory_sans_file()?

Not sure what you mean -- this is just generating a string based on the path. delete_file() has side-effects (namely, deleting a file)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was really just poking fun at the method name. I apologize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha no worries :-)


- :class:`.ReadinessWorkspace`, which adds semantics around the "readiness"
of the workspace for use in a submission.
- :class:`.SingleFileWorkspace`, which adds the concept of a "single file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed eventually allowing ancillary files with 'single-file' submissions. Does this implementation limit in any way the ability to provide a single-file submission with ancillary data files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this implementation limit in any way the ability to provide a single-file submission with ancillary data files?

No, because .file_count excludes ancillary files, and .is_single_file_submission is just looking at .file_count.

raise RuntimeError('Storage adapter is not set')
self.storage.makedirs(self, self.source_path)
self.storage.makedirs(self, self.ancillary_path)
self.storage.makedirs(self, self.removed_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create system directory to be consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, source_path is based on base_path, which is where the system files live. So it should exist by extension of these being created. I guess if you wanted to store system files somewhere else; hadn't considered that scenario. Do you think we should support this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh. I saw new system directory and thought you were now sticking the system/checkpoint files there. My bad. I need to take another look at the code.

self.files.set(u_file.path, u_file)

@modifies_workspace()
def delete_workspace(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class contains methods that 'alter files' according to class docstring. The scope of this method seems a bit beyond just altering files. Maybe just inaccurate description or maybe it's in the wrong place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it was misleading; fixed in 3d5ca4b

return None


# QUESTION: is this still relevant, given that FM service is decoupled from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the current time if you provide the compilation service with a submission containing the missing font file AutoTeX basically aborts. So you could eliminate this check and let the compilation service generate an error. One could argue that many of the checks in the FM service are related to compilation and belong in the compilation service.

logger = logging.getLogger(__name__)


class CheckForMissingReferences(BaseChecker):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer represents current logic in the Legacy system and needs to be updated.

logger = logging.getLogger(__name__)


# TODO: implement this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you forgot to implement this functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These weren't implemented before the refactor, IIRC

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. That was a little wishful thinking on my part.

@@ -0,0 +1,56 @@
"""Check for TeX-generated files."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this needs to be reworded along the lines of 'eliminate files generated from TeX compilation' to distinguish it more from TeX Produced. TeXGenerated makes me think of TeXProduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 7545f3d

workspace.rename(child, new_path)
workspace.remove(entries[0], "Removed top level directory")
#
# # Set permissions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Permissions have always been an issue. I expect you are dealing with these elsewhere.

workspace.create(dest, touch=False, is_ancillary=is_ancillary)


# TODO: Add support for compressed files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we accept Unix compressed files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno. This TODO was there before the refactor IIRC

#
#
# # Upload a submission package and perform normal operations on upload
# def test_upload_files_normal(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These normal/typical upload tests were definitely working in the pre-refactored FM service. This test basically scripted a series of typical requests that might be expected during a normal submission process. The idea was to combine different types of requests together in order to catch any inconsistencies in the final workspace state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this it?

# Upload a submission package and perform normal operations on upload

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these just migrated elsewhere. I seem to recall running into them elsewhere.

logger.setLevel(int(os.environ.get('LOGLEVEL', '20')))


class TestMissingReferences(TestCase):
Copy link
Contributor

@DavidLFielding DavidLFielding Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests need revision as the bib/bbl logic has changed. (Ignore my comments on bib/bbl handling if you've already updated the internal logic per recent changes)

@@ -0,0 +1,344 @@
"""Tests the application via its API, using the QuarantineStorageAdapter."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test appears to specifically test the new quarantine storage adaptor. Would it make sense to run all existing tests against each of the storage adaptor instances? Could you set the storage adaptor, run all of the existing tests, set a different storage adaptor, and run all of the tests again? My sense is if we are changing the underlying storage adaptors we should run an exhaustive series of tests that are not specifically tailored to the storage adaptors. Then again we probably need a few tests tailored to the storage adaptors. Just my two cents here.

Copy link
Contributor

@DavidLFielding DavidLFielding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to at least glance at most of the files in the PR. Hopefully, my comments are useful in some way. I may check a few more things if I have time but plan to move on to other tasks tomorrow.

At this point, I have not run the service beyond exercising the tests, which all passed.

@DavidLFielding
Copy link
Contributor

I notice some formatting issues in my checkpoint PR so on a whim I decided to run some checks on File Manager.

pylint filemanager
...

Your code has been rated at 8.81/10 (previous run: 8.81/10, +0.00)

Yikes! I hear screams off in the distance.

There seem to be a huge number of trivial continue indentation and unused argument warnings.

Yet there are still a large number of errors that seem like they need to be addressed:

ex.
filemanager/domain/index.py:35: [E1137(unsupported-assignment-operation), FileIndex.set] 'self.system' does not support item assignment
...
filemanager/domain/index.py:47: [E1135(unsupported-membership-test), FileIndex.contains] Value 'self.system' doesn't support membership test
...
filemanager/domain/index.py:60: [E1136(unsubscriptable-object), FileIndex.get] Value 'self.system' is unsubscriptable

Lots of these...

filemanager/domain/uploads/stored.py:101: [W0613(unused-argument), IStorageAdapter.get_full_path] Unused argument 'workspace'
filemanager/domain/uploads/stored.py:102: [W0613(unused-argument), IStorageAdapter.get_full_path] Unused argument 'u_file'
...

filemanager/domain/uploads/file_mutations.py:384: [W0201(attribute-defined-outside-init), SourceLog.post_init] Attribute '_logger' defined outside init
filemanager/domain/uploads/file_mutations.py:385: [W0201(attribute-defined-outside-init), SourceLog.post_init] Attribute '_f_path' defined outside init
filemanager/domain/uploads/file_mutations.py:386: [W0201(attribute-defined-outside-init), SourceLog.post_init] Attribute 'file_handler' defined outside init

Here at arXiv we aim for a Pylint target score of 9/10. :^) wink wink

I realize you are extremely busy.

Would you like me to start eliminating the trivial sorts of Pylint errors? I'm not sure what's going on with so many unused argument errors and whether there is an easy global fix. Some of the other errors might be faster for you to resolve (if you think this is important).

@DavidLFielding
Copy link
Contributor

While I'm at it I ran 'pydocstyle --convention=numpy --add-ignore=D401'

There are nearly 60 "Missing docstring" errors. Many of these are located in the newer file_mutations.py/storage.py code. Don't scream yet.

mypy generates a large number of errors. Most of these are trivial and easily correctable:

easy ones...lots of these...

filemanager/domain/tests/test_workspace.py:16: error: Function is missing a return type annotation
filemanager/domain/tests/test_workspace.py:16: note: Use "-> None" if function does not return a value

while some of these other errors might warrant a closer look:

filemanager/domain/uploads/stored.py:223: error: "IStorageAdapter" has no attribute "get_size_bytes"
filemanager/domain/uploads/stored.py:247: error: Returning Any from function declared to return "bytes"
filemanager/domain/uploads/translatable.py:77: error: "Type[datetime]" has no attribute "fromisoformat"
filemanager/domain/uploads/translatable.py:109: error: "TranslatableWorkspace" has no attribute "lastupload_readiness"
filemanager/domain/uploads/translatable.py:110: error: "TranslatableWorkspace" has no attribute "status"
...

along with several of the type mismatch flavor...

filemanager/domain/uploads/checkpoint.py:307: error: Argument 1 to "_update_from_checkpoint" of "CheckpointWorkspace" has incompatible type "TranslatableWorkspace"; expected "CheckpointWorkspace"
...

You may want to take a quick look to see if any of these need to be addressed.

Tomorrow I will fire up the submission UI suite and test the FM as I resume work on alpha tests.

@erickpeirson
Copy link
Contributor Author

Would you like me to start eliminating the trivial sorts of Pylint errors? I'm not sure what's going on with so many unused argument errors and whether there is an easy global fix. Some of the other errors might be faster for you to resolve (if you think this is important).

Good catches. Some of these are really things that need to be fixed, and some are not.

For example, the unsupported-assignment-operation and other messages on those dicts are spurious, because pylint doesn't know that dataclasses is doing magic with field(...). So some of those can get # pylint: disable=ed. I also added the following to the .pylintrc as blacklisted:

  • unused-argument -- most of this comes from the use of protocols, and the standardized calling API of checker methods.
  • duplicate-code -- this is largely coming from tests. May be a better way to do this, not an expert on pylint config.
  • no-self-use -- this largely comes from the checkers; pylint doesn't know what patterns are being used here, so this is spurious.

@erickpeirson
Copy link
Contributor Author

erickpeirson commented Aug 14, 2019

mypy generates a large number of errors. Most of these are trivial and easily correctable:

I'd recommend using ./tests/type-check.sh filemanager rather than running mypy directly (maybe you are already doing this). I had a tee in the wrong spot in that script, so it was printing ignored errors (the ones in the tests) even though ultimately those aren't counted against the final result. That being said, I am still seeing errors (which is somewhat disheartening, because I really thought that I had fixed all of these):

$ ./tests/type-check.sh filemanager
filemanager/domain/uploads/stored.py:225: error: "IStorageAdapter" has no attribute "get_size_bytes"
filemanager/domain/uploads/stored.py:249: error: Returning Any from function declared to return "bytes"
filemanager/domain/uploads/translatable.py:77: error: "Type[datetime]" has no attribute "fromisoformat"
filemanager/domain/uploads/translatable.py:109: error: "TranslatableWorkspace" has no attribute "lastupload_readiness"
filemanager/domain/uploads/translatable.py:110: error: "TranslatableWorkspace" has no attribute "status"
filemanager/domain/uploads/translatable.py:111: error: "TranslatableWorkspace" has no attribute "lock_state"
filemanager/domain/uploads/translatable.py:112: error: "TranslatableWorkspace" has no attribute "source_type"
filemanager/domain/uploads/translatable.py:141: error: "Type[TranslatableWorkspace]" has no attribute "Status"
filemanager/domain/uploads/translatable.py:142: error: "Type[TranslatableWorkspace]" has no attribute "LockState"
filemanager/domain/uploads/translatable.py:143: error: "Type[TranslatableWorkspace]" has no attribute "SourceType"
filemanager/domain/uploads/translatable.py:152: error: "Type[TranslatableWorkspace]" has no attribute "Readiness"
filemanager/domain/uploads/checkpoint.py:307: error: Argument 1 to "_update_from_checkpoint" of "CheckpointWorkspace" has incompatible type "TranslatableWorkspace"; expected "CheckpointWorkspace"
filemanager/process/strategy.py:34: error: Function is missing a return type annotation
filemanager/process/strategy.py:34: note: Use "-> None" if function does not return a value
filemanager/process/strategy.py:52: error: Need type annotation for 'tasks'
filemanager/process/strategy.py:56: error: Function is missing a return type annotation
filemanager/process/strategy.py:61: error: Function is missing a return type annotation
filemanager/process/strategy.py:61: note: Use "-> None" if function does not return a value
filemanager/process/strategy.py:74: error: List comprehension has incompatible type List[UploadedFile]; expected List[Tuple[Any, ...]]
filemanager/process/strategy.py:75: error: Call to untyped function "await_completion" in typed context
filemanager/process/strategy.py:84: error: "IChecker" has no attribute "__iter__" (not iterable)
filemanager/process/strategy.py:95: error: No return value expected
filemanager/process/check/file_type.py:399: error: Argument 1 to "_type_of_latex2e" has incompatible type "UploadedFile"; expected "BytesIO"
filemanager/process/check/file_type.py:435: error: Incompatible types in assignment (expression has type "bytes", variable has type "str")
filemanager/process/check/file_type.py:464: error: Argument 1 to "_type_of_latex2e" has incompatible type "IO[Any]"; expected "BytesIO"
filemanager/process/check/cleanup.py:171: error: Argument 1 to "findall" of "Pattern" has incompatible type "mmap"; expected "bytes"
filemanager/process/check/cleanup.py:226: error: Argument 1 to "search" of "Pattern" has incompatible type "str"; expected "bytes"
filemanager/process/check/cleanup.py:233: error: Argument 1 to "search" of "Pattern" has incompatible type "str"; expected "bytes"
filemanager/process/check/cleanup.py:239: error: Argument 1 to "search" of "Pattern" has incompatible type "str"; expected "bytes"
filemanager/process/check/cleanup.py:247: error: Argument 1 to "search" of "Pattern" has incompatible type "str"; expected "bytes"
filemanager/process/check/cleanup.py:253: error: Unsupported operand types for + ("bytes" and "str")
filemanager/process/check/cleanup.py:256: error: Argument 1 to "search" of "Pattern" has incompatible type "str"; expected "bytes"
filemanager/controllers/checkpoint.py:122: error: "print_exc" does not return a value
filemanager/controllers/checkpoint.py:233: error: "print_exc" does not return a value
filemanager/controllers/checkpoint.py:282: error: "print_exc" does not return a value
filemanager/controllers/upload.py:101: error: Incompatible types in assignment (expression has type "None", variable has type "UploadWorkspace")
filemanager/controllers/upload.py:177: error: Argument 1 to "create" of "FileMutationsWorkspace" has incompatible type "Union[str, None, Any]"; expected "str"
filemanager/controllers/files.py:221: error: "UploadFileSecurityError" has no attribute "description"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants