Skip to content

Use Dart library to read and write tar files #2817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 5, 2021

Conversation

simolus3
Copy link
Contributor

Uses pkg:tar to read and write tar files. That package is based on stream transformers and generally tries to minimize memory overhead. It supports the ustar format and extended PAX headers.

For testing, I ran tool/extract_all_pub_dev.dart, but only with the latest version of each package. To test backwards-compatibility, I published some packages with pkg:tar to a personal pub server and downloaded them with the regular pub on Linux and Windows. I don't have access to a Mac, but I can run more tests on the other OSes if needed.

Full disclosure: I wrote the tar package.
Closes #1602 (and a bunch of related issues).

@google-cla google-cla bot added the cla: yes label Dec 27, 2020
@jonasfj
Copy link
Member

jonasfj commented Jan 5, 2021

@simolus3, this is AWESOME! 🚀

@walnutdust has some work in progress in a PR with a tar-stream reader like this in google/dart-neats#56
But we never quite finished it.. It modelled on the tar implementation from golang (not a literal translation, but tried to follow what archive/tar does). I think one of the bugs we've had there was a broken ChunkedStreamIterator which was recently fixed :D

I wonder if we could combine efforts?

  • We should at a minimum be able to reuse the test cases 🎉
  • The approach using ChunkedStreamIterator is perhaps a bit more readable, so it might be worth exploring..
  • In either case, both of these implementations might need more tests wrt. special characters and the like 🙈

@walnutdust, do you recall why we went away from making transforming a Stream<List<int>> into a Stream<TarEntry>? Was it to do with cancelling or was it just because it would be harder to use?

@simolus3
Copy link
Contributor Author

simolus3 commented Jan 5, 2021

Edit: I was a bit slow to see your email, I'll reply in more detail there.

I didn't know about the existing effort (and sorry for snatching the package name :D).

The approach using ChunkedStreamIterator is perhaps a bit more readable

That's true, but I like that tar.reader can be a StreamTransformer transparently handling pauses/resumes. I took the approach from package:mime, IMO this is pretty convenient to use in an await for loop. I have some ideas to make the implementation more readable too.

I'm not sure what the best way forward is now. I can open a PR to add my writer to @walnutdust's implementation and transfer the pub package to you when that gets merged. I need a tar package that works without dart:io though. I also want to use it for untrusted inputs, so some guards against memory DoS attacks with huge PAX headers would be nice.
I can also continue to work on my tar implementation and add the test cases you suggested. It has the features I need, but the reader doesn't deal with sparse files or PAX Headers for modification times yet. I could copy that code from @walnutdust and give credit if that's ok.
I slightly prefer the second approach, but I'm happy to help with the other implementation too if that's easier for you.

@walnutdust
Copy link
Contributor

walnutdust commented Jan 6, 2021

@jonasfj

do you recall why we went away from making transforming a Stream<List> into a Stream? Was it to do with cancelling or was it just because it would be harder to use?

This might have been because we were trying to move away from StreamControllers to async generators, and with async generators the next loop iteration is run again before the first one necessary conflicts. @simolus3 found a great solution through sync StreamController, and it looks like it probably works well.

@simolus3 I think making tar.reader be a StreamTransformer is pretty cool, and the additional concerns you have are valid! I would encourage possibly having more tests with regards to ensuring the transformed stream has the expected behavior even when passed on to other stream users (e.g. StreamIterator). We found certain bugs in ChunkedStreamIterator's use of StreamControlleronly when we stress tested the output stream in cases outside of an await-for loop aha Streams are really not easy 🙊. I'm worried, for example, about how code like this chunk will interact.

If we are certain that the relevant Stream logic is correct, I think the main concerns remaining are:

  1. Interface preference - I personally don't have a preference
  2. Maintainability + Integration into pub - @jonasfj can better advise on this, especially with google's requirements for mirroring code.
  3. What .tar formats are we supporting? This shouldn't be too much of a concern either way, since we can reference either my or golang's or other implementations to see how they process the various tar formats.
  4. More test cases - feel free to grab/modify from my PR

@sigurdm
Copy link
Contributor

sigurdm commented Feb 1, 2021

I have tried unpacking all archives of pub.dev with this code and the code currently in pub,.dev (on linux), then running diff -rq on the results.

I found no differences (after fixing for empty directories) except for obscure cases like page_turn-1.0.1.tar.gz/example/test/widget_test.dart that ends up with no permissions when unpacking with gnu tar but seems fine with this code!

Co-authored-by: Sigurd Meldgaard <[email protected]>
@sigurdm
Copy link
Contributor

sigurdm commented Feb 4, 2021

If @simolus3 and @jonasfj have nothing more I think this is ready to merge!

Copy link
Member

@jonasfj jonasfj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits.

I still think we should forbid symlinks, but that's something we can do on the server. That way we can roll it back if people decide we really need it.

@@ -216,6 +215,10 @@ Future<String> _createFileFromStream(Stream<List<int>> stream, String file) {
});
}

void _chmod(int mode, String file) {
runProcessSync('chmod', [mode.toRadixString(8), file]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should use /bin/chmod if it exists, so we are not affected by users tweaking PATH.
According to FHS 3.4.2 /bin/chmod should always exist on Linux.

I'm guessing it has a fixed location on Mac OS X as well. I think we should prefer the fixed location, if it exists that way we are not affected by changes to PATH -- I think we did this when using tar :D

Perhaps we should also validate the mode and check the exit code from running chmod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See:

/// Find a tar. Prefering system installed tar.
///
/// On linux tar should always be /bin/tar [See FHS 2.3][1]
/// On MacOS it seems to always be /usr/bin/tar.
///
/// [1]: https://refspecs.linuxfoundation.org/FHS_2.3/fhs-2.3.pdf
String _findTarPath() {
  for (final file in ['/bin/tar', '/usr/bin/tar']) {
    if (fileExists(file)) {
      return file;
    }
  }
  log.warning(
      'Could not find a system `tar` installed in /bin/tar or /usr/bin/tar, '
      'attempting to use tar from PATH');
  return 'tar';
}

Let's do that to find chmod, just in case someone decorates their shell with a different chmod 🙈

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could perhaps also use dart:ffi to bind the syscall... But we a little care has to be taken that we don't bind symbols only exposed because the Linux build is in release mode and not product mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be /bin/chmod on macOS too, so that should work.

Using dart:ffi sounds interesting (especially because we could check the value of umask at runtime too). AFAIK Dart doesn't support linking libc statically, so the symbols should be available

final resolvedTarget = path.joinAll(
[parentDirectory, ...path.posix.split(entry.header.linkName)]);
if (!path.isWithin(destination, resolvedTarget)) {
// Don't allow links to files outside of this tar.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should print a warning when we ignore thing. I think it's the right call here.

lib/src/io.dart Outdated
TarHeader(
// Ensure paths in tar files use forward slashes
name: path.url.joinAll(path.split(relative)),
mode: stat.mode,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should do: mode: defaultMode | (stat.mode & executableMask);

final file = File(path.normalize(entry));
final stat = file.statSync();

return TarEntry(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it allowed to make file entries without first having the directory entries.

If allowed by tar format, then I think just having files is preferable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any documentation stating whether it is acceptable to exclude directory entries or not.

GNU tar has explicit support for it though (it recovers from ENOENT by creating parent directories), and I'm not aware of any tar client that doesn't support this.

final relative = path.relative(entry, from: baseDir);
// On Windows, we can't open some files without normalizing them
final file = File(path.normalize(entry));
final stat = file.statSync();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test if this is a symlink... If so maybe we should print a warning line that the symlink will be embedded as a file, and NOT a link.

IMO, it's probably preferable to keep the format simple and avoid fancy features like symlinks in distributed packages.

@sigurdm sigurdm merged commit 9a70949 into dart-lang:master Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a Dart-only library for archives instead of the 'tar' executable.
4 participants