Skip to content

fastq::Reader: from_maybe_gzip_path: New instantiation. #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: v0.4
Choose a base branch
from

Conversation

wwood
Copy link

@wwood wwood commented Oct 18, 2019

Hi,

One thing I've not seen in biological rust-land is a kseq.h type code, where you can provide some method a file path (which might be FIFO like stdin) and it works out for you if it is fastq or fasta, and whether it is gzip or not.

This PR is an attempt at the "gzip or not" part for the fastq reader. It is unfinished (see TODOs in code) because insufficient thought has been given to the architecture here e.g. should a enum be returned, or a dyn fastq::Reader? I figure you might have some ideas already so I didn't want to go too far down this path without hearing them first.

Tests pass, but the code is not benchmarked. It isn't clear to me whether it is OK that the gzip reader is using the right size of buffer. Multithreading the gzip decompression might also speed things up.

Let me know what you think.
Thanks, ben

@markschl
Copy link
Owner

markschl commented Oct 23, 2019

Thanks for your interest in this library and the pull request!

To be honest, I avoided this kind of functionality up to now, because there is several possibilities to implement this. I preferred letting users choose how to do it on their own. Still, I'm open to add a simple 'magic' function that suits most use cases (if possible), and possibly also checks for FASTQ vs. FASTA. Or, at least to make it as easy as possible to implement this functionality given specific needs. A few comments/thoughts:

  • I think your approach is a good start, and supplying the buf_redux::BufReader to flate2::bufread::GzDecoder is a good idea. GZIP decoding is probably always (?) fastest if filesystem I/O is buffered.
  • With STDIN, I think the situation is more complex. std::io::Stdin does not give access to the buffer, we need to call lock(). std:io::StdinLock implements BufRead, and a call to fill_buf() would give access to the buffer. However, StdinLock borrows from Stdin, and therefore, I think we could only return an object with static lifetime. Wrapping Stdin in a BufReader might have a negative impact on performance.
  • Your approach of checking the '@' byte is fine if you are sure it's FASTQ. An even better way would probably be to check for the presence of a GZIP header as in this function of fastq-rs. In this project (for which I originally created seq_io), I used the file path only to determine the format. Then, I supply Box<dyn std::io::Read> (which is either a File or STDIN, optionally wrapped in a decoder for a compression format) to the FASTQ / FASTA parsers.
  • For a magic function that checks for compression and recognizes the sequence format, I would prefer a trait-based approach, because this should supposedly also integrate with RecordSet and functions in the parallel module. I suspect, an enum based approach is maybe fine in a simple use case as in your code, but with more options it could quickly become complicated. I don't actually know, how these approaches would compare in a benchmark, the result may also depend on record size (sequence length). My guess is that differences are negligible, since most of the time is spent with sequence parsing.
  • If following a trait-based approach, seq_io::parallel::Reader trait could be replaced by a more general trait implemented by FASTA and FASTQ readers, possibly named seq_io::Read analogous to std::io naming, or seq_io::Reader. It would need some associated types (Record and RecordSet). In addition, there could be a Record trait, which is implemented by FASTA / FASTQ records. This would allow abstracting over sequence types, and a function like Record::qual() could return Option<&[u8]>, depending on whether there is quality information or not. Supporting multi-line FASTA is more complicated. Here I actually used an enum approach for this problem, not sure if that is the best/most elegant solution.
  • Decoding a GZIP file in a single background thread is for instance possible with another library I wrote. It's not 'real' multithreaded decoding, but at least it made decoding faster in most cases. Still, I guess it would be best, to default to a single-threaded solution in a general-purpose function.
  • Is it enough to support just GZIP, or should BZIP2 also be supported by such a function?
  • Such functionality could be feature-gated to avoid unnecessary dependencies...

Given the difficulties with STDIN, there could eventually be two functions, one that takes a file path (as in your code), and another, which takes a BufRead such as StdinLock, so the user would have to take care of the specifics. Both functions would return io::Result<Box<dyn seq_io::Read>>.

I hope that this makes sense.

Currently, I'm working on some code refactoring, but after that I'm happy to do something on this topic. And contributions are always welcome.

@markschl markschl force-pushed the master branch 2 times, most recently from 99648a1 to 358032d Compare October 19, 2020 11:27
@markschl
Copy link
Owner

I rewrote the whole library and added support for "FASTX" with two different approaches. GZIP recognition should be easy to implement now (for files, not stdin). I'm not sure yet whether it should be in the crate because there may be several ways of implementing it.
Comments welcome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants