Description
It seems like we're coming across a few small "footguns", some common some less so, so it may be nice to collect them in one place so when writing something like #6 we don't forget anything.
This could happen in #6, but I'd prefer if it were separate so this can dive a little deeper and actually make a list.
Off the top of my head from stuff that's come up today in the gitter and elsewhere:
- Reading
stdin
line by line is slow, because it does a heap alloc at each line. Easiest solution isBufReader::read_lines
- Using
String
as an input type for arbitrary data that doesn't necessarily need to be UTF-8 - Using
println!
when writing lots of small output, which locksstdout
at each call. Prefer lockingstdout
manually and using something likewrite!
- I think Silently dropped errors in std #28 probably qualifies too, although it's probably far less common. The workout is simple, but not really intuitive IMO (calling
libc::close
manually, and checking for errors) println!
can panic on a broken pipestd::env::args
panics at invalid UTF-8 (the fix is to usestd::env::args_os
but this is not very intuitive, and with issues like Ergonomics of String handling #26 it's far more friction than it should be)
If you've got more I'll add them. Of course I think we should also be open to discussion about the ones I listed as to why they should, or should not be listed and/or other fixes for them. 😄
Gitter log of reading `stdin` line by line
@rcoh 18:37: Hello! I'm working on https://github.com/rcoh/angle-grinder
in the process, I noticed that reading from stdin via .lines() is pretty slow
there are ways around it, of course. ripgrep uses a custom written stdin buffer
@BurntSushi 18:38: yeah it creates an alloc for each new line. ripgrep's buffer is quite a bit more involved. :grinning: easiest path to something fast is BufReader::read_line
@rcoh 18:38: In any case, I was wondering if there were any plans to make reading lots of data from stdin a more generally supported pattern or if ripgrep's pattern could be abstracted into a library in some way
@BurntSushi 18:39: ripgrep doesn't really read line-by-line, so it's a bit different
@rcoh 18:40: but I assume at some point in the internals, it's splitting things into lines?
@BurntSushi 18:40: nope
@rcoh 18:40: oh interesting!
@BurntSushi 18:40: only for output
@rcoh 18:40: doesn't grep have line-by-line semantics? Does ripgrep as well?
@BurntSushi 18:41: in the (very common) case that searching a file yields no matches, it is possible that the concept of a "line" never actually materializes at all
@rcoh 18:41: interesting.
@BurntSushi 18:41: see: https://blog.burntsushi.net/ripgrep/#mechanics (GNU grep does the same thing)
@rcoh 18:42: My current approach is:
let mut line = String::with_capacity(1024);
while buf.read_line(&mut line).unwrap() > 0 {
self.proc_str(&(line));
line.clear();
}
Is that fastest I can do without resorting to a totally different strategy?
@BurntSushi 18:42: pretty much, yes. if you could use a Vec instead of a String, then you can use read_until and avoid the UTF-8 validation check. depends on what you're doing though!
if you're processing data from arbitrary files, then String is very likely the wrong thing to use
@rcoh 18:44: it's intended to process log files so it's almost always printable data except in some weird cases
@BurntSushi 18:44: it all depends on whether you're OK with it choking on non-UTF-8 encoded text
@rcoh 18:53: Got it. Thanks for your help! I literally couldn't think of a more qualified person to have answer my question :grinning:
@BurntSushi 18:53: :grinning: FWIW, I do intend on splitting the core ripgrep search code out into the grep crate, which will include the buffer handling. it's likely possible that the API will be flexible enough to do line-by-line parsing as fast as possible (although, avoiding line-by-line will still of course be faster for many workloads)
@rcoh 19:02: Cool. I don't actually need to be line-by-line until after running some filtering so that would be useful for my use case.