Add grapheme iteration benchmarks for various languages. #78

cessen · 2020-02-14T10:59:00Z

Grapheme iteration benchmarks for Arabic, English, Hindi, Japanese, Korean, Mandarin Chinese, Russian, and C source code.

cessen · 2020-02-14T11:10:15Z

Here are the cleaned up benchmarks. They aren't quite the same as the ones I used in #77:

I basically re-built them to make extra sure all the text is from open-source/free-culture sources.
I took the time to source enough text to make most of the text unique, rather than a bunch of repeat copy-paste.
I reduced the text size to ~50kb each, as that seems like plenty.
I removed the zalgo and worst-case texts, as I don't think they're actually useful in practice and would likely just be confusing to people looking at the benchmarks in the future.

I wasn't totally sure how best to exclude the benches folder from publishing. Simply adding it to the exclude list actually causes packaging to fail, since its referenced by the [[bench]] section. So for now I only excluded benches/texts, which contains the large text files. If I should do this differently, let me know!

cessen · 2020-02-14T12:50:22Z

Also, just to be clear: the benchmarks don't compile without the text files present. So I'm a bit nervous about this approach for excluding the benchmark texts from publishing. I'm not sure if there's any infrastructure that might not like that.

Manishearth

This looks great! Can we mention the license for these files? It's CC-BY-SA so we need to mention the license and attribute it.

Manishearth · 2020-02-14T20:20:07Z

benches/graphemes.rs

+
+fn graphemes_english(bench: &mut Bencher) {
+    bench.iter(|| {
+        for g in UnicodeSegmentation::graphemes(TEXT_ENGLISH, true) {


the text itself should also pass through black_box. Probably doesn't matter given how large it is, but worth a shot.

Alternatively, we can load the file dynamically outside of the iter() call.

Pushed a fix that does this.

Manishearth · 2020-02-14T20:22:53Z

We've had complaints about tests that break when run from published code, but this is usually packagers ensuring that their packaging went correctly. I don't think we'll have the same issue for benches. However, my understanding is that you don't need a [[bench]] entry if you just have a benches/ folder anyway, so we could totally just exclude it.

Manishearth · 2020-02-14T20:24:09Z

Oh, nope, it's necessary because you're using a custom harness. Darn.

Manishearth · 2020-02-14T20:31:34Z

Anyway, the benches will now compile, just not run if run from the package.

Manishearth · 2020-02-14T20:43:44Z

Pushed in some commits adding a license/attribution and making the benchmarks use files. Thanks for doing this!

cessen · 2020-02-15T00:49:19Z

Awesome, thanks for cleaning this up!

Out of curiosity, is there a way to do benchmarks in rust without a custom harness? My understanding was that the standard benchmarker isn't stable, so you always need a custom harness. If that's not the case, I'd be more than happy to change the benchmarker that's used.

Manishearth · 2020-02-15T02:04:37Z

I didn't know that you can use custom harnesses this way on stable!

Yes, the default harness isn't stable, but most people just set up their CI to bench on nightly only. shrug

Add grapheme iteration benchmarks for various languages.

c5bc229

Manishearth approved these changes Feb 14, 2020

View reviewed changes

Load bench data from file instead

5bf3107

Add license and readme for benchmark texts

b1765ec

Manishearth merged commit 485767a into unicode-rs:master Feb 14, 2020

cessen deleted the grapheme_bench branch February 15, 2020 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add grapheme iteration benchmarks for various languages. #78

Add grapheme iteration benchmarks for various languages. #78

cessen commented Feb 14, 2020 •

edited

Loading

cessen commented Feb 14, 2020

cessen commented Feb 14, 2020

Manishearth left a comment

Manishearth Feb 14, 2020

Manishearth Feb 14, 2020

Manishearth commented Feb 14, 2020

Manishearth commented Feb 14, 2020

Manishearth commented Feb 14, 2020

Manishearth commented Feb 14, 2020

cessen commented Feb 15, 2020

Manishearth commented Feb 15, 2020

Add grapheme iteration benchmarks for various languages. #78

Add grapheme iteration benchmarks for various languages. #78

Conversation

cessen commented Feb 14, 2020 • edited Loading

cessen commented Feb 14, 2020

cessen commented Feb 14, 2020

Manishearth left a comment

Choose a reason for hiding this comment

Manishearth Feb 14, 2020

Choose a reason for hiding this comment

Manishearth Feb 14, 2020

Choose a reason for hiding this comment

Manishearth commented Feb 14, 2020

Manishearth commented Feb 14, 2020

Manishearth commented Feb 14, 2020

Manishearth commented Feb 14, 2020

cessen commented Feb 15, 2020

Manishearth commented Feb 15, 2020

cessen commented Feb 14, 2020 •

edited

Loading