Skip to content

parallel-letter-frequency: add canonical data #2209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 14, 2023

Conversation

ErikSchierboom
Copy link
Member

@ErikSchierboom ErikSchierboom commented Feb 24, 2023

Closes #574

Copy link
Member

@junedev junedev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider regarding the canonical data for this exercise: We had the data proposed here in the Go track before as well and students were always very disappointed/confused that the concurrent version was not actually faster that the sequencial one. In Go this is easy to access for a student as all tests include benchmarks. I would assume the problem also exists in other languages as concurrency primitives usually have a price that is only worth it if the parallel processing happens for a while.

While "it's not faster" is a nice learning, I am not sure it is the intention of the exercise so this is why I am bringing this up here.
cc @kytrinyx

Here the data we use in the Go track now where you actually see an improvement with the concurrent version:
https://github.com/exercism/go/blob/main/exercises/practice/parallel-letter-frequency/parallel_letter_frequency_test.go
I imagine for some languages you would need an even bigger input to see any benefit.

@ErikSchierboom
Copy link
Member Author

Here the data we use in the Go track now where you actually see an improvement with the concurrent version:
https://github.com/exercism/go/blob/main/exercises/practice/parallel-letter-frequency/parallel_letter_frequency_test.go
I imagine for some languages you would need an even bigger input to see any benefit.

I did consider including a very large input, but was unsure what people would think. Thoughts?

@ErikSchierboom
Copy link
Member Author

I've updated the large texts test case to match the data used in the Go exercise.

@andrerfcsantos
Copy link
Member

andrerfcsantos commented Mar 1, 2023

This is exciting!

I'm OK with this being merged as is, but here are some thoughts on this:

  1. Right now, the tests have really small strings or really big ones. I think we should have more cases in between. Maybe cases where each text is a sentence or a short paragraph with punctuation. That way, both the counting logic and the "ignore non-letters characters" logic could be tested simultaneously, in addition to the tests that test for these things separately. It would make for a more interesting debug experience in general. One idea for a test case would be the famous The quick brown fox jumped over the lazy dog.. It includes all letters of the English alphabet, punctuation, and a capital letter, but any other sentence would do too.
  2. I see a test case with non-ASCII characters. Should this test case be put behind a scenario/property so tracks could more easily filter it out? I'm thinking of languages that can't handle utf8 out-of-the-box that could have a problem with this test case.
  3. The way I see it, there are 3 main definitions for what a "large input" is for this exercise: few big strings, many small strings and many big strings. In Go, we were more interested in the first definition since it was the one that would allow us to make the concurrent version faster. But I think there's value in exploring the second option too, even if it isn't faster in the concurrent version. What happens when you have 100 strings that might or not be small and give now string to each thread/coroutine/goroutine? While for Go having 100 goroutines is not a problem, maybe it can be a problem for Java or other languages to have 100 threads, and this would allow exploring those scenarios too. Maybe it's worth having these as different kinds of large inputs as different scenarios/properties too?

Copy link
Member

@petertseng petertseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please amend the commit message to include "closes https://github.com/exercism/problem-specifications/issues/574" or any equivalent string, thank you

@ErikSchierboom ErikSchierboom force-pushed the parallel-letter-frequency-canonical-data branch 4 times, most recently from b61d450 to 88fd5ee Compare March 2, 2023 08:41
@ErikSchierboom
Copy link
Member Author

I've changed a couple of things:

  • Added the unicode scenario to the unicode test case
  • Added a test with 50 small texts ("abbccc")
  • Added a test that has some sentences, which have a combination of lower and uppercase letters, whitespace and punctuation.

@ErikSchierboom
Copy link
Member Author

please amend the commit message to include "closes #574" or any equivalent string, thank you

Done. I've also added this to the PR description.

@ErikSchierboom
Copy link
Member Author

@junedev @petertseng Are you happy with the changes I made?

@junedev
Copy link
Member

junedev commented Mar 2, 2023

@ErikSchierboom I preferred the version without the last test about the many small inputs. I always saw this exercise as being a good starter to practice concurrency primitives for the first time. I would have left the "tuning how much you do concurrently" part for another exercise. I'm still ok waving this through, just my personal opinion.

@ErikSchierboom
Copy link
Member Author

I'm fine with removing it. Let's hear what @petertseng think.

Copy link
Member

@petertseng petertseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think that rather depends on the teaching goals of this exercise versus that future exercise. But that exercise doesn't exist yet and this one does now. So what I would think to do is: Take the test with many small texts for now. Once that future exercise is made, stop recommending the test with many small texts, if it's better suited for that exercise.

(Of course, we've discussed in the past we don't have a really good way to say that a test is no longer recommended since reimplements is the only mechanism, but I don't think that should be considered fatal to this idea)

@ErikSchierboom
Copy link
Member Author

Hmm, I think that rather depends on the teaching goals of this exercise versus that future exercise. But that exercise doesn't exist yet and this one does now. So what I would think to do is: Take the test with many small texts for now. Once that future exercise is made, stop recommending the test with many small texts, if it's better suited for that exercise.

@junedev would you be okay with that?

(Of course, we've discussed in the past we don't have a really good way to say that a test is no longer recommended since reimplements is the only mechanism, but I don't think that should be considered fatal to this idea)

I think we might need some way to deprecate a test case without it being reimplemented.

@junedev
Copy link
Member

junedev commented Mar 7, 2023

@ErikSchierboom Whatever you/others think is best is fine for me. I just wanted to mention it has a small drawback, that was all.

@ErikSchierboom
Copy link
Member Author

I'm not entirely sure. CC @exercism/reviewers I'd be curious in hearing your thoughts.

@ErikSchierboom ErikSchierboom force-pushed the parallel-letter-frequency-canonical-data branch from 32b659f to 9fc4e72 Compare March 7, 2023 08:11
@ErikSchierboom ErikSchierboom merged commit 822d524 into main Mar 14, 2023
@ErikSchierboom ErikSchierboom deleted the parallel-letter-frequency-canonical-data branch March 14, 2023 19:05
@ErikSchierboom
Copy link
Member Author

Thanks everyone for chiming in! I've decided to leave the many texts test case in there, as I think it is interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parallel-letter-frequency: Implement canonical-data.json
4 participants