Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions content/posts/transformer-part1/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ However, this issue has been resolved in modern Chapel (version 2.7), which intr

This is another layer that performed significantly worse in Chapel. The random number generator I used in the Chapel version is from the `Random` standard module. As for the C++ version, I tried to implement the same random algorithm, `pcg_setseq_64_xsh_rr_32`. I also used integer-based random generation with an integer threshold, which is 4–5 times faster than using floating-point numbers.

It also appeared that using `rng.fill` is faster than using `rng.next` when iterating over an array. Since this function forces parallelism when available, `CHPL_RT_NUM_THREADS_PER_LOCALE=1` must be set accordingly when experimenting with a single thread.
Since this function forces parallelism when available, `CHPL_RT_NUM_THREADS_PER_LOCALE=1` must be set accordingly when experimenting with a single thread.

This layer in Chapel is significantly slower compared to those in other models, primarily due to the random number generator. Using the random function with bounds caused a significant performance drop. After removing the bounds, Chapel achieved performance comparable to the C++ version. I reported and discussed this issue on a GitHub issue titled "[Random with bounds is much slower than no bound](https://github.com/chapel-lang/chapel/issues/28036)". However, since I only noticed and reported it after completing the project, I did not resolve it.

Expand Down Expand Up @@ -339,4 +339,4 @@ The other layers seem fine and perform as well as, or better than, the PyTorch v

In this post, we explore the methodology of the experiment and the first test, running a small-size model on a single thread. The performance of the C++ and Chapel models is comparable to that of the two PyTorch models, with the C++ version being the fastest, as the benefits of PyTorch’s optimized linear algebra are not very apparent in this small-scale test. The Chapel version was slowest in this test, mainly due to the Dropout and Softmax layers. Several unexpected performance issues were also encountered, requiring tricky solutions during Chapel’s development.

In the next post in this series, we will explore the second test, using a full-size model on single and multiple threads, along with a discussion on productivity.
In the next post in this series, we will explore the second test, using a full-size model on single and multiple threads, along with a discussion on productivity.