-
Notifications
You must be signed in to change notification settings - Fork 1.3k
TileGrid divmod optimization #7147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This should be marginally faster than the conditional, but I didn't measure a difference.
This raises the speed on a benchmark on raspberry pi pico from 2.7fps to 3.1fps (+15%), because divisions are particularly expensive in M0 micros.
.. by tracking subpixels and subgrids directly This increases fps of my test to 3.47
This increases the framerate of my test to 3.58fps
.. so that a new pixel is computed less frequently. This increases the refresh rate on a 2x scaled test from 3.7fps to 4.8fps, without affecting the framerate of the 1x scaled test. Note that this commit reindents a large block of code, so viewing it with whitespace changes suppressed may be helpful for review.
.. this raises the rate for my test from 3.58 to 3.72. It uses the 1bpp path.
This increases the performance to 3.80fps It makes things slightly slower for other display types, but 16-bit TFTs need to be faster than tricolor e-paper displays to update.
This increases speed to 3.85fps.
.. and move the common cases first. 3.91fps.
Unsurprisingly some boards became over-full due to code growth. Can probably fix it by putting particularly all the cases of manually specialized get_pixel behind a define. |
09e99f0
to
6e80d85
Compare
Other notes:
On rp2040, anyway, adding background SPI writes looks "not hard". 😉 arches that didn't support background writes would transparently fall back to foreground writes. but it didn't help. Trying to keep the code in this actually improving on the status quo, here are the dubious commits that didn't help: https://github.com/adafruit/circuitpython/compare/main...jepler:circuitpython:displayio-spi-background?expand=1 |
Another note on speeds: We can try several levels of "no-op"'ing
The last test does a 'break' at the end of the first loop, trying to demonstrate the overhead that I think is mostly in start/stop transaction, set refresh region, etc. Tests are with https://www.adafruit.com/product/4311 + pi pico w. SPI clock is nominal |
.. a third will be added soon. Passing 4 arguments in registers is more efficient in ARM than having 5 arguments anyway. (saves ~80 bytes flash) and is either tied or a hair faster at redraw than the previous ref
6e80d85
to
8adfab2
Compare
I realized that one of the reasons that the locking is so diffuse is to enable sharing the bus with an SD card so that OnDiskBitmaps work. OnDiskBitmaps are one of the bits of displayio least likely to ever be performant, but they also have a marked negative impact on the speed of the rest of the system even when not in use. |
I have concluded that there does not appear to be a path of incremental change to the goal of overlapping pixel transmission (DMA) with pixel calculation, due to OnDiskBitmap (even if unused). My rather grumpy-sounding commit message about it, in yet another branch without a future: d72d841 |
Hmm that is disappointing. My first thought would be a dedicated DisplayBus would be ideal but would obviously have to be optional for cases when it is not possible.
In theory shouldn't this be faster (or my morning math may be off). (72000 pixels * 16 bits) / 12 Mhz gets me about 10ms per refresh not 48ms? |
I think our maths are both wrong.
The default is 24MHz, I initially wrote in my comment that I was at 12MHz. However, I'm pretty sure about the 48ms, or that I'm making a bits vs bytes error.
|
Are you still wanting to work on this or can we close it? (Closed PRs can still be accessed.) I think its all tricky stuff. |
I'm not likely to finish this in the near future. Please feel free to pick up any part(s) of this if you have an interest. |
I speculated that at least on M0 devices like the RP2040, the use of division and modulo operators inside the TileGrid's main loop.
I removed it to the greatest extent possible from the inner loop, then piled on more optimizations.
I ultimately raised the frame rate of scrolling terminal content from 2.7fps to 3.90fps on a raspberry pi pico w, as measured by this program below. Since much of the screen was blank space I'm not sure these are actually full-screen refreshes, vs just refreshing the left 1/3 of the screen.
I feel like 3.9fps is still not great but I think this is the end of the line for the low-level optimization ideas I had.
My test program: