Skip to content

TileGrid divmod optimization #7147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

jepler
Copy link

@jepler jepler commented Oct 28, 2022

I speculated that at least on M0 devices like the RP2040, the use of division and modulo operators inside the TileGrid's main loop.
I removed it to the greatest extent possible from the inner loop, then piled on more optimizations.

I ultimately raised the frame rate of scrolling terminal content from 2.7fps to 3.90fps on a raspberry pi pico w, as measured by this program below. Since much of the screen was blank space I'm not sure these are actually full-screen refreshes, vs just refreshing the left 1/3 of the screen.

I feel like 3.9fps is still not great but I think this is the end of the line for the low-level optimization ideas I had.

My test program:

import board
import busio
import displayio
from adafruit_st7789 import ST7789
import time

displayio.release_displays()

bus = busio.SPI(clock=board.GP18, MOSI=board.GP19)
tft_cs = board.GP20
tft_rst = board.GP21
tft_dc = board.GP16
tft_bl = board.GP15

display_bus = displayio.FourWire(bus, command=tft_dc, chip_select=tft_cs, reset=tft_rst)
display = ST7789(display_bus, width=240, height=320, rowstart=0, backlight_pin=tft_bl)

display.auto_refresh = False
for i in range(30):
    print(i)
t0 = time.monotonic_ns()
for i in range(30):
    display.refresh()
    t1 = time.monotonic_ns()
    fps = 1e9 / (t1-t0)
    print(f"{fps=}")
    t0 = t1

This should be marginally faster than the conditional, but I didn't
measure a difference.
This raises the speed on a benchmark on raspberry pi pico from 2.7fps
to 3.1fps (+15%), because divisions are particularly expensive in M0
micros.
.. by tracking subpixels and subgrids directly

This increases fps of my test to 3.47
This increases the framerate of my test to 3.58fps
.. so that a new pixel is computed less frequently.

This increases the refresh rate on a 2x scaled test from 3.7fps to
4.8fps, without affecting the framerate of the 1x scaled test.

Note that this commit reindents a large block of code, so viewing
it with whitespace changes suppressed may be helpful for review.
.. this raises the rate for my test from 3.58 to 3.72. It uses the 1bpp
path.
This increases the performance to 3.80fps

It makes things slightly slower for other display types, but 16-bit
TFTs need to be faster than tricolor e-paper displays to update.
.. and move the common cases first. 3.91fps.
@jepler
Copy link
Author

jepler commented Oct 29, 2022

Unsurprisingly some boards became over-full due to code growth. Can probably fix it by putting particularly all the cases of manually specialized get_pixel behind a define.

@jepler jepler force-pushed the tilegrid-divmod-optimization branch 2 times, most recently from 09e99f0 to 6e80d85 Compare October 29, 2022 02:18
@jepler
Copy link
Author

jepler commented Oct 29, 2022

Other notes:

  • _refresh_area sends many many too many commands. For SPI TFTs it should only be necessary to send one update rectangle and then a stream of only pixels
  • but for other devices like i2c, having the data split across multiple 'send's may not be OK
  • this requires the transaction be enlarged to cover the whole update...
  • and is the only plausible way for background writes to work
  • as proof that this has a payoff, reducing the number of send calls by enlarging the minimum buffer by 4x gave a speed bump. I didn't write down how much

On rp2040, anyway, adding background SPI writes looks "not hard". 😉 arches that didn't support background writes would transparently fall back to foreground writes. but it didn't help. Trying to keep the code in this actually improving on the status quo, here are the dubious commits that didn't help: https://github.com/adafruit/circuitpython/compare/main...jepler:circuitpython:displayio-spi-background?expand=1

@jepler
Copy link
Author

jepler commented Oct 29, 2022

Another note on speeds: We can try several levels of "no-op"'ing _refresh_area.

Test Time (ms) saved vs baseline
Baseline 252 N/A
Skip just displayio_display_core_fill_area 84 168
Skip just _send_pixels 185 67
Skip both 13 239
1 transaction simulation 3 249

The last test does a 'break' at the end of the first loop, trying to demonstrate the overhead that I think is mostly in start/stop transaction, set refresh region, etc.

Tests are with https://www.adafruit.com/product/4311 + pi pico w. SPI clock is nominal 12MHz 24MHz except as noted. The actual update area is 72000 pixels, so send_pixels requires a minimum of 48ms, so 19ms or 28% of the total time of that step is lost to overhead. And if background writing worked, the SPI transaction would be entirely 'covered' by the calculation of the next block. Other per-subrectangle overheads account for 10ms more. But the #1 and #2 points for work are inside core_fill_area or in overapping send & calculate phases.

@jepler jepler marked this pull request as draft October 29, 2022 02:40
.. a third will be added soon.  Passing 4 arguments in registers
is more efficient in ARM than having 5 arguments anyway. (saves ~80
bytes flash) and is either tied or a hair faster at redraw than the
previous ref
@jepler jepler force-pushed the tilegrid-divmod-optimization branch from 6e80d85 to 8adfab2 Compare October 29, 2022 13:16
@jepler
Copy link
Author

jepler commented Oct 29, 2022

I realized that one of the reasons that the locking is so diffuse is to enable sharing the bus with an SD card so that OnDiskBitmaps work. OnDiskBitmaps are one of the bits of displayio least likely to ever be performant, but they also have a marked negative impact on the speed of the rest of the system even when not in use.

@jepler
Copy link
Author

jepler commented Oct 29, 2022

I have concluded that there does not appear to be a path of incremental change to the goal of overlapping pixel transmission (DMA) with pixel calculation, due to OnDiskBitmap (even if unused). My rather grumpy-sounding commit message about it, in yet another branch without a future: d72d841

@gamblor21
Copy link
Member

I have concluded that there does not appear to be a path of incremental change to the goal of overlapping pixel transmission (DMA) with pixel calculation, due to OnDiskBitmap (even if unused). My rather grumpy-sounding commit message about it, in yet another branch without a future: d72d841

Hmm that is disappointing. My first thought would be a dedicated DisplayBus would be ideal but would obviously have to be optional for cases when it is not possible.

Tests are with https://www.adafruit.com/product/4311 + pi pico w. SPI clock is nominal 12MHz except as noted. The actual update area is 72000 pixels, so send_pixels requires a minimum of 48ms,

In theory shouldn't this be faster (or my morning math may be off). (72000 pixels * 16 bits) / 12 Mhz gets me about 10ms per refresh not 48ms?

@jepler
Copy link
Author

jepler commented Oct 29, 2022

I think our maths are both wrong.

        { MP_QSTR_baudrate, MP_ARG_INT | MP_ARG_KW_ONLY, {.u_int = 24000000} },

The default is 24MHz, I initially wrote in my comment that I was at 12MHz. However, I'm pretty sure about the 48ms, or that I'm making a bits vs bytes error.

You have: 72000*16/24MHz
You want: ms
	* 48

@tannewt
Copy link
Member

tannewt commented Nov 15, 2022

Are you still wanting to work on this or can we close it? (Closed PRs can still be accessed.) I think its all tricky stuff.

@jepler
Copy link
Author

jepler commented Nov 28, 2022

I'm not likely to finish this in the near future. Please feel free to pick up any part(s) of this if you have an interest.

@jepler jepler closed this Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants