TileGrid divmod optimization #7147

jepler · 2022-10-28T20:48:31Z

I speculated that at least on M0 devices like the RP2040, the use of division and modulo operators inside the TileGrid's main loop.
I removed it to the greatest extent possible from the inner loop, then piled on more optimizations.

I ultimately raised the frame rate of scrolling terminal content from 2.7fps to 3.90fps on a raspberry pi pico w, as measured by this program below. Since much of the screen was blank space I'm not sure these are actually full-screen refreshes, vs just refreshing the left 1/3 of the screen.

I feel like 3.9fps is still not great but I think this is the end of the line for the low-level optimization ideas I had.

My test program:

import board
import busio
import displayio
from adafruit_st7789 import ST7789
import time

displayio.release_displays()

bus = busio.SPI(clock=board.GP18, MOSI=board.GP19)
tft_cs = board.GP20
tft_rst = board.GP21
tft_dc = board.GP16
tft_bl = board.GP15

display_bus = displayio.FourWire(bus, command=tft_dc, chip_select=tft_cs, reset=tft_rst)
display = ST7789(display_bus, width=240, height=320, rowstart=0, backlight_pin=tft_bl)

display.auto_refresh = False
for i in range(30):
    print(i)
t0 = time.monotonic_ns()
for i in range(30):
    display.refresh()
    t1 = time.monotonic_ns()
    fps = 1e9 / (t1-t0)
    print(f"{fps=}")
    t0 = t1

This should be marginally faster than the conditional, but I didn't measure a difference.

This raises the speed on a benchmark on raspberry pi pico from 2.7fps to 3.1fps (+15%), because divisions are particularly expensive in M0 micros.

.. by tracking subpixels and subgrids directly This increases fps of my test to 3.47

This increases the framerate of my test to 3.58fps

.. so that a new pixel is computed less frequently. This increases the refresh rate on a 2x scaled test from 3.7fps to 4.8fps, without affecting the framerate of the 1x scaled test. Note that this commit reindents a large block of code, so viewing it with whitespace changes suppressed may be helpful for review.

.. this raises the rate for my test from 3.58 to 3.72. It uses the 1bpp path.

This increases the performance to 3.80fps It makes things slightly slower for other display types, but 16-bit TFTs need to be faster than tricolor e-paper displays to update.

This increases speed to 3.85fps.

.. and move the common cases first. 3.91fps.

jepler · 2022-10-29T00:01:31Z

Unsurprisingly some boards became over-full due to code growth. Can probably fix it by putting particularly all the cases of manually specialized get_pixel behind a define.

jepler · 2022-10-29T02:19:18Z

Other notes:

_refresh_area sends many many too many commands. For SPI TFTs it should only be necessary to send one update rectangle and then a stream of only pixels
but for other devices like i2c, having the data split across multiple 'send's may not be OK
this requires the transaction be enlarged to cover the whole update...
and is the only plausible way for background writes to work
as proof that this has a payoff, reducing the number of send calls by enlarging the minimum buffer by 4x gave a speed bump. I didn't write down how much

On rp2040, anyway, adding background SPI writes looks "not hard". 😉 arches that didn't support background writes would transparently fall back to foreground writes. but it didn't help. Trying to keep the code in this actually improving on the status quo, here are the dubious commits that didn't help: https://github.com/adafruit/circuitpython/compare/main...jepler:circuitpython:displayio-spi-background?expand=1

jepler · 2022-10-29T02:40:19Z

Another note on speeds: We can try several levels of "no-op"'ing _refresh_area.

Test	Time (ms)	saved vs baseline
Baseline	252	N/A
Skip just displayio_display_core_fill_area	84	168
Skip just _send_pixels	185	67
Skip both	13	239
1 transaction simulation	3	249

The last test does a 'break' at the end of the first loop, trying to demonstrate the overhead that I think is mostly in start/stop transaction, set refresh region, etc.

Tests are with https://www.adafruit.com/product/4311 + pi pico w. SPI clock is nominal ~~12MHz~~ 24MHz except as noted. The actual update area is 72000 pixels, so send_pixels requires a minimum of 48ms, so 19ms or 28% of the total time of that step is lost to overhead. And if background writing worked, the SPI transaction would be entirely 'covered' by the calculation of the next block. Other per-subrectangle overheads account for 10ms more. But the #1 and #2 points for work are inside core_fill_area or in overapping send & calculate phases.

.. a third will be added soon. Passing 4 arguments in registers is more efficient in ARM than having 5 arguments anyway. (saves ~80 bytes flash) and is either tied or a hair faster at redraw than the previous ref

jepler · 2022-10-29T13:47:22Z

I realized that one of the reasons that the locking is so diffuse is to enable sharing the bus with an SD card so that OnDiskBitmaps work. OnDiskBitmaps are one of the bits of displayio least likely to ever be performant, but they also have a marked negative impact on the speed of the rest of the system even when not in use.

jepler · 2022-10-29T14:09:57Z

I have concluded that there does not appear to be a path of incremental change to the goal of overlapping pixel transmission (DMA) with pixel calculation, due to OnDiskBitmap (even if unused). My rather grumpy-sounding commit message about it, in yet another branch without a future: d72d841

gamblor21 · 2022-10-29T14:33:25Z

I have concluded that there does not appear to be a path of incremental change to the goal of overlapping pixel transmission (DMA) with pixel calculation, due to OnDiskBitmap (even if unused). My rather grumpy-sounding commit message about it, in yet another branch without a future: d72d841

Hmm that is disappointing. My first thought would be a dedicated DisplayBus would be ideal but would obviously have to be optional for cases when it is not possible.

Tests are with https://www.adafruit.com/product/4311 + pi pico w. SPI clock is nominal 12MHz except as noted. The actual update area is 72000 pixels, so send_pixels requires a minimum of 48ms,

In theory shouldn't this be faster (or my morning math may be off). (72000 pixels * 16 bits) / 12 Mhz gets me about 10ms per refresh not 48ms?

jepler · 2022-10-29T16:08:55Z

I think our maths are both wrong.

        { MP_QSTR_baudrate, MP_ARG_INT | MP_ARG_KW_ONLY, {.u_int = 24000000} },

The default is 24MHz, I initially wrote in my comment that I was at 12MHz. However, I'm pretty sure about the 48ms, or that I'm making a bits vs bytes error.

You have: 72000*16/24MHz
You want: ms
	* 48

tannewt · 2022-11-15T23:00:10Z

Are you still wanting to work on this or can we close it? (Closed PRs can still be accessed.) I think its all tricky stuff.

jepler · 2022-11-28T19:02:33Z

I'm not likely to finish this in the near future. Please feel free to pick up any part(s) of this if you have an interest.

jepler added 9 commits October 28, 2022 15:23

TileGrid: Use a function pointer to call the get-pixel function

9ff3ceb

This should be marginally faster than the conditional, but I didn't measure a difference.

TileGrid: Move some divisions out of tilegrid_fill_area inner loop

81ad1ee

This raises the speed on a benchmark on raspberry pi pico from 2.7fps to 3.1fps (+15%), because divisions are particularly expensive in M0 micros.

TileGrid: Eliminate more %, / in inner loop

c0dc3ec

.. by tracking subpixels and subgrids directly This increases fps of my test to 3.47

TileGrid: Avoid divmod when the tile number didn't change

fd4de7c

This increases the framerate of my test to 3.58fps

TileGrid: Optimize common pixel depths

8316433

.. this raises the rate for my test from 3.58 to 3.72. It uses the 1bpp path.

Palette: Move the depth-16 case earlier

8abe518

This increases the performance to 3.80fps It makes things slightly slower for other display types, but 16-bit TFTs need to be faster than tricolor e-paper displays to update.

TileGrid: Apply aggressive optimization to this important function

f3f2af6

This increases speed to 3.85fps.

TileGrid: avoid obj_is_type calls in loop

e51e02a

.. and move the common cases first. 3.91fps.

jepler force-pushed the tilegrid-divmod-optimization branch 2 times, most recently from 09e99f0 to 6e80d85 Compare October 29, 2022 02:18

jepler marked this pull request as draft October 29, 2022 02:40

Combine two flags into one argument

8adfab2

.. a third will be added soon. Passing 4 arguments in registers is more efficient in ARM than having 5 arguments anyway. (saves ~80 bytes flash) and is either tied or a hair faster at redraw than the previous ref

jepler force-pushed the tilegrid-divmod-optimization branch from 6e80d85 to 8adfab2 Compare October 29, 2022 13:16

jepler closed this Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TileGrid divmod optimization #7147

TileGrid divmod optimization #7147

jepler commented Oct 28, 2022

jepler commented Oct 29, 2022

jepler commented Oct 29, 2022

jepler commented Oct 29, 2022 •

edited

Loading

jepler commented Oct 29, 2022

jepler commented Oct 29, 2022

gamblor21 commented Oct 29, 2022

jepler commented Oct 29, 2022

tannewt commented Nov 15, 2022

jepler commented Nov 28, 2022

TileGrid divmod optimization #7147

TileGrid divmod optimization #7147

Conversation

jepler commented Oct 28, 2022

jepler commented Oct 29, 2022

jepler commented Oct 29, 2022

jepler commented Oct 29, 2022 • edited Loading

jepler commented Oct 29, 2022

jepler commented Oct 29, 2022

gamblor21 commented Oct 29, 2022

jepler commented Oct 29, 2022

tannewt commented Nov 15, 2022

jepler commented Nov 28, 2022

jepler commented Oct 29, 2022 •

edited

Loading