Futharks performance in regards to block-wide scans could be improved in certain cases if there were a sequentialization factor such that a single thread works on a given number of elements. A use case that comes to mind is blocked radix sort and blocked partition. I do not believe writing code such that kernel scans have a sequentialization factor should be the users responsibility in the same way that the device-wide scan has a sequentialization factor. It may also be the case that this problem applies to other block kernels.