gh-101178: refactor base64.b85encode to be memory friendly #112248

romuald · 2023-11-18T22:06:32Z

Current description

Rewrote the base64._85encode method logic in C, by plugging in the binascii module (already taking care of the bae64 methods)

By using C and a single buffer, the memory use is reduced to a minimum, addressing the initial issue.

It also greatly improves performance as a bonus:

main

SMALL (11 bytes): 1575427 iterations (1.27 µs per call, 115.41 ns per byte)
MEDIUM (200 bytes): 204909 iterations (9.76 µs per call, 48.80 ns per byte)
BIG (5000 bytes): 8623 iterations (231.94 µs per call, 46.39 ns per byte)
VERYBIG (500000 bytes): 81 iterations (24.69 ms per call, 49.38 ns per byte)

branch

SMALL (11 bytes): 11230718 iterations (178.08 ns per call, 16.19 ns per byte)
MEDIUM (200 bytes): 6004721 iterations (333.07 ns per call, 1.67 ns per byte)
BIG (5000 bytes): 458005 iterations (4.37 µs per call, 873.35 ps per byte)
VERYBIG (500000 bytes): 4772 iterations (419.11 µs per call, 838.22 ps per byte)

Script used to test: https://gist.github.com/romuald/7aeba5f40693bb351da4abe62ad7321d

Previous description (python refactor)

not up to date with current PR

Refactor code to make use of generators instead of allocating 2 potentially huge lists for large datasets

Memory gain only measured using macOS and a 5Mb input.

Using main:

Before encoding
Physical footprint:         16.3M
Physical footprint (peak):  21.3M

After encoding
Physical footprint:         45.0M
Physical footprint (peak):  244.1M

With refactor:

Before encoding
Physical footprint:         14.6M
Physical footprint (peak):  19.6M

After encoding
Physical footprint:         28.5M
Physical footprint (peak):  34.4M

~~The execution time is more than doubled, which may not be acceptable. However the memory used is reduced by more than 90%~~
edit: changed the algorithm to be more efficient, the performance decrease now seems to be negligible

I also have no idea how (and if) I should test this

Here is the script I've used to measure the execution time, ~~the memdebug can probably be adapted to read /proc/{pid} on Linux~~
edit: updated to work on Linux too

import os
import sys
import random
import hashlib
import platform
import subprocess
from time import time

from base64 import b85encode

def memdebug():
    if platform.system == "Darwin":
        if not os.environ.get("MallocStackLogging"):
            return

        res = subprocess.check_output(["malloc_history", str(os.getpid()), "-highWaterMark", "-allBySize"])

        for line in res.splitlines():
            if line.startswith(b"Physical"):
                print(line.decode())
    elif platform.system() == "Linux":
        with open(f"/proc/{os.getpid()}/status") as reader:
            for line in reader:
                if line.startswith("VmPeak:"):
                    print(line, end="")


def main():
    # use a stable input
    rnd = random.Random()
    rnd.seed(42)
    data = rnd.randbytes(5_000_000)

    memdebug()

    start = time()
    import pdb
    try:
        res = b85encode(data)
    except Exception:
        # pdb.post_mortem()
        raise
    end = time()

    memdebug()

    print("Data length:", len(data))
    print("Output length:", len(res))
    print(f"Decode time:  {end-start:.3f}s")

    h = hashlib.md5(res).hexdigest()
    print("Hashed result", h)
    assert h == "ad97e45ba085865e70f7aa05c9a31388"

    

if __name__ == '__main__':
    main()

Issue: base64.b85encode uses significant amount of RAM #101178

romuald · 2023-11-19T13:56:31Z

I've changed the algorithm to use a dedicated generator, using unpack("!512I") unstead of 128 unpack("!I") seems to be far more efficient. The issue is now that this may not be as readable due to the additional generator

arhadthedev · 2023-11-19T14:13:04Z

cc @sobolevn as a commenter in the issue, @pitrou an an author of the original _85encode.

romuald · 2023-11-19T14:25:24Z

Tested on a Linux VM the new code is actually faster with the 5Mb dataset 🤔

Memory gain is as follow:

branch: VmPeak: 31532 kB -> 45344 kB
main: VmPeak: 33608 kB -> 253752 kB

romuald · 2024-03-01T14:18:11Z

@pitrou any chance you could spare a little time to review this?

pitrou · 2024-03-01T16:39:32Z

Could you post timeit numbers for both small and large inputs? (please be sure to compile in non-debug mode)

pitrou · 2024-03-01T16:40:09Z

Lib/base64.py

+        for c in unpack512(b[offset:offset+512]):
+            yield c


This can be simplified to:

Suggested change

for c in unpack512(b[offset:offset+512]):

yield c

yield from unpack512(b[offset:offset+512])

Changed (I'm not yet used to yield from ^^)

pitrou · 2024-03-01T16:41:07Z

Lib/test/test_base64.py

+        # since the result is to large to fit inside a test,
+        # use a hash method to validate the test
+        self.assertEqual(len(result), 784)
+        self.assertEqual(hashlib.md5(result).hexdigest(),


I'm not sure md5 is always available. WDYT @gpshead ?

Good catch, forgot about this
According to https://docs.python.org/3/library/hashlib.html#hashlib.algorithms_guaranteed md5 may not be present, I'll switch to a sha1

romuald · 2024-03-03T16:12:34Z

@pitrou here is a timeit script / results on a Macbook M1 (I don't have access to a Linux version right now)

import timeit
import random

from base64 import b85encode

REPEAT = 5
COUNT = 10_000

SMALL_INPUT: bytes = b"hello world"
MEDIUM_INPUT: bytes  # 200 bytes
BIG_INPUT: bytes  # 5000 bytes

SMALL_COUNT = 500_000
MEDIUM_COUNT = 100_000
BIG_COUNT = 20_000

def init():
    global MEDIUM_INPUT, BIG_INPUT

    rnd = random.Random()
    rnd.seed(42)

    MEDIUM_INPUT = rnd.randbytes(200)
    BIG_INPUT = rnd.randbytes(5000)

def main():
    init()

    for name in "SMALL", "MEDIUM", "BIG":
        timer = timeit.Timer(f"b85encode({name}_INPUT)", globals=globals())
        count = globals()[f"{name}_COUNT"]
        values  = timer.repeat(REPEAT, count)
        values = ", ".join("%.3fs" % x for x in values)
        
        print(f"Timeit {name} ({count} iterations: {values}")

if __name__ == '__main__':
    main()

Results:

main branch
Timeit SMALL (500000 iterations: 0.617s, 0.607s, 0.605s, 0.605s, 0.604s
Timeit MEDIUM (100000 iterations: 1.000s, 0.999s, 0.999s, 0.999s, 0.999s
Timeit BIG (20000 iterations: 4.789s, 4.788s, 4.794s, 4.800s, 4.782s

gh-101178-b58encode-memuse branch
Timeit SMALL (500000 iterations: 1.193s, 1.186s, 1.184s, 1.190s, 1.174s
Timeit MEDIUM (100000 iterations: 1.748s, 1.701s, 1.701s, 1.701s, 1.700s
Timeit BIG (20000 iterations: 5.705s, 5.675s, 5.673s, 5.668s, 5.672s

pitrou · 2024-03-03T17:22:34Z

The performance decrease is a bit unfortunate. Instead of defining a separate _85buffer_iter_words generator, perhaps we can look for a hybrid approach, something like (just a sketch):

def _85encode(b, chars, chars2, pad=False, foldnuls=False, foldspaces=False):
    # Helper function for a85encode and b85encode
    if not isinstance(b, bytes_types):
        # TODO can this be `memoryview(b).cast('B')` instead?
        b = memoryview(b).tobytes()

    def encode_words(words):
        # Encode a sequence of 32-bit words, excluding padding
        chunks = [b'z' if foldnuls and not word else
                  b'y' if foldspaces and word == 0x20202020 else
                  (chars2[word // 614125] +
                   chars2[word // 85 % 7225] +
                   chars[word % 85])
                  for word in words]
        return b''.join(chunks)

    n1 = len(b) // 512  # number of 512 bytes unpack
    n2 = (len(b) - n1 * 512) // 4  # number of 4 bytes unpack
    padding = (-len(b)) % 4

    unpack512 = struct.Struct("!128I").unpack
    unpack4 = struct.Struct("!I").unpack

    offset = 0
    blocks = []
    for _ in range(n1):
        blocks.append(encode_words(unpack512(b[offset:offset+512])))
        offset += 512

    for _ in range(n2):
        blocks.append(encode_words(unpack4(b[offset:offset+4])))
        offset += 4

    if padding:
        # TODO deal with last bytes and padding...

    return b''.join(blocks)

romuald · 2024-03-04T10:01:57Z

Performance regression also on Linux amd64:

Timeit SMALL (500000 iterations: 0.874s, 0.880s, 0.913s, 0.868s, 0.869s
Timeit MEDIUM (100000 iterations: 1.208s, 1.201s, 1.211s, 1.211s, 1.212s
Timeit BIG (20000 iterations: 5.753s, 5.737s, 5.736s, 5.773s, 5.954s


Timeit SMALL (500000 iterations: 1.581s, 1.600s, 1.596s, 1.583s, 1.542s
Timeit MEDIUM (100000 iterations: 2.200s, 2.243s, 2.293s, 2.303s, 2.340s
Timeit BIG (20000 iterations: 7.443s, 7.478s, 7.678s, 7.502s, 7.457s

I find it a bit strange because my initial test a few months back the execution was slower on macOS but slightly faster on Linux. I attribued that to the fact that the memory allocation was clostly enough to be a factor

romuald · 2025-02-14T10:12:29Z

@pitrou sorry for the huge lag

I gave up after spending a lot of time trying to find a way to have be both CPU and memory friendly for small and large dataset

Until last week when I realized that I could simply rewrite the function in C (which I have done), to have both performance and memory improvements

My question is, should I create a new PR or push -f on this one?

sobolevn · 2025-02-14T10:33:01Z

You surely can do both :)
But I prefer to re-use PRs.

Initially done to reduce the huge memory consumption of the previous implementation for large inputs, and that no memory-friendly python way was found that did not include a performance regression This implementation also greatly improve performance in all cases Signed-off-by: Romuald Brunet <[email protected]>

Regression was found while testing the new C implementation, when foldspaces was used with b85encode (since a chunk could end in z without having been folded)

romuald · 2025-02-16T15:25:15Z

@pitrou / @sobolevn I rewrote this PR from scratch to use the a C implementation instead of a python one

Note that I do not consider myself a seasoned C developer so the implementation may be lacking.

I've tried to maximize compatibility with previous implementation even if the _85encode method is private, we could drop chars2 for example, and possibly change the type of XXchars to bytes

I've also added a test to check for a regression I found while testing the new implementation with random data

Modules/binascii.c

picnixz · 2025-02-16T15:37:32Z

Modules/binascii.c

@@ -1239,13 +1239,101 @@ binascii_b2a_qp_impl(PyObject *module, Py_buffer *data, int quotetabs,
    return rv;
 }

+/*[clinic input]
+binascii.b2a_base85


It would feel weird not to have binascii.a2b_base85 so I would suggest keeping it private for now. Ideally, base64 should have its own C accelerator module but maybe it's an overkill.

Renamed to private

That was what delayed me initially because I had no idea how to add a module dedicated to base64, I found out that it did use the binascii one only last week

picnixz · 2025-02-16T15:39:16Z

By the way, I didn't look at the implementation in detail yet, but if you want to compare your implementation with a popular one, you can have a look at https://github.com/git/git/blob/master/base85.c#L40.

Apply suggestions Co-authored-by: Bénédikt Tran <[email protected]>

romuald · 2025-02-16T16:58:54Z

By the way, I didn't look at the implementation in detail yet, but if you want to compare your implementation with a popular one, you can have a look at https://github.com/git/git/blob/master/base85.c#L40.

Thanks, I didn't know where too look when I started (the base algorithm originally came from a LLM :/)

Inspiring from git's code, this could be used instead:

    size_t i = 0 ;
    size_t underflow = 0;  // was overflow, but underflow may be a better name since `i` will not go over bin_len
    while (i < bin_len) {
        // translate each 4 byte chunk to 32bit integer
        uint32_t value = 0;
        for (int cnt = 24; cnt >= 0; cnt -= 8) {
            value |= bin_data[i] << cnt;
            if (++i == bin_len) {
                // Number of bytes under the 4 bytes rounded value
                underflow = cnt / 8;
                break;
            }
        }
        // ...
    }

There is a 20% performance gain for large inputs (starting a 5kb) since the potential padding does not need to be computed for each chunk.

Shall I use this method instead?

picnixz · 2025-02-16T17:29:24Z

There is a 20% performance gain for large inputs [...] Shall I use this method instead?

Yes, since base85 can be used for encoding large inputs it's worth it I think.

Inspired from git source https://github.com/git/git/blob/03944513488db4a81fdb4c21c3b515e4cb260b05/base85.c#L79 This avoid checking the chunk size on every iteration and thus improves performance

Since j is not unsigned anymore we can reverse the table lookup loop

picnixz · 2025-02-17T11:25:10Z

Modules/binascii.c

+    size_t i = 0 ;
+    int padding = 0;
+
+    while (i < bin_len) {


I would also credit the git implementation for this one as it's heavily based on it.

Credit added. I don't know if I should phrase it differently?

I've also added some other comments to try to explain the logic more

bedevere-app bot added the awaiting review label Nov 18, 2023

bedevere-app bot mentioned this pull request Nov 18, 2023

base64.b85encode uses significant amount of RAM #101178

Open

romuald force-pushed the gh-101178-b58encode-memuse branch from eb34187 to 54793e7 Compare November 19, 2023 13:51

romuald force-pushed the gh-101178-b58encode-memuse branch from 54793e7 to 39b1d7e Compare November 19, 2023 13:59

arhadthedev added the performance Performance or resource usage label Nov 19, 2023

pitrou reviewed Mar 1, 2024

View reviewed changes

romuald force-pushed the gh-101178-b58encode-memuse branch from 899c88d to d0e7691 Compare March 3, 2024 15:30

romuald added 2 commits February 16, 2025 16:05

Add possible regression test in test_base64

74fc245

Regression was found while testing the new C implementation, when foldspaces was used with b85encode (since a chunk could end in z without having been folded)

romuald force-pushed the gh-101178-b58encode-memuse branch from d0e7691 to 74fc245 Compare February 16, 2025 15:05

picnixz reviewed Feb 16, 2025

View reviewed changes

Review changes Modules/binascii.c

60a3ae6

Apply suggestions Co-authored-by: Bénédikt Tran <[email protected]>

romuald added 2 commits February 16, 2025 20:33

Review fixes: update algorithm

aaa09e1

Inspired from git source https://github.com/git/git/blob/03944513488db4a81fdb4c21c3b515e4cb260b05/base85.c#L79 This avoid checking the chunk size on every iteration and thus improves performance

Further plagiate git's implementation

cb46a5d

Since j is not unsigned anymore we can reverse the table lookup loop

picnixz reviewed Feb 17, 2025

View reviewed changes

romuald added 2 commits February 17, 2025 22:47

Rename b2a_base85 as private for now

c88450b

Credit to git's implementation and more comments

2fc892c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-101178: refactor base64.b85encode to be memory friendly #112248

gh-101178: refactor base64.b85encode to be memory friendly #112248

romuald commented Nov 18, 2023 •

edited

Loading

romuald commented Nov 19, 2023

arhadthedev commented Nov 19, 2023

romuald commented Nov 19, 2023

romuald commented Mar 1, 2024

pitrou commented Mar 1, 2024

pitrou Mar 1, 2024

romuald Mar 3, 2024

pitrou Mar 1, 2024

romuald Mar 3, 2024

romuald commented Mar 3, 2024

pitrou commented Mar 3, 2024 •

edited

Loading

romuald commented Mar 4, 2024

romuald commented Feb 14, 2025

sobolevn commented Feb 14, 2025 •

edited

Loading

romuald commented Feb 16, 2025

picnixz Feb 16, 2025

romuald Feb 17, 2025

picnixz commented Feb 16, 2025

romuald commented Feb 16, 2025 •

edited

Loading

picnixz commented Feb 16, 2025 •

edited

Loading

picnixz Feb 17, 2025

romuald Feb 17, 2025

	for c in unpack512(b[offset:offset+512]):
	yield c
	yield from unpack512(b[offset:offset+512])

gh-101178: refactor base64.b85encode to be memory friendly #112248

Are you sure you want to change the base?

gh-101178: refactor base64.b85encode to be memory friendly #112248

Conversation

romuald commented Nov 18, 2023 • edited Loading

Current description

Previous description (python refactor)

romuald commented Nov 19, 2023

arhadthedev commented Nov 19, 2023

romuald commented Nov 19, 2023

romuald commented Mar 1, 2024

pitrou commented Mar 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romuald commented Mar 3, 2024

pitrou commented Mar 3, 2024 • edited Loading

romuald commented Mar 4, 2024

romuald commented Feb 14, 2025

sobolevn commented Feb 14, 2025 • edited Loading

romuald commented Feb 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

picnixz commented Feb 16, 2025

romuald commented Feb 16, 2025 • edited Loading

picnixz commented Feb 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romuald commented Nov 18, 2023 •

edited

Loading

pitrou commented Mar 3, 2024 •

edited

Loading

sobolevn commented Feb 14, 2025 •

edited

Loading

romuald commented Feb 16, 2025 •

edited

Loading

picnixz commented Feb 16, 2025 •

edited

Loading