Skip to content

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Mar 14, 2025

It's useful to be able to rebuild a Compressor from an existing symbol table. For example, if you want to perform incremental compression, or compress a value to pushdown comparisons over the encoded data.

Going through CompressorBuilder, calling insert for each symbol and building may seem appropriate, but you're not guaranteed to get the same codes order afterward! This is because of the finalize step, many of the 2-byte codes will actually get reordered. This is an optimization we port from the C++ code, and allows us to skip a hashtable lookup during compression for many inputs.

To avoid the ambiguity, we provide a new Compressor::rebuild_from to rebuild from a slice of symbols and lens. This repopulates the codes_two_byte and hash table that are needed at compress time, and does so without the finalize step that occurs when going through the CompressorBuilder.

I've also gone and updated the fuzzer to check both a freshly built and a rebuilt compressor. It caught several bugs as I was going through cleaning up this PR 😅 I let the fuzzer run for 30 minutes on my laptop and it didn't find anything.

@codspeed-hq
Copy link

codspeed-hq bot commented Mar 14, 2025

CodSpeed Performance Report

Merging #84 will not alter performance

Comparing aduffy/build-from-existing (187c4b5) with develop (44d9d78)

Summary

✅ 16 untouched benchmarks

@a10y a10y marked this pull request as draft March 14, 2025 22:27
@a10y a10y force-pushed the aduffy/build-from-existing branch from 1c9b08d to 70865a4 Compare March 17, 2025 15:54
@a10y
Copy link
Contributor Author

a10y commented Mar 17, 2025

Just to tickle a fancy, I wanted to see if I could remove all of the unsafe without impacting performance, unfortunately it looks like, at least on microbenchmarks, removing unsafe has a substantial penalty.

Would be good future work though to see if this is possible.

image

@a10y a10y marked this pull request as ready for review March 17, 2025 16:40
@a10y a10y force-pushed the aduffy/build-from-existing branch from 034256c to bb04f95 Compare March 17, 2025 16:41
@a10y a10y force-pushed the aduffy/build-from-existing branch from bb04f95 to 187c4b5 Compare March 17, 2025 16:45
@a10y a10y enabled auto-merge (squash) March 17, 2025 16:46
@a10y a10y merged commit 7dd4852 into develop Mar 17, 2025
5 checks passed
@a10y a10y deleted the aduffy/build-from-existing branch March 17, 2025 17:05
@github-actions github-actions bot mentioned this pull request Mar 17, 2025
a10y pushed a commit that referenced this pull request Mar 17, 2025
## 🤖 New release

* `fsst-rs`: 0.5.1 -> 0.5.2 (✓ API compatible changes)

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

## [0.5.2](v0.5.1...v0.5.2) -
2025-03-17

### Added

- add rebuild from existing function
([#84](#84))
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
a10y added a commit to vortex-data/vortex that referenced this pull request Mar 17, 2025
The previous implementation of FSST comparison pushdown relied on
rebuilding a compressor by reinserting symbols one-by-one into the
CompressorBuilder, and then building it.

That doesn't work, for reasons described in the description at
spiraldb/fsst#84.

We use the new `rebuild_from` API on fsst compressor to build a new
compressor that is guaranteed to preserve symbol table ordering, and
thus guarantee equal compression outputs.
a1412744807 added a commit to a1412744807/rs-fsst that referenced this pull request Oct 27, 2025
## 🤖 New release

* `fsst-rs`: 0.5.1 -> 0.5.2 (✓ API compatible changes)

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

## [0.5.2](spiraldb/fsst@v0.5.1...v0.5.2) -
2025-03-17

### Added

- add rebuild from existing function
([#84](spiraldb/fsst#84))
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Goodbai-1206 added a commit to Goodbai-1206/fsst-rs that referenced this pull request Oct 29, 2025
## 🤖 New release

* `fsst-rs`: 0.5.1 -> 0.5.2 (✓ API compatible changes)

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

## [0.5.2](spiraldb/fsst@v0.5.1...v0.5.2) -
2025-03-17

### Added

- add rebuild from existing function
([#84](spiraldb/fsst#84))
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
akkjs887 added a commit to akkjs887/rt-rsst that referenced this pull request Oct 29, 2025
## 🤖 New release

* `fsst-rs`: 0.5.1 -> 0.5.2 (✓ API compatible changes)

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

## [0.5.2](spiraldb/fsst@v0.5.1...v0.5.2) -
2025-03-17

### Added

- add rebuild from existing function
([#84](spiraldb/fsst#84))
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
yelirekhmon added a commit to yelirekhmon/fstrs that referenced this pull request Oct 30, 2025
## 🤖 New release

* `fsst-rs`: 0.5.1 -> 0.5.2 (✓ API compatible changes)

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

## [0.5.2](spiraldb/fsst@v0.5.1...v0.5.2) -
2025-03-17

### Added

- add rebuild from existing function
([#84](spiraldb/fsst#84))
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants