Skip to content

Replace 4-byte MD5 abi_hash with PostgreSQL SHA-256 generated column#266

Open
Uxio0 wants to merge 3 commits into
mainfrom
feature-improve-abi-hash
Open

Replace 4-byte MD5 abi_hash with PostgreSQL SHA-256 generated column#266
Uxio0 wants to merge 3 commits into
mainfrom
feature-improve-abi-hash

Conversation

@Uxio0
Copy link
Copy Markdown
Member

@Uxio0 Uxio0 commented Apr 16, 2026

The old get_md5_abi_hash truncated MD5 to 4 bytes, giving ~50% collision probability at 65k rows. The new abi_hash column is GENERATED ALWAYS AS (sha256(abi_json::jsonb::text::bytea)) STORED so PostgreSQL owns both the computation and the uniqueness guarantee. JSONB normalisation makes the hash stable regardless of key insertion order.

Python no longer computes or stores the hash; get_or_create_abi uses the IntegrityError try/catch pattern for concurrent-safe upserts, and get_abi uses a JSONB equality query for key-order-independent lookup.

Fixes PLA-1301

@Uxio0 Uxio0 force-pushed the feature-improve-abi-hash branch 2 times, most recently from 58d5b10 to 1a73376 Compare April 23, 2026 11:29
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 23, 2026

Coverage Report for CI Build 24841246694

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage decreased (-0.1%) to 91.476%

Details

  • Coverage decreased (-0.1%) from the base build.
  • Patch coverage: 2 uncovered changes across 2 files (19 of 21 lines covered, 90.48%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
app/datasources/db/models.py 15 14 93.33%
app/routers/models.py 6 5 83.33%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 1443
Covered Lines: 1320
Line Coverage: 91.48%
Coverage Strength: 0.91 hits per line

💛 - Coveralls

Comment thread app/routers/models.py
Comment thread app/datasources/db/models.py Outdated
"""
abi_hash = get_md5_abi_hash(abi_json)
query = select(cls).where(cls.abi_hash == abi_hash).limit(1)
query = (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you verify that this will use the index in database queries?

Since abi_hash is a stored generated column, we can mirror the same expression on the query side and let PostgreSQL use the index:

from sqlalchemy import func                                                                                                                                                                                                                                                                                           
computed_hash = func.sha256(
    sa_cast(sa_cast(literal(json.dumps(abi_json)), JSONB), LargeBinary)
 )                                                                                                                                                                                                                                                                                                                     
query = select(cls).where(cls.abi_hash == computed_hash)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree @falvaradorodriguez , it was not using the index. Fixed in c162981



def upgrade() -> None:
op.execute("DROP INDEX ix_abi_abi_hash")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think of a mechanism to apply this migration. This could cause the service to be down for an long time and cause problems for our customers.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, first we need to properly test it in staging

Uxio0 and others added 3 commits May 13, 2026 14:51
The old get_md5_abi_hash truncated MD5 to 4 bytes, giving ~50% collision
probability at 65k rows. The new abi_hash column is GENERATED ALWAYS AS
(sha256(abi_json::jsonb::text::bytea)) STORED — PostgreSQL owns both the
computation and the uniqueness guarantee. JSONB normalisation makes the
hash stable regardless of key insertion order.

Python no longer computes or stores the hash; get_or_create_abi uses the
IntegrityError try/catch pattern for concurrent-safe upserts, and get_abi
uses a JSONB equality query for key-order-independent lookup. The field is
removed from the public API response entirely.
The previous JSONB-equality query forced a sequential scan with a
per-row cast of abi_json to jsonb. Compute sha256 of the same
JSONB-normalised text form on the server side and match against the
indexed abi_hash generated column, turning the lookup into an
index probe while preserving key-order independence.
@Uxio0 Uxio0 force-pushed the feature-improve-abi-hash branch from 9c1bba3 to c162981 Compare May 13, 2026 12:52
@Uxio0 Uxio0 requested a review from a team as a code owner May 13, 2026 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants