Replace 4-byte MD5 abi_hash with PostgreSQL SHA-256 generated column#266
Replace 4-byte MD5 abi_hash with PostgreSQL SHA-256 generated column#266Uxio0 wants to merge 3 commits into
Conversation
58d5b10 to
1a73376
Compare
Coverage Report for CI Build 24841246694Warning Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes. Coverage decreased (-0.1%) to 91.476%Details
Uncovered Changes
Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
| """ | ||
| abi_hash = get_md5_abi_hash(abi_json) | ||
| query = select(cls).where(cls.abi_hash == abi_hash).limit(1) | ||
| query = ( |
There was a problem hiding this comment.
Can you verify that this will use the index in database queries?
Since abi_hash is a stored generated column, we can mirror the same expression on the query side and let PostgreSQL use the index:
from sqlalchemy import func
computed_hash = func.sha256(
sa_cast(sa_cast(literal(json.dumps(abi_json)), JSONB), LargeBinary)
)
query = select(cls).where(cls.abi_hash == computed_hash)
There was a problem hiding this comment.
Agree @falvaradorodriguez , it was not using the index. Fixed in c162981
|
|
||
|
|
||
| def upgrade() -> None: | ||
| op.execute("DROP INDEX ix_abi_abi_hash") |
There was a problem hiding this comment.
We need to think of a mechanism to apply this migration. This could cause the service to be down for an long time and cause problems for our customers.
There was a problem hiding this comment.
Yeah, first we need to properly test it in staging
The old get_md5_abi_hash truncated MD5 to 4 bytes, giving ~50% collision probability at 65k rows. The new abi_hash column is GENERATED ALWAYS AS (sha256(abi_json::jsonb::text::bytea)) STORED — PostgreSQL owns both the computation and the uniqueness guarantee. JSONB normalisation makes the hash stable regardless of key insertion order. Python no longer computes or stores the hash; get_or_create_abi uses the IntegrityError try/catch pattern for concurrent-safe upserts, and get_abi uses a JSONB equality query for key-order-independent lookup. The field is removed from the public API response entirely.
The previous JSONB-equality query forced a sequential scan with a per-row cast of abi_json to jsonb. Compute sha256 of the same JSONB-normalised text form on the server side and match against the indexed abi_hash generated column, turning the lookup into an index probe while preserving key-order independence.
9c1bba3 to
c162981
Compare
The old
get_md5_abi_hashtruncated MD5 to 4 bytes, giving ~50% collision probability at 65k rows. The newabi_hashcolumn isGENERATED ALWAYS AS (sha256(abi_json::jsonb::text::bytea)) STOREDso PostgreSQL owns both the computation and the uniqueness guarantee.JSONBnormalisation makes the hash stable regardless of key insertion order.Python no longer computes or stores the hash;
get_or_create_abiuses theIntegrityErrortry/catch pattern for concurrent-safe upserts, and get_abi uses aJSONBequality query for key-order-independent lookup.Fixes PLA-1301