-
Notifications
You must be signed in to change notification settings - Fork 11.6k
convert: add tensor hash general.hash.sha256 to kv store #8645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unconvinced this is desirable. One side effect of introducing this change is prolonging the conversion process. Say someone wants to convert Llama 3 405B that's stored on HDD. How much longer is it going to take?
It's also quite easy to overwrite the hash and other metadata. A small change in tensor will result in change of hash, so if someone wants to avoid making model possible to track it's not difficult.
I think conversion as bf16
and f16
will also generate different hashes.
Another thing is doesn't hashlib
need to store all the tensors in memory to calculate the hash? According to docs .update
concatenates the subsequent calls, which would suggest that at some point all the tensors would be stored in memory, which would be an absolute deal-breaker when converting bigger models.
I agree with @Galunid regarding the overhead (both CPU-wise and memory-wise). This also has the exact same problems as the UUID autogeneration, because the hash for an If you truly want this to work as an integrity check, then But the way
@Galunid No, But in this case, reading the tensor data from a An eventual solution would be to put metadata at the end of GGUF model files, which would also help with editing metadata without rewriting all of the data (good for tokenizer fixes too). But this requires deeper format changes, although it might be possible to fit this backward-compatibly into the existing GGUF v3. (if you have ideas for this, feel free to explore them) But as this PR is now, I think it has the following problems:
|
@Galunid the sha256 sum by itself will not consume memory as @compilade said, it does a running hash sum as bytes come into it. However I see your point regarding impacting lazy loading and I don't see anyway around it, so am inclined to close this PR. Maybe if they really need to, they could just leverage off llama-gguf-hash anyway on load to their database. @compilade regarding the idea of extending the end of the gguf file as an extension. GG is heavily against the idea unless it is truly unavoidable as he would prefer ensuring backwards compatibility via the kv store. That's not to say it won't happen in the future, but if we do then we better have a good reason... or spin off a new file format standard not encumbered by the past (If so, then I'll suggest using CBOR over inventing our own structure format for metadata... and of course sticking the metadata at the end like you suggest)
That's a bit strange, does llama-gguf-hash also show difference? Is this a translation between 'safetensor to GGUF Q8' vs 'gguf F32 to GGUF Q8'? If so then maybe it's valid to have a difference, since there might be slight difference in behavior due to difference between converting from two different float formats to Q8? |
No, the difference is only in the metadata, because of the hash introduced in this PR which differs, because it depends on the output when converting and it's not updated by Another solution (instead of updating the hash in
There is no difference in behavior. See #7234. Internally, |
While autogeneration of UUID is a bit controversial, I decided to adapt the logic a bit for a straight up sha256 tensor hash in the kv store as
general.hash.sha256
.While there already other choices like xxhash and sha1, in this context I think we have better value with a known strong cryptographic hash method like sha256. This I think would also pave the way for self signed gguf file so you can be sure it came from a known entity.
I also thought about 'per tensor layer' hash, but not sure how useful it would be at this stage as per layer tensor seems to be more of a 'developer debugging tool' at this stage. So best to keep to whole tensor level hashing instead.
For model repo maintainers like huggingface, this has immediate use in being able to track models even when KV metadata has been updated (e.g. fixing authorship metadata).
For anyone who may be interested, you might want to add some logic to either llama-gguf-hash to self check a gguf tensor data if this hash is present in the kv store. I opted against doing it as I wasn't sure on the utility yet and it would be more work than this current PR.
Testing process I did
During conversion you would get this new print out
Checked that gguf-dump --markdown I can see the new entry:
Checked that the sha256 is consistent with llama-gguf-hash:
So at least it appears the sha256 process is consistent. The logic is similar to my attempt at autogenerated UUID which was also consistent, so less likely to have an error creep in this context.