Encryption in IPLD

We’ve been talking about encryption for a long time but haven’t done much to actually accommodate it.

In the meantime, people have built encrypted applications on IPLD using application specific encryption schemes. There’s a lot we can learn from these approaches but one thing they all have in common, and that we need to be concerned with, is that they can’t make use of the higher level tools we’ve been creating for generic IPLD usage. Specifically, they only get limited utility from Schemas and Selectors.

Below I’ve tried to brain dump my current state on how to tackle this problem and the different considerations that are bouncing around.

## “Fat Pointers”

The loose consensus among the IPLD team has been that encryption is best handled by some use of a “fat pointer.” However, the *meaning* of fat pointer varies, so it’s worth discussing the different meanings before we get in to different approaches.

### Fat Pointer as a Node (using a Schema)

```sh
type EncryptedLink struct {
  cid Link
  fromPublicKey Bytes
  toPublicKey Bytes
  algo String
  codec Int
}
```

**Pros**
* We can create pointers anywhere we can put a node, which means anywhere inside of a block.
* We would not need to update or improve most of our existing primitives.

**Cons**
* Expensive. This ads a lot of overhead to every link, almost all of it duplicated in many places in a way that cannot be de-duplicated.
* This sort of ditches CID as the link mechanism, we’d have to use `raw` blocks everywhere and write new codec information into this link. So we’d still end up having to update the read portions of our stack to accommodate as the information about how to decode the block is now outside the CID.
* Doesn’t easily interop with existing data structures. Take any data structure spec we have in IPLD Schema, they would all need to be updated to put Unions in every place we *might* want encrypted data.

### Fat Pointer as a Block (using a Multiformat Codec Identifier)

This approach is what something like `COSE` would lean towards.

```
dag-cbor block -> dag-cose block -> raw
```

**Pros**
* This extends a standard incorporating the needs of several parties across the industry.
* We would not need to update or improve most of our existing primitives.

**Cons**
* This has all the same drawbacks of the above method but much worse, since *every* encrypted Node would need a unique block to use as a link.

### Fat Pointer as a CID (presumably CIDv2)

We could extend CID to include information about how to decrypt the linked block.

There’s a few different ways to accomplish this, but the one that makes the most sense to me would be to make `CIDv2` the “fat pointer” version of a CID. Basically, it’s a CID with 2 multihashes. One multihash identifies the block containing the “fat pointer” and the second identifies the actual data. Unlike CIDv1, the codec identifier only tells you how to decode the “fat pointer” and the pointer will include the necessary information to decode the block data for the second multihash.

The reason this needs to be two multihashes in a single CID, rather than including the second CID in the “fat pointer,” is so that you can de-duplicate common fat pointers. This would allow us to de-duplicate the publickey information for all encrypted data in a way that we can’t with the prior approaches.

**Pros**
* Cheapest option. The multihash is smaller than the encryption information, so there’s a savings as soon as you have two blocks encrypted with the same information.
* Deduplication. You can encode decryption information into a single block and use that for as many encrypted blocks as you want.
* Extensible. We can leverage this for other “fat pointers” in the future.

**Cons**
* This changes a lot more in our stack than the other approaches. While it makes some things a bit easier, I haven’t even thought through all the implications across the stack.
* The extensibility is actually a little concerning. This could be used by application developers to embed all kinds of new information without having thought through all the considerations. For instance, Schemas only validate that a link is a CID, which means that these fat pointers MUST resolve to a single node, if you did something like embed a selector as a fat pointer that resolved to multiple nodes you’d cause considerable breaks elsewhere in the stack.

## Signing vs Encryption

In reading @johnnycrunch’s notes on COSE I realized something; most of what we’ve been thinking of as signing use cases are really encryption use cases.

When we talk about signing we tend to use it to describe “ownership.” If you sign something you are saying “this is mine” or “i made this.” But it’s entirely possible, even quite easy given how links work, to sign the work of other people. This is always a concern when you structure data to be signed, but it’s particularly problematic for us because you’re signing the *link* and it’s very easy to use that hash elsewhere with some ownership attached to it, which is not accurate or secure.

What you probably want most of the time is not signing but encryption against a publicly decryptable secret or key. This way, the link hash is unique to data encrypted with your publicKey and nobody else can “sign” that hash. Sure, they can decrypt the data and produce their own but they’ll have a different hash.

## Replication Keys

Several content addressed encryption schemes have employed a two tier system for encryption so that data can be replicated using a “replication key” that exposes the full graph of links but cannot decrypt the other data.

This is a good system, but it can also easily double the cost of encryption if “fat pointers” can’t be de-duplicated.

Using the CIDv2 proposal this could be done with only 2 “fat pointer” blocks being created for an entire graph of encrypted data. Whereas other methods will end up doubling the overhead.

# Conclusions

Having thought about this for a while now, I think that encryption is best understood as part of a larger and more generic problem set that we have yet to tackle in IPLD.

There are use cases that require more information than just the codec identifier to decode the block data. I think that, historically, we’ve pushed back on these because we were trying to carve out exactly what the Block layer is meant to provide and what should be pushed up the stack. I can point to many times that myself and others have tried to push application specific considerations into this layer, only to realize how problematic it is months later.

Now that we have a better idea of where these boundaries are I think that we can start tackling this problem set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Encryption in IPLD #257

“Fat Pointers”

Fat Pointer as a Node (using a Schema)

Fat Pointer as a Block (using a Multiformat Codec Identifier)

Fat Pointer as a CID (presumably CIDv2)

Signing vs Encryption

Replication Keys

Conclusions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Encryption in IPLD #257

Description

“Fat Pointers”

Fat Pointer as a Node (using a Schema)

Fat Pointer as a Block (using a Multiformat Codec Identifier)

Fat Pointer as a CID (presumably CIDv2)

Signing vs Encryption

Replication Keys

Conclusions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions