Skip to content

Using IPFS in the OSM infrastructure #388

@RubenKelevra

Description

@RubenKelevra

Hey guys, hope you're all doing fine in the current situation & as a long time mapper I like to thank you for all the work you put into this project! :)

Interplanetary Filesystem

IPFS is a network protocol that allows exchanging data efficiently in a worldwide mesh-network. The content is addressed by a Content-ID (CID) - by default SHA256 - and ensures that the content wasn't altered.

All interconnections are dynamically established and terminated, based on you the requests to the daemon and the queries in the global Distributed Hash-Table - which is used to resolve Content-IDs to Peers, and Peers to IP+ports.

Storage concept

There are multiple data types, but the most interesting for you is UnixFS (files and folders). A Content-ID of a folder is immutable and thus ensures that all data inside a folder can be verified after receiving it.

IPFS has a build-in 'name system' that allows to assign a static id (the public key of an RSA or ed25519 key) and point it to changing content. This way you can switch a link from one folder version to a different folder version atomically. The static IDs are accessed through/ipfs/ and the content IDs are accessed through/ipfs/ on a web-gateway.

An example page via a CID-Link on a Gateway

Software for end-users

But you don't need a gateway to access such URLs, there are also Browser Plugins (For Firefox and Chrome), which can resolve and access them directly, Desktop Clients (for Windows / Linux / MacOS) and there's a wget replacement which uses IPFS directly to access the URL.

Backwards compatibility

You can offer a webpage which is accessible via HTTP(s) and IPFS at the same time. The browser plugins automatically detect if a webpage got a DNSLink-Entry and will switch to IPFS. All IPFS-project pages are for example stored on an IPFS cluster and served by a regular web server and can also be fetched by the browser plugins.

On the website itself, you can link an URL to one of the web-gateways, to allow users with regular browsers to access the data, without having to install anything.

If the link points towards a folder it looks like this dataset.

Cluster

IPFS alone does not guarantee data replication, everything is just stored locally for other clients to access. To achieve data replication, you need the cluster daemon. It will create a set of elements and let you add or remove them. Each element can be tagged with an expiry time (after it will be removed automatically) a minimum and a maximum replication amount.

Maximum will set the number of copies created on add, while a drop below minimum will automatically start additional replication of the data.

Altering the cluster-configuration

A cluster can dynamically grow or shrink without any configuration needed, and new data is preferable allocated to the peers which got the freest space. This way every need peer in the cluster will extend the available storage in the cluster.

Write access on the cluster is defined with the cluster configuration file (A JSON file), which lists a number of public-keys that are allowed to alter the set of elements.

Adding cluster members

Following a cluster is very simple, everyone with a locally running IPFS-daemon can start a cluster-follower which reads the cluster configuration file and communicates with the local IPFS-daemon to do the necessary replications.

Those public collaboration clusters are available since the last release of IPFS-cluster and some of the clusters are listed here:

https://collab.ipfscluster.io/

Server Outages

Server outages are no issue. The cluster has no 'master' which is necessary for the operations. Nodes with write access can go completely offline, while the data is still available.

Server outages of third-parties might trigger additionally copies of data, to guarantee the availability inside the cluster - if necessary.

If a server of the cluster comes back online, it will receive the full delta of the cluster metadata, catch up and continue the operation automatically.

Data integrity

All data is checked for integrity block by block (default block size max 256K) via SHA256 sum according to the CID (and it's metadata).

Tamper resistance

The data held on the mirrors cannot be tampered with since IPFS would just don't accept those data, because of the wrong checksum. Nobody without your keys can write to the cluster and nobody without your keys can alter the IPFS-Name system entry.

Community aspect

IPFS allows to easily read-access of the files on the mirrors but also allows everyone in the community to set up a cluster follower without having to list an additional URL on a Wiki page which needs to be cleaned up if some of the servers are no longer available etc.

Disaster recovery

Private key for Cluster-Write-Access lost

If the write key of a cluster is lost, a new cluster has to be created. This requires a daemon restart with a new configuration file and refetch of the cluster-metadata all cluster-followers. The data-integrity is unaffected since the data will stay online and on a reimport stay the same.

This can be mitigated by an alternative write key which is securely stored on a backup location.

Complete data loss all (project) servers

Since there are third-party servers, the data-integrity won't be affected. Regarding write access, look above.

Data-integration issues on a cluster server

On this server, the data store needs to be verified. All data with errors will be removed and refetched by the cluster-follower.

If the databases are affected too, IPFS can be wiped (private key doesn't need to be maintained), and the ipfs-cluster-follower can be wiped as well (private key doesn't need to be maintained).

The follower and IPFS can then be restarted again and will pull the full metadata-history again, then receiving any newly written data.

If the follower-identity is maintained (the private key isn't wiped) the cluster follower will fetch his part of the replication again.

Data loss on the whole cluster

If some data is completely lost on the cluster it can be added again by adding the same data on an IPFS node. So an offline-backup, for example, can restore the data on the cluster.

Data transfer speed

Netflix did a great job improving the IPFS component which organizes the data transfers - Bitswap. There's an blogpost about that. It will be part of the next major release, to be released within the next month.

Archiving via IPFS-cluster

Since IPFS allows everyone to replicate the data easily and offer redundancies this way, it might be an interesting solution for your backups as well - in a second cluster installation.

A third party person outside of the main team could hold the write access to this archiving/backup cluster. The main team adds all backup files to IPFS and the third-party-person puts the CID of the backup-folder to the cluster-pin.

If files from the mirror-cluster should be archived, they can just be added via ContentID to the backup cluster and are automatically transferred.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions