Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World) #104

flyingzumwalt · 2017-01-16T19:43:32Z

The Main Epic: Replicate 350 TB of Data Between 3 Peers (and the World)

People (hypothetical):

Jack (Stanford)
Michelle (U Toronto/EDGI)
Amy (a University in Midwest)
IPFS team
Anyone out there following along

Technical Considerations:

If we can roll out filestore in time (see #95 and #91), we can update this plan to have Jack tell ipfs to "track" the data rather than "adding" it to ipfs. This would allow him to serve his original copy of the dataset directly to the network without creating a second copy on his local machines. In the meantime, we can start the experiment using ipfs add with smaller volumes of data (ie. 5-10TB). This will allow us to start surfacing and addressing issues around:

Providers UX
Blockstore Performance
Delegated Content Routing
Memory Usage
Deployment/Ops Experience

Advance Prep: Downloading the Data & Setting up the Network

Download all of data.gov #113 Jack Downloads all of data.gov (~350TB) to storage devices on Stanford's network
Institutional Collaborators Install and Configure IPFS #114 Jack, Michelle and Amy install and configure ipfs

Test-run: 5TB

[awaiting instructions] Everyone sets up the monitoring tools so they can report on performance and to provide info in case of errors
Add the first 5 TB to IPFS and Publish the content to the DHT #117 Jack adds the first 5 TB to IPFS. The hashes get published to the testbed network's DHT
Jack gives the root hash for the dataset to Michelle and Amy
Replicate the first 5 TB to peers #118 Michelle and Amy pin the root hash on their ipfs nodes. The nodes replicate all of the data.
Run tests to confirm that first 5TB were Replicated Properly #119 Michelle and Amy run tests to confirm that the data were successfully replicated

Test-runs: 50 TB, 100 TB, 300 TB

Jack gradually adds more of the dataset to ipfs, giving the new root hashes to Michelle and Amy. They replicate the data.

Move to the Public Network

After testing is complete, switch the nodes to the public/default IPFS network. Provide the blocks on the DHT and publish the root hashes for people in the general public to pin.

Follow-up

At the end of the sprint, we will need to follow up on a lot of things. See #103

The text was updated successfully, but these errors were encountered:

flyingzumwalt · 2017-01-20T20:17:29Z

UPDATE: Based on initial crawls of the first 3000 datasets, @mejackreed has modified his estimates of the total size of data.gov. The entire corpus of data.gov might only be between 1TB and 10TB. We have identified at least one other large climate dataset, that we will try to download in addition to data.gov.

How this impacts the experiment

If it does turn out that the entire data.gov corpus is under 10TB, it will impact this experiment in a couple ways:

More people will be able to participate in the network, pinning the entire corpus on their IPFS nodes
The additional datasets, like this 30TB NOAA dataset will be included in the experiment and replicated to institutional collaborators for the purposes of testing the system and backing up those datasets temporarily, but it will be easy to either pin or skip those datasets independently of the main data.gov corpus. At the very least, it will be much easier to find new homes for those datasets and move them to the new homes over IPFS.
The IPFS team will have to find an even bigger dataset to test our systems at loads over 100TB. 😄

flyingzumwalt added backlog epic labels Jan 16, 2017

flyingzumwalt added this to the Data.gov (aka 300 TB Challenge) milestone Jan 16, 2017

flyingzumwalt changed the title ~~Main data.gov Epic: Replicate 300 TB of Data Between 3 Peers (and the World)~~ Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and the World) Jan 16, 2017

flyingzumwalt mentioned this issue Jan 16, 2017

Sprint: Data.gov (aka 300 TB Challenge) #87

Open

flyingzumwalt changed the title ~~Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and the World)~~ Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World) Jan 16, 2017

flyingzumwalt mentioned this issue Jan 16, 2017

Call for Participants/Collaborators for data.gov Sprint #107

Open

dcwalk mentioned this issue Jan 23, 2017

Test IPFS out as storage method for data edgi-govdata-archiving/overview#27

Closed

flyingzumwalt mentioned this issue Jan 30, 2017

Captain's Log for the ipfs/archives endeavor #138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World) #104

Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World) #104

flyingzumwalt commented Jan 16, 2017 •

edited

Loading

flyingzumwalt commented Jan 20, 2017 •

edited

Loading

Uh oh!

Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World) #104

Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World) #104

Comments

flyingzumwalt commented Jan 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Advance Prep: Downloading the Data & Setting up the Network

Test-run: 5TB

Test-runs: 50 TB, 100 TB, 300 TB

Move to the Public Network

Follow-up

flyingzumwalt commented Jan 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How this impacts the experiment

Uh oh!

flyingzumwalt commented Jan 16, 2017 •

edited

Loading

flyingzumwalt commented Jan 20, 2017 •

edited

Loading