Skip to content
This repository was archived by the owner on Apr 16, 2020. It is now read-only.

Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World) #104

Open
flyingzumwalt opened this issue Jan 16, 2017 · 1 comment

Comments

@flyingzumwalt
Copy link
Contributor

flyingzumwalt commented Jan 16, 2017

The Main Epic: Replicate 350 TB of Data Between 3 Peers (and the World)

People (hypothetical):

  • Jack (Stanford)
  • Michelle (U Toronto/EDGI)
  • Amy (a University in Midwest)
  • IPFS team
  • Anyone out there following along

Technical Considerations:

If we can roll out filestore in time (see #95 and #91), we can update this plan to have Jack tell ipfs to "track" the data rather than "adding" it to ipfs. This would allow him to serve his original copy of the dataset directly to the network without creating a second copy on his local machines. In the meantime, we can start the experiment using ipfs add with smaller volumes of data (ie. 5-10TB). This will allow us to start surfacing and addressing issues around:

  • Providers UX
  • Blockstore Performance
  • Delegated Content Routing
  • Memory Usage
  • Deployment/Ops Experience

Advance Prep: Downloading the Data & Setting up the Network

  1. Download all of data.gov #113 Jack Downloads all of data.gov (~350TB) to storage devices on Stanford's network
  2. Institutional Collaborators Install and Configure IPFS #114 Jack, Michelle and Amy install and configure ipfs

Test-run: 5TB

  1. [awaiting instructions] Everyone sets up the monitoring tools so they can report on performance and to provide info in case of errors
  2. Add the first 5 TB to IPFS and Publish the content to the DHT #117 Jack adds the first 5 TB to IPFS. The hashes get published to the testbed network's DHT
  3. Jack gives the root hash for the dataset to Michelle and Amy
  4. Replicate the first 5 TB to peers #118 Michelle and Amy pin the root hash on their ipfs nodes. The nodes replicate all of the data.
  5. Run tests to confirm that first 5TB were Replicated Properly #119 Michelle and Amy run tests to confirm that the data were successfully replicated

Test-runs: 50 TB, 100 TB, 300 TB

Jack gradually adds more of the dataset to ipfs, giving the new root hashes to Michelle and Amy. They replicate the data.

Move to the Public Network

After testing is complete, switch the nodes to the public/default IPFS network. Provide the blocks on the DHT and publish the root hashes for people in the general public to pin.

Follow-up

At the end of the sprint, we will need to follow up on a lot of things. See #103

@flyingzumwalt flyingzumwalt changed the title Main data.gov Epic: Replicate 300 TB of Data Between 3 Peers (and the World) Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and the World) Jan 16, 2017
@flyingzumwalt flyingzumwalt changed the title Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and the World) Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World) Jan 16, 2017
@flyingzumwalt
Copy link
Contributor Author

flyingzumwalt commented Jan 20, 2017

UPDATE: Based on initial crawls of the first 3000 datasets, @mejackreed has modified his estimates of the total size of data.gov. The entire corpus of data.gov might only be between 1TB and 10TB. We have identified at least one other large climate dataset, that we will try to download in addition to data.gov.

How this impacts the experiment

If it does turn out that the entire data.gov corpus is under 10TB, it will impact this experiment in a couple ways:

  1. More people will be able to participate in the network, pinning the entire corpus on their IPFS nodes
  2. The additional datasets, like this 30TB NOAA dataset will be included in the experiment and replicated to institutional collaborators for the purposes of testing the system and backing up those datasets temporarily, but it will be easy to either pin or skip those datasets independently of the main data.gov corpus. At the very least, it will be much easier to find new homes for those datasets and move them to the new homes over IPFS.
  3. The IPFS team will have to find an even bigger dataset to test our systems at loads over 100TB. 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant