IPFS (April, 2024)
==================

Consistent hashing
  Say you want to cache web pages on a number of machines
    Hash URL, and chose machine based on hash:  e.g., machine = hash % nservers
  What happens when you add and remove machines?
    If URL -> server mapping changes completely, most cache data invalid
  What happens if a client hasn't yet heard of a new machine
    Could use entirely wrong hash function
  Consistent hashing:  Only invalidate data in proportion to reconfiguration
    Add one new server, should invalidate 1/N of data

How to implement consistent hashing?
  Give each node a number of randomly chosen IDs (e.g., 160-bit number)
  compute key = Hash(URL), which is also in same ID space
  Approach 1:  Map IDs onto circle [Chord]
    Store URL at node whose ID follows key on circle (successor)
    [Go over what happens when you add or remove a node]
  Approach 2:  Arrange node IDs in a tree (e.g., height 160) [Kademlia]
    Start with most significant bit of URL hash (key)
    Go down tree choosing left or right branch based on next bit of key
      If only one branch, just chose that one
    In other words, find node whose ID minimizes  (key XOR ID)
    [Go over what happens when you add or remove a node]

Problem: neighbor of failed node may bear brunt of load spike
  How to distribute failed node's load more evenly?
  Give each node multiple "virtual IDs"
  Keys close to failed node's multiple virtual IDs will be spread out
  Note: in heterogeneous environment, more virtual IDs for beefier machines

Another approach: CARP [https://tools.ietf.org/html/draft-vinod-carp-v1-03]:
  For each server ID id_1, id_2, ..., compute h_i := H(id_i, URL)
  Sort the h_i, and take the server i with the highest "score" h_i
  In heterogeneous environments, can also scale h_i by capacity
  Advantages of CARP:
    No longer need multiple virtual IDs
    Servers can decline requests under high load, next nodes spread out
  Disadvantage?  Lookup is always O(N), vs. O(log N) for, e.g., Kademlia

Now, say you want to build a gigantic data cache
  So big each machine only knows about a small fraction of other machines
  So big that nodes come and go all the time--nodes owned by many people
  Napster approach--centralized server coordinates, can be shut down
  Gnutella approach--ad-hoc configuration, broadcast queries very expensive

Better approach:  Use consistent hashing for *routing* as well
  Store data as before on "closest" node to key in ID space
    Chord:  closest node closest clockwise around circle
    Kademlia:  closest node is closest treating ID XOR as distance
  Build routing tables to allow ID-space navigation
    Each node knows about ID-space neighbors
    Perhaps each node knows a few farther-away nodes
      To move long distances quickly
  For n nodes, can do this with log n RPCs and O(log n) state at each node
  These data structures are called distributed hash tables (DHTs)

How would you do this in Kademlia tree?
  If a node's (base 2) id is b_0 b_1 b_2 ..., the node must know:
    - A node with first bit ~b_0  (other half of tree)
    - A node with prefix b_0 ~b_1 (other quarter in this half of tree)
    - A node with prefix b_0 b_1 ~b_2
    - etc.
  Then you can get one bit closer to key with each RPC
  (Note: To handle unbalanced trees, each node actually must
   know all nodes under some point in the tree with two children.)

In a very large system, machines will be failing all the time
  If N machines and a constant fraction fail, then with subsets of
         size < O(log(N)) stand good chance of all nodes in a subset failing
  Pick "replication factor" k-sized set unlikely to fail simultaneously
  In Kademlia, should know up to k nodes with each prefix
         ~b_0..., b_0 ~b_1..., b_0 b_1 ~b_2...
    I.e., k nodes in other half of tree, other quarter of this half, etc.
    [Show tree then say node 14 wants to look up id 4]
  Must do *maintenance* to ensure you learn of dead nodes
    Note:  Advantage of Kademlia is symmetry
      You receive RPCs from nodes in same distribution as you send them to
      So will often already know that nodes in routing table are alive
    Chord goes clockwise around circle, so incoming RPCs less useful

Optimization
  Unlike Chord, Kademlia lets you contact *any* node in appropriate subtree
  So for better performance, cache nodes with low round-trip-time (RTT)
  Note: failure causes RPC timeouts, which can seriously hurt latency
    Can trade off bandwidth for latency w. concurrent requests

Digression:  Bit shifting vs. bit correcting
  Using bit *shifting*, can actually have constant-size routing table
    E.g., node with ID x knows nodes with IDs 2x and 2x+1
  But makes failure recovery and optimization harder--see Koorde DHT

Kadmlia popularized by trackerless bittorrent, now used by IPFS

What is the "decentralized web", and why do people want this?
  Maybe any infrastructure built on P2P systems like DHTs
  Serve content without relying on hyperscale companies
    But can't anyone create a web site--why is web centralized?
      Bandwidth limits and high charges [collaborative caching--CoralCDN]
      Could run into censorship or other regulatory issues
  Serve content related to blockchain (e.g., NFT images)

Now say you want to build a giant file store on a DHT
  Break files into blocks, name blocks with inode-like metadata
  Idea:  Index all data+metadata by collision-resistant hash (e.g., SHA256)
    Just need root hash to traverse entire data collection
    Results self-authenticating, assures integrity of data from random nodes

What does IPFS store in Kademlia DHT?
  Provider records:  Content identifier (CID) -> provider PeerID
    where PeerID is multihash of provider public key
    (multihash is a collision-resistant where hash function deducible)
  Peer records:  PeerID -> multiaddress
  Actual data not stored in DHT, providers must decide to host it
What happens in IPFS to look up file with root CID?
  Opportunistically ask all peers connected to if they have CID
  Use kademlia to lookup CID -> PeerID
  Is PeerID in 900 cached values? no, DHT lookup PeerID -> Multiaddress
  Connect to Multiadress [peer routing], run Bitswap--what's this?
    Send "want-have" messages with CID(s) you want to peers
    Peers reply with "have" or "dont-have"
    Send "want-block" to one peer that has the block to download data

What's wrong with original IPFS (w. vanilla DHT-based indexing)?
  Publishing data is super slow 66.73 second 95%ile latency (p. 443)
    Why so expensive?  Records must be replicated on 20 nodes
      Delay is dominated by DHT walk to find 20 closest nodes
    How much should we care about this?
      Data must be re-published every 24 hours--sounds bad
      Latency would be unacceptable for live vidoe/audio feed
      Latency probably okay for static data (e.g., NFT contents)
        But would need to overlap publication in parallel if many records
        Maybe could be smart by sorting records?
  Retrieving data is somewhat slow (2.72 second median latency)
    How much do we care?
      More than for publishing--don't want to wait for web pages to load
      Mixed HTTP and IPFS latencies make web pages act weird
  Hard to cache hot content when spread across many machines
    If caching happens after DHT lookup, will still be slow
    [CoralCDN had caching along DHT route]
  Most browsers don't support IPFS (big problem for mobile devices)

What are the proposed solutions to these problems?
  1. Interplanetary indexers - give up on DHT for storing records
     Publishers structure advertisement deltas into hash chain
     Makes reconciliation easy if indexer misses an update
     What are downsides of this approach?
       Storage for chains grows without bounds
       Might not scale to huge numbers of publishers
  2. Hydra boosters
     Hydra head: node with 1000s of kademlia IDs
       Idea:  All nodes one hop from at least one head in booster
     Hydra database:  DynamoDB shared by 10-15 Hydra heads
       Any head in booster can reply with provider record stored by any other
  3. HTTP gateways--just serve IPFS over HTTP, with nginx caching

How do hydra heads ensure roughly uniform depth of IDs? "power of two" choices
  Background, say you are load-balancing requests to N symmetric servers
    Option 1: pick a random server for each request
    Option 2: pick 2 random servers, route to one with shorter queue
    Option 3: query 3 servers for load, pick one with lowest load
  Option 2 has exponentially better (in N) wait times than option 1
      Option 3 is just a constant factor improvement
  For hydra head node IDs:
    generate two random IDs, pick one that keeps trie depth more regular
    How well does this work (Fig. 7, appendix)
      What would perfect look like?  straight line
        But not all Hydra boosters the same size (10-15 nodes)
        Also, non-hydra nodes not evenly distributed (could clump)
      Only 20% of nodes not 1 hop from Hydra booster, pretty good

Are these measures effective?
  1. Yes (Figure 4):  Store way more records, retrieve way faster
     (recall plots are log scale, so these are huge improvements)
     What does 4c tell us?  It's all routing/connection establishment time
       Latency almost "agnostic" to number of records advertised
  2. Hydra boosters are a bit disappointing
     Do shave DHT lookup hops (Figure 5b)
     But -3.2%-+36.5% improvement is not revolutionary
  3. Makes IPFS accessible from phones and vanilla browsers
     Cached lookups much faster (Figure 5c)
       50.3% nginx cache hit rate too fast to plot
       node-cached (by IPFS) also significantly faster than DHT
     Non-cached faster than DHT at median, slower at tail--why?
       paper extra overhead from retrieving and forwarding, which is vague
       my hypothesis: DHT always offers routing choice (20 nodes at each hop)
         But if network path to gateway is slow, just have to keep waiting

How much do these improvements undermine the decentralization goal
  1: A lot--many records only available in indexer
  2: Not much--anyone could deploy hydras unilaterally anyway
  3: Some--but no one has to use them, so maybe not so bad
Too bad #1 seems like the most crucial for system operation at scale

Is IPFS censorship resistant, even w/o interplanetary indexers?
  Probably most nodes running at a handful of cloud providers
  DHTs themselves not robust to malicious participants--hydra hindrances?
  Can create bogus provider that advertises but won't serve documents
  Can also go after providers--only the index is decentralized
    But more participants, so harder to coerce
    People operating nodes may have less liability for content

What high-level lessons can we draw from this paper?
  There's centralization in P2P systems
  There's demand for P2P systems, even when not perfectly decentralized
  Kademlia is slow--must architect around high latency
  Decentralized systems aren't that big
    15-node hydra booster can be one-hop from 80% of nodes
    Maybe don't need (log n)-sized routing tables / maintenance traffic?
  Maybe try to benefit from locality?
    Maybe store related CIDs together
    Or don't chop collections up over as many CIDs?