IPFS (April, 2024) ================== Consistent hashing Say you want to cache web pages on a number of machines Hash URL, and chose machine based on hash: e.g., machine = hash % nservers What happens when you add and remove machines? If URL -> server mapping changes completely, most cache data invalid What happens if a client hasn't yet heard of a new machine Could use entirely wrong hash function Consistent hashing: Only invalidate data in proportion to reconfiguration Add one new server, should invalidate 1/N of data How to implement consistent hashing? Give each node a number of randomly chosen IDs (e.g., 160-bit number) compute key = Hash(URL), which is also in same ID space Approach 1: Map IDs onto circle [Chord] Store URL at node whose ID follows key on circle (successor) [Go over what happens when you add or remove a node] Approach 2: Arrange node IDs in a tree (e.g., height 160) [Kademlia] Start with most significant bit of URL hash (key) Go down tree choosing left or right branch based on next bit of key If only one branch, just chose that one In other words, find node whose ID minimizes (key XOR ID) [Go over what happens when you add or remove a node] Problem: neighbor of failed node may bear brunt of load spike How to distribute failed node's load more evenly? Give each node multiple "virtual IDs" Keys close to failed node's multiple virtual IDs will be spread out Note: in heterogeneous environment, more virtual IDs for beefier machines Another approach: CARP [https://tools.ietf.org/html/draft-vinod-carp-v1-03]: For each server ID id_1, id_2, ..., compute h_i := H(id_i, URL) Sort the h_i, and take the server i with the highest "score" h_i In heterogeneous environments, can also scale h_i by capacity Advantages of CARP: No longer need multiple virtual IDs Servers can decline requests under high load, next nodes spread out Disadvantage? Lookup is always O(N), vs. O(log N) for, e.g., Kademlia Now, say you want to build a gigantic data cache So big each machine only knows about a small fraction of other machines So big that nodes come and go all the time--nodes owned by many people Napster approach--centralized server coordinates, can be shut down Gnutella approach--ad-hoc configuration, broadcast queries very expensive Better approach: Use consistent hashing for *routing* as well Store data as before on "closest" node to key in ID space Chord: closest node closest clockwise around circle Kademlia: closest node is closest treating ID XOR as distance Build routing tables to allow ID-space navigation Each node knows about ID-space neighbors Perhaps each node knows a few farther-away nodes To move long distances quickly For n nodes, can do this with log n RPCs and O(log n) state at each node These data structures are called distributed hash tables (DHTs) How would you do this in Kademlia tree? If a node's (base 2) id is b_0 b_1 b_2 ..., the node must know: - A node with first bit ~b_0 (other half of tree) - A node with prefix b_0 ~b_1 (other quarter in this half of tree) - A node with prefix b_0 b_1 ~b_2 - etc. Then you can get one bit closer to key with each RPC (Note: To handle unbalanced trees, each node actually must know all nodes under some point in the tree with two children.) In a very large system, machines will be failing all the time If N machines and a constant fraction fail, then with subsets of size < O(log(N)) stand good chance of all nodes in a subset failing Pick "replication factor" k-sized set unlikely to fail simultaneously In Kademlia, should know up to k nodes with each prefix ~b_0..., b_0 ~b_1..., b_0 b_1 ~b_2... I.e., k nodes in other half of tree, other quarter of this half, etc. [Show tree then say node 14 wants to look up id 4] Must do *maintenance* to ensure you learn of dead nodes Note: Advantage of Kademlia is symmetry You receive RPCs from nodes in same distribution as you send them to So will often already know that nodes in routing table are alive Chord goes clockwise around circle, so incoming RPCs less useful Optimization Unlike Chord, Kademlia lets you contact *any* node in appropriate subtree So for better performance, cache nodes with low round-trip-time (RTT) Note: failure causes RPC timeouts, which can seriously hurt latency Can trade off bandwidth for latency w. concurrent requests Digression: Bit shifting vs. bit correcting Using bit *shifting*, can actually have constant-size routing table E.g., node with ID x knows nodes with IDs 2x and 2x+1 But makes failure recovery and optimization harder--see Koorde DHT Kadmlia popularized by trackerless bittorrent, now used by IPFS What is the "decentralized web", and why do people want this? Maybe any infrastructure built on P2P systems like DHTs Serve content without relying on hyperscale companies But can't anyone create a web site--why is web centralized? Bandwidth limits and high charges [collaborative caching--CoralCDN] Could run into censorship or other regulatory issues Serve content related to blockchain (e.g., NFT images) Now say you want to build a giant file store on a DHT Break files into blocks, name blocks with inode-like metadata Idea: Index all data+metadata by collision-resistant hash (e.g., SHA256) Just need root hash to traverse entire data collection Results self-authenticating, assures integrity of data from random nodes What does IPFS store in Kademlia DHT? Provider records: Content identifier (CID) -> provider PeerID where PeerID is multihash of provider public key (multihash is a collision-resistant where hash function deducible) Peer records: PeerID -> multiaddress Actual data not stored in DHT, providers must decide to host it What happens in IPFS to look up file with root CID? Opportunistically ask all peers connected to if they have CID Use kademlia to lookup CID -> PeerID Is PeerID in 900 cached values? no, DHT lookup PeerID -> Multiaddress Connect to Multiadress [peer routing], run Bitswap--what's this? Send "want-have" messages with CID(s) you want to peers Peers reply with "have" or "dont-have" Send "want-block" to one peer that has the block to download data What's wrong with original IPFS (w. vanilla DHT-based indexing)? Publishing data is super slow 66.73 second 95%ile latency (p. 443) Why so expensive? Records must be replicated on 20 nodes Delay is dominated by DHT walk to find 20 closest nodes How much should we care about this? Data must be re-published every 24 hours--sounds bad Latency would be unacceptable for live vidoe/audio feed Latency probably okay for static data (e.g., NFT contents) But would need to overlap publication in parallel if many records Maybe could be smart by sorting records? Retrieving data is somewhat slow (2.72 second median latency) How much do we care? More than for publishing--don't want to wait for web pages to load Mixed HTTP and IPFS latencies make web pages act weird Hard to cache hot content when spread across many machines If caching happens after DHT lookup, will still be slow [CoralCDN had caching along DHT route] Most browsers don't support IPFS (big problem for mobile devices) What are the proposed solutions to these problems? 1. Interplanetary indexers - give up on DHT for storing records Publishers structure advertisement deltas into hash chain Makes reconciliation easy if indexer misses an update What are downsides of this approach? Storage for chains grows without bounds Might not scale to huge numbers of publishers 2. Hydra boosters Hydra head: node with 1000s of kademlia IDs Idea: All nodes one hop from at least one head in booster Hydra database: DynamoDB shared by 10-15 Hydra heads Any head in booster can reply with provider record stored by any other 3. HTTP gateways--just serve IPFS over HTTP, with nginx caching How do hydra heads ensure roughly uniform depth of IDs? "power of two" choices Background, say you are load-balancing requests to N symmetric servers Option 1: pick a random server for each request Option 2: pick 2 random servers, route to one with shorter queue Option 3: query 3 servers for load, pick one with lowest load Option 2 has exponentially better (in N) wait times than option 1 Option 3 is just a constant factor improvement For hydra head node IDs: generate two random IDs, pick one that keeps trie depth more regular How well does this work (Fig. 7, appendix) What would perfect look like? straight line But not all Hydra boosters the same size (10-15 nodes) Also, non-hydra nodes not evenly distributed (could clump) Only 20% of nodes not 1 hop from Hydra booster, pretty good Are these measures effective? 1. Yes (Figure 4): Store way more records, retrieve way faster (recall plots are log scale, so these are huge improvements) What does 4c tell us? It's all routing/connection establishment time Latency almost "agnostic" to number of records advertised 2. Hydra boosters are a bit disappointing Do shave DHT lookup hops (Figure 5b) But -3.2%-+36.5% improvement is not revolutionary 3. Makes IPFS accessible from phones and vanilla browsers Cached lookups much faster (Figure 5c) 50.3% nginx cache hit rate too fast to plot node-cached (by IPFS) also significantly faster than DHT Non-cached faster than DHT at median, slower at tail--why? paper extra overhead from retrieving and forwarding, which is vague my hypothesis: DHT always offers routing choice (20 nodes at each hop) But if network path to gateway is slow, just have to keep waiting How much do these improvements undermine the decentralization goal 1: A lot--many records only available in indexer 2: Not much--anyone could deploy hydras unilaterally anyway 3: Some--but no one has to use them, so maybe not so bad Too bad #1 seems like the most crucial for system operation at scale Is IPFS censorship resistant, even w/o interplanetary indexers? Probably most nodes running at a handful of cloud providers DHTs themselves not robust to malicious participants--hydra hindrances? Can create bogus provider that advertises but won't serve documents Can also go after providers--only the index is decentralized But more participants, so harder to coerce People operating nodes may have less liability for content What high-level lessons can we draw from this paper? There's centralization in P2P systems There's demand for P2P systems, even when not perfectly decentralized Kademlia is slow--must architect around high latency Decentralized systems aren't that big 15-node hydra booster can be one-hop from 80% of nodes Maybe don't need (log n)-sized routing tables / maintenance traffic? Maybe try to benefit from locality? Maybe store related CIDs together Or don't chop collections up over as many CIDs?