3.13.2. Disjoint Set (Union-Find)

The Disjoint Set data structure, also known as Union-Find, is a data structure that tracks a set of elements partitioned into multiple disjoint (non-overlapping) subsets. It’s particularly useful in network connectivity(eg. Section 3.8.5.3), least common ancestors, image processing, and more. The primary operations of this data structure are:

Union(x,y): Merge two subsets, represented by x and y, into a single subset.
Find(x): Determine which subset a particular element x is in. This can be used for determining if two elements are in the same subset.

3.13.2.1. Optimization Technique

The efficiency of these operations is crucial for many algorithms, and the Union-Find data structure provides a way to perform them efficiently using three optimizations.

Path Compression: In a naive Union-Find implementation, the Find operation can become inefficient if the trees become very deep. In the worst case, you might have to traverse a long chain of elements just to find the root. With Path Compression, it flatten the structure of the tree whenever Find() is used, by making every node point directly to the root. This dramatically reduces the time complexity of subsequent find operations.

Path Havling: Flatten the structure of the tree whenever Find() is used, by making every node point directly to the root. This dramatically reduces the time complexity of subsequent find operations.
Union by Size: In Union(), attach the root of the tree with fewer elements to the root of the tree with more elements to avoid increasing the tree height unnecessarily.

The implementation could be the following:

struct union_find {
  vector<int> bo, sz;
  union_find(int z) {
    bo.resize(z), sz.resize(z);
    iota(bo.begin(), bo.end(), 0); // 💣
    fill(sz.begin(), sz.end(), 1);
  }
  int find(int i) { // find final big boss' index
    int j=i;
    while (bo[i] != i){
      bo[i]=bo[bo[i]], i = bo[i]; // path halving
    }
    return bo[j]=i; // path compression
  }
  bool union_(int x, int y) {
    int bx = find(x), by = find(y);
    if (bx == by) return false;
    if (sz[bx] < sz[by])  // merge by rank, winner takes all
      bo[bx] = by, sz[by] += sz[bx], sz[bx] = 0;
    else
      bo[by] = bx, sz[bx] += sz[by], sz[by] = 0;
    return true;
  }
  // retrieve all the peers of `idx`
  vector<int> get_set(int idx) {
    vector<int> r;
    int bi = find(idx);
    for (int i = 0; i < bo.size(); i++) {
      if (find(i) == bi) {
        r.push_back(i);
      }
    }
    return r;
  }
  // number of groups
  int number_of_groups() {
    // 1. count when bo[i]==i
    // int res=0;for(int i=0;i<bo.size();i++) res+=int(bo[i]==i); return res;
    // 2. count sz>0
    return count_if(sz.begin(), sz.end(), [](int i) { return i > 0; });
  }
  bool connected(int x, int y) { return find(x) == find(y); }
};

In addition to Union(x,y) and Find(x), there are three more functions: get_set(x), number_of_groups(), and connected().

Related Questions:

Groups of Strings(Section 3.13.8.3)
Longest Consecutive Sequence(Section 3.13.8.1)
Multi-Language Translator(Section 3.13.8.6)
Maze Generation(Section 3.13.8.5)

Reference:

Time complexity: https://en.wikipedia.org/wiki/Disjoint-set_data_structure#Time_complexity
Path compression: https://www.youtube.com/watch?v=VHRhJWacxis

Song:

Title: Union Find
Lyrics: Henry Fuheng Wu

(Verse 1)
Disjoint Set
a.k.a Union-Find
It helps us track sets
with all its might.(all its might)
They are just bunch of trees,
For efficiency,
we put them in some arrays(use two arrays)

(Chorus)
wowo
Union-Find, a super star,
Find your boss
No matter who you are.
Mazes and graphs, it conquers them all,
Union-Find answers the programmer's call.

(Verse 2)
As the name implies,
core functions are union and find,
Yes we can,
optimize it with union by size (winner takes all)
In the find method,
compress and halve the path,
It flatten the trees
so code is fast.

(Outro)
wowo
Union-Find, a super star,
Connecting the elements
No matter how far.
Kruskal minimizes spanning tree
You conquers problems and set us free(set us free)