The Linux Page

Redis & the infamous "Waiting for the cluster to join....." message

Cluster Issues

Problem

For my work, I need to have a simple Redis cluster to test that everything works. Unfortunately, when trying to setup the cluster, I was getting this message:

Waiting for the cluster to join................

with the dots (.) going forever.

Creation of Cluster

To create the cluster I decided to create VPS computers with Ubuntu 18.04 since that's what we're still using. These have Redis version 4.0.3.

So... I created a clone from my clean Ubuntu 18.04.

I started that first system, changed the IP address to a group I could use on my system, and installed Redis.

Then I cloned that first Redis node 5 more times to get a cluster of 6 nodes (which is the minimum by default—although you do not need to have 6 computers because one computer can be a master and the slave of another computer).

All nodes started and worked as expected.

To make sure, I tested and I could connect between all the nodes as I wanted. I do not have any firewall in those nodes (they only have local addresses anyway).

Solution Search

Settings

I found many answers, such as How to solve redis cluster “Waiting for the cluster to join” issue?, but really none of the solutions I found on Stackoverflow helped resolve my  problem.

First, about the setup that I changed in /etc/redis/redis.conf:

cluster-enabled yes
bind 172.16.1.121 192.168.2.121
cluster-announce-ip 172.16.1.121

Note: some of the answers I've found say to swap the IPs or remove the unused IPs, but that did not help at all.

Source for the cluster-announce. (although it may not be required)

My IP addresses go from 121 to 126.

Actually Create the Cluster

Finally, I run the following command to create the cluster:

redis-4.0.9/src/redis-trib.rb create --replicas 1 \
    172.16.1.121:6379 172.16.1.122:6379 172.16.1.123:6379 \
    172.16.1.124:6379 172.16.1.125:6379 172.16.1.126:6379

Note: I got the script from the source on Ubuntu 18.04 so I know it one to one corresponds to the running Redis server. The redis-trib-rb is only available in the source. It doesn't get installed.

apt-get source redis

I thougnt that the output of the command looks as expected, but then it gets stuck saying it's trying to join the cluster nodes:

>>> Creating cluster
>>> Performing hash slots allocation on 6 nodes...
Using 3 masters:
172.16.1.121:6379
172.16.1.122:6379
172.16.1.123:6379
Adding replica 172.16.1.125:6379 to 172.16.1.121:6379
Adding replica 172.16.1.126:6379 to 172.16.1.122:6379
Adding replica 172.16.1.124:6379 to 172.16.1.123:6379
M: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.121:6379
   slots:0-5460 (5461 slots) master
M: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.122:6379
   slots:5461-10922 (5462 slots) master
M: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.123:6379
   slots:10923-16383 (5461 slots) master
S: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.124:6379
   replicates 4261275145911ebe4844c4bd6885c4b2670b0caa
S: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.125:6379
   replicates 4261275145911ebe4844c4bd6885c4b2670b0caa
S: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.126:6379
   replicates 4261275145911ebe4844c4bd6885c4b2670b0caa
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join..... <going forever> ......Ctr-^C
Traceback (most recent call last):
    3: from redis-4.0.9/src/redis-trib.rb:1830:in `<main>'
    2: from redis-4.0.9/src/redis-trib.rb:1436:in `create_cluster_cmd'
    1: from redis-4.0.9/src/redis-trib.rb:653:in `wait_cluster_join'
redis-4.0.9/src/redis-trib.rb:653:in `sleep': Interrupt

The commands I used are as described in the Redis cluster tutorial.

I also tried to see whether the following command would work, but since Ubuntu 18.04 is still on version 4.x, that's not yet available:

redis-cli --cluster ...

If you have Redis 5.x+ then it should work for you instead of having to get the source.

Network Ready?

To make sure that the network was working as expected, I checked each computer with netstat to see that Redis was indeed listening on two ports as expected:

$ netstat -a64n
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 192.168.2.126:6379      0.0.0.0:*               LISTEN
tcp        0      0 172.16.1.126:6379       0.0.0.0:*               LISTEN
tcp        0      0 192.168.2.126:16379     0.0.0.0:*               LISTEN
tcp        0      0 172.16.1.126:16379      0.0.0.0:*               LISTEN
  ...

As we can see, Redis does listen on two ports for each IP address as expected for a cluster (here it was for 126, I had the same results for 121 to 126).

Trying Creating Again

Just in case, maybe something was wrong the first time, I tried again... However, when testing a second time, I get this error unless I clean up all the Redis:

/var/lib/gems/2.5.0/gems/redis-4.2.1/lib/redis/client.rb:127:in `call': ERR Slot 0 is already busy (Redis::CommandError)

To do the cleanup, I used:

for n in 1 2 3 4 5 6
do
    redis-cli -h 172.16.1.12$n FLUSHALL
    redis-cli -h 172.16.1.12$n CLUSTER RESET SOFT
done

Source: ERR Slot xxx is already busy (Redis::CommandError)

Note: This script works from any one of the nodes, which proves that any of the clients can connect to any other Redis as expected. So there is no connection issue. Note that my VPSes do not currently run a firewall. It's 100% open.

Then I can re-run the command above, although it still gets stuck while trying to join...

Your Firewall?

If you have a firewall, see that you have both ports open per Redis instance.

Redis uses port 6379 by default and it also opens 16379 (first port + 10,000). Both of these ports can be modified in the settings if you don't like the defaults.

The firewall must let all the Redis servers in your cluster connect to either port. Your clients only need the first one (6379 by default).

The nodes.conf File

I also checked the `nodes.conf` file on the various VPSes and here is an example:

$ sudo cat /var/lib/redis/nodes.conf
4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.121:6379@16379 myself,master - 0 0 6 connected
vars currentEpoch 6 lastVoteEpoch 0

which shows that 172.16.1.126 received some commands from 172.16.1.121 where I ran the command above. As far as I'm concerned, this looks good.

Debug to the Rescue?

I turned on debug logs:

loglevel debug

But I don't really see anything useful, except that the nodes that are expected to be slaves seem to not be slaves...

1773:M 31 Aug 20:15:21.683 . Unrecognized RDB AUX field: 'redis-ver'
1773:M 31 Aug 20:15:21.683 . Unrecognized RDB AUX field: 'redis-bits'
1773:M 31 Aug 20:15:21.683 . Unrecognized RDB AUX field: 'ctime'
1773:M 31 Aug 20:15:21.683 . Unrecognized RDB AUX field: 'used-mem'
1773:M 31 Aug 20:15:21.683 . Unrecognized RDB AUX field: 'aof-preamble'
1773:M 31 Aug 20:15:21.683 * DB loaded from disk: 0.000 seconds
1773:M 31 Aug 20:15:21.683 * Ready to accept connections
1773:M 31 Aug 20:15:21.683 - 0 clients connected (0 slaves), 1313864 bytes in use
1773:M 31 Aug 20:15:24.100 - Accepted 172.16.1.121:50314
1773:M 31 Aug 20:15:24.151 * DB saved on disk
1773:M 31 Aug 20:15:24.151 - Client closed connection
1773:M 31 Aug 20:15:24.155 - Accepted 172.16.1.121:50316
1773:M 31 Aug 20:15:24.207 - Client closed connection
1773:M 31 Aug 20:15:28.137 - 0 clients connected (0 slaves), 1314056 bytes in use
1773:M 31 Aug 20:15:31.409 - Accepted 172.16.1.121:50338
1773:M 31 Aug 20:15:33.475 - Accepted cluster node 172.16.1.124:43383
1773:M 31 Aug 20:15:33.475 . --- Processing packet of type 2, 2256 bytes
1773:M 31 Aug 20:15:33.475 . Ping packet received: (nil)
1773:M 31 Aug 20:15:33.475 . pong packet received: (nil)
1773:M 31 Aug 20:15:33.475 # Discarding UPDATE message about myself.
1773:M 31 Aug 20:15:33.476 . I/O error reading from node link: connection closed
1773:M 31 Aug 20:15:33.478 - Accepted cluster node 172.16.1.125:41003
1773:M 31 Aug 20:15:33.478 . --- Processing packet of type 2, 2256 bytes

CLUSTER NODES

I tried the CLUSTER NODES Redis command, but each node say they are a master node and none of the nodes seem to know anything about the others:

$ redis-cli -h 172.16.1.121 cluster nodes
4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.121:6379@16379 myself,master - 0 0 6 connected 0-5460
$ redis-cli -h 172.16.1.122 cluster nodes
4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.121:6379@16379 myself,master - 0 0 6 connected 5461-10922
$ redis-cli -h 172.16.1.123 cluster nodes
4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.121:6379@16379 myself,master - 0 0 6 connected 10923-16383
$ redis-cli -h 172.16.1.124 cluster nodes
4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.121:6379@16379 myself,master - 0 0 6 connected
$ redis-cli -h 172.16.1.125 cluster nodes
4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.121:6379@16379 myself,master - 0 0 6 connected
$ redis-cli -h 172.16.1.126 cluster nodes
4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.121:6379@16379 myself,master - 0 0 6 connected

Also we can notice a difference between the slaves the others: there is no range for the slaves... But that didn't fix anything yet.

Problem Resolved!

While perusing the docs, I recall reading something about a unique node ID.

There is even a CLUSTER MYID command and it is repeated there: each node has a unique ID.

Now, looking back at my output above, I see that all nodes have exactly the same ID. Hmmm.... Why is that? Well... when I created my VPSes, remember, I created the first one from my clean 18.04 install and then cloned that one 5 times... Yep! That means all 5 had the same node ID as the first one. Oops!

To resolve the issue, I connected to each node from 122 to 126 and ran the following commands:

sudo systemctl stop redis
rm -i /var/lib/redis/nodes.conf
sudo systemctl start redis

Note: Why does that work? On startup, the Redis server checks for the nodes.conf file. If present, it uses that data. If not present, it creates a new one from scratch.

Then the create command above worked as expected and although we still can see a "Waiting for the cluster to join....." message, that message only lasted for about 6 seconds and then the cluster was formed.

Now the CLUSTER NODES command would display all the nodes as expected:

a25825fde1b893453470ffe4e3a771a37e301df7 172.16.1.123:6379@16379 master - 0 1598908897000 3 connected 10923-16383
4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.126:6379@16379 myself,master - 0 1598908895000 6 connected 0-5460
85af30b16851df066d9988ddb6e05d98110418e1 172.16.1.124:6379@16379 slave a25825fde1b893453470ffe4e3a771a37e301df7 0 1598908899000 4 connected
9cf46e7773a2bc353d62b24a14508292bd57b6f4 172.16.1.122:6379@16379 master - 0 1598908897244 2 connected 5461-10922
39d0e7ceb3a5ce65e0598f595b5a1f4d26065ced 172.16.1.126:6379@16379 slave 9cf46e7773a2bc353d62b24a14508292bd57b6f4 0 1598908900099 7 connected
875e2b073e9615f84b4d94cf22b431bc1e335c69 172.16.1.125:6379@16379 slave 4261275145911ebe4844c4bd6885c4b2670b0caa 0 1598908896000 6 connected

As we can see, the first string is actually the Node identifier and now they all are different. Also the data to the right side is much more complete (although as I was searching for the solution, I had no clue... this is my first time with a Redis Cluster).

Note that from the output above, we had the following:

M: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.121:6379
   slots:0-5460 (5461 slots) master
M: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.122:6379
   slots:5461-10922 (5462 slots) master
M: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.123:6379
   slots:10923-16383 (5461 slots) master
S: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.124:6379
   replicates 4261275145911ebe4844c4bd6885c4b2670b0caa
S: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.125:6379
   replicates 4261275145911ebe4844c4bd6885c4b2670b0caa
S: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.126:6379
   replicates 4261275145911ebe4844c4bd6885c4b2670b0caa

And as we can see, all 6 nodes had the exact same ID:

4261275145911ebe4844c4bd6885c4b2670b0caa

Once I fixed those IDs, the output correctly showed different IDs like so:

M: 4261275145911ebe4844c4bd6885c4b2670b0caa 172.16.1.121:6379
   slots:0-5460 (5461 slots) master
M: 9cf46e7773a2bc353d62b24a14508292bd57b6f4 172.16.1.122:6379
   slots:5461-10922 (5462 slots) master
M: a25825fde1b893453470ffe4e3a771a37e301df7 172.16.1.123:6379
   slots:10923-16383 (5461 slots) master
S: 85af30b16851df066d9988ddb6e05d98110418e1 172.16.1.124:6379
   replicates a25825fde1b893453470ffe4e3a771a37e301df7
S: 875e2b073e9615f84b4d94cf22b431bc1e335c69 172.16.1.125:6379
   replicates 4261275145911ebe4844c4bd6885c4b2670b0caa
S: 39d0e7ceb3a5ce65e0598f595b5a1f4d26065ced 172.16.1.126:6379
   replicates 9cf46e7773a2bc353d62b24a14508292bd57b6f4

We see a repeat because the slaves reference the master that they duplicate, but the first IDs all look different now.

Re: Redis & the infamous "Waiting for the cluster to ...

Thank you for your hard worked this resolving solution.
I had dose exactly same situation.
Now I'm happy cause your kindly documentation.

Thanks again.