Problem with Worker connecting to Leader node

Hi Guys,

I have setup a test environment on my laptop using VBOX and with a Centos7 edge leader node and worker using ports 9000(ui) and 4200 for internal comms.

I’ve installed the worker node using curl. Seems there is an issue with the worker node connecting to the leader as shown below. From the worker node, telnet to the leader nodes port 9000 and 4200 connect fine. But not sure what additional stuff to check based on the below messages:

{"time":"2022-05-23T00:50:14.256Z","cid":"api","channel":"output:DistWorker","level":"info","message":"attempting to connect","host":"10.0.2.4","port":4200,"tls":false}
{"time":"2022-05-23T00:50:14.256Z","cid":"api","channel":"output:DistWorker","level":"debug","message":"will retry to connect","nextConnectTime":1653267018346}
{"time":"2022-05-23T00:50:14.256Z","cid":"api","channel":"output:DistWorker","level":"debug","message":"connecting","host":"10.0.2.4","port":4200,"tls":false}
{"time":"2022-05-23T00:50:14.264Z","cid":"api","channel":"input:DistMaster","level":"debug","message":"opened connection","src":"10.0.2.4:4200"}
{"time":"2022-05-23T00:50:14.264Z","cid":"api","channel":"output:DistWorker","level":"info","message":"connected","host":"10.0.2.4","port":4200,"tls":false}
{"time":"2022-05-23T00:50:14.264Z","cid":"api","channel":"output:DistWorker","level":"info","message":"flushing buffer backlog","count":1,"totalSize":302}
{"time":"2022-05-23T00:50:14.267Z","cid":"api","channel":"output:DistWorker","level":"info","message":"sending unblocked","since":1653267014,"endpoint":{"host":"10.0.2.4","port":4200,"tls":false}}
2 UpGoats

Sorry, the messages are these:

{"time":"2022-05-23T01:35:09.958Z","cid":"api","channel":"output:DistWorker","level":"info","message":"attempting to connect","host":"10.0.2.4","port":4200,"tls":false}
{"time":"2022-05-23T01:35:09.959Z","cid":"api","channel":"output:DistWorker","level":"debug","message":"will retry to connect","nextConnectTime":1653269713683}
{"time":"2022-05-23T01:35:09.959Z","cid":"api","channel":"output:DistWorker","level":"debug","message":"connecting","host":"10.0.2.4","port":4200,"tls":false}
{"time":"2022-05-23T01:35:09.962Z","cid":"api","channel":"input:DistMaster","level":"debug","message":"opened connection","src":"10.0.2.4:4200"}
{"time":"2022-05-23T01:35:09.962Z","cid":"api","channel":"output:DistWorker","level":"info","message":"connected","host":"10.0.2.4","port":4200,"tls":false}
{"time":"2022-05-23T01:35:09.962Z","cid":"api","channel":"output:DistWorker","level":"info","message":"flushing buffer backlog","count":1,"totalSize":240}
{"time":"2022-05-23T01:35:09.964Z","cid":"api","channel":"output:DistWorker","level":"debug","message":"will retry to connect","nextConnectTime":1653269711826}
{"time":"2022-05-23T01:35:09.965Z","cid":"api","channel":"input:DistMaster","level":"debug","message":"closed connection","src":"10.0.2.4:4200","error":{"message":"write EPIPE","stack":"Error: write EPIPE\n    at afterWriteDispatched (internal/stream_base_commons.js:156:25)\n    at writeGeneric (internal/stream_base_commons.js:147:3)\n    at Socket._writeGeneric (net.js:798:11)\n    at Socket._write (net.js:810:8)\n    at writeOrBuffer (internal/streams/writable.js:358:12)\n    at Socket.Writable.write (internal/streams/writable.js:303:10)\n    at y.writeAndFlush (/opt/cribl/bin/cribl.js:14:12771977)\n    at y.sendNextBuffer (/opt/cribl/bin/cribl.js:14:12772984)\n    at Immediate._onImmediate (/opt/cribl/bin/cribl.js:14:12772665)\n    at processImmediate (internal/timers.js:464:21)"},"r":0,"b":0}
{"time":"2022-05-23T01:35:10.972Z","cid":"api","channel":"output:DistWorker","level":"warn","message":"sending is blocked","since":1653269709,"elapsed":1,"endpoint":{"host":"10.0.2.4","port":4200,"tls":false}}

Another variation is this…

{"time":"2022-05-23T01:35:08.073Z","cid":"api","channel":"output:DistWorker","level":"info","message":"attempting to connect","host":"10.0.2.4","port":4200,"tls":false}
{"time":"2022-05-23T01:35:08.073Z","cid":"api","channel":"output:DistWorker","level":"debug","message":"will retry to connect","nextConnectTime":1653269711797}
{"time":"2022-05-23T01:35:08.073Z","cid":"api","channel":"output:DistWorker","level":"debug","message":"connecting","host":"10.0.2.4","port":4200,"tls":false}
{"time":"2022-05-23T01:35:08.075Z","cid":"api","channel":"input:DistMaster","level":"debug","message":"opened connection","src":"10.0.2.4:4200"}
{"time":"2022-05-23T01:35:08.076Z","cid":"api","channel":"output:DistWorker","level":"info","message":"connected","host":"10.0.2.4","port":4200,"tls":false}
{"time":"2022-05-23T01:35:08.076Z","cid":"api","channel":"output:DistWorker","level":"info","message":"flushing buffer backlog","count":1,"totalSize":240}
{"time":"2022-05-23T01:35:08.095Z","cid":"api","channel":"output:DistWorker","level":"debug","message":"will retry to connect","nextConnectTime":1653269709957}
{"time":"2022-05-23T01:35:08.176Z","cid":"api","channel":"output:DistWorker","level":"error","message":"connection error","error":"This socket has been ended by the other party"}
1 UpGoat

OK figured it out. So the default Auth token “criblmaster” needs to be replaced with a proper one. Clicking the generate button on the Distributed Settings > Leader Settings page and using this new token in the /opt/cribl/local/_system/instance.yml file on the worker node did the trick!

2 UpGoats

Glad you were able to address this issue. From the messages that you posted from the Worker, this is the one (see below) that points to the issue. There are other scenarios where you could experience this message as well and validating the ports, connection, and reviewing the tcpdump output will help narrow it down.

{“time”:“2022-05-23T01:35:08.176Z”,“cid”:“api”,“channel”:“output:DistWorker”,“level”:“error”,“message”:“connection error”,“error”:“This socket has been ended by the other party”}

1 UpGoat