Problem with deploy config from Leader to Workers

Hi, I have problem with deploy config from Leader to Worker. I have distributed deployment with 2 Workers and 1 worker group (testing, free license used). I configured secure connection between Leader and Workers according to https://docs.cribl.io/stream/securing-communications. For the first look everything is OK - both Workers are visible on Manage Worker Nodes on Leader as alive, I can connect to Workers GUI (Remote UI Access is on), config version is OK etc.
Problem is when I want to deploy config changes to Workers. On Manage Worker Nodes I see yellow exclamation mark and neverending rotating circle in Config version column and config changes are not propagated to Workers.
In log on Leader I see this:

{"time":"2022-06-17T13:01:21.350Z","cid":"api","channel":"CriblMaster","level":"info","message":"sending config update request","group":"default","version":"8627dcf","logStreamEnv":"master","worker":"ff5ea4c4-51eb-4aa4-a650-2e6c8251bb21"}
{"time":"2022-06-17T13:01:21.363Z","cid":"api","channel":"CriblMaster","level":"warn","message":"failed config update","group":"default","version":"8627dcf","logStreamEnv":"master","worker":"ff5ea4c4-51eb-4aa4-a650-2e6c8251bb21","elapsed":13,"error":{"__criblEventType":"event","__ctrlFields":[],"__final":false,"__cloneCount":0,"type":"resp","status":500,"message":"error running handler for req=configure","error":{"message":"Received non-OK status code=403","stack":"Error: Received non-OK status code=403\n    at ClientRequest.<anonymous> (/srv/app/int/secmon/cribl/bin/cribl.js:14:11928157)\n    at Object.onceWrapper (events.js:520:26)\n    at ClientRequest.emit (events.js:400:28)\n    at ClientRequest.emit (domain.js:475:12)\n    at HTTPParser.parserOnIncomingClient (_http_client.js:647:27)\n    at HTTPParser.parserOnHeadersComplete (_http_common.js:127:17)\n    at Socket.socketOnData (_http_client.js:515:22)\n    at Socket.emit (events.js:400:28)\n    at Socket.emit (domain.js:475:12)\n    at Socket.Readable.read (internal/streams/readable.js:504:10)"},"req":"configure","reqId":8,"rpc":false,"__raw":"{\"type\":\"resp\",\"status\":500,\"message\":\"error running handler for req=configure\",\"error\":{\"message\":\"Received non-OK status code=403\",\"stack\":\"Error: Received non-OK status code=403\\n    at ClientRequest.<anonymous> (/srv/app/int/secmon/cribl/bin/cribl.js:14:11928157)\\n    at Object.onceWrapper (events.js:520:26)\\n    at ClientRequest.emit (events.js:400:28)\\n    at ClientRequest.emit (domain.js:475:12)\\n    at HTTPParser.parserOnIncomingClient (_http_client.js:647:27)\\n    at HTTPParser.parserOnHeadersComplete (_http_common.js:127:17)\\n    at Socket.socketOnData (_http_client.js:515:22)\\n    at Socket.emit (events.js:400:28)\\n    at Socket.emit (domain.js:475:12)\\n    at Socket.Readable.read (internal/streams/readable.js:504:10)\"},\"req\":\"configure\",\"reqId\":8,\"rpc\":false}","__socketAddr":"10.88.29.42:16337","__srcIpPort":"10.88.29.42:16337"}}
{"time":"2022-06-17T13:01:21.363Z","cid":"api","channel":"CriblMaster","level":"warn","message":"failed to get worker configs up-to-date","group":"default","version":"8627dcf","guid":"ff5ea4c4-51eb-4aa4-a650-2e6c8251bb21","error":{"__criblEventType":"event","__ctrlFields":[],"__final":false,"__cloneCount":0,"type":"resp","status":500,"message":"error running handler for req=configure","error":{"message":"Received non-OK status code=403","stack":"Error: Received non-OK status code=403\n    at ClientRequest.<anonymous> (/srv/app/int/secmon/cribl/bin/cribl.js:14:11928157)\n    at Object.onceWrapper (events.js:520:26)\n    at ClientRequest.emit (events.js:400:28)\n    at ClientRequest.emit (domain.js:475:12)\n    at HTTPParser.parserOnIncomingClient (_http_client.js:647:27)\n    at HTTPParser.parserOnHeadersComplete (_http_common.js:127:17)\n    at Socket.socketOnData (_http_client.js:515:22)\n    at Socket.emit (events.js:400:28)\n    at Socket.emit (domain.js:475:12)\n    at Socket.Readable.read (internal/streams/readable.js:504:10)"},"req":"configure","reqId":8,"rpc":false,"__raw":"{\"type\":\"resp\",\"status\":500,\"message\":\"error running handler for req=configure\",\"error\":{\"message\":\"Received non-OK status code=403\",\"stack\":\"Error: Received non-OK status code=403\\n    at ClientRequest.<anonymous> (/srv/app/int/secmon/cribl/bin/cribl.js:14:11928157)\\n    at Object.onceWrapper (events.js:520:26)\\n    at ClientRequest.emit (events.js:400:28)\\n    at ClientRequest.emit (domain.js:475:12)\\n    at HTTPParser.parserOnIncomingClient (_http_client.js:647:27)\\n    at HTTPParser.parserOnHeadersComplete (_http_common.js:127:17)\\n    at Socket.socketOnData (_http_client.js:515:22)\\n    at Socket.emit (events.js:400:28)\\n    at Socket.emit (domain.js:475:12)\\n    at Socket.Readable.read (internal/streams/readable.js:504:10)\"},\"req\":\"configure\",\"reqId\":8,\"rpc\":false}","__socketAddr":"10.88.29.42:16337","__srcIpPort":"10.88.29.42:16337"}}

and on Worker

{
	"time": "2022-06-17T13:01:21.351Z",
	"cid": "api",
	"channel": "CriblWorker",
	"level": "info",
	"message": "leader triggered configure",
	"version": "8627dcf",
	"group": "default",
	"checksum": "sha1:27d31c969b154e7addc0af8afd8133ae4510b6c8",
	"url": "https://10.88.29.12:4200/api/v1/master/bundles/default/8627dcf?guid=ff5ea4c4-51eb-4aa4-a650-2e6c8251bb21",
	"source": "cribl.log"
}

I tried with different kinds of certificates (self-signed certificates generated on cribl servers, certificates made by CA which I configured on Leader node, certificates from external 3rd party CA - still the same result.
Without TLS everything works well.
Could you please help me point out what is wrong?
Many thanks for help in advance.
Best regards
Lukas Mecir

1 UpGoat

It seems like I’ve seen this before but can’t quite remember what the solution was. One question I have though is whether your workers have a proxy env variable set for getting to the Internet?

It seems the notification from the leader to the worker is getting through since the worker logs the event indicating it was told by leader that a new config is available despite the logs on the leader showing both a 500 and 403 errors. I think when we saw this in the past it was because the workers were trying to connect to the leader via proxy and never getting through because the proxy wasn’t directing them to the leader but to the Internet.

The reason the workers would be directed away from the leader by the proxy is because a separate connection from workers to leader occurs when they download their config deployments. They use the same port (4200/tcp) as regular cluster communications but because the request is HTTP-based it can be hijacked by OS proxy env variables. Regular cluster comms aren’t HTTP-based.

And maybe the reason why it works without TLS but fails with TLS is because you only have a proxy env var set for HTTPS but not HTTPS. Just a guess.

hope that helps.