Amazon S3 Source will sometimes hang up on files

Hi, We have configured an Amazon S3 source on our Cribl Stream, one thing we are observing is that after first couple of minutes the workers are hanging onto certain files and taking long time to process those files. It stays like that until we make a configuration change and commit it, and seeing logs as shown in following screenshot. If I check the live data, not seeing anything, cpu and memory utilization is below 50%.

Screenshot attached:
image

Anything unique about those files? Extra large overall? Extra large events inside? Tricky content inside?

The diag bundle shows these two recurring errors listed below when trying to send events to the Splunk HEC output. These are going to cause blocking backpressure because this output was configured by a Stream admin to block upon backpressure. So when Stream tries to collect more data from S3 a worker process is going to pause the event stream because it’s getting block signals from the output.
{"time":"2022-12-09T23:16:32.349Z","cid":"w0","channel":"output:SplunkHECEndpoint","level":"error","message":"error while flushing","error":{"message":"connect ETIMEDOUT xxx.xxx.xxx.xxx:8088","stack":"Error: connect ETIMEDOUT xxx.xxx.xxx.xxx:8088\n at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1159:16)"}}

{"time":"2022-12-09T23:16:39.996Z","cid":"w0","channel":"output:SplunkHECEndpoint","level":"error","message":"hit request concurrency limit","stack":"BackpressureError: hit request concurrency limit\n at /opt/cribl/bin/cribl.js:14:12504414\n at new Promise (<anonymous>)\n at l.flushBuffer (/opt/cribl/bin/cribl.js:14:12504332)\n at Immediate._onImmediate (/opt/cribl/bin/cribl.js:14:12504020)\n at processImmediate (internal/timers.js:464:21)","reason":"output is experiencing increased load","name":"BackpressureError","conflictingFields":{"message":"hit request concurrency limit"}}

Below is a minutely event that is logged by each worker process. This one was logged 45 seconds after the events above and shows 3 blocked EPs. An EP (event processor) is blocked when destinations are backpressuring; this prevents event flow from input to output.

{"time":"2022-12-09T23:17:05.089Z","cid":"w0","channel":"server","level":"info","message":"_raw stats","inEvents":263994,"outEvents":263994,"inBytes":30632878,"outBytes":30632878,"starttime":1670627760,"endtime":1670627820,"activeCxn":0,"openCxn":0,"closeCxn":0,"rejectCxn":0,"abortCxn":0,"pqInEvents":0,"pqOutEvents":0,"pqInBytes":0,"pqOutBytes":0,"pqTotalBytes":0,"droppedEvents":0,"tasksStarted":1,"tasksCompleted":1,"activeEP":4,"blockedEP":3,"cpuPerc":7.78,"mem":{"heap":105,"ext":116,"rss":332}}

The recommendation to resolve the stalling of the S3 file downloads is to address the issues with the destination. There are multiple reasons for why the requests can fail so I can’t give you an exact resolution but generally speaking it seems the destination is unable to accept all the HTTP requests coming from Stream so it may be overwhelmed or there could be issues with the intervening load balancer.

I see you have max request concurrency in the config set to 32. This is quite high and might be the issue. This setting is per worker process so this means each worker process is sending 32 HTTP requests to the destination every 1 second (the 1 second is based on the Flush Period setting value). That might be too much for the destination to handle from a connection perspective. But again, a definitive value for what that needs to be is not something we can provide; you’ll have to experiment to see what the destination can accommodate.

The destination has been disabled for now, and S3 ingestion has resumed.