I am seeing some irregularities with collecting large files from a filesystem.
We batch process files anywhere from 100MB to 100GB in size. I am currently noticing an issue with larger files.
To troubleshoot I created a collector that reads the data in, and directly writes it back to disk. No other ETL is done on the data.
Result of 1 collection
Result of collecting the same file from the same location using the same event breakers and going to the same destination.
My current environment is 1 leader, 1 worker, and the file is being picked up from an NFS mount.
As you can see from the screenshots, only 3-4 million of the events are being collected of the ~24 million events in the file. The destination is writing to disk about 5-6 GB from the original ~38GB.
There are no errors that I see in the job log, and I can’t find any setting regarding worker process limits or job limits that would affect this.
What version of Stream are you on?
Can you show your Collector settings?
Turning on debug on the collector could possibly provide more information.
Stream version: 3.4.1
Most of the collector settings are default.
I have added my event breakers.
Set a custom field for routing specifically back out to the filesystem.
I am going to run a collection with debug on now.
After doing that you can look at the logs for the job by going to Monitoring → Job Inspector.
The majority of the debug logs are…
“message: failed to pop task reason: no task in the queue”
“message: skipping metrics flush on pristine metrics”
Nothing sticks out to me as bad or breaking. No tasks in queue because it’s one file… so only 1 task.
Also this 3rd run has a different amount of events captured again.
Might be worth opening a ticket. I think we will need to take a closer look at your Logs/configurations through a diag.
Sounds good. I’ll open a case with a link to this thread.