File size upper limit when collecting from File System?

amiller · March 2023

I am seeing some irregularities with collecting large files from a filesystem.

We batch process files anywhere from 100MB to 100GB in size. I am currently noticing an issue with larger files.

To troubleshoot I created a collector that reads the data in, and directly writes it back to disk. No other ETL is done on the data.

Result of 1 collection

Result of collecting the same file from the same location using the same event breakers and going to the same destination.

My current environment is 1 leader, 1 worker, and the file is being picked up from an NFS mount.

As you can see from the screenshots, only 3-4 million of the events are being collected of the ~24 million events in the file. The destination is writing to disk about 5-6 GB from the original ~38GB.

There are no errors that I see in the job log, and I can't find any setting regarding worker process limits or job limits that would affect this.

Kyle McCririe · March 2023

What version of Stream are you on?

Can you show your Collector settings?

Turning on debug on the collector could possibly provide more information.

Kyle McCririe · March 2023

Might be worth opening a ticket. I think we will need to take a closer look at your Logs/configurations through a diag.

amiller · March 2023

Sounds good. Ill open a case with a link to this thread.

Kyle McCririe · March 2023

After doing that you can look at the logs for the job by going to Monitoring → Job Inspector.

amiller · March 2023

Stream version: 3.4.1

Most of the collector settings are default.
I have added my event breakers.
Set a custom field for routing specifically back out to the filesystem.

I am going to run a collection with debug on now.

amiller · March 2023

The majority of the debug logs are…

"message: failed to pop task reason: no task in the queue"
and
"message: skipping metrics flush on pristine metrics"

Nothing sticks out to me as bad or breaking. No tasks in queue because its one file… so only 1 task.

Also this 3rd run has a different amount of events captured again.

File size upper limit when collecting from File System?

Answers

Categories