We have updated our Terms of Service, Code of Conduct, and Addendum.

File size upper limit when collecting from File System?

Options
amiller
amiller Posts: 21
edited September 2023 in Stream

I am seeing some irregularities with collecting large files from a filesystem.

We batch process files anywhere from 100MB to 100GB in size. I am currently noticing an issue with larger files.

To troubleshoot I created a collector that reads the data in, and directly writes it back to disk. No other ETL is done on the data.

Result of 1 collection

Result of collecting the same file from the same location using the same event breakers and going to the same destination.

My current environment is 1 leader, 1 worker, and the file is being picked up from an NFS mount.

As you can see from the screenshots, only 3-4 million of the events are being collected of the ~24 million events in the file. The destination is writing to disk about 5-6 GB from the original ~38GB.

There are no errors that I see in the job log, and I can't find any setting regarding worker process limits or job limits that would affect this.

Answers

  • Kyle McCririe
    Kyle McCririe Posts: 29 ✭✭
    Options

    What version of Stream are you on?

    Can you show your Collector settings?

    Turning on debug on the collector could possibly provide more information.

  • Kyle McCririe
    Kyle McCririe Posts: 29 ✭✭
    Options

    Might be worth opening a ticket. I think we will need to take a closer look at your Logs/configurations through a diag.

  • amiller
    amiller Posts: 21
    Options

    Sounds good. Ill open a case with a link to this thread.

  • Kyle McCririe
    Kyle McCririe Posts: 29 ✭✭
    edited July 2023
    Options

    After doing that you can look at the logs for the job by going to Monitoring → Job Inspector.

  • amiller
    amiller Posts: 21
    edited July 2023
    Options

    Stream version: 3.4.1

    Most of the collector settings are default.
    I have added my event breakers.
    Set a custom field for routing specifically back out to the filesystem.

    I am going to run a collection with debug on now.

  • amiller
    amiller Posts: 21
    edited July 2023
    Options

    The majority of the debug logs are…

    "message: failed to pop task reason: no task in the queue"
    and
    "message: skipping metrics flush on pristine metrics"

    Nothing sticks out to me as bad or breaking. No tasks in queue because its one file… so only 1 task.

    Also this 3rd run has a different amount of events captured again.