We have updated our Terms of Service, Code of Conduct, and Addendum.

How does the Google Cloud Storage source keep track of files already consumed?

josh.hart
josh.hart Posts: 8

I’m curious how the GCS source keeps track of files that it has already consumed and what prevents Stream from consuming the same file more than once? Our use case involves a BigQuery export of Gmail activity logs to a GCS bucket. We’d like to transfer those events to an S3 data lake (I know, moving from one cloud storage to another; we have reasons, but I digress). Ideally, we’d like to consume the events from each file in the GCS bucket and upon success, delete the file from GCS, but at the least, we want to make sure we don’t ingest the same file(s) over and over again.

Thanks!

Answers

  • Chris
    Chris Posts: 13 mod

    I believe, as a Source, that you bring in logs from GCP much like you do with S3. With S3, you use SQS for the Cribl to identify which logs are new and their location. Then Cribl picks them up based on what is provided from SQS. Google Pub/Sub is similar in that its the messaging system for GCP that tells Cribl which logs are new and what Cribl needs to pick up.

  • josh.hart
    josh.hart Posts: 8

    Looking at the GCS collector, there is no setting for Google Pub/Sub, so Im still curious how thats working. The only thing we would configure would be the bucket name and a path.

  • Chris
    Chris Posts: 13 mod

    Sorry, I misunderstood. I didnt realize you were talking about the collector, I thought you were talking about the Google Cloud Pub/Sub source. For the Collector, it would be very similar to how you would pick up logs from an S3 bucket using a collector.

    For me, I usually format my path with dates (i.e archive/2022/04/12/13/)

    This formatting allows you to use time-based tokens in your Collector:

    archive/${_time:%Y}/${_time:%m}/${_time:%d}/${_time:%H}/

    When you run your collector you can do so in such a way with cron schedules and timeframes that Cribl doesnt necessarily need to know whats been read, so much as you just ensuring the next run to pick up off where the last one ended.

    I hope that makes sense.

    Further reference documentation here: Google Cloud Storage | Cribl Docs

  • josh.hart
    josh.hart Posts: 8

    It does; however, Im planning to pull from a location where I dont have the ability to change the output path format for the keys in the storage location. So, while I can appreciate how such a method would help going forward, it still doesnt answer how Cribl internally keeps track of files its already downloaded to prevent from downloading the same files again.