How does the Google Cloud Storage source keep track of files already consumed?

I’m curious how the GCS source keeps track of files that it has already consumed and what prevents Stream from consuming the same file more than once? Our use case involves a BigQuery export of Gmail activity logs to a GCS bucket. We’d like to transfer those events to an S3 data lake (I know, moving from one cloud storage to another; we have reasons, but I digress). Ideally, we’d like to consume the events from each file in the GCS bucket and upon success, delete the file from GCS, but at the least, we want to make sure we don’t ingest the same file(s) over and over again.

Thanks!

2 UpGoats

I believe, as a Source, that you bring in logs from GCP much like you do with S3. With S3, you use SQS for the Cribl to identify which logs are new and their location. Then Cribl picks them up based on what is provided from SQS. Google Pub/Sub is similar in that it’s the messaging system for GCP that tells Cribl which logs are new and what Cribl needs to pick up.

Looking at the GCS collector, there is no setting for Google Pub/Sub, so I’m still curious how that’s working. The only thing we would configure would be the bucket name and a path.

Sorry, I misunderstood. I didn’t realize you were talking about the collector, I thought you were talking about the Google Cloud Pub/Sub source. For the Collector, it would be very similar to how you would pick up logs from an S3 bucket using a collector.

For me, I usually format my path with dates (i.e archive/2022/04/12/13/)

This formatting allows you to use time-based tokens in your Collector:

archive/${_time:%Y}/${_time:%m}/${_time:%d}/${_time:%H}/

When you run your collector you can do so in such a way with cron schedules and timeframes that Cribl doesn’t necessarily need to know what’s been read, so much as you just ensuring the next run to pick up off where the last one ended.

I hope that makes sense.

Further reference documentation here: Google Cloud Storage | Cribl Docs

2 UpGoats

It does; however, I’m planning to pull from a location where I don’t have the ability to change the output path format for the keys in the storage location. So, while I can appreciate how such a method would help going forward, it still doesn’t answer how Cribl internally keeps track of files it’s already downloaded to prevent from downloading the same files again.