Strategies for Reducing Log Volumes: Deduplication/Aggregation

See community slack for more context:

We occasionally identify some log sets that could be good candidates for deduplication, but they usually differ by one or two attributes that make each unique one important in-and-of-itself; for example a Correlation Id etc.

Is there a way for Cribl to “dedup” those kinds of logs but then aggregate all the unique Correlation Ids into one field or something like that?

Thanks!
Ryan

3 UpGoats

I think what you’re asking for (event aggregation) demands a temporary data store of some type, or a caching mechanism which is shared across workers. It’s really a workflow problem, not just a data stream problem. There are a bunch of approaches you can take, but they’re not necessarily straight forward. We’re doing something similar for building transactions (based on event sequences). We’re porting our Transaction Analytics engine from Splunk and it’s been an interesting exercise.
Ping me if you want to discuss more in depth.

Mike

4 UpGoats

Is there a solution if I’m content to settle for worker-level aggregation?

2 UpGoats

Worker level is easier because you can use in-memory cache across pipelines, but you still need a way to flush, etc. It’s more of a programming exercise.

2 UpGoats

At the worker level, the built in Aggregation function should do what you need, no?

3 UpGoats

I’ve only played with the stats agg function (and it’s been awhile) so there may be a function that supports queueing for strings/events…

2 UpGoats

I haven’t tested, but I think values(correlation_id) is likely to provide the necessary aggregation.

3 UpGoats

ah, yes. I was thinking about more complete event agg vs field agg.

2 UpGoats

Thanks guys, we’ll play with this to see if it works for us…

2 UpGoats