Authored by Quadrant CTO, Champ Clark III
I recently spent the week in picturesque Greece with my friends at Suricata, taking in the annual Suricon user-event while delivering a talk centered on “Network Data Points” (NDPs). This presentation was given on November 9th, 2022 in Athens; one of several talks around examining security data and storage differently.
My concept spawned from a consistent trend I’ve identified when security incidents arise, particularly around ascertaining certain datasets quickly. These datasets are typically referred to as IoCs, or “Indicators of Compromise”. Examples of these IoCs can include IP addresses, file hashes, TLS JA3 hashes, file names, and much more.
The typical “accepted” solution is to stuff all your security data into a SIEM and search the datasets within. The issue for most organizations arises in the retention of that data. Your average company doesn’t have clusters of machines with petabytes of storage and terabytes of RAM to keep the data intact. What most organizations do, instead, is store everything they can within their SIEM and archive the rest of the data to “cold” storage (raw, compressed data on disk or in the cloud).
Assuming the IoC is recent enough to search through said SIEM, you’ll likely find what you’re looking for. All good, right?
Not so fast. It’s become increasingly common for our team to be asked for IoC searches that extend 6 months or longer. This is where things get tricky – because you’re having to decompress data within cold storage to execute the search and locate the data. These types of lookups can take hours or even days.
My talk, titled “Let’s Try Something New with Suricata Storage” addresses this issue in depth – view it in its entirety below. It’s really a simple concept at its core.
How Quadrant Handles Data Distillation
We take the data that is going into our SIEM and “distill” it down to only what we need. Much of the data we collect, be it from network traffic or log analysis, are repeating events.
As an example, when making an HTTP connection to a web site, our sensor records the connection within the “flow” data. In other words, the timestamp and “flow ID” of when the connection took place, TCP flags, the source IP address, destination IP address, source port, destination port, how much data the client and server sent, the age of the session, etc.
While this is happening, we’re also recording data at the protocol level. What was the hostname and URI the client went to? what was the server's HTTP result code? I could go on.
At this point, we’ve collected a substantial amount of information – but how much, if any, is actually useful? Especially if the question is “have we seen this IP address in the last six months?”
If we take this type of dataset and distill it down, we can make a new record with only the external IP address of the web site, the “flow ID”, the timestamp, and the decoded protocol. We also only store the very last recorded timestamp for an IP address, creating a single “last seen” record.
The “flow ID” is a unique ID for each flow that ties records together. In our example, our HTTP connection creates flow – http and file information data. The Suricata backend assigns each connection this “flow ID”, unique to that connection. Flow IDs make it much easier to pivot from one data source (flow data) to another (http, for example), as they share a unique flow ID for that session.
Proving the Concept
When we tested this concept for our 2022 Suricon Athen’s talk, we used a 128GB network data log (flows, dns, file information, http sessions, etc.) and dumped it into our SIEM. Once stored, the non-distilled data consumed approximately 550GB of disk space. Yet, when storing only the NDPs, that storage dropped to 500MB (megabytes) of storage!
The NDP index within our SIEM operates just like any other index. It just happens to be a very stripped-down representation of what is currently in our SIEM, while also providing a pivot point for cold storage.
Our hypothetical SIEM query becomes “Have we seen this IP address in the last year?”. We can now go to our NDP index and run a quick search for that IP – which if not found, instantly provides our answer. This effectively eliminates the need to decompress thousands of logs in cold storage, or even search for current “hot” storage!
Alternatively, let’s say that we have seen the IP address in question, occurring roughly 8 months ago. The SIEM not only returns that timestamp, which acts as the “last seen” record, but also returns the unique flow ID. This means we can essentially “skip” over 8 months of data, go right to the date and time the event took place, and start our search there. Even with this data in “cold” storage, we’ve drastically reduced our search times.
We can also instantly answer “Yes! We’ve observed this IP address within our environment”.
My 2022 Suricon presentation goes into deeper detail around how we store the data with our new indexing engine, named “Zinc”, and how our “data broker” software “Meer” distills the data. While this presentation only covers network data, we’re applying the same concepts to “log” data, as well. We intend to roll this out to our customers by Q1 2023. Stay tuned :fingers-crossed: