"Big Data" has been around for a while and many organisations are forging ahead with Hadoop deployments or looking at NoSQL database models such as the opensource MongoDB, to allow for the processing of vast logistical, marketing or consumer lead data sources. Infosec is no stranger to a big approach to data gathering and analytics. SIEM (security information and event monitoring) solutions have long since been focused on centralizing vast amounts of application and network device log data in order to provide a fast repository where known signatures can applied.
Big & Fast
The SIEM vendor product differentiation approach, has often been focused on capacity and speed. Nitro (McAfee's SIEM product) prides itself on it's supremely fast Ada written database. HP's ArcSight product is all about device and platform integration and scalability. The use of SIEM is symptomatic to the use of IT in general - the focus on automation of existing problems, via integration and centralization. The drivers behind these are pretty simple - there is a cost benefit and tangible Return on Investment of automating something in the long term (staff can swap out to more complex, value driven projects, there's a faster turn around of existing problems) whereas centralization, often provides simpler infrastructures to support, maintain and optimize.
The Knowns, Unknowns and Known Unknowns of Security
I don't want to take too much inspiration from George Bush's confusing path of known unknowns, but there is a valid point, that when it comes to protection in any aspect of life, knowing what you're protecting and more importantly, who, or what you are setting protection from, is incredibly important. SIEM products are incredibly useful at helping to find known issues. For example, if a login attempt fails 3 times on a particular application, or the ability to identify traffic going a blacklisted IP address. All characteristics have a known set of values, which help to build up a query. This can develop into a catalog of known queries (aka signatures) which can be applied your dataset. The larger the dataset, the more bad stuff you hope to capture. This is where the three S's of SIEM come in - the sphere, scope and speed of analysis. Deployments want huge datasets, connected to numerous differing sources of information, with the ability to very quickly run a known signature against the data in order to find a match. The focus is on a real-time (or near time) analysis using a helicopter-in a approach. Can this approach be extended further? A pure big-data style approach for security? How can we start to use that vast data set to look for the unknowns?
Benefits to Security
The first area which seems to be gaining popularity is the marrying of SIEM activity data to identity and access management (IAM) data. IAM knows about an individuals identity (who, where and possibly why) as well as that identity's capabilities (who, has access to what?), but IAM doesn't know what that user has actually been doing with their access. SIEM on the other hand, knows exactly what has been going (even with out any signature analytics) but doesn't necessarily know by whom. Start to map activity user id's or IP addresses to real identities stored in an IAM solution and you suddenly have a much wider scope of analysis, and also a lot more context around what you're analyzing. This can help with attempting to map out the 'unknowns' such as fraud and internal and external malicious attacks.
Managing the known attacks is probably an easier place to start with. This would involve understanding what metrics or signatures an organisation whats to focus on. Again, this would be driven by a basic asset classification and risk management process. What do I need protecting and what scenarios would result in those assets being threatened? The approach from a security-analytics perspective, is to not be focused on technical silo's. Try to see security originating and terminating across a range of business and technical objects. If a malicious destination IP address is found in a TCP packet picked up via the firewall logs in the SIEM environment, that packet has originated somewhere. What internal host device maps to the source IP address? What operating system is the host device? What common vulnerabilities does that device have? Who is using that device? What is their employee-id, job title or role? Are they a contractor or permanent member of staff? Which systems are the using? Within those systems, what access do they have, was that access approved and what data are they exposed to and so on? Suddenly the picture can be more complex, but also more insightful, especially when attempting to identify the true root cause.
This complex chain of correlated "security big data", can be used in a manner of ways from post-incident analysis and trend analytics as well as for the mapping of internal data to external threat intelligence.
Big data is here to stay and security analytics just needs to figure out the best way to use it.