Since its introduction by WheelGroup in 1995, signature-based detection has been a staple of antivirus software. Now, over twenty years later, it seems that it’s reached the limits of its usefulness. In 2016, the Webroot Threat Report published that, thanks to a large spike in the usage of polymorphic, or self-altering, code, 94% of malware that year was found to be unique, having only been encountered once. This is a trend that has only been continuing into 2018, and, like shaking an Etch-A-Sketch, with every shift in form taken by a malicious file, all work done on defining its characteristics becomes obsolete. We at InQuest have found a robust solution to this problem via the use of generalized, heuristic signatures that work together to give an overall likelihood of a file’s potential maliciousness, but such signatures can be difficult and unintuitive to create. Enter machine learning, the adaptive solution to an amorphous problem. Through the construction of statistical models based off of past observations on the commonalities and differences that malicious and benign files tend to share with one another, a properly tuned ML algorithm makes a great partner for a signature author, providing insights that a human expert might not have thought of. It can even work alongside them, constructing signatures of its own, and as an added bonus can catch malware itself without the use of signatures. Now, we at InQuest pride ourselves on being at the forefront when it comes to malware detection and prevention, so it should be no surprise that we too have our own machine learning stack in development.
Built up out of four separate ML classifiers, (random forest, support vector machine, logistic regression, and gradient boost,) and trained on 90+ features on a dataset of millions of samples, InQuest Machina is our own proprietary way of utilizing machine learning in our quest for airtight cybersecurity. Over the course of the next month, we’ll be going over our classifiers, the features they’re analyzing, and the effectiveness of our methods, in this new blog series that we’ve decided to call Ex Machina.
Of course, we should first go over the numerous ways that machine learning has already contributed to computer security, not only to put our own work in context, but to help explain in-depth what exactly ML does and the versatility of its utilization. First, it’s important to go over what machine learning is not; namely, while it’s a critical component of AI systems, it’s not AI in and of itself, in much the same way that while our memories are an important part of our consciousness, they aren’t all that make it up. While AI is focused on the replication of all human thought, machine learning is specifically geared towards the emulation of the mind’s ability to learn from experience; in other words, it’s about the application of mathematical techniques to learn patterns from previous data, and apply those patterns to the prediction of future results. With this in mind, ML algorithms can be sorted into two types, supervised and unsupervised, based on their methods of analyzing and utilizing this past data. Both have their uses in the security pipeline.
Supervised learning is the most straightforward; given a set of examples, previously labeled by humans, the algorithm learns which combinations of features in the examples are associated with which labels. In the case of computer security, these examples could be a collection of emails labeled either spam or not spam, and their features could be the frequency of certain words per document, (e.g. ‘winner’ or ‘special.’). It would then be up to the algorithm to construct an appropriate model that, given these frequencies, is able to determine the category of an email with a high amount of confidence. The methods that can be used in pursuit of this goal can be as simple as assigning each feature a weight based on how big of an impact changes to it affect label likelihood, to the creation of hundreds of decision trees (a collection of 20-questions-style “if-then” statements.) Where it really gets cool is that, after such a model is trained, one is able to use the importance it has placed on various features as a way to gain insight into the creation of new signatures, ones that might have been too obtuse for a human being to have considered.
Unsupervised learning, on the other hand, is based on unlabeled data. The task set before an unsupervised algorithm is to group data based on its features, in order to form “clusters” or “families” of similar items. One example of how this is currently used in the security realm is the discovery of common origins for various kinds of malware. As malware with similar features, such as a narrow range of IP addresses or an unusual amount of specific characters, as likely to come from similar sources, unsupervised learning can help provide signature authors with the means to construct specific checks for these sources on incoming files. The reverse can also be true; by finding patterns that benign files share with one another, code can be written to reject files that deviate significantly from such clusters, a technique known as anomaly detection.
At InQuest, we hope to share how we’ve put both of these methods to work in our journey to construct the optimal security apparatus, starting with the fruits of our own experiments with supervised learning and how it’s affected our signature construction. Stay tuned for when we discuss what interesting insights our random forest, SVM, logistic regression, and gradient boost classifiers have outputted, as well as the science behind how each one has obtained its results.
If you'd like to hear more about how we apply machine learning to augment our human intelligence, schedule a briefing.