Stringless YARA Rules

By: Rob King

September 30, 2018

Here at InQuest, YARA is among the many tools we use to perform Deep File Inspection® (DFI), with a fairly extensive rule set. InQuest operates at line speed in very high-traffic networks, so these rules need to be fast.

This blog post is the first in a series discussing YARA performance notes, tips, and hacks.

YARA bills itself as the “Pattern Matching Swiss Knife”. (It used to be the “Pattern Matching Swiss Army Knife” but apparently “Swiss Army Knife” is a trademark of Victorinox AG. Who knew?)

It’s used to determine if a given input (often a file, but it can attach to a running process and analyze its memory too) matches any defined rules. YARA rules consist of three parts, two of them optional:

Some rule metadata, which is just a mapping of strings to values. These name-value pairs have no effect on the rule itself, but are useful for conveying additional information when a rule matches. This section is optional.
Some “strings”. I put it in quotes because they’re not limited to static strings; regular expressions are also permitted. This section is likewise optional.
A condition. This is exactly one boolean expression of arbitrary complexity. If it evaluates to true for a given file, the rule fires. This section isn’t optional.

There are a few other things that can happen in rules, like tags, but they are beyond the scope of this blog post.

The condition is where the magic is all tied together. It can check for matches of any strings/regular expressions defined in the rule, check against the values of some provided external variables, call external functions, and even run loops.

The Example Rule

To use a simple example, let’s write a rule that detects if the file we’re looking at is an Adobe Flash file.

Adobe Flash files begin with one of three magic strings: “FWS”, “CWS”, and “ZWS”. The three magic strings differentiate Flash files based on the compression mechanism used for the data they contain.

A First Attempt

This is the “obvious” way to write this in Yara:

rule Flash
{
    strings:
        $flash_magic = /[FCZ]WS/

    condition:
        $flash_magic at 0
}

This defines a single regular expression (named $flash_magic), case sensitive, that will match the three strings we defined above. We then say the rule Flash will match if $flash_magic has a match in the file at offset 0 (that is, the beginning of the file).

Let’s see how fast this is. I’ll run it on a Flash file in the InQuest testing corpus:

yara -f test1.rule testfile : 0.006s

0.006s is really pretty good, right? But let’s try running it on a much larger (1GB), more malicious file:

yara -f test1.rule maliciousfile : 0.838s
error scanning maliciousfile: string "$flash_magic" in rule "Flash" caused too many matches

Hm. That’s no good. We couldn’t run the rule on the file at all.

Too Many Matches

What’s going on here is that YARA first finds all the matches for a regular expression in the file and then checks the rule conditions to see if they are true. The malicious file has a huge number of strings that end up containing the characters “CWS”, “FWS”, and/or “ZWS”.

Obviously, we don’t want to fail to analyze the file, so we need to try a few alternatives.

Don’t Match So Much

Let’s try modifying the regular expression so that it won’t match so much. Since the magic bytes are always at the beginning of the file, let’s anchor the regular expression:

rule Flash
{
    strings:
        $flash_magic = /^[FCZ]WS/

    condition:
        $flash_magic
}

How well does this run?

 time yara -f test2.rule maliciousfile : 29.821s

Success! YARA completed the analysis of the file. This makes sense, because the regular expression now doesn’t have too many matches: it can only match at the beginning of the file.

However, it took about thirty seconds to run. Can we get things faster?

Would Static Strings Be Better?

Let’s try changing the regular expression to three string matches and use YARA’s condition to tie them together:

rule Flash
{
    strings:
        $flash_magic1 = "FWS"
        $flash_magic3 = "CWS"
        $flash_magic2 = "ZWS"

    condition:
        $flash_magic1 at 0 or $flash_magic2 at 0 or $flash_magic3 at 0
}

Let’s try running this rule on the malicious file now:

yara -f test3.rule maliciousfile : 23.726s

Pretty good. We sped up by around six seconds.

Now, the question is, can we go faster? It turns out, we can…but why it works will take some explaining.

Strings and Atoms

YARA tries very hard to make string and regular expression matching very fast. It operates under the assumption that a file is going to have hundreds or even thousands of rules run against it, and each of those rules is usually going to contain one if not many more strings.

To speed up this process, YARA tries to avoid running regular expressions over the whole file. Instead, at rule compilation time, it looks at all of the defined strings and regular expressions and extracts out of them a collection of atoms.

An atom is a short string. For example, for the expression

    $flash_magic = /[FCZ]WS/

YARA might extract out the atoms “FWS”, “CWS”, and “ZWS”. The set of all atoms for a given rule set is then fed into the Aho-Corasick algorithm,a fast string-finding algorithm.

(The “Aho” in “Aho-Corasick” refers to Alfred Aho who is also the “A” in the AWK programming language.)

The Aho-Corasick algorithm is run and the offsets of each atom are recorded. Thanks to this, the various regular expressions don’t need to be run over the whole file: any spot where the atoms couldn’t possibly line up with the expression is eliminated.

This can result in enormous speedups, but it has the major downside that it requires a lot of preprocessing to find all of the atoms in the file. For a large file with a lot of atom matches (indicating a lot of potential regular expression or string matches), this preprocessing time can be large.

Eliminate the Strings

To speed things up, we could try eliminating the strings/regular expression. The problem is, how do you match the regular expression ^[FCZ]WS when you can’t look for strings?

YARA has a collection of built-in “integer functions” that can read integers of various sizes and orderings from a given offset in the file. For example:

    uint16be(0x72) == 0x3829

would read a 16-bit big-endian integer from offset 0x72 in the file and see if it’s equal to 0x3829. Similar functions exist for single bytes and 16- and 32-bit integers in both big- and little-endian formats.

Given this, we can transform our regular expression into a sequence of these calls. To not keep you all in suspense, here’s what that would look like:

rule Flash
{
    condition:
        /* 'CWS' = '43 57 53' */
        (uint16be(0x0) == 0x4357 and uint8(0x2) == 0x53)
        or
        /* 'FWS' = '46 57 53' */
        (uint16be(0x0) == 0x4657 and uint8(0x2) == 0x53)
        or
        /* 'ZWS' = '5a 57 53' */
        (uint16be(0x0) == 0x5a57 and uint8(0x2) == 0x53)
}

Notice that this rule has no strings at all. Let’s try running it:

yara -f test4.rule maliciousfile : 15.665

About 25% faster! Not bad at all.

Notice how we used a combination of 16-bit and 8-bit calls above, to handle the fact that our strings were three bytes long. We use that sort of trick often to match strings of arbitrary length.

Can We Go Faster?

We’ve gone from failing to process the file at all to 29 seconds, to 23 seconds, to 15 seconds.

Before we see if we can go any further, it might be useful to see what the lower limit actually is. To do that, we can run a rule that does nothing against the file:

rule NullRule
{
    condition:
        false
}

Let’s run it against the file and see how long it takes:

yara -f test5.rule maliciousfile : 14.476s

Running a rule that does nothing against our file takes just under 1.2 seconds less time than our fastest rule. That fourteen seconds is the time it takes to launch YARA, parse and compile the rule, load the target file, and do whatever preprocessing is necessary.

We could probably shave off a few milliseconds by reordering the clauses in our condition, but I don’t think it would buy us much. I think we’ve truly gotten this rule as fast as it will go.

Conclusion

YARA is pretty fast already, especially given how extensive its abilities are. However, how you write your rules can have a real bearing on how fast they run, and sometimes doing things in the less-obvious way can result in some real speedups.

Free Email Hygiene Analysis

Solid email security begins with proper email hygiene. There are a variety of email hygiene technologies and wrapping one’s head around them all is challenging. Try our complimentary Email Hygiene Analysis and receive an instant report about your company’s security posture including a simple rating with iterative guidance, as well as a comparison against the Fortune 500. Try it today!

Stringless YARA Rules

The Example Rule

A First Attempt

Too Many Matches

Don’t Match So Much

Would Static Strings Be Better?

Strings and Atoms

Eliminate the Strings

Can We Go Faster?

Conclusion

Free Email Hygiene Analysis

About the Author

Rob King

Products

Research & tools

Why Inquest

Resources

Company