Skip to main content

Short-Circuiting Boolean Operators in YARA

Posted on 2018-12-18 by Rob King

Here at InQuest, YARA is among the many tools we use to perform deep-file inspection, with a fairly extensive rule set. InQuest operates at line speed in very high-traffic networks, so these rules need to be fast.

This blog post is the second in a series discussing YARA performance notes, tips, and hacks.

YARA is designed to match files and blocks of memory against arbitrary patterns. These patterns are expressed via regular experssions and literal strings, along with various functions provided by modules.

Each of these tests (regular expressions, string matches, function calls, and so on) can succeed or fail for a given input. Whether a rule fires is determined by a single condition that is a combination of all of the individual terms that are being tested.

This post talks about how changing the order of your terms in a rule's condition can have a significant performance impact.

The Example Rule

To use a simple example, let's write a rule that checks to see if the file in question is a blacklisted executable. Our blacklist is expressed as a list of SHA-256 hashes and when we say "executable" in this example, we mean an ELF binary.

A First Attempt

This is the "obvious" way to write this in Yara, given our super-huge two-member blacklist and YARA's hash module:

import "hash"

rule BlacklistedExecutable
{
  condition:
    hash.sha256(0, filesize) ==
        "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c"
    or
    hash.sha256(0, filesize) ==
        "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23"
}

This will calculate the SHA256 hash of the file and fire the rule if the hash matches one of the blacklisted hashes.

Let's run this on the InQuest "good" corpus:

$ time yara -r blacklist.yara /data/corpus/good
BlacklistedExecutable /data/corpus/good/aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23
real   0m12.192s
user   0m33.474s
sys    0m4.081s

Okay, so around twelve seconds. Not bad, not bad.

Calculating Hashes is Time-Consuming

How much of that time was taken calculating the hash of the file? We can see by creating an empty rule:

rule Empty
{
    condition:
        false
}

And running against the same set of files:

$ time yara -r empty.yara /data/corpus/good
real    0m6.118s
user    0m15.209s
sys     0m3.568s

So we can see that about half of our time is spent calculating the hashes, while the other half is spent on unavoidable work like opening and reading file contents; we'll never be faster than six seconds.

Don't Calculate the Hash Unnecessarily

It would be great if we didn't have to calculate the hash of the file if we knew it couldn't possibly be the file in question.

That "good" corpus has gigs and gigs of files that cannot possibly match because they aren't executables: they're documents, images, text files, and so on. It would be nice if we could avoid calculating the hash on those files if at all possible.

Logical Connectives and Short-Circuiting

YARA supports three logical connectives: and, or, and not. These have the same meaning that they do in logic:

- *a and b* is true only if both *a* and *b* are true
- *a or b* is true if either *a* and *b* or true, or both of them are
- *not a* is true if *a* is false

YARA's version of these operators implements them using a technique known as "short-circuiting". This technique is common in many programming languages (including universally popular ones like Python, C, and Java) that performs a kind of "conditional evaluation". The arguments to and and or are evaluated from left to right and if the outcome is certain after evaluting the left hand side the right hand side is skipped.

For example, if I write

false and check_database(1)

I know after evaluating the false that there's no way that statement can be true. Conversely, if I write

true or check_database(1)

I know after evaluating the true that there's no way that the statement can be false.

Many programming languages, YARA included, take advantage of this bit of logical truth and simply don't evaluate the right hand side of a connective if the left hand side is sufficient to determine truth or falsity.

(The famous, brilliant, and...let's say punctillious...computer scientist Edsger Dijkstra, who taught here in Austin for many years, was famously opposed to short-circuit evaluation. "Because the conditional connectives thus complicate the formal reasoning about programs, they are better avoided," he wrote in his famous handwriting.

Because of this impediment to formal reasoning about programs, some languages omit the short-circuiting operators entirely and others, like Erlang and Eiffel, provide them but with different names so that you know what you're getting into.)

Only Calculating Hashes on Binaries

Armed with the knowledge of short-circuiting logical connectives in YARA, we can now write a rule that only calculates the hash of a file that we've confirmed is an ELF binary. If the binary check is faster than the hash calculation and we know most of the files we're going to inspect aren't ELF binaries, this should result in a pretty significant speedup.

First, let's confirm that the short-circuiting works as we expect it does. Let's write a rule that would calculate the hash if YARA didn't do short-circuiting:

import "hash"

rule BlacklistedExecutable
{
  condition:
   false
   and
   (hash.sha256(0, filesize) ==
        "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c"
   or
   hash.sha256(0, filesize) ==
        "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23")
}

And run it:

$ time yara -r short-circuit.yara /data/corpus/good
real    0m6.118s
user    0m15.209s
sys     0m3.568s

Yep. This rule is essentially running at the same speed as our empty rule above. So, now we just need to replace that false with something that's only true for ELF binaries.

Magic and More Magic

Let's try writing a rule using the magic module that only calculates hashes on ELF binaries:

import "magic"
import "hash"
rule BlacklistedExecutable
{
  condition:
   magic.mime_type() == "application/x-sharedlib"
   and
   (hash.sha256(0, filesize) ==
        "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c"
   or
   hash.sha256(0, filesize) ==
        "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23")
}

(ELF binaries are identified as "application/x-sharedlib" on my system, which is not surprising since ELF is also the shared library format.)

$ time yara -r blacklist.yara /data/corpus/good
BlacklistedExecutable /data/corpus/good/aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23
real    1m9.621s
user    4m0.515s
sys     0m8.023s

Oh no! We're now taking over a minute, where our first rule took only twelve seconds.

(NB. This section's title comes from the story of Magic and More Magic, an old hacker legend. Note that the magic module doesn't take its name from this story, of course, but rather from the concept of "magic numbers" at the beginning of files that identify what they are.)

The Integer Trick

But wait! We can use the trick from my last blog post to write a quick test for ELF binaries.

The magic string for ELF binaries is "\x7fELF"; that is, an ASCII DEL character followed by the letters 'E', 'L', and 'F'. We can convert that to a single 32-bit integer and use YARA's uint32be function to inspect the first four bytes of the file very quickly using the trick described in the previous blog post.

Let's write a rule this way:

import "hash"

rule BlacklistedExecutable
{
  condition:
   uint32be(0) == 0x7f454c46
   and
   (hash.sha256(0, filesize) ==
        "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c"
   or
   hash.sha256(0, filesize) ==
        "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23")
}

Note the short-circuiting evaluation again. Let's see how it performs:

$ time yara -r blacklist.yara /data/corpus/good
BlacklistedExecutable /data/corpus/good/aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23
real    0m8.575s
user    0m23.309s
sys     0m3.650s

Success! It's around 50% faster than our first test, and only 33% slower than the empty test!

Conclusion

We increased the speed of our YARA rule by 50% simply by taking advantage of our knowledge of how YARA implements logical connectives. In any rule with lots of different connected terms, it makes sense to move the cheap tests to the left side of their connective.

Coming Soon

Tune in next time for a discussion (and release!) of some of the tools we use to generate YARA rules automatically, including a tool that makes the "integer checks" discussed above easy to write.

yara