Blog

Ex Machina: A Frolic through the Forests

Here at InQuest, YARA is among the many tools we use to perform deep-file
inspection, with a fairly extensive rule set. InQuest operates at line
speed in very high-traffic networks, so these rules need to be fast.

This blog post is the second in a series discussing YARA performance notes,
tips, and hacks.

YARA is designed to match files and
blocks of memory against arbitrary patterns. These patterns are expressed via
regular experssions and
literal strings, along with various
functions provided by modules.

Each of these tests (regular expressions, string matches, function calls, and so
on) can succeed or fail for a given input. Whether a rule fires is determined by
a single condition that is a combination of all of the individual terms that
are being tested.

This post talks about how changing the order of your terms in a rule’s condition
can have a significant performance impact.

The Example Rule

To use a simple example, let’s write a rule that checks to see if the file
in question is a blacklisted executable. Our blacklist is expressed as a list
of SHA-256 hashes and when we say
“executable” in this example, we mean an
ELF binary.

A First Attempt

This is the “obvious” way to write this in Yara, given our super-huge
two-member blacklist and YARA’s hash module:

import "hash"

rule BlacklistedExecutable
{
  condition:
    hash.sha256(0, filesize) ==
        "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c"
    or
    hash.sha256(0, filesize) ==
        "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23"
}

This will calculate the SHA256 hash of the file and fire the rule if the hash
matches one of the blacklisted hashes.

Let’s run this on the InQuest “good” corpus:

$ time yara -r blacklist.yara /data/corpus/good
BlacklistedExecutable /data/corpus/good/aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23
real   0m12.192s
user   0m33.474s
sys    0m4.081s

Okay, so around twelve seconds. Not bad, not bad.

Calculating Hashes is Time-Consuming

How much of that time was taken calculating the hash of the file?
We can see by creating an empty rule:

rule Empty
{
    condition:
        false
}

And running against the same set of files:

$ time yara -r empty.yara /data/corpus/good
real    0m6.118s
user    0m15.209s
sys     0m3.568s

So we can see that about half of our time is spent calculating the hashes,
while the other half is spent on unavoidable work like opening and reading
file contents; we’ll never be faster than six seconds.

Don’t Calculate the Hash Unnecessarily

It would be great if we didn’t have to calculate the hash of the file if
we knew it couldn’t possibly be the file in question.

That “good” corpus has gigs and gigs of files that cannot possibly match
because they aren’t executables: they’re documents, images, text files, and
so on. It would be nice if we could avoid calculating the hash on those files
if at all possible.

Logical Connectives and Short-Circuiting

YARA supports three logical connectives: andor, and not.
These have the same meaning that they do in logic:

- *a and b* is true only if both *a* and *b* are true
- *a or b* is true if either *a* and *b* or true, or both of them are
- *not a* is true if *a* is false

YARA’s version of these operators implements them using a technique
known as “short-circuiting”. This technique is common in many programming
languages (including universally popular ones like Python, C, and Java) that
performs a kind of “conditional evaluation”. The arguments to and and or
are evaluated from left to right and if the outcome is certain after evaluting
the left hand side the right hand side is skipped.

For example, if I write

false and check_database(1)

I know after evaluating the false that there’s no way that statement can be true.
Conversely, if I write

true or check_database(1)

I know after evaluating the true that there’s no way that the statement can be
false.

Many programming languages, YARA included, take advantage of this bit of logical
truth and simply don’t evaluate the right hand side of a connective if the left
hand side is sufficient to determine truth or falsity.

(The famous, brilliant, and…let’s say punctillious…computer
scientist Edsger Dijkstra, who taught here in Austin for many years,
was famously opposed to short-circuit evaluation. “Because the
conditional connectives thus complicate the formal reasoning
about programs, they are better avoided,” he wrote in his famous
handwriting
.

Because of this impediment to formal reasoning about programs, some
languages omit the short-circuiting operators entirely and others, like
Erlang and Eiffel, provide them but with different names so that you know
what you’re getting into.)

Only Calculating Hashes on Binaries

Armed with the knowledge of short-circuiting logical connectives in YARA, we can now
write a rule that only calculates the hash of a file that we’ve confirmed is an ELF
binary. If the binary check is faster than the hash calculation and we know most of the
files we’re going to inspect aren’t ELF binaries, this should result in a pretty significant speedup.

First, let’s confirm that the short-circuiting works as we expect it does.
Let’s write a rule that would calculate the hash if YARA didn’t do short-circuiting:

import "hash"

rule BlacklistedExecutable
{
  condition:
   false
   and
   (hash.sha256(0, filesize) ==
        "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c"
   or
   hash.sha256(0, filesize) ==
        "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23")
}

And run it:

$ time yara -r short-circuit.yara /data/corpus/good
real    0m6.118s
user    0m15.209s
sys     0m3.568s

Yep. This rule is essentially running at the same speed as our empty rule above.
So, now we just need to replace that false with something that’s only true for ELF binaries.

Magic and More Magic

Let’s try writing a rule using the magic module that only calculates hashes on ELF binaries:

import "magic"
import "hash"
rule BlacklistedExecutable
{
  condition:
   magic.mime_type() == "application/x-sharedlib"
   and
   (hash.sha256(0, filesize) ==
        "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c"
   or
   hash.sha256(0, filesize) ==
        "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23")
}

(ELF binaries are identified as “application/x-sharedlib” on my system, which is not surprising since
ELF is also the shared library format.)

$ time yara -r blacklist.yara /data/corpus/good
BlacklistedExecutable /data/corpus/good/aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23
real    1m9.621s
user    4m0.515s
sys     0m8.023s

Oh no! We’re now taking over a minute, where our first rule took only twelve seconds.

(NB. This section’s title comes from the story of Magic and More
Magic
, an old hacker
legend. Note that the magic module doesn’t take its name from this story,
of course, but rather from the concept of “magic numbers” at the beginning
of files that identify what they are.)

The Integer Trick

But wait! We can use the trick from my last blog post
to write a quick test for ELF binaries.

The magic string for ELF binaries is “x7fELF”; that is, an ASCII DEL character followed by
the letters ‘E’, ‘L’, and ‘F’. We can convert that to a single 32-bit integer and use YARA’s
uint32be function to inspect the first four bytes of the file very quickly using the trick
described in the previous blog post.

Let’s write a rule this way:

import "hash"

rule BlacklistedExecutable
{
  condition:
   uint32be(0) == 0x7f454c46
   and
   (hash.sha256(0, filesize) ==
        "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c"
   or
   hash.sha256(0, filesize) ==
        "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23")
}

Note the short-circuiting evaluation again. Let’s see how it performs:

$ time yara -r blacklist.yara /data/corpus/good
BlacklistedExecutable /data/corpus/good/aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23
real    0m8.575s
user    0m23.309s
sys     0m3.650s

Success! It’s around 50% faster than our first test, and only 33% slower than the
empty test!

Conclusion

We increased the speed of our YARA rule by 50% simply by taking advantage of our knowledge of
how YARA implements logical connectives. In any rule with lots of different connected terms,
it makes sense to move the cheap tests to the left side of their connective.

Coming Soon

Tune in next time for a discussion (and release!) of some of the tools we use to
generate YARA rules automatically, including a tool that makes the “integer checks”
discussed above easy to write.