Short-Circuiting Boolean Operators in YARA

Posted on 2018-12-18 by Rob King
Here at InQuest, YARA is among the many tools we use to perform deep-file inspection, with a fairly extensive rule set. InQuest operates at line speed in very high-traffic networks, so these rules need to be fast. This blog post is the second in a series discussing YARA performance notes, tips, and hacks. [YARA]( is designed to match files and blocks of memory against arbitrary patterns. These patterns are expressed via [regular experssions]( and literal strings, along with various [functions provided by modules]( Each of these tests (regular expressions, string matches, function calls, and so on) can succeed or fail for a given input. Whether a rule fires is determined by a single *condition* that is a combination of all of the individual terms that are being tested. This post talks about how changing the order of your terms in a rule's condition can have a significant performance impact. ## The Example Rule To use a simple example, let's write a rule that checks to see if the file in question is a blacklisted executable. Our blacklist is expressed as a list of [SHA-256]( hashes and when we say "executable" in this example, we mean an [ELF]( binary. ## A First Attempt This is the "obvious" way to write this in Yara, given our super-huge two-member blacklist and YARA's *hash* module: ``` import "hash" rule BlacklistedExecutable { condition: hash.sha256(0, filesize) == "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c" or hash.sha256(0, filesize) == "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23" } ``` This will calculate the SHA256 hash of the file and fire the rule if the hash matches one of the blacklisted hashes. Let's run this on the InQuest "good" corpus: ``` $ time yara -r blacklist.yara /data/corpus/good BlacklistedExecutable /data/corpus/good/aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23 real 0m12.192s user 0m33.474s sys 0m4.081s ``` Okay, so around twelve seconds. Not bad, not bad. ## Calculating Hashes is Time-Consuming How much of that time was taken calculating the hash of the file? We can see by creating an empty rule: ``` rule Empty { condition: false } ``` And running against the same set of files: ``` $ time yara -r empty.yara /data/corpus/good real 0m6.118s user 0m15.209s sys 0m3.568s ``` So we can see that about half of our time is spent calculating the hashes, while the other half is spent on unavoidable work like opening and reading file contents; we'll never be faster than six seconds. ## Don't Calculate the Hash Unnecessarily It would be great if we didn't have to calculate the hash of the file if we knew it couldn't possibly be the file in question. That "good" corpus has gigs and gigs of files that cannot possibly match because they aren't executables: they're documents, images, text files, and so on. It would be nice if we could avoid calculating the hash on those files if at all possible. ## Logical Connectives and Short-Circuiting YARA supports three logical connectives: *and*, *or*, and *not*. These have the same meaning that they do in logic: - *a and b* is true only if both *a* and *b* are true - *a or b* is true if either *a* and *b* or true, or both of them are - *not a* is true if *a* is false YARA's version of these operators implements them using a technique known as "short-circuiting". This technique is common in many programming languages (including universally popular ones like Python, C, and Java) that performs a kind of "conditional evaluation". The arguments to *and* and *or* are evaluated from left to right and if the outcome is certain after evaluting the left hand side the right hand side is skipped. For example, if I write ``` false and check_database(1) ``` I know after evaluating the *false* that there's no way that statement can be true. Conversely, if I write ``` true or check_database(1) ``` I know after evaluating the *true* that there's no way that the statement can be false. Many programming languages, YARA included, take advantage of this bit of logical truth and simply don't evaluate the right hand side of a connective if the left hand side is sufficient to determine truth or falsity. (The famous, brilliant, and...let's say scientist Edsger Dijkstra, who taught here in Austin for many years, was famously opposed to short-circuit evaluation. "Because the conditional connectives thus complicate the formal reasoning about programs, they are better avoided," he [wrote in his famous handwriting]( Because of this impediment to formal reasoning about programs, some languages omit the short-circuiting operators entirely and others, like Erlang and Eiffel, provide them but with different names so that you know what you're getting into.) ### Only Calculating Hashes on Binaries Armed with the knowledge of short-circuiting logical connectives in YARA, we can now write a rule that only calculates the hash of a file that we've confirmed is an ELF binary. If the binary check is faster than the hash calculation and we know most of the files we're going to inspect aren't ELF binaries, this should result in a pretty significant speedup. First, let's confirm that the short-circuiting works as we expect it does. Let's write a rule that would calculate the hash if YARA didn't do short-circuiting: ``` import "hash" rule BlacklistedExecutable { condition: false and (hash.sha256(0, filesize) == "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c" or hash.sha256(0, filesize) == "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23") } ``` And run it: ``` $ time yara -r short-circuit.yara /data/corpus/good real 0m6.118s user 0m15.209s sys 0m3.568s ``` Yep. This rule is essentially running at the same speed as our empty rule above. So, now we just need to replace that *false* with something that's only true for ELF binaries. ## Magic and More Magic Let's try writing a rule using the *magic* module that only calculates hashes on ELF binaries: ``` import "magic" import "hash" rule BlacklistedExecutable { condition: magic.mime_type() == "application/x-sharedlib" and (hash.sha256(0, filesize) == "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c" or hash.sha256(0, filesize) == "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23") } ``` (ELF binaries are identified as "application/x-sharedlib" on my system, which is not surprising since ELF is also the shared library format.) ``` $ time yara -r blacklist.yara /data/corpus/good BlacklistedExecutable /data/corpus/good/aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23 real 1m9.621s user 4m0.515s sys 0m8.023s ``` Oh no! We're now taking over a minute, where our first rule took only twelve seconds. (NB. This section's title comes from the story of [Magic and More Magic](, an old hacker legend. Note that the *magic* module doesn't take its name from this story, of course, but rather from the concept of "magic numbers" at the beginning of files that identify what they are.) ## The Integer Trick But wait! We can [use the trick from my last blog post]( to write a quick test for ELF binaries. The magic string for ELF binaries is "\x7fELF"; that is, an ASCII DEL character followed by the letters 'E', 'L', and 'F'. We can convert that to a single 32-bit integer and use YARA's *uint32be* function to inspect the first four bytes of the file very quickly using the trick described in the previous blog post. Let's write a rule this way: ``` import "hash" rule BlacklistedExecutable { condition: uint32be(0) == 0x7f454c46 and (hash.sha256(0, filesize) == "0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c" or hash.sha256(0, filesize) == "aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23") } ``` Note the short-circuiting evaluation again. Let's see how it performs: ``` $ time yara -r blacklist.yara /data/corpus/good BlacklistedExecutable /data/corpus/good/aec283b16bc0916fb028d5cac5b409c24f879c5940ca04fc01ba5f7acb81bc23 real 0m8.575s user 0m23.309s sys 0m3.650s ``` Success! It's around 50% faster than our first test, and only 33% slower than the empty test! ## Conclusion We increased the speed of our YARA rule by 50% simply by taking advantage of our knowledge of how YARA implements logical connectives. In any rule with lots of different connected terms, it makes sense to move the cheap tests to the left side of their connective. ## Coming Soon Tune in next time for a discussion (and release!) of some of the tools we use to generate YARA rules automatically, including a tool that makes the "integer checks" discussed above easy to write.


Get The InQuest Insider

Find us on Twitter for frequent updates, follow our Blog for bi-weekly technical write-ups, or subscribe here to receive our monthly newsletter, The InQuest Insider. We curate and provide you with the latest news stories, field notes about innovative malware, novel research / analysis / threat hunting tools, security tips and more.