Here at InQuest, YARA is among the many tools we use to perform deep-file
inspection, with a fairly extensive rule set. InQuest operates at line
speed in very high-traffic networks, so these rules need to be fast.
This blog post is the first in a series discussing YARA performance notes,
tips, and hacks.
[YARA](https://virustotal.github.io/yara/) bills itself as the "Pattern Matching Swiss Knife". (It used to be the
"Pattern Matching Swiss Army Knife" but apparently "Swiss Army Knife" is a trademark of Victorinox AG. Who knew?)
It's used to determine if a given input (often a file, but it can attach
to a running process and analyze its memory too) matches any defined rules.
YARA rules consist of three parts, two of them optional:
- Some rule metadata, which is just a mapping of strings to values.
These name-value pairs have no effect on the rule itself, but are useful
for conveying additional information when a rule matches. This section
- Some "strings". I put it in quotes because they're not limited to static
strings; regular expressions are also permitted. This section is likewise
- A condition. This is exactly one boolean expression of arbitrary complexity.
If it evaluates to true for a given file, the rule fires. This section
There are a few other things that can happen in rules, like tags, but they are beyond the scope of this blog post.
The condition is where the magic is all tied together. It can check for
matches of any strings/regular expressions defined in the rule, check
against the values of some provided external variables, call external
functions, and even run loops.
## The Example Rule
To use a simple example, let's write a rule that detects if the file we're
looking at is an Adobe Flash file.
Adobe Flash files begin with one of three magic strings: "FWS", "CWS",
and "ZWS". The three magic strings differentiate Flash files based on
the compression mechanism used for the data they contain.
## A First Attempt
This is the "obvious" way to write this in Yara:
$flash_magic = /[FCZ]WS/
$flash_magic at 0
This defines a single regular expression (named `$flash_magic`), case
sensitive, that will match the three strings we defined above. We then
say the rule `Flash` will match if `$flash_magic` has a match in the
file at offset 0 (that is, the beginning of the file).
Let's see how fast this is. I'll run it on a Flash file in the InQuest
yara -f test1.rule testfile : 0.006s
0.006s is really pretty good, right? But let's try running it on a much
larger (1GB), more malicious file:
yara -f test1.rule maliciousfile : 0.838s
error scanning maliciousfile: string "$flash_magic" in rule "Flash" caused too many matches
Hm. That's no good. We couldn't run the rule on the file at all.
## Too Many Matches
What's going on here is that YARA first finds all the matches for a regular
expression in the file and *then* checks the rule conditions to see if
they are true. The malicious file has a huge number of strings that end
up containing the characters "CWS", "FWS", and/or "ZWS".
Obviously, we don't want to fail to analyze the file, so we need to try
a few alternatives.
## Don't Match So Much
Let's try modifying the regular expression so that it won't match so much.
Since the magic bytes are always at the beginning of the file, let's anchor
the regular expression:
$flash_magic = /^[FCZ]WS/
How well does this run?
time yara -f test2.rule maliciousfile : 29.821s
Success! YARA completed the analysis of the file. This makes sense,
because the regular expression now doesn't have too many matches: it can
only match at the beginning of the file.
However, it took about thirty seconds to run. Can we get things faster?
## Would Static Strings Be Better?
Let's try changing the regular expression to three string matches and use
YARA's condition to tie them together:
$flash_magic1 = "FWS"
$flash_magic3 = "CWS"
$flash_magic2 = "ZWS"
$flash_magic1 at 0 or $flash_magic2 at 0 or $flash_magic3 at 0
Let's try running this rule on the malicious file now:
yara -f test3.rule maliciousfile : 23.726s
Pretty good. We sped up by around six seconds.
Now, the question is, can we go faster? It turns out, we can...but why
it works will take some explaining.
## Strings and Atoms
YARA tries very hard to make string and regular expression matching
very fast. It operates under the assumption that a file is going to have
hundreds or even thousands of rules run against it, and each of those
rules is usually going to contain one if not many more strings.
To speed up this process, YARA tries to avoid running regular expressions
over the whole file. Instead, at rule compilation time, it looks at all
of the defined strings and regular expressions and extracts out of them
a collection of *atoms*.
An atom is a short string. For example, for the expression
$flash_magic = /[FCZ]WS/
YARA might extract out the atoms "FWS", "CWS", and "ZWS". The set
of all atoms for a given rule set is then fed into the [Aho-Corasick
a fast string-finding algorithm.
(The "Aho" in "Aho-Corasick" refers to Alfred Aho who is also the "A"
in the AWK programming language.)
The Aho-Corasick algorithm is run and the offsets of each atom are recorded.
Thanks to this, the various regular expressions don't need to be run over
the whole file: any spot where the atoms couldn't possibly line up with
the expression is eliminated.
This can result in enormous speedups, but it has the major downside that
it requires a lot of preprocessing to find all of the atoms in the file.
For a large file with a lot of atom matches (indicating a lot of potential
regular expression or string matches), this preprocessing time can be large.
## Eliminate the Strings
To speed things up, we could try eliminating the strings/regular expression.
The problem is, how do you match the regular expression `^[FCZ]WS` when
you can't look for strings?
YARA has a collection of built-in "integer functions" that can read integers
of various sizes and orderings from a given offset in the file. For example:
uint16be(0x72) == 0x3829
would read a 16-bit big-endian integer from offset `0x72` in the file and
see if it's equal to `0x3829`. Similar functions exist for single bytes
and 16- and 32-bit integers in both big- and little-endian formats.
Given this, we can transform our regular expression into a sequence of these
calls. To not keep you all in suspense, here's what that would look like:
/* 'CWS' = '43 57 53' */
(uint16be(0x0) == 0x4357 and uint8(0x2) == 0x53)
/* 'FWS' = '46 57 53' */
(uint16be(0x0) == 0x4657 and uint8(0x2) == 0x53)
/* 'ZWS' = '5a 57 53' */
(uint16be(0x0) == 0x5a57 and uint8(0x2) == 0x53)
Notice that this rule has no strings at all. Let's try running it:
yara -f test4.rule maliciousfile : 15.665
About 25% faster! Not bad at all.
Notice how we used a combination of 16-bit and 8-bit calls above, to
handle the fact that our strings were three bytes long. We use that sort
of trick often to match strings of arbitrary length.
## Can We Go Faster?
We've gone from failing to process the file at all to 29 seconds, to 23
seconds, to 15 seconds.
Before we see if we can go any further, it might be useful to see what the
lower limit actually is. To do that, we can run a rule that does nothing
against the file:
Let's run it against the file and see how long it takes:
yara -f test5.rule maliciousfile : 14.476s
Running a rule that does nothing against our file takes just under 1.2
seconds less time than our fastest rule. That fourteen seconds is the time
it takes to launch YARA, parse and compile the rule, load the target file,
and do whatever preprocessing is necessary.
We could probably shave off a few milliseconds by reordering the clauses
in our condition, but I don't think it would buy us much. I think we've
truly gotten this rule as fast as it will go.
YARA is pretty fast already, especially given how extensive its abilities are.
However, how you write your rules can have a real bearing on how fast they run,
and sometimes doing things in the less-obvious way can result in some real