Blog

An Introduction to Deep File Inspection® (DFI)

infographic charts analysis of corporate laptop antivirus or phishing malware attach detection and cleanup software or finding and searching database files on screen with magnifier lens as wide banner

What is Deep File Inspection?

Deep File Inspection®, or DFI®, is the reassembly of packets captured off of the wire into application-level content that is then reconstructed, unraveled, and dissected (decompressed, decoded, decrypted, deobfuscated) in an automated fashion. This enables our applied analytics to better determine the intent by examining the file contents (containers, objects, etc.) as an artifact.

Deep File Inspection is the in-depth analysis of files, decomposing them to their most basic components for individual inspection, and analyzing them with a variety of analytical techniques. DFI is capable of scanning files from multiple sources, most commonly to monitor email and web pathways for files being delivered to endpoints and user inboxes. When monitoring network streams, application-level content is extracted and then reconstructed, unraveled, and dissected (decompressed, decoded, decrypted, deobfuscated) in an automated fashion. This allows heuristic analysis to better determine the intent by analysis of the file contents (containers, objects, etc.) as an artifact.

Why is Deep File Inspection Necessary?

Adversaries, whether strategically focused nation-state groups or financially motivated criminal gangs, understand an important fact about the modern enterprise target; the use of complex, evasive file formats in attack chains are capable of bypassing defenses and achieving an initial foothold. Complex, file-based tradecraft is a common element in countless attacker playbooks.

Given a document, executable binary, or other file types, there are a variety of approaches an analyst can take to reverse engineer it. On a real-world network, the volume of traffic and the sheer quantity and complexity of malicious content requires a solution that is both robust and scalable. InQuest’s Deep Packet Inspection (DPI) is designed for this challenge, evaluating, classifying, and scoring individual network sessions and derived files in a way that is both in-depth as well as highly performant. Many network monitoring solutions may identify or extract files from email or network streams, but InQuest’s DFI performs the heavy lifting of file analysis. Sophisticated threat actors frequently disguise the origination of malicious content in files by packing binaries, embedding payloads inside of documents, or utilizing obscure file types in unexpected ways, abusing them as malware carriers. A robust file analyzer must break these files down and peer inside them to identify and counter attacker techniques to evade detection. DFI recurses through the inner layers of complex file formats, enabling analyzers to find and expose hidden attack payloads.

Sandbox analysis or emulation can provide a behavioral picture of a sample’s effect on a system, and attempt to identify and classify malware this way. The disadvantage, in this case, is that it works slowly, and relies on its implementation to be susceptible to the attempted attack. Modern malware possesses a wide array of methods for anti-sandbox and anti-virtualization to ultimately evade these types of dynamic analysis systems.

Deep File Inspection as a technique is higher level than Deep Packet Inspection, where individual packets or connections are judged to be benign or malicious, but faster than dynamic analysis approaches that attempt to classify a sample by its effect on a system. This speed advantage allows DFI to be deployed in scenarios where network volume demands quick triage of traffic. A file can be evaluated as benign, suspicious, or malicious in seconds rather than minutes. Additionally, file evaluation by way of DFI can be done with a high degree of parallelism. Sparing cycles of expensive integrations is achieved through the application of a waterfall method ordered from most to least expedient analysis technique. By characterizing the files themselves, we can index which features of the specification, and thus the client, will be used. This can help isolate unusual functionality that may be suspicious, or attempts to activate vulnerable sections of a program’s code in the case of software vulnerabilities.

Example: DFI Unravel

To illustrate this, we’ll examine a piece of malware seen in the wild relating to a vulnerability in Microsoft Office’s Equation Editor. Further technical write-ups about samples from this campaign are linked below. First, we use rtfdump to get the list of OLE objects within the file.

rtfdump -d cb3429e608144909ef25df2605c24ec253b10b6e99cbb6657afa6b92e9f32fb5 | grep -i "object"

Next, we can use rtfobj to dump them all for closer inspection. The first OLE object contains a certain OLE CLSID:

\objemb{\*\oleclsid \'7bD5DE8D20-5BB8-11D1-A1E3-00A0C90F2731\'7d}

Looking this up via the Windows registry shows msvbvm60.dll. This DLL, among others, is notable in that in many instances it will be loaded at a fixed memory address. This is desirable to an attacker writing an exploit wishing to bypass ASLR, as it provides reliable addresses on which to base an ROP chain.

Moving on to dump the other two OLE objects, we find compressed Word documents. Once we extract these, we find the heart of the malicious document in the document.xml: it attempts to exploit CVE-2017-11826, a vulnerability in the OOXML parser. The malformed tags and font name are designed to return to a known address in the above DLL, and pivot to the attacker’s stage 2.

<w:body >
        <w:shapeDefaults >
                <o:OLEObject >
                        <w:font w:name="LincerCharChar裬࢈font:batang"><o:idmap/>
                </o:OLEObject>
        </w:shapeDefaults>
</w:body>

What can we discern from these documents to detect and prevent them? First, the documents all contain several layers, as is increasingly common in malware campaigns. In an attempt to avoid AV detection or sandbox analysis, attackers will embed exploits within archives, documents, OLE objects, or combinations of each. The key points here are that the outer layers of container file formats need to be recursively peeled away, and the constructs that comprise the exploit itself exposed: the DLL that indicates an attempt at ASLR bypass, and the malformed XML tags that present a telltale indicator for CVE-2017-11826. Linked below are public YARA signatures that run on the output of our proprietary post-processing stack (or on the manual efforts of an analyst) to detect documents containing this exploit.

Example: DFI Unravel 2.0

To illustrate this, we’ll examine a malware sample seen in the wild relating to a vulnerability in recent versions of Microsoft Office. Further technical write-ups about samples from this campaign will be linked at the bottom of this page. First, we seek to understand the structure of the outermost file. Originally named Overview_of_UWCs_UkraineInNATO_campaign.docx, we can see that it is an Office Open XML document file. These document types are packaged as ZIP archives:

a61b2eafcf39715031357df6b01e85e0d1ea2e8ee1dfec241b114e18f7a1163f: Microsoft Word 2007+

Archive:  a61b2eafcf39715031357df6b01e85e0d1ea2e8ee1dfec241b114e18f7a1163f
Zip file size: 120614 bytes, number of entries: 14
-rw---- 	4.5 fat  	590 b- defS 80-Jan-01 00:00 _rels/.rels
-rw---- 	4.5 fat	22924 b- defS 80-Jan-01 00:00 word/document.xml
-rw---- 	4.5 fat 	1216 b- defS 80-Jan-01 00:00 word/_rels/document.xml.rels
-rw---- 	4.5 fat	96252 b- stor 80-Jan-01 00:00 word/media/image1.png
-rw---- 	4.5 fat 	7643 b- defS 80-Jan-01 00:00 word/theme/theme1.xml
-rw---- 	4.5 fat 	3108 b- defS 80-Jan-01 00:00 word/settings.xml
-rw---- 	4.5 fat 	5344 b- defS 80-Jan-01 00:00 word/numbering.xml
-rw---- 	4.5 fat	32213 b- defS 80-Jan-01 00:00 word/styles.xml
-rw---- 	4.5 fat  	843 b- defS 80-Jan-01 00:00 word/webSettings.xml
-rw---- 	4.5 fat 	2329 b- defS 80-Jan-01 00:00 word/fontTable.xml
-rw---- 	4.5 fat  	604 b- defS 80-Jan-01 00:00 docProps/core.xml
-rw---- 	4.5 fat  	988 b- defS 80-Jan-01 00:00 docProps/app.xml
-rw---- 	2.0 fat	44146 b- defN 23-Jun-30 05:21 word/afchunk.rtf
-rw---- 	2.0 fat 	1548 b- defN 23-Jun-30 05:21 [Content_Types].xml

Within the contents of the archive, we note the file /word/_rels/document.xml.rels, which contains document relation metadata. In this metadata file, a reference to a file stands out:

<?xml version="1.0" encoding="utf-8"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml" Id="rId3"/>
  <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml" Id="rId7"/>
  <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml" Id="rId2"/>
  <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" Target="numbering.xml" Id="rId1"/>
  <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml" Id="rId6"/>
  <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png" Id="rId5"/>
  <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings" Target="webSettings.xml" Id="rId4"/>
  <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Target="/word/afchunk.rtf" Id="AltChunkId5"/>
</Relationships>

This altChunk reference is documented in the specification, indicating that the afchunk.rtf file is external content included using the relationship type defined in Alternative Format Import Part; this specifies inclusion of the RTF file in the OOXML document, as the standard states that Word supports the application/rtf content type. We can see now that the DOCX file contains an embedded payload in the form of a Rich Text Format (RTF) document. We can use rtfdump to get the list of objects within the file, revealing multiple objects that are of interest (truncated for brevity):

$ rtfdump -d afchunk.rtf
...
147	Level  4 	... 0 \*\objdata
    Name: b'Word.Document.8\x00' Size: 0
    md5: d41d8cd98f00b204e9800998ecf8427e
152	Level  4 	... 0 \*\objdata
    Name: b'OLE2LINK\x00' Size: 3584
    md5: ed315c3b36a83206dfd1bba013b91575
    magic: d0cf11e0
161    Level  2   	... 0 \*\datastore
    Name: b'Msxml2.SAXXMLReader.5.0\x00' Size: 1536
    md5: ae2afc3652ddaffe79bc53bb63ea9ccf
    magic: d0cf11e0

Looking at object 147, suspicious content stands out:

$ rtfdump.py -s 147 -H afchunk.rtf | head
00000000: 01 05 00 00 01 00 00 00  10 00 00 00 57 6F 72 64  ............Word
00000010: 2E 44 6F 63 75 6D 65 6E  74 2E 38 00 2F 00 00 00  .Document.8./...
00000020: 5C 5C 31 30 34 2E 32 33  34 2E 32 33 39 2E 32 36  \\104.234.239.26
00000030: 5C 73 68 61 72 65 31 5C  4D 53 48 54 4D 4C 5F 43  \share1\MSHTML_C
00000040: 37 5C 66 69 6C 65 30 30  31 2E 75 72 6C 00 00 00  7\file001.url...
00000050: 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 01  ................
00000060: 05 00 00 05 00 00 00 10  00 00 00 57 6F 72 64 2E  ...........Word.
00000070: 44 6F 63 75 6D 65 6E 74  2E 38 00 0E 00 00 00 D2  Document.8......
00000080: 0A 00 00 01 00 09 00 00  03 69 05 00 00 03 00 66  .........i.....f
00000090: 04 00 00 00 00 66 04 00  00 26 06 0F 00 C2 08 57  .....f...&.....W

This object reveals a UNC share path referencing a remote internet protocol address, indicating that content may be pulled from an untrusted host:

\\104.234.239[.]26\share1\MSHTML_C7\file001.url

Looking at object 152, additional breadcrumbs appear:

00000870: 4B A9 0B 5A 01 00 00 68  00 74 00 74 00 70 00 3A  K..Z...h.t.t.p.:
00000880: 00 2F 00 2F 00 37 00 34  00 2E 00 35 00 30 00 2E  ././.7.4...5.0..
00000890: 00 39 00 34 00 2E 00 31  00 35 00 36 00 2F 00 4D  .9.4...1.5.6./.M
000008A0: 00 53 00 48 00 54 00 4D  00 4C 00 5F 00 43 00 37  .S.H.T.M.L._.C.7
000008B0: 00 2F 00 73 00 74 00 61  00 72 00 74 00 2E 00 78  ./.s.t.a.r.t...x
000008C0: 00 6D 00 6C 00 00 00 00  00 00 00 00 00 00 00 00  .m.l............

In this file, an HTTP URL is observed, referencing additional remote content:

hXXp://74.50.94[.]156/MSHTML_C7/start.xml

We also note the presence in the document of CLSID 00000300-0000-0000-C000-000000000046 – StdOleLink),  an embedded OLE object known to be abused in CVE-2017-0199, CVE-2017-8570, CVE-2017-8759 or CVE-2018-8174, and apparently now CVE-2023-36884. This likely explains why several early antivirus detections on the files suggested relation to older CVEs.

It should be emphasized here as well that the content in the embedded objects is stored in a hexadecimal encoded format; identification of the inner contents requires hexadecimal decoding of the object data. This is shown in the image of the raw file content:

Figure 1 – embedded RTF content is itself encoded for native obfuscation

What can we discern from these documents in order to detect and prevent them? First, the documents all contain several layers, which remains a common trait of evasion in malware campaigns. In an attempt to avoid AV detection or sandbox analysis, attackers embed exploits within archives, documents, OLE objects, or combinations of each. The key points here are that the outer layers of container file formats need to be recursively peeled away, and the constructs that comprise the exploit itself exposed: embedded RTF objects, nested malformed objects and OLE objects that present telltale indicators for CVE-2023-36884.

Further Examples

There are many examples we can provide that illustrate the benefits of Deep File Inspection from a variety of angles. We’ll continue exploring some in future blog posts. An excellent example from recent history can be found in our post “Adobe Flash MediaPlayer DRM Use-After-Free Vulnerability” which details the background, dissection, and detection of a 0day Adobe Flash exploit caught in the wild and later assigned CVE-2018-4878. In this specific case DFI carves a SWF file out of a document carrier then extracts and decompiles the embedded ActionScript content. Our signatures run on the original file, extracted SWF file, and the output of the decompiler. In cases such as Flash, we’re able to generically detect exploitation techniques at the ActionScript layer. Anchoring on TTPs allows us to catch 0day exploits that despite leveraging a capability unknown to the community, still rely on known techniques for memory manipulation and eventual code execution.

Looking further back, we invite you to additionally read our series on Microsoft DDE, starting with “DDE, Macro-less Command Execution Vulnerability”.

Summary

In real-world scenarios, identifying malicious content from a massive volume of traffic at multi-gigabit speeds requires an approach that can scale, but still discern the intent of a file based on what it attempts to do. Deep File Inspection is a technique that provides analysts with a quick way of extracting malicious embedded content and alerting on it, as well as performing threat hunting to triage suspicious content for more detailed analysis. This can cover everything from the presence of LNK objects in a document, to 0-day exploits, or any combination of similar attributes.

References