SOC analysts typically have access to a mix of proprietary, commercial, open source, and personal reputation sources for various indicator of compromise (IOCs). IOCs include file hashes, IP addresses, domain names, SSL certificate fingerprints and more. Aggregating the variety of feeds into a single source is a prudent first-step for manual search and programmatic accessibility. In this article we outline a number of publicly available resources and describe a simple method for aggregating them into a single reputation database. The final product, while not containing the highest fidelity data, can provide a valuable reference for threat hunters. Commercially, we supply InQuest users with a propriety reputation API, sourced from both manual and automated threat hunting efforts. Over 80% of these artifacts do not overlap with what we're seeing in the public domain.
Talk to any of our security engineers and you'll be sure to hear at some point that we "throw everything and the kitchen sink" at the problem of identifying malicious content. We consider a variety of artifacts such as IPs, URLs, HTTP and SMTP headers, and more. Primarily our focus is file centric. Files are carved off the wire, fed through our Deep File Inspection (DFI) stack, and then analyzed for signs of malignant code. The DFI process is recursive in nature and allows us to peel away compression, obfuscation, encoding, and layering techniques used by attackers to mask their payloads and ultimately evade detection. Our system functions in parallel in a map-reduce like pattern that results in the output of a single threat score per session, ranging from 0 to 10. The threat scoring algorithm is primarily driven by file-analysis but aditional factors can move a score up or down. This includes inputs from Multi-AV providers (OPSWAT, VirusTotal), detonation solutions (Cuckoo, Joe Sandbox, VMRay, FireEye, Falcon aka HybridAnalysis), and a multitude of reputation sources. The variety of sources are highlighted in a "threat receipt":
Stacking sources under a single pane of glass is valuable to a SOC analyst for determining if any given alert is a true positive or not. Consensus across multiple sources is a solid indication that a threat is legitimate. Conversely, there's value for threat hunting teams in filtering their alert stream based on these same sources. Consider for example the hunt for targeted attacks. In this case the hunter will want to search for InQuest or internally sourced alerts that do not overlap with known malware (AV hits) or commonly shared endpoints (public reputation). Seeing as how multi-sourcing threat intelligence is valuable in a variety of ways, the rest of the article provides a high level walk-through of the key requirements for creating your own public domain IOC aggregator. First, we'll begin with available open-source solutions.
OSS or Scrape & Parse?
There are many different tools that can be used to aggregate data from public intelligence feeds and we took a few publicly available ones for a test drive. We looked at CIF, its successor Bearded Avenger (CIF v3), YETI, and have our own horse in the race: The OSINT Omnibus. CIF and Bearded Avenger are commonly used tools for gathering threat intelligence that allow you to combine data from numerous public feeds, and your own data. YETI is another aggregation tool with a feature rich UI and storing the data in a NoSQL format. Any of these tools could be be great depending on your specific use-case. If nothing quite fits the bill however, it's trivial to build your own programmatic scraper or even lean on staple tools such as wget and curl. The trickier portion is extracting the indicators you're interested in.
The majority of the parsing is done via regular expressions (regex). The regex for extracting an IP, hash, and ASN are fairly trivial. URLs on the other hand can be quite a handful to get right. Let's lean on a well maintained and ever evolving community effort, Gruber Liberal Regex Pattern for Web URLs:
Alternatively, you can outsource IOC extraction to python-iocextract (readme), an open-source and open-licensed library we wrote and maintain. With our regular expressions in hand, let's take a look at the sources we'll be scraping.
As of the time of writing, we've enumerated 44 publicly available feeds across 22 unique sources. If we've missed any, be sure to let us know via e-mail or Twitter. Reputation data is time sensitive. While some indicators can last for years, others can cycle from malicious to benign in the same. As a recommended default, consider expiring any scraped artifacts after 30 days. On an anecdotal side note, we have some proprietary entries in our reputation feed that go back to 2014 ... and we still see them in use periodically. Some malicious actors will cycle through their available infrastructure.
Consider the following table:
|Source||Data Type||Update Rate||Description|
|[Alienvault]||IP||6hr||AlienVault IP reputation database|
|[Bambenek]||IP||1hr||Bambenek Consulting feed of C2 IP addresses|
|[BinaryDefence]*||IP||12hr||Binary Defence Systems IP addresses from threat intelligence and banlist feeds|
|[Blocklist]||IP||3hr||Blocklist IP addresses reported for running attacks by users primarily using fail2ban|
|[Botscout]||IP||3hr||Botscout IP addresses of bots|
|[Bruteforceblocker]||IP||12hr||Bruteforceblocker IP addresses from users reporting failed authentication attempts|
|[Ciarmy]||IP||24hr||CIArmy IP addresses that they score poorly and are not yet widely identified as being malicious|
|[Csirtg]||IP||1hr||CSIRTG IPv4 Addresses seen delivering unsolicited commercial emails|
|[Cybercrime]||IP||3hr||Cybercrime IP addresses tracking C2|
|[Dataplane]||IP||3hr||Dataplane IP addesses that have been seen doing various attacks|
|[Feodo]||IP||3hr||Feodo IP addresses used as C2 communitcation channel by the Fendo Trojan|
|[Isc.sans]*||Domain||3hr||Isc.sans domains collected from dshield.org|
|[Malc0de]||IP||12hr||Malc0de IP addresses recently caught distributing malware|
|[Malwaredomainlist]||IP, Domain||12hr||Malwaredomainlist IP addresses and domains that are actively malicious|
|[Openphish]**||URL||12hr||OpenPhish URLs found using AI algorithms to find zero-day phishing sites|
|[Packetmail]*||IP||12hr||Packetmail IP addresses found performing TCP SYN to non-listening service|
|[Phishtank]||URL||24hr||Phishtank URLs used in phishing campaigns|
|[Spamhaus]||ASN||8hr||Spamhaus list of ASNs known to be run by spammers, ASNs run by cyber criminals, and highjacked ASNs|
|[Sslbl]**||IP||3hr||SSLBL Hashes associated with malicious SSL certificates|
|[Talos Inteligence]||IP||12hr||Talos Intelligence IP addresses of known network threats flagged on all Cisco Security Products|
|[Threatweb]||IP, Domain, URL||3hr||Threatweb lists of malicious domains, malicious IPs, malicious URLs, and botnet IPs|
|[VxVault]||IP. URL||12hr||VxVault list of malicious URLs and their IPs|
|[Zeus]||IP, Domain||3hr||Zeus IP addresses and domains used by the ZeuS trojan|
* denotes not licensed for commercial use ** denotes user must contact owner of list for commercial use
The Source, Data Type, and Description columns in the table are self explanatory. The Update Rate represents the recommended amount of time to wait between scrapes of that specific source. This recommendation is based on the rate at which the source updates their data, but it is not exactly the same. Scraping these sources over the course of a few days will result in an artifact extraction rate like so:
Artifacts such IP, domain, hash, URL, and ASN are all directly available from the sources above. You can take a step further though and alongside the primary data type, extrude derived information as well. Consider for example, resolving the ASN for every IP and building a reputation score per ASN sourced on the number of malicious IPs it contains. Using the ASN data from MaxMind, one can calculate the ratio of known malicious IPs out of the total number of IPs under the given ASN. Depending on your tolerance, if a threshold is exceeded, you may consider blocking or alerting on any communications with any IP address under the ASN.
As an exercise, we implemented the above at a 5% threshold, producing the following table as of the time of writing (April 2018):
Count = amount of unique suspicious IPs Total = total amount of possible IPs in the ASN Percentage = count / total * 100
|ASN Number||ASN Name||Count||Total||Percentage|
|35828||Teratrade Hungary Kft||517||1280||40.39|
|48131||EAN NUM LOG Kft.||374||1024||36.52|
|52635||TECNOLOGIA E EQUIPAMENTOS||286||1024||27.93|
|199264||Estro Web Services Private Limited||138||768||17.96|
|64484||Jupiter 25 Limited||36||256||14.06|
|58222||Solar Invest UK LTD.||56||512||10.93|
|57509||L&L Investment Ltd.||26||256||10.15|
|205092||Outsource Grid Limited||24||256||9.37|
|206791||Slobozhenyuk B.Y. PE||21||256||8.20|
|39517||Hostmaze Inc Srl-d||21||256||8.20|
|263368||CASA DA INFORMATICA LTDA - ME||161||2048||7.86|
|205280||United Protection (UK) Security LIMITED||19||256||7.42|
|60307||FOP HORBAN VITALII Anatoliyovich||17||256||6.64|
|262812||K.H.D. SILVESTRI E CIA LTDA||314||5120||6.13|
|263880||WANTEL TECNOLOGIA LTDA. Â EPP||245||4096||5.98|
|262711||TEK TURBO PROVEDOR DE INTERNET LTDA||426||8192||5.20|
When determining whether or not an artifact is malicious, it is critical to have as much supporting information as possible to make that determination. An aggregated reputation database containing data from reliable intelligence sources provides a great analytical layer of scrutiny that can be used to identify suspicious and/or malicious content within your environment.
- OSINT Omnibus
- FireHOL: Good source of different IP feeds, and a large amount of interesting data for each feed. (Website, Source list)
- YETI: Premade reputation aggregator and database (Website, Repository)
- CIF: Premade reputation aggregator and database (Website, Repository)