Aggregating Public Domain Reputation Feeds

SOC analysts typically have access to a mix of proprietary, commercial, open source, and personal reputation sources for various indicator of compromise (IOCs). IOCs include file hashes, IP addresses, domain names, SSL certificate fingerprints and more. Aggregating the variety of feeds into a single source is a prudent first-step for manual search and programmatic accessibility. In this article we outline a number of publicly available resources and describe a simple method for aggregating them into a single reputation database. The final product, while not containing the highest fidelity data, can provide a valuable reference for threat hunters. Commercially, we supply InQuest users with a propriety reputation API, sourced from both manual and automated threat hunting efforts. Over 80% of these artifacts do not overlap with what we’re seeing in the public domain.

Talk to any of our security engineers and you’ll be sure to hear at some point that we “throw everything and the kitchen sink” at the problem of identifying malicious content. We consider a variety of artifacts such as IPs, URLs, HTTP and SMTP headers, and more. Primarily our focus is file centric. Files are carved off the wire, fed through our Deep File Inspection ® (DFI) stack, and then analyzed for signs of malignant code. The DFI process is recursive in nature and allows us to peel away compression, obfuscation, encoding, and layering techniques used by attackers to mask their payloads and ultimately evade detection. Our system functions in parallel in a map-reduce like pattern that results in the output of a single threat score per session, ranging from 0 to 10. The threat scoring algorithm is primarily driven by file-analysis but additional factors can move a score up or down. This includes inputs from Multi-AV providers (OPSWAT, VirusTotal), detonation solutions (Cuckoo, Joe Sandbox, VMRay, FireEye, Falcon aka HybridAnalysis), and a multitude of reputation sources. The variety of sources are highlighted in a “threat receipt”:

threat score contributors

Stacking sources under a single pane of glass is valuable to a SOC analyst for determining if any given alert is a true positive or not. Consensus across multiple sources is a solid indication that a threat is legitimate. Conversely, there’s value for threat hunting teams in filtering their alert stream based on these same sources. Consider for example the hunt for targeted attacks. In this case the hunter will want to search for InQuest or internally sourced alerts that do not overlap with known malware (AV hits) or commonly shared endpoints (public reputation). Seeing as how multi-sourcing threat intelligence is valuable in a variety of ways, the rest of the article provides a high level walk-through of the key requirements for creating your own public domain IOC aggregator. First, we’ll begin with available open-source solutions.

OSS or Scrape & Parse?

There are many different tools that can be used to aggregate data from public intelligence feeds and we took a few publicly available ones for a test drive. We looked at CIF, its successor Bearded Avenger (CIF v3), YETI, and have our own horse in the race: The OSINT Omnibus. CIF and Bearded Avenger are commonly used tools for gathering threat intelligence that allow you to combine data from numerous public feeds, and your own data. YETI is another aggregation tool with a feature rich UI and storing the data in a NoSQL format. Any of these tools could be be great depending on your specific use-case. If nothing quite fits the bill however, it’s trivial to build your own programmatic scraper or even lean on staple tools such as wget and curl. The trickier portion is extracting the indicators you’re interested in.

The majority of the parsing is done via regular expressions (regex). The regex for extracting an IP, hash, and ASN are fairly trivial. URLs on the other hand can be quite a handful to get right. Let’s lean on a well maintained and ever evolving community effort, Gruber Liberal Regex Pattern for Web URLs:

# not so easy.
URL_RE = (?i)b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^s()<>{}[]]+|([^s()]*?([^s()]+)[^s()]*?)|([^s]+?))+(?:([^s()]*?([^s()]+)[^s()]*?)|([^s]+?)|[^s`!()[]{};:'".,<>?])|(?:(?<!@)[a-z0-9]+(?:[.-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)b/?(?!@)))

# easy.
IP_RE      = (d{1,3}.d{1,3}.d{1,3}.d{1,3}(?:/d{1,3})?)
DOMAIN_RE  = [w-.]+[^.s/d]{2,}
ASN_NUM_RE = b(AS[d]{1,10})(?=.;.)b

Alternatively, you can outsource IOC extraction to python-iocextract (readme), an open-source and open-licensed library we wrote and maintain. With our regular expressions in hand, let’s take a look at the sources we’ll be scraping.


As of the time of writing, we’ve enumerated 44 publicly available feeds across 22 unique sources. If we’ve missed any, be sure to let us know via e-mail or Twitter. Reputation data is time sensitive. While some indicators can last for years, others can cycle from malicious to benign in the same. As a recommended default, consider expiring any scraped artifacts after 30 days. On an anecdotal side note, we have some proprietary entries in our reputation feed that go back to 2014 … and we still see them in use periodically. Some malicious actors will cycle through their available infrastructure.

Consider the following table:

SourceData TypeUpdate RateDescription
AlienvaultIP6hrAlienVault IP reputation database
BambenekIP1hrBambenek Consulting feed of C2 IP addresses
BinaryDefence*IP12hrBinary Defence Systems IP addresses from threat intelligence and banlist feeds
BlocklistIP3hrBlocklist IP addresses reported for running attacks by users primarily using fail2ban
BotscoutIP3hrBotscout IP addresses of bots
BruteforceblockerIP12hrBruteforceblocker IP addresses from users reporting failed authentication attempts
CiarmyIP24hrCIArmy IP addresses that they score poorly and are not yet widely identified as being malicious
CsirtgIP1hrCSIRTG IPv4 Addresses seen delivering unsolicited commercial emails
CybercrimeIP3hrCybercrime IP addresses tracking C2
DataplaneIP3hrDataplane IP addesses that have been seen doing various attacks
FeodoIP3hrFeodo IP addresses used as C2 communitcation channel by the Fendo Trojan
Isc.sans*Domain3hrIsc.sans domains collected from
Malc0deIP12hrMalc0de IP addresses recently caught distributing malware
MalwaredomainlistIP, Domain12hrMalwaredomainlist IP addresses and domains that are actively malicious
Openphish**URL12hrOpenPhish URLs found using AI algorithms to find zero-day phishing sites
Packetmail*IP12hrPacketmail IP addresses found performing TCP SYN to non-listening service
PhishtankURL24hrPhishtank URLs used in phishing campaigns
SpamhausASN8hrSpamhaus list of ASNs known to be run by spammers, ASNs run by cyber criminals, and highjacked ASNs
Sslbl**IP3hrSSLBL Hashes associated with malicious SSL certificates
Talos InteligenceIP12hrTalos Intelligence IP addresses of known network threats flagged on all Cisco Security Products
ThreatwebIP, Domain, URL3hrThreatweb lists of malicious domains, malicious IPs, malicious URLs, and botnet IPs
VxVaultIP. URL12hrVxVault list of malicious URLs and their IPs
ZeusIP, Domain3hrZeus IP addresses and domains used by the ZeuS trojan
*  denotes not licensed for commercial use
** denotes user must contact owner of list for commercial use

The SourceData Type, and Description columns in the table are self explanatory. The Update Rate represents the recommended amount of time to wait between scrapes of that specific source. This recommendation is based on the rate at which the source updates their data, but it is not exactly the same. Scraping these sources over the course of a few days will result in an artifact extraction rate like so:

Crawl stats from ~4/2/2018 - ~4/5/2018
Crawl stats from ~4/2/2018 – ~4/5/2018

Artifacts such IP, domain, hash, URL, and ASN are all directly available from the sources above. You can take a step further though and alongside the primary data type, extrude derived information as well. Consider for example, resolving the ASN for every IP and building a reputation score per ASN sourced on the number of malicious IPs it contains. Using the ASN data
from MaxMind, one can calculate the ratio of known malicious IPs out of the total number of IPs under the given ASN. Depending on your tolerance, if a threshold is exceeded, you may consider blocking or alerting on any communications with any IP address under the ASN.

As an exercise, we implemented the above at a 5% threshold, producing the following table as of the time of writing (April 2018):

Count = amount of unique suspicious IPs
Total = total amount of possible IPs in the ASN
Percentage = count / total * 100
ASN NumberASN NameCountTotalPercentage
42570Nagravision SA28951256.44
35828Teratrade Hungary Kft517128040.39
63128Strong Technology517128040.39
48131EAN NUM LOG Kft.374102436.52
199264Estro Web Services Private Limited13876817.96
64484Jupiter 25 Limited3625614.06
58222Solar Invest UK LTD.5651210.93
57509L&L Investment Ltd.2625610.15
205092Outsource Grid Limited242569.37
264643Enredes S.A.8710248.49
206791Slobozhenyuk B.Y. PE212568.20
39517Hostmaze Inc Srl-d212568.20
263368CASA DA INFORMATICA LTDA – ME16120487.86
205280United Protection (UK) Security LIMITED192567.42
43765Pp Sks-lugan7010246.83
60307FOP HORBAN VITALII Anatoliyovich172566.64
202963Master-Integration Ltd162566.25
262812K.H.D. SILVESTRI E CIA LTDA31451206.13
263880WANTEL TECNOLOGIA LTDA. ­ EPP24540965.98
394380Leaseweb USA777140805.51

When determining whether or not an artifact is malicious, it is critical to have as much supporting information as possible to make that determination. An aggregated reputation database containing data from reliable intelligence sources provides a great analytical layer of scrutiny that can be used to identify suspicious and/or malicious content within your environment.



  26. *Discontinued 2019

How Effective Is Your Email Security Stack?

Did you know, 80% of malware is delivered via email? How well do your defenses stand up to today’s emerging malware? Discover how effectively your email provider’s security performs with our Email Attack Simulation. You’ll receive daily reports on threats that bypassed your defenses as well as recommendations for closing the gap. Free of charge for 30 days.

Get My Email Attack Simulation