Written by: Carlos Hernandez Ganan, Siôn Lloyd, and Samaneh Tajalizadehkhoob
Reputation Block Lists (RBLs) serve as a common defense against harmful and unwanted Internet content. These lists contain the Internet Protocol addresses, domain names, or full uniform resource locators of known spam sources, phishing pages, malicious sites, or other unwanted content. We use these data sources for producing metrics in systems like ICANN’s Domain Abuse Activity Reporting (DAAR) system for research, and as training sets for machine learning models, among other uses. Organizations may use RBLs to block incoming or outgoing access to domain names, filter spam, warn about phishing or other activities, all of which protect users from online threats.
There are several different providers of these lists, many of whom have disparate and specific focuses, collection methods, delivery mechanisms, and so on. This leads to RBL providers having varying points of strengths. Additionally, RBL data has a wide range of use cases, and we can see that not all of them will be ideally suited to all users. Both the recent Office of the Chief Technology Officer (OCTO) publication, and our new paper, presented and published at the Anti-Phishing Working Group technical summit 2023, explore this thoroughly.
To help evaluate whether the characteristics of an RBL are suitable for our proposed use, we first have to harmonize all incoming data and metadata, and then store it in a single, consistent format. Only then can we begin to measure and compare the volume of data and the types of threat addressed, identify duplicate data in reports, and so on. In addition, we can also begin to analyze less tangible metrics like potential false positives. Over time, this process of harmonizing data has allowed us to create sets of metrics that can be used to help evaluate RBLs, both in isolation and in combination with each other.
As a result, we see that there is no single RBL that meets the needs of all use cases. Combining RBLs into a single source of data turns out to be exceptionally important, as the strengths of one will often compensate for the weaknesses of another, and vice versa. In fact, we have consistently seen that the overlap between RBLs is small (less than five percent in most cases). This means that the ancillary benefits are often significant.
In our new paper, we argue that understanding the strengths and weaknesses of any one RBL, or combination of multiple RBLs, is key to getting a good fit for a particular use case. To maximize the benefits of RBLs, we suggest combining two or more. This provides a fuller, contextually rich overview of what is occurring, rather than relying on any single RBL. Here is the summary of our contribution in this publication:
- We proposed a systematic methodology for evaluating RBLs, providing a structured approach for assessing their effectiveness and suitability.
- We conducted a nuanced examination of RBL characteristics, covering aspects such as volume, overlap, timeliness, churn, liveliness, purity, and accuracy.
- We demonstrated the practical application of our proposed metrics through visualizations, offering a tangible and accessible guide for researchers, practitioners, and decision-makers in the cybersecurity domain.
For more detail on our methods and findings see the OCTO Publication 037.
Authors: Carlos Hernandez Ganan, Siôn Lloyd, and Samaneh Tajalizadehkhoob