

A probabilistic framework for semi-supervised clustering. Sugato Basu, Mikhail Bilenko, and Raymond J.

Sugato Basu, Arindam Banerjee, and Raymond J.We also discuss how ATOL has been deployed and externally evaluated, as part of the LIGHTS system.

On the LIGHTS dataset, ATOLClassify gives a 12% performance gain over an analyst-provided baseline, while ATOLCluster gives a 7% improvement over state-of-the-art semi-supervised clustering algorithms.

The paper presents empirical results of ATOL on onion datasets derived from the LIGHTS repository, and additionally benchmarks ATOL's algorithms on the publicly available 20 Newsgroups dataset to demonstrate the reproducibility of its results. ATOL has three core components - (a) a novel keyword discovery mechanism (ATOLKeyword) which extends analyst-provided keywords for different categories by suggesting new descriptive and discriminative keywords that are relevant for the categories (b) a classification framework (ATOLClassify) that uses the discovered keywords to map onion site content to a set of categories when sufficient labeled data is available (c) a clustering framework (ATOLCluster) that can leverage information from multiple external heterogeneous knowledge sources, ranging from domain expertise to Bitcoin transaction data, to categorize onion content in the absence of sufficient supervised data. In this paper we describe Automated Tool for Onion Labeling (ATOL), a novel scalable analysis service developed to conduct a thematic assessment of the content of onion sites in the LIGHTS repository. We have developed an automated infrastructure that crawls and indexes content from onion sites into a large-scale data repository, called LIGHTS, with over 100M pages. Identifying and monitoring such illicit sites in the darkweb is of high relevance to the Computer Security and Law Enforcement communities. Onion sites on the darkweb operate using the Tor Hidden Service (HS) protocol to shield their locations on the Internet, which (among other features) enables these sites to host malicious and illegal content while being resistant to legal action and seizure.
