Text Categorization and Clustering

Classify electronic documents based on content, even without a predefined taxonomy.

Technology Description

CategoriX is a software solution that selects categories to which a document belongs. It improves the way users can browse, search or filter information in large document collections.

Using Xerox patented linguistic analysis technologies and machine learning algorithms, CategoriX consists of:
  • A training tool, which learns probabilistic models from a collection of already categorized documents
  • A categorization tool, which compares each new document with the models to infer the probable categories to which it should be assigned


  • Multi-label (a document can be assigned to several classes with a confidence level for each class)
  • Hierarchical (or flat)
  • Optional linguistic pre-processing (for 15+ languages) for some specific types of document collections
  • Semi-supervised learning: use of unlabelled data to reduce the amount of required human annotation

Large Scale Categorization

Taking advantage of clusters of computers for both training and categorization, our large scale categorization can scale with the number of categories while keeping the same level of accuracy and speed, so that document categorization remains interactive.

Operating Environment

The software is written in Java and can be deployed on multiple platforms including UNIX, Linux, and Windows. Java runtime 1.4.2 or later is required. Documents can have the following format: XML, HTML, Plain Text.

ClusteriX Technology Description

When a taxonomy does not exist, or fails to represent the current world in a meaningful way to the documents one needs to classify, ClusteriX will:
  • Identify groups of similar documents
  • Characterize (name) the content of identified groups

ClusteriX and CategoriX Coupling Scenario

  • Organize a collection using ClusteriX (taxonomy induction)
  • Categorize new documents with the induced taxonomy
  • When a category grows too large, reorganize locally using ClusteriX
  • Categorize directly with the updated taxonomy

State-of-the-art Hierarchical Performance

CategoriX and ClusteriX employ a hierarchical model that relates categories to each other. This adds an extra dimension resulting in more accurate categorization. Documents may be assigned to categories at different levels in the hierarchy. The categorization training tool can generate flat models as well as hierarchical ones.

Adaptive Learning

High-confidence documents can be directly added to the training data set to update the probabilistic models. Low-confidence documents can be checked by a human. This may lead to discover novel topics, in which case the category system can be re-trained locally (on a subset of the data) and/or incrementally (using only the additional material).

Intellectual Property Summary

Xerox Intellectual Property includes patents, patent applications, and know-how.

For Licensing Information

To learn more about licensing the Text Categorization and Clustering technology.