Xerox
Xerox Technology and Brand Licensing
Xerox Technology and Brand Licensing
TechnologiesPatentsBrandAboutNewsContactGo Search

Text Categorization and Clustering

Classify electronic documents based on content, even without a predefined taxonomy

CategoriX Technology Description

CategoriX is a software solution that selects categories to which a document belongs. It improves the way users can browse, search or filter information in large document collections.

Using Xerox patented linguistic analysis technologies and machine learning algorithms, CategoriX consists of:

A training tool, which learns probabilistic models from a collection of already categorized documents;
A categorization tool, which compares each new document with the models to infer the probable categories to which it should be assigned.

Categorization process


CategoriX Functionality


Large Scale Categorization

Taking advantage of clusters of computers for both training and categorization, our large scale categorization can scale with the number of categories while keeping the same level of accuracy and speed, so that document categorization remains interactive.

Operating Environment

The software is written in Java and can be deployed on multiple platforms including UNIX, Linux, and Windows. Java runtime 1.4.2 or later is required. Documents can have the following format: XML, HTML, Plain Text.

ClusteriX Technology Description

When a taxonomy does not exist, or fails to represent the current world in a meaningful way to the documents one needs to classify, ClusteriX will:

Identify groups of similar documents
Characterize (name) the content of identified groups

ClusteriX and CategoriX Coupling Scenario


Clustering Process


State-of-the-art Hierarchical Performance

CategoriX and ClusteriX employ a hierarchical model that relates categories to each other. This adds an extra dimension resulting in more accurate categorization. Documents may be assigned to categories at different levels in the hierarchy. The categorization training tool can generate flat models as well as hierarchical ones.

Adaptive Learning

High-confidence documents can be directly added to the training data set to update the probabilistic models. Low-confidence documents can be checked by a human. This may lead to discover novel topics, in which case the category system can be re-trained locally (on a subset of the data) and/or incrementally (using only the additional material).

For Licensing Information

CategoriX and ClusteriX are available for commercial technology licensing (incl. OEM). Maintenance, technical support, customization and integration services can be provided on a contracted basis.

To learn more about licensing the CategoriX and ClusteriX technologies, contact Xerox.

Click here for an easy-to-print 1-page PDF version of the Xerox text categorization and clustering technology description.


© 2001 - 2010 XEROX CORPORATION. All rights reserved.