 |
 |
 |
 |
 |
 |
 |
|
|
| |
Text Categorization and Clustering
Classify electronic documents based on content, even without a predefined
taxonomy.
Technology Description
CategoriX is a software solution that selects categories to which a document
belongs. It improves the way users can browse, search or filter information in
large document collections.
Using Xerox patented linguistic analysis technologies and machine learning
algorithms, CategoriX consists of:
- A training tool, which learns probabilistic models from a collection of
already categorized documents
- A categorization tool, which compares each new document with the models to
infer the probable categories to which it should be assigned
Functionality
- Multi-label (a document can be assigned to several classes with a
confidence level for each class)
- Hierarchical (or flat)
- Optional linguistic pre-processing (for 15+ languages) for some specific
types of document collections
- Semi-supervised learning: use of unlabelled data to reduce the amount of
required human annotation
Large Scale Categorization
Taking advantage of clusters of computers for both training and categorization,
our large scale categorization can scale with the number of categories while
keeping the same level of accuracy and speed, so that document categorization
remains interactive.
Operating Environment
The software is written in Java and can be deployed on multiple platforms
including UNIX, Linux, and Windows. Java runtime 1.4.2 or later is required.
Documents can have the following format: XML, HTML, Plain Text.
ClusteriX Technology Description
When a taxonomy does not exist, or fails to represent the current world in a
meaningful way to the documents one needs to classify, ClusteriX will:
- Identify groups of similar documents
- Characterize (name) the content of identified groups
ClusteriX and CategoriX Coupling Scenario
- Organize a collection using ClusteriX (taxonomy induction)
- Categorize new documents with the induced taxonomy
- When a category grows too large, reorganize locally using ClusteriX
- Categorize directly with the updated taxonomy
State-of-the-art Hierarchical Performance
CategoriX and ClusteriX employ a hierarchical model that relates categories to
each other. This adds an extra dimension resulting in more accurate
categorization. Documents may be assigned to categories at different levels in
the hierarchy. The categorization training tool can generate flat models as
well as hierarchical ones.
Adaptive Learning
High-confidence documents can be directly added to the training data set to
update the probabilistic models. Low-confidence documents can be checked by a
human. This may lead to discover novel topics, in which case the category
system can be re-trained locally (on a subset of the data) and/or incrementally
(using only the additional material).
Intellectual Property Summary
Xerox Intellectual Property includes patents, patent applications, and know-how.
For Licensing Information
To learn more about licensing the Text Categorization and Clustering
technology.
|