![]() |
Industry:
Technology:
| Text Categorization and Clustering Classify electronic documents based on content, even without a predefined taxonomy CategoriX is a software solution that selects categories to which a document belongs. It improves the way users can browse, search or filter information in large document collections. Using Xerox patented linguistic analysis technologies and machine learning algorithms, CategoriX consists of:
![]() ![]() Taking advantage of clusters of computers for both training and categorization, our large scale categorization can scale with the number of categories while keeping the same level of accuracy and speed, so that document categorization remains interactive. The software is written in Java and can be deployed on multiple platforms including UNIX, Linux, and Windows. Java runtime 1.4.2 or later is required. Documents can have the following format: XML, HTML, Plain Text. When a taxonomy does not exist, or fails to represent the current world in a meaningful way to the documents one needs to classify, ClusteriX will:
![]() ![]() CategoriX and ClusteriX employ a hierarchical model that relates categories to each other. This adds an extra dimension resulting in more accurate categorization. Documents may be assigned to categories at different levels in the hierarchy. The categorization training tool can generate flat models as well as hierarchical ones. High-confidence documents can be directly added to the training data set to update the probabilistic models. Low-confidence documents can be checked by a human. This may lead to discover novel topics, in which case the category system can be re-trained locally (on a subset of the data) and/or incrementally (using only the additional material). CategoriX and ClusteriX are available for commercial technology licensing (incl. OEM). Maintenance, technical support, customization and integration services can be provided on a contracted basis. To learn more about licensing the CategoriX and ClusteriX technologies, contact Xerox. Click here for an easy-to-print 1-page PDF version of the Xerox text categorization and clustering technology description. |
© 2001 - 2010 XEROX CORPORATION. All rights reserved. |