Text Categorization and Clustering
Classify electronic documents based on content, even without a predefined taxonomy.
Technology DescriptionCategoriX is a software solution that selects categories to which a document belongs. It improves the way users can browse, search or filter information in large document collections.
Using Xerox patented linguistic analysis technologies and machine learning algorithms, CategoriX consists of:
Large Scale CategorizationTaking advantage of clusters of computers for both training and categorization, our large scale categorization can scale with the number of categories while keeping the same level of accuracy and speed, so that document categorization remains interactive.
Operating EnvironmentThe software is written in Java and can be deployed on multiple platforms including UNIX, Linux, and Windows. Java runtime 1.4.2 or later is required. Documents can have the following format: XML, HTML, Plain Text.
ClusteriX Technology DescriptionWhen a taxonomy does not exist, or fails to represent the current world in a meaningful way to the documents one needs to classify, ClusteriX will:
ClusteriX and CategoriX Coupling Scenario
State-of-the-art Hierarchical PerformanceCategoriX and ClusteriX employ a hierarchical model that relates categories to each other. This adds an extra dimension resulting in more accurate categorization. Documents may be assigned to categories at different levels in the hierarchy. The categorization training tool can generate flat models as well as hierarchical ones.
Adaptive LearningHigh-confidence documents can be directly added to the training data set to update the probabilistic models. Low-confidence documents can be checked by a human. This may lead to discover novel topics, in which case the category system can be re-trained locally (on a subset of the data) and/or incrementally (using only the additional material).
Intellectual Property SummaryXerox Intellectual Property includes patents, patent applications, and know-how.
For Licensing InformationTo learn more about licensing the Text Categorization and Clustering technology.