Generic Visual Categorizer

Automatically classify images based on visual content.

Technology Description

Generic visual categorization (GVC) assigns one or multiple labels to an image based on its semantic content. "Generic" highlights the goal of classifying a wide range of objects and scenes.

Xerox has developed a technique that is sufficiently generic to work with several object types simultaneously and which can be readily extended to new object types. It can also handle variations in view, imaging, lighting and occlusion (partial visibility), typical of the real world, as well as the intra-class variations typical of semantic classes of everyday objects (e.g. size, shape, form, color).

Xerox’s breakthrough GVC technology is the result of combining Xerox scientists' expertise in image processing, computer vision and machine learning.

The Xerox visual categorisation method first extracts and describes patches found in an image. It then maps patch descriptors to "visual vocabularies" which are sets of predetermined clusters of patches called "visual words".

Visual vocabularies are learned automatically from training sets and provide an intermediate representation (hidden layer) bridging the semantic gap between the low-level features extracted from an image and the high-level concepts to be categorized.

Since one universal vocabulary made of the most frequent visual words across all the considered classes is not sufficient, Xerox borrowed a technique from speech recognition known as "vocabulary adaptation" to derive class-specific vocabularies from the universal vocabulary.

For each class, an image is characterized by a histogram of visual word occurences, which determines whether the image content is best modeled by the universal vocabulary or the corresponding class vocabulary.


Advantages of the Xerox method include simplicity, computational efficiency, scalability, robustness to variations, and applicability to all types of classes and objects. Presently, GVC has been trained for about 100 categories. Rigorous tests involving more than 30 simultaneous categories have demonstrated state-of-the-art categorization performance:
  • Classification run-time of 0.2 to 0.5 sec per image (depending on processor performance)
  • ~0.1 msec computational increment per added class
  • Equal error rate ranging from 2% to 10% (depending on class, independent of number of classes)


  • Automatic tagging of images (e.g., images in documents, photographic archives, consumer photo albums, online shopping catalogues)
  • Content-based image retrieval

Generic Visual Categorizer Training Tool

Train your own Visual Categorizer from a collection of tagged images

Presently, GVC has been trained for about 100 categories. To meet customers' needs beyond the current coverage, Xerox has developed a beta version GVC Training Tool with a graphical user interface which is simple enough to be used by any holder of tagged images, while sufficiently sophisticated to give performance feedback and offer an iterative training process.

Tagging of the Training Material

The training material is a collection of tagged images. There is no need to associate tags with corresponding regions in the image (no segmentation). The collection should be representative of what the visual categorizer is expected to recognize at run time. It should also be diverse. An image in the training set can be labelled with multiple tags. Training iterations are likely to detect the wrong and the missing labels, thus giving the user a chance to improve performance.

Training Iteration Settings

Starting from such a tagged collection, the user can define the following settings:
  • Subset: models can be trained for a subset of all categories
  • Class aggregation: classes can be aggregated at training time. For example, if the training dataset includes distinct labels for "cats" and "dogs", the user may however prefer to combine them as "pets"
  • Class-to-class neutrality: when two categories are semantically close to each other (e.g. Forest / Trees), the user can instruct the Training Tool to neutralize some tags during training. By default, images tagged anything else than "Forest" are negative examples of the "Forest" class. It is wise to handle "Trees" as neutral examples vis a vis the "Forest" category

Training Process

A progress bar helps visualizing the process steps. Some intermediate results are saved for later reuse. Training time is dependent on the computer hardware (processor, memory, hard disk). Training 40 categories from 30,000 labelled pictures is completed in a few hours on a PentiumŪ 4 or Athlon™ 64 processor computer.

Interactive Performance Feedback

The Training Tool can measure the performance of the models just trained, by using the technique of N-fold cross-validation, where N is set by the user. This randomly splits the training set in N subsets of equal size, and iteratively subtracts one subset from the training dataset and uses it as test data. The performance can be visualized in several ways incl. precision, recall, F1, accuracy, confusion matrix.

Performance Measure (percision & recall)

The user can easily correct gaps in the tagging. Color coding indicates how each image was tagged, while the scores result from the models just trained. Ranking by scores and looking at the highest and lowest ones help to identify mismatches in the tagging, which can be easily corrected.

Assistance in Tagging

Based on the performance measure the user can decide on appropriate next steps, which may include:
  • Remove, aggregate or redefine certain categories
  • Add, remove or edit some tags or images
  • Neutralize certain tags vis a vis certain categories
  • Launch a new training iteration
  • Import the trained models into the GVC run-time

Intellectual Property Summary

Xerox Intellectual Property includes patents, patent applications, and know-how.

For Licensing Information

To learn more about licensing the Generic Visual Categorizer technology.