Sobre el proyecto

Este servicio proporciona una búsqueda de similitud sobre la base de imágenes. Criterio solo es la similitud de los motivos de las características externas (colores, texturas, formas y contrastes sorprendentes). La búsqueda de similitud basado en imágenes ofrece actualmente acceso a una creciente cartera de más que 3 millones obras digitalizadas de la Biblioteca Estatal de Baviera (manuscritos, libros raros, mapas). Estas obras se encuentran entre los componentes más valiosos del patrimonio cultural de Baviera y también para el patrimonio nacional. En total,57 millones de imágenes disponibles.

Bayerische Staatsbibliothek

Digitale Bibliothek / Münchener Digitalisierungszentrum
Ludwigstraße 16, 80539 München
E-Mail: digitale.bibliothek[AT]bsb-muenchen.de

www.bsb-muenchen.de
www.digitale-sammlungen.de

Fraunhofer-Institut für Nachrichtentechnik

Abteilung Videokodierung und Maschinelles Lernen
Einsteinufer 37, 10587 Berlin
Tel: +49 30 31002-0

www.hhi.fraunhofer.de

The Data Basis

With its historical stock of more than 2.4 million digitized books, the Bavarian State Library is one of the most important cultural institutions in the world. This large collection consists mainly of copyright-free works from the 8th to the 20th century with a larg variety of content - from the handwritten medieval Bible to 1920s' tabloids. The high pace of mass digitization during the recent years had its price - indexing of the content lags behind, especially in works which have not been mechanically reprocessed and made accessible by means of Optical Character Recognition (OCR). This applies in particular to medieval manuscripts, old printings and other special collections. Therefore, most of the images were still largely hidden to the user and could only be detected by manual skimming on the screen. because of this, the Bavarian State Library, together with the Fraunhofer Heinrich Hertz Institute in Berlin, developed this service for a similarity-based image search. It identifies image contents of all 2.4 million digital images automatically.

The Method Used

In order to enable similarity searches, the stock of the digitized books has to be edited or indexed accordingly. The software identifies and extracts all images from all digitized book pages automatically, using morphological methods. Afterwards the images are being classified according to their specific color and edge characteristics. Images "without any information value" can be discarded during this process, using methods from the field of machine learning. With this method, more than 43 million individual images have been identified from BSB's digitized books. They are available directly to the user via this web application. Thanks to the variety and richness of the indexed collections, this service not only appeals to historians and book scientists, but also to those interested in a variety of disciplines. The similarity search creates unknown, unusual and often surprising references between different works.

The search does not refer to the images themselves. This would not be possible in real-time with a stock of about 43 million. Instead, individual image descriptors are being used. Descriptors are records that contain the visual information of an image in a very compressed form. In this case, the descriptor associated with a picture has a size of only 96 bytes. In addition, a distance function is required, which indicates the distance between two descriptors. This function is to represent the visual difference between two images as optimally as possible via the descriptors. The similarity function, which returns value between 0.0 and 1.0, is calculated from the distance function. The value 0.0 means maximum dissimilarity, the value 1.0 maximum similarity. The visual descriptor contains Information on the color and distribution of the edge orientation of the image. Due to more than 43 million identified images with the same number of descriptors, 43 million comparisons must be performed during every single search query.

The Project History

Work on this project started in 2011 with a first prototypical implementation of 250 digitized works of the BSB. After many months of intensive development work, the first service for image similarity search went online in April 2013. At that time, around 4 million individual image segments out of 60,000 books were available. During the following years, this number was increased to 6 million image segments from 80,000 volumes. The actual version was developed in 2016 and comprises all digital images from all digitized books of the BSB. It provides more than 43 million images for searching.