Informazioni sul progetto
Questo servizio fornisce una ricerca di immagini per analogia, secondo il criterio della loro somiglianza attraverso caratteristiche esterne (colori, trama, forme e contrasti accentuati). La ricerca consente l´accesso ad un patrimonio di 3 milioni opere di 12 secoli digitalizzate dalla Biblioteca Statale Bavarese (manoscritti, libri rari, mappe). Queste opere sono annoverate all´interno del patrimonio di spicco dell´eredità culturale della Baviera e al contempo di quello nazionale. Complessivamente sono disponibili 57 milioni di immagini.
Fraunhofer-Institut für Nachrichtentechnik
Abteilung Videokodierung und Maschinelles Lernen
Einsteinufer 37, 10587 Berlin
Tel: +49 30 31002-0
www.hhi.fraunhofer.de
The Data Basis
With its historical stock of more than 2.4 million digitized books, the Bavarian State Library is one of the most important cultural institutions in the world. This large collection consists mainly of copyright-free works from the 8th to the 20th century with a larg variety of content - from the handwritten medieval Bible to 1920s' tabloids. The high pace of mass digitization during the recent years had its price - indexing of the content lags behind, especially in works which have not been mechanically reprocessed and made accessible by means of Optical Character Recognition (OCR). This applies in particular to medieval manuscripts, old printings and other special collections. Therefore, most of the images were still largely hidden to the user and could only be detected by manual skimming on the screen. because of this, the Bavarian State Library, together with the Fraunhofer Heinrich Hertz Institute in Berlin, developed this service for a similarity-based image search. It identifies image contents of all 2.4 million digital images automatically.
The Method Used
In order to enable similarity searches, the stock of the digitized books has to be edited or indexed accordingly. The software identifies and extracts all images from all digitized book pages automatically, using morphological methods. Afterwards the images are being classified according to their specific color and edge characteristics. Images "without any information value" can be discarded during this process, using methods from the field of machine learning.
With this method, more than 43 million individual images have been identified from BSB's digitized books. They are available directly to the user via this web application. Thanks to the variety and richness of the indexed collections, this service not only appeals to historians and book scientists, but also to those interested in a variety of disciplines. The similarity search creates unknown, unusual and often surprising references between different works.
The search does not refer to the images themselves. This would not be possible in real-time with a stock of about 43 million. Instead, individual image descriptors are being used. Descriptors are records that contain the visual information of an image in a very compressed form. In this case, the descriptor associated with a picture has a size of only 96 bytes. In addition, a distance function is required, which indicates the distance between two descriptors. This function is to represent the visual difference between two images as optimally as possible via the descriptors. The similarity function, which returns value between 0.0 and 1.0, is calculated from the distance function. The value 0.0 means maximum dissimilarity, the value 1.0 maximum similarity. The visual descriptor contains Information on the color and distribution of the edge orientation of the image. Due to more than 43 million identified images with the same number of descriptors, 43 million comparisons must be performed during every single search query.
The Project History
Work on this project started in 2011 with a first prototypical implementation of 250 digitized works of the BSB. After many months of intensive development work, the first service for image similarity search went online in April 2013. At that time, around 4 million individual image segments out of 60,000 books were available. During the following years, this number was increased to 6 million image segments from 80,000 volumes. The actual version was developed in 2016 and comprises all digital images from all digitized books of the BSB. It provides more than 43 million images for searching.