Knowledge & expertise

Historical Newspaper Archive published as Linked Open Data

09/05/2017

On the website www.hetarchief.be, an initiative of VIAA, you can find more than 50.000 digitised newspapers from the First World War. VIAA is continuously working on the accessibility of this digital archive. This blog describes how we enable researchers to carry out large scale and semi-automatic searches in this archive by applying Linked Data.

Linked Open Data

More and more organisations find their way to Linked Open Data (LOD) to publish their data. This method allows to link several sources, e.g. a location, newspaper or person, to each other. This enables new or less obvious relations to surface. Open Data refers to the fact that these sources are published with an open license. On Het Archief, the metadata are released under a CC0 license and are considered open. The content of the papers and the OCR text are not covered by this license (see footnote). In the case of our newspaper archive, the main challenge is to recognise the important entities such as a well-known person, and link that to e.g. Wikipedia. Based on a SPARQL-query, researchers can use information from Wikipedia to compose a collection of newspapers.

Every source, such as a paper, a location or a person, is identified with a URI so that other sources can refer to it. Linked Data uses triples to represent links between information. A triple can be seen as an expression that contains a fixed pattern, consisting of a subject, a predicate and an object. The subject is the respective source, and that subject has a relation, or a predicate, with an object. This object can be another source or have a certain value, e.g. a first name. Take the following fact as an example: “The newspaper with URI <http://data.viaa.be/noid/6d5p844s42_19151214_0002> contains a tag representing Alfred Bastien”. This can be reduced to the following triple:
Subject: <http://data.viaa.be/noid/6d5p844s42_19151214_0002>    → the paper in question
Predikaat: <http://www.bbc.co.uk/ontologies/creativework#tag>     → has a tag of
Object: <http://fr.dbpedia.org/resource/Alfred_Bastien>                   → Alfred Bastien

How we published Belgian war papers as Linked Open Data


Step 1: convert existing information

VIAA’s media asset management (MAM) system contains valuable metadata such as title, organisation and date. This raw data was collected and converted to the Turtle format. The script’s code is available as open source on Github. For semantic descriptions, we used the BBC Creative Work Ontology, among others. The Creative Work class contains characteristics such as title, tag and date of creation which are also applicable on our database hetarchief. The Optical Character Recognition (OCR) text and a scan of the page are, however, not available as Open Data, as they may still be copyrighted. All other metadata are freely available as Open Data according to CC0.

Step 2: link with external source

In a second step, we would like to add metadata to hetarchief that make the link with information of an external source. For this we chose DBpedia, which has developed an extraction of information from Wikipedia to Linked Data. To do so, we have extracted well-known persons and place names from the OCR-text of each page with DBpedia Spotlight. This service recognises and classifies entities such as names of places, persons and organisations by using Standford’s Named Entitity Recognition (NER) software and gives you their DBpedia URI’s. Thanks to this tool, we found 33 entities per page on average.

Step 3: Triple Pattern Fragments 

An important challenge in Linked Oped Data is the use of it. One object, in this case a newspaper, will contain dozens of links. These, in turn, will link to dozens of other objects. This will very soon make the drawn up queries complex and demand a lot of the resources of a server. To tackle this, the data can be published in a specific way. The triples we used, were published by using a Triple Pattern Fragments Interface (TPF). The idea behind this, is that a server only answers HTTP GET-requests according to a certain triple pattern. The matching triples are called a fragment and can be stored in cache, needing little resources from the infrastructure. The simple interface allows to build intelligent clients, like a web application. Other organisations, like DBpedia and Wikidata, also publish through such a TPF-interface, so a client can answer questions over all of these data sources.

Application: build a newspaper collection

To build a collection semi-automatically, we use the generated DBpedia tags of step 2. Figure 1 (below) shows an example question a TPF client can answer, namely: “Give me all newspapers containing a reference to war painters”. War painters are not a part of our metadata, but through linked open data sources we can collect a list of all war painters and in the same query check if they appear in newspapers on hetarchief. You can carry out this query yourself right here.

Example of a TPF client requesting all newspapers referring to war painters. On the left you can see a screenshot, on the right some explanation:

As you can see in the example above, the linking of data creates new opportunities: we are no longer limited to the metadata stored in our own system, but we can use existing metadata sources, such as DBpedia, to apply a query on our data set, which used to require a lot of manual labour or scripting to answer.

NER-software like Stanford’s only makes a best effort to recognise entities. The existing errors are often due to the bad OCR quality and a difference in time period. A next step would be the setting up of a crowdsource tool, enabling others to enter their own improvements.

Want to know more about VIAA’s Linked Data projects? Take a look at this project.

The linked dataset can be queried here.


FOOTNOTE: Metadata, such as the name of a paper or its publication date, can be considered open data. The situation of the content of the papers and the OCR text is different. The bulk of the newspaper articles and illustrations have no signature. In Belgium, anonymous works are under copyright until 70 years post publication. That term is greatly exceeded so the bulk of the collection belongs to the public domain.Exceptionally, individual copyright might still be applicable. We cannot guarantee that all sources on News of the Great War are in fact free of copyright and open.

Questions about this project?
Please contact Matthias Priem.

Matthias2Matti