On the website www.hetarchief.be, an initiative of VIAA, you can find more than 50.000 digitised newspapers from the First World War. VIAA is continuously working on the accessibility of this digital archive. This blog describes how we enable researchers to carry out large scale and semi-automatic searches in this archive by applying Linked Data.
Linked Open Data
More and more organisations find their way to Linked Open Data (LOD) to publish their data. This method allows to link several sources, e.g. a location, newspaper or person, to each other. This enables new or less obvious relations to surface. Open Data refers to the fact that these sources are published with an open license. On Het Archief, the metadata are released under a CC0 license and are considered open. The content of the papers and the OCR text are not covered by this license (see footnote). In the case of our newspaper archive, the main challenge is to recognise the important entities such as a well-known person, and link that to e.g. Wikipedia. Based on a SPARQL-query, researchers can use information from Wikipedia to compose a collection of newspapers.
Every source, such as a paper, a location or a person, is identified with a URI so that other sources can refer to it. Linked Data uses triples to represent links between information. A triple can be seen as an expression that contains a fixed pattern, consisting of a subject, a predicate and an object. The subject is the respective source, and that subject has a relation, or a predicate, with an object. This object can be another source or have a certain value, e.g. a first name. Take the following fact as an example: “The newspaper with URI <http://data.viaa.be/noid/6d5p844s42_19151214_0002> contains a tag representing Alfred Bastien”. This can be reduced to the following triple:
Subject: <http://data.viaa.be/noid/6d5p844s42_19151214_0002> → the paper in question
Predikaat: <http://www.bbc.co.uk/ontologies/creativework#tag> → has a tag of
Object: <http://fr.dbpedia.org/resource/Alfred_Bastien> → Alfred Bastien
In a second step, we would like to add metadata to hetarchief that make the link with information of an external source. For this we chose DBpedia, which has developed an extraction of information from Wikipedia to Linked Data. To do so, we have extracted well-known persons and place names from the OCR-text of each page with DBpedia Spotlight. This service recognises and classifies entities such as names of places, persons and organisations by using Standford’s Named Entitity Recognition (NER) software and gives you their DBpedia URI’s. Thanks to this tool, we found 33 entities per page on average.
To build a collection semi-automatically, we use the generated DBpedia tags of step 2. Figure 1 (below) shows an example question a TPF client can answer, namely: “Give me all newspapers containing a reference to war painters”. War painters are not a part of our metadata, but through linked open data sources we can collect a list of all war painters and in the same query check if they appear in newspapers on hetarchief. You can carry out this query yourself right here.
Example of a TPF client requesting all newspapers referring to war painters. On the left you can see a screenshot, on the right some explanation:
As you can see in the example above, the linking of data creates new opportunities: we are no longer limited to the metadata stored in our own system, but we can use existing metadata sources, such as DBpedia, to apply a query on our data set, which used to require a lot of manual labour or scripting to answer.
FOOTNOTE: Metadata, such as the name of a paper or its publication date, can be considered open data. The situation of the content of the papers and the OCR text is different. The bulk of the newspaper articles and illustrations have no signature. In Belgium, anonymous works are under copyright until 70 years post publication. That term is greatly exceeded so the bulk of the collection belongs to the public domain.Exceptionally, individual copyright might still be applicable. We cannot guarantee that all sources on News of the Great War are in fact free of copyright and open.