Class TutorialDocumentProcessor

java.lang.Object
org.wikidata.wdtk.examples.TutorialDocumentProcessor
All Implemented Interfaces:
EntityDocumentProcessor

public class TutorialDocumentProcessor extends Object implements EntityDocumentProcessor
This is a simple template for an EntityDocumentProcessor that can be modified to try your own code.

Exercise 1: Just run the code as it is and have a look at the output. It will print a lot of data about item documents to the console. You can see roughly what the data looks like. Find the data for one item and look up the item on wikidata.org. Find the data that you can see on the Web page in the print out (note that some details might have changed since you local data is based on a dump).

Exercise 2: The code below already counts how many items and properties it processes. Add additional counters to count: (1) the number of labels, (2) the number of aliases, (3) the number of statements, (4) the number of site links. Print this data at the end or write it to the file if you like.

Exercise 3: Extend your code from Exercise 2 to count how many items have a link to English Wikipedia (or another Wikipedia of your choice). The site identifier used in the data for English Wikipedia is "enwiki".

Exercise 4: Building on the code of Exercise 3, count the number of site links for all sites that are linked. Use, for example, a hashmap to store integer counters for each site id you encounter. Print the results to a CSV file and load the file into a spreadsheet application (this can also be an online application such as Google Drive). You can order the data by count and create a diagram. The number of site links should be close to the number of articles in the project.

Exercise 5: Compute the average life expectancy of people on Wikidata. To do this, consider items with a birth date (P569) and death date (P570). Whenever both dates are found, compute the difference of years between the dates. Store the sum of these lifespans (in years) and the number of people for which you recorded a lifespace to compute the average. Some hints:

  • There can be more than one statement for any property, even for date of birth/death (if there are different opinions). For simplicity, just use the first.
  • Dates can be uncertain. This is expressed by their precision, TimeValue.getPrecision(). You should only consider values with precision greater or equal to TimeValue.PREC_DAY.

Exercise 6: Compute the average life span as in Exercise 5, but now grouped by year of birth. This will show you how life expectancy changed over time (at least for people with Wikipedia articles). For this, create arrays or maps to store the sum of the lifespan and number of people for each year of birth. Finally, compute all the averages and store them to a CSV file that gives the average life expectancy for each year of birth. Load this file into a spreadsheet too to create a diagram. What do you notice? Some hints:

  • An easy way to store the numbers you need for each year of birth is to use an array where the year is the index. This is possible here since you know that years should be in a certain range. You could also use a Hashmap, of course, but sorting by key is more work in this case.
  • The data can contain errors. If you see strange effects in the results, maybe you need to filter some unlikely cases.
  • To get a smooth trend for life expectancy, you need to have at least a few people for every year of birth. It might be a good idea to consider only people born after the year 1200 to make sure that you have enough precise data.
Author:
Markus Kroetzsch
  • Constructor Details

    • TutorialDocumentProcessor

      public TutorialDocumentProcessor()
  • Method Details

    • processItemDocument

      public void processItemDocument(ItemDocument itemDocument)
      Processes one item document. This is often the main workhorse that gathers the data you are interested in. You can modify this code as you wish.
      Specified by:
      processItemDocument in interface EntityDocumentProcessor
      Parameters:
      itemDocument - the ItemDocument
    • processPropertyDocument

      public void processPropertyDocument(PropertyDocument propertyDocument)
      Processes one property document. Property documents mainly tell you the name and datatype of properties. It can be useful to process all properties first to store data about them that is useful when processing items. There are not very many properties (about 1100 as of August 2014), so it is safe to store all their data for later use.
      Specified by:
      processPropertyDocument in interface EntityDocumentProcessor
      Parameters:
      propertyDocument - the PropertyDocument
    • storeResults

      public void storeResults()
      Stores the processing results in a file. CSV (comma separated values) is a simple format that makes sense for such tasks. It can be imported easily into spreadsheet tools to generate diagrams from the data.