Class DumpProcessingController

java.lang.Object
org.wikidata.wdtk.dumpfiles.DumpProcessingController

public class DumpProcessingController extends Object
A class for controlling the processing of dump files through a unified interface. The settings of the controller specify how dump files should be fetched and processed.

The methods for registering listeners to process dump files that contain revisions are registerMwRevisionProcessor(MwRevisionProcessor, String, boolean) and registerEntityDocumentProcessor(EntityDocumentProcessor, String, boolean).

For processing the content of wiki pages, there are two modes of operation: revision-based and entity-document-based. The former is used when processing dump files that contain revisions. These hold detailed information about each revision (revision number, author, time, etc.) that could be used by revision processors.

The entity-document-based operation is used when processing simplified dumps that contain only the content of the current (entity) pages of a wiki. In this case, no additional information is available and only the entity document processors are called (since we have no revisions). Both modes use the same entity document processors. In revision-based runs, it is possible to restrict some entity document processors to certain content models only (e.g., to process only properties). In entity-document-based runs, this is ignored and all entity document processors get to see all the data.

The methods for revision-based processing of selected dump files (and downloading them first, finding out which ones are relevant) are processAllRecentRevisionDumps(), processMostRecentMainDump(), and processMostRecentMainDump().

To extract the most recent sitelinks information, the method getSitesInformation() can be used. To get information about the revision dump files that the main methods will process, one can use getWmfDumpFileManager() to get access to the underlying dump file manager, which can be used to get access to dump file data.

The controller will also catch exceptions that may occur when trying to download and read dump files. They will be turned into logged errors.

Author:
Markus Kroetzsch
  • Constructor Details

    • DumpProcessingController

      public DumpProcessingController(String projectName)
      Creates a new DumpFileProcessingController for the project of the given name. By default, the dump file directory will be assumed to be in the current directory and the object will access the Web to fetch the most recent files.
      Parameters:
      projectName - Wikimedia projectname, e.g., "wikidatawiki" or "enwiki"
  • Method Details

    • setDownloadDirectory

      public void setDownloadDirectory(String downloadDirectory) throws IOException
      Sets the directory where dumpfiles are stored locally. If it does not exist yet, this directory will be created. Dumpfiles will later be stored in a subdirectory "dumpfiles", but this will only be created when needed.
      Parameters:
      downloadDirectory - the download base directory
      Throws:
      IOException - if the existence of the directory could not be checked or if it did not exists and could not be created either
    • setOfflineMode

      public void setOfflineMode(boolean offlineModeEnabled)
      Disables or enables Web access.
      Parameters:
      offlineModeEnabled - if true, all Web access is disabled and only local files will be processed
    • setPropertyFilter

      public void setPropertyFilter(Set<PropertyIdValue> propertyFilter)
      Sets a property filter. If given, all data will be preprocessed to contain only statements for the given (main) properties.
      Parameters:
      propertyFilter - set of properties that should be retained (can be empty)
      See Also:
    • setSiteLinkFilter

      public void setSiteLinkFilter(Set<String> siteLinkFilter)
      Sets a site link filter. If given, all data will be preprocessed to contain only data for the given site keys.
      Parameters:
      siteLinkFilter - set of siteLinks that should be retained (can be empty)
      See Also:
    • setLanguageFilter

      public void setLanguageFilter(Set<String> languageFilter)
      Sets a language filter. If given, all data will be preprocessed to contain only data for the given languages.
      Parameters:
      languageFilter - set of language codes that should be retained (can be empty)
      See Also:
    • registerMwRevisionProcessor

      public void registerMwRevisionProcessor(MwRevisionProcessor mwRevisionProcessor, String model, boolean onlyCurrentRevisions)
      Registers an MwRevisionProcessor, which will henceforth be notified of all revisions that are encountered in the dump.

      This only is used when processing dumps that contain revisions. In particular, plain JSON dumps contain no revision information.

      Importantly, the MwRevision that the registered processors will receive is valid only during the execution of MwRevisionProcessor.processRevision(MwRevision), but it will not be permanent. If the data is to be retained permanently, the revision processor needs to make its own copy.

      Parameters:
      mwRevisionProcessor - the revision processor to register
      model - the content model that the processor is registered for; it will only be notified of revisions in that model; if null is given, all revisions will be processed whatever their model
      onlyCurrentRevisions - if true, then the subscriber is only notified of the most current revisions; if false, then it will receive all revisions, current or not
    • registerEntityDocumentProcessor

      public void registerEntityDocumentProcessor(EntityDocumentProcessor entityDocumentProcessor, String model, boolean onlyCurrentRevisions)
      Registers an EntityDocumentProcessor, which will henceforth be notified of all entity documents that are encountered in the dump.

      It is possible to register processors for specific content types and to use either all revisions or only the most current ones. This functionality is only available when processing dumps that contain this information. In particular, plain JSON dumps do not specify content models at all and have only one (current) revision of each entity.

      Parameters:
      entityDocumentProcessor - the entity document processor to register
      model - the content model that the processor is registered for; it will only be notified of revisions in that model; if null is given, all revisions will be processed whatever their model
      onlyCurrentRevisions - if true, then the subscriber is only notified of the most current revisions; if false, then it will receive all revisions, current or not
    • getSitesInformation

      public Sites getSitesInformation() throws IOException
      Processes the most recent dump of the sites table to extract information about registered sites.
      Returns:
      a Sites objects that contains the extracted information, or null if no sites dump was available (typically in offline mode without having any previously downloaded sites dumps)
      Throws:
      IOException - if there was a problem accessing the sites table dump or the dump download directory
    • processAllRecentRevisionDumps

      public void processAllRecentRevisionDumps()
      Processes all relevant page revision dumps in order. The registered listeners (MwRevisionProcessor or EntityDocumentProcessor objects) will be notified of all data they registered for.

      Note that this method may not always provide reliable results since single incremental dump files are sometimes missing, even if earlier and later incremental dumps are available. In such a case, processing all recent dumps will miss some (random) revisions, thus reflecting a state that the wiki has never really been in. It might thus be preferable to process only a single (main) dump file without any incremental dumps.

      See Also:
    • processMostRecentMainDump

      public void processMostRecentMainDump()
      Processes the most recent main (complete) dump that is available. Convenience method: same as retrieving a dump with getMostRecentDump(DumpContentType) with DumpContentType.CURRENT or DumpContentType.FULL, and processing it with processDump(MwDumpFile). The individual methods should be used for better control and error handling.
      See Also:
    • processMostRecentJsonDump

      public void processMostRecentJsonDump()
      Processes the most recent main (complete) dump in JSON form that is available. Convenience method: same as retrieving a dump with getMostRecentDump(DumpContentType) with DumpContentType.JSON, and processing it with processDump(MwDumpFile). The individual methods should be used for better control and error handling.
      See Also:
    • processDump

      public void processDump(MwDumpFile dumpFile)
      Processes the contents of the given dump file. All registered processor objects will be notified of all data. Note that JSON dumps do not contains any revision information, so that registered MwRevisionProcessor objects will not be notified in this case. Dumps of type DumpContentType.SITES cannot be processed with this method; use getSitesInformation() to process these dumps.
      Parameters:
      dumpFile - the dump to process
    • getMostRecentDump

      public MwDumpFile getMostRecentDump(DumpContentType dumpContentType)
      Returns a handler for the most recent dump file of the given type that is available (under the current settings), or null if no dump file of this type could be retrieved.
      Parameters:
      dumpContentType - the type of the dump, e.g., DumpContentType.JSON
      Returns:
      the most recent dump, or null if none was found
    • getWmfDumpFileManager

      public WmfDumpFileManager getWmfDumpFileManager()
      Returns a WmfDumpFileManager based on the current settings. This object can be used to get direct access to dump files, e.g., to gather more information. Most basic operations can also be performed using the interface of the DumpProcessingController and this is often preferable.

      This dump file manager will not be updated if the settings change later.

      Returns:
      a WmfDumpFileManager for the current settings or null if there was a problem (e.g., since the current dump file directory could not be accessed)