Class MwRevisionDumpFileProcessor

java.lang.Object
org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor
All Implemented Interfaces:
MwDumpFileProcessor

public class MwRevisionDumpFileProcessor extends Object implements MwDumpFileProcessor
This class processes MediaWiki dumpfiles that contain lists of page revisions in the specific XML format used by MediaWiki for exporting pages. It extracts all revisions and forwards them to any registered revision processor. The class also keeps track of whether or not a certain article respectively revision has already been encountered. Therefore, no revision is processed twice and the registered revision processors can be informed whether the revision is the first of the given article or not. The first revision of an article that is encountered in a MediaWiki dump file is usually the most recent one. If multiple dump files are processed in reverse chronological order, the first revision that is encountered is also the most recent one overall.
Author:
Markus Kroetzsch
  • Constructor Details

    • MwRevisionDumpFileProcessor

      public MwRevisionDumpFileProcessor(MwRevisionProcessor mwRevisionProcessor)
      Constructor.
      Parameters:
      mwRevisionProcessor - the revision processor to which all revisions will be reported
  • Method Details

    • reset

      public void reset()
      Resets the internal state of the object. All information gathered from previously processed dumps and all related statistics will be forgotten. If this method is not called, then consecutive invocations of processDumpFileContents(InputStream, MwDumpFile) will continue to add to the internal state. This is useful for processing dumps that are split into several parts.

      This will not unregister any MwRevisionProcessors.

    • processDumpFileContents

      public void processDumpFileContents(InputStream inputStream, MwDumpFile dumpFile)
      Description copied from interface: MwDumpFileProcessor
      Process dump file data from the given input stream.

      The input stream is obtained from the given dump file via MwDumpFile.getDumpFileStream(). It will be closed by the caller.

      Specified by:
      processDumpFileContents in interface MwDumpFileProcessor
      Parameters:
      inputStream - to access the contents of the dump
      dumpFile - to access further information about this dump