Class DumpProcessingController
The methods for registering listeners to process dump files that contain
revisions are
registerMwRevisionProcessor(MwRevisionProcessor, String, boolean)
and
registerEntityDocumentProcessor(EntityDocumentProcessor, String, boolean)
.
For processing the content of wiki pages, there are two modes of operation: revision-based and entity-document-based. The former is used when processing dump files that contain revisions. These hold detailed information about each revision (revision number, author, time, etc.) that could be used by revision processors.
The entity-document-based operation is used when processing simplified dumps that contain only the content of the current (entity) pages of a wiki. In this case, no additional information is available and only the entity document processors are called (since we have no revisions). Both modes use the same entity document processors. In revision-based runs, it is possible to restrict some entity document processors to certain content models only (e.g., to process only properties). In entity-document-based runs, this is ignored and all entity document processors get to see all the data.
The methods for revision-based processing of selected dump files (and
downloading them first, finding out which ones are relevant) are
processAllRecentRevisionDumps()
,
processMostRecentMainDump()
, and
processMostRecentMainDump()
.
To extract the most recent sitelinks information, the method
getSitesInformation()
can be used. To get information about the
revision dump files that the main methods will process, one can use
getWmfDumpFileManager()
to get access to the underlying dump file
manager, which can be used to get access to dump file data.
The controller will also catch exceptions that may occur when trying to download and read dump files. They will be turned into logged errors.
- Author:
- Markus Kroetzsch
-
Constructor Summary
ConstructorDescriptionDumpProcessingController
(String projectName) Creates a new DumpFileProcessingController for the project of the given name. -
Method Summary
Modifier and TypeMethodDescriptiongetMostRecentDump
(DumpContentType dumpContentType) Returns a handler for the most recent dump file of the given type that is available (under the current settings), or null if no dump file of this type could be retrieved.Processes the most recent dump of the sites table to extract information about registered sites.Returns a WmfDumpFileManager based on the current settings.void
Processes all relevant page revision dumps in order.void
processDump
(MwDumpFile dumpFile) Processes the contents of the given dump file.void
Processes the most recent main (complete) dump in JSON form that is available.void
Processes the most recent main (complete) dump that is available.void
registerEntityDocumentProcessor
(EntityDocumentProcessor entityDocumentProcessor, String model, boolean onlyCurrentRevisions) Registers an EntityDocumentProcessor, which will henceforth be notified of all entity documents that are encountered in the dump.void
registerMwRevisionProcessor
(MwRevisionProcessor mwRevisionProcessor, String model, boolean onlyCurrentRevisions) Registers an MwRevisionProcessor, which will henceforth be notified of all revisions that are encountered in the dump.void
setDownloadDirectory
(String downloadDirectory) Sets the directory where dumpfiles are stored locally.void
setLanguageFilter
(Set<String> languageFilter) Sets a language filter.void
setOfflineMode
(boolean offlineModeEnabled) Disables or enables Web access.void
setPropertyFilter
(Set<PropertyIdValue> propertyFilter) Sets a property filter.void
setSiteLinkFilter
(Set<String> siteLinkFilter) Sets a site link filter.
-
Constructor Details
-
DumpProcessingController
Creates a new DumpFileProcessingController for the project of the given name. By default, the dump file directory will be assumed to be in the current directory and the object will access the Web to fetch the most recent files.- Parameters:
projectName
- Wikimedia projectname, e.g., "wikidatawiki" or "enwiki"
-
-
Method Details
-
setDownloadDirectory
Sets the directory where dumpfiles are stored locally. If it does not exist yet, this directory will be created. Dumpfiles will later be stored in a subdirectory "dumpfiles", but this will only be created when needed.- Parameters:
downloadDirectory
- the download base directory- Throws:
IOException
- if the existence of the directory could not be checked or if it did not exists and could not be created either
-
setOfflineMode
public void setOfflineMode(boolean offlineModeEnabled) Disables or enables Web access.- Parameters:
offlineModeEnabled
- if true, all Web access is disabled and only local files will be processed
-
setPropertyFilter
Sets a property filter. If given, all data will be preprocessed to contain only statements for the given (main) properties.- Parameters:
propertyFilter
- set of properties that should be retained (can be empty)- See Also:
-
setSiteLinkFilter
Sets a site link filter. If given, all data will be preprocessed to contain only data for the given site keys.- Parameters:
siteLinkFilter
- set of siteLinks that should be retained (can be empty)- See Also:
-
setLanguageFilter
Sets a language filter. If given, all data will be preprocessed to contain only data for the given languages.- Parameters:
languageFilter
- set of language codes that should be retained (can be empty)- See Also:
-
registerMwRevisionProcessor
public void registerMwRevisionProcessor(MwRevisionProcessor mwRevisionProcessor, String model, boolean onlyCurrentRevisions) Registers an MwRevisionProcessor, which will henceforth be notified of all revisions that are encountered in the dump.This only is used when processing dumps that contain revisions. In particular, plain JSON dumps contain no revision information.
Importantly, the
MwRevision
that the registered processors will receive is valid only during the execution ofMwRevisionProcessor.processRevision(MwRevision)
, but it will not be permanent. If the data is to be retained permanently, the revision processor needs to make its own copy.- Parameters:
mwRevisionProcessor
- the revision processor to registermodel
- the content model that the processor is registered for; it will only be notified of revisions in that model; if null is given, all revisions will be processed whatever their modelonlyCurrentRevisions
- if true, then the subscriber is only notified of the most current revisions; if false, then it will receive all revisions, current or not
-
registerEntityDocumentProcessor
public void registerEntityDocumentProcessor(EntityDocumentProcessor entityDocumentProcessor, String model, boolean onlyCurrentRevisions) Registers an EntityDocumentProcessor, which will henceforth be notified of all entity documents that are encountered in the dump.It is possible to register processors for specific content types and to use either all revisions or only the most current ones. This functionality is only available when processing dumps that contain this information. In particular, plain JSON dumps do not specify content models at all and have only one (current) revision of each entity.
- Parameters:
entityDocumentProcessor
- the entity document processor to registermodel
- the content model that the processor is registered for; it will only be notified of revisions in that model; if null is given, all revisions will be processed whatever their modelonlyCurrentRevisions
- if true, then the subscriber is only notified of the most current revisions; if false, then it will receive all revisions, current or not
-
getSitesInformation
Processes the most recent dump of the sites table to extract information about registered sites.- Returns:
- a Sites objects that contains the extracted information, or null if no sites dump was available (typically in offline mode without having any previously downloaded sites dumps)
- Throws:
IOException
- if there was a problem accessing the sites table dump or the dump download directory
-
processAllRecentRevisionDumps
public void processAllRecentRevisionDumps()Processes all relevant page revision dumps in order. The registered listeners (MwRevisionProcessor or EntityDocumentProcessor objects) will be notified of all data they registered for.Note that this method may not always provide reliable results since single incremental dump files are sometimes missing, even if earlier and later incremental dumps are available. In such a case, processing all recent dumps will miss some (random) revisions, thus reflecting a state that the wiki has never really been in. It might thus be preferable to process only a single (main) dump file without any incremental dumps.
-
processMostRecentMainDump
public void processMostRecentMainDump()Processes the most recent main (complete) dump that is available. Convenience method: same as retrieving a dump withgetMostRecentDump(DumpContentType)
withDumpContentType.CURRENT
orDumpContentType.FULL
, and processing it withprocessDump(MwDumpFile)
. The individual methods should be used for better control and error handling.- See Also:
-
processMostRecentJsonDump
public void processMostRecentJsonDump()Processes the most recent main (complete) dump in JSON form that is available. Convenience method: same as retrieving a dump withgetMostRecentDump(DumpContentType)
withDumpContentType.JSON
, and processing it withprocessDump(MwDumpFile)
. The individual methods should be used for better control and error handling.- See Also:
-
processDump
Processes the contents of the given dump file. All registered processor objects will be notified of all data. Note that JSON dumps do not contains any revision information, so that registeredMwRevisionProcessor
objects will not be notified in this case. Dumps of typeDumpContentType.SITES
cannot be processed with this method; usegetSitesInformation()
to process these dumps.- Parameters:
dumpFile
- the dump to process
-
getMostRecentDump
Returns a handler for the most recent dump file of the given type that is available (under the current settings), or null if no dump file of this type could be retrieved.- Parameters:
dumpContentType
- the type of the dump, e.g.,DumpContentType.JSON
- Returns:
- the most recent dump, or null if none was found
-
getWmfDumpFileManager
Returns a WmfDumpFileManager based on the current settings. This object can be used to get direct access to dump files, e.g., to gather more information. Most basic operations can also be performed using the interface of theDumpProcessingController
and this is often preferable.This dump file manager will not be updated if the settings change later.
- Returns:
- a WmfDumpFileManager for the current settings or null if there was a problem (e.g., since the current dump file directory could not be accessed)
-