public class DumpProcessingController extends Object
The methods for registering listeners to process dump files that contain
revisions are
registerMwRevisionProcessor(MwRevisionProcessor, String, boolean)
and
registerEntityDocumentProcessor(EntityDocumentProcessor, String, boolean)
.
For processing the content of wiki pages, there are two modes of operation: revision-based and entity-document-based. The former is used when processing dump files that contain revisions. These hold detailed information about each revision (revision number, author, time, etc.) that could be used by revision processors.
The entity-document-based operation is used when processing simplified dumps that contain only the content of the current (entity) pages of a wiki. In this case, no additional information is available and only the entity document processors are called (since we have no revisions). Both modes use the same entity document processors. In revision-based runs, it is possible to restrict some entity document processors to certain content models only (e.g., to process only properties). In entity-document-based runs, this is ignored and all entity document processors get to see all the data.
The methods for revision-based processing of selected dump files (and
downloading them first, finding out which ones are relevant) are
processAllRecentRevisionDumps()
,
processMostRecentMainDump()
, and
processMostRecentMainDump()
.
To extract the most recent sitelinks information, the method
getSitesInformation()
can be used. To get information about the
revision dump files that the main methods will process, one can use
getWmfDumpFileManager()
to get access to the underlying dump file
manager, which can be used to get access to dump file data.
The controller will also catch exceptions that may occur when trying to download and read dump files. They will be turned into logged errors.
Constructor and Description |
---|
DumpProcessingController(String projectName)
Creates a new DumpFileProcessingController for the project of the given
name.
|
Modifier and Type | Method and Description |
---|---|
MwDumpFile |
getMostRecentDump(DumpContentType dumpContentType)
Returns a handler for the most recent dump file of the given type that is
available (under the current settings), or null if no dump file of this
type could be retrieved.
|
Sites |
getSitesInformation()
Processes the most recent dump of the sites table to extract information
about registered sites.
|
WmfDumpFileManager |
getWmfDumpFileManager()
Returns a WmfDumpFileManager based on the current settings.
|
void |
processAllRecentRevisionDumps()
Processes all relevant page revision dumps in order.
|
void |
processDump(MwDumpFile dumpFile)
Processes the contents of the given dump file.
|
void |
processMostRecentJsonDump()
Processes the most recent main (complete) dump in JSON form that is
available.
|
void |
processMostRecentMainDump()
Processes the most recent main (complete) dump that is available.
|
void |
registerEntityDocumentProcessor(EntityDocumentProcessor entityDocumentProcessor,
String model,
boolean onlyCurrentRevisions)
Registers an EntityDocumentProcessor, which will henceforth be notified
of all entity documents that are encountered in the dump.
|
void |
registerMwRevisionProcessor(MwRevisionProcessor mwRevisionProcessor,
String model,
boolean onlyCurrentRevisions)
Registers an MwRevisionProcessor, which will henceforth be notified of
all revisions that are encountered in the dump.
|
void |
setDownloadDirectory(String downloadDirectory)
Sets the directory where dumpfiles are stored locally.
|
void |
setLanguageFilter(Set<String> languageFilter)
Sets a language filter.
|
void |
setOfflineMode(boolean offlineModeEnabled)
Disables or enables Web access.
|
void |
setPropertyFilter(Set<PropertyIdValue> propertyFilter)
Sets a property filter.
|
void |
setSiteLinkFilter(Set<String> siteLinkFilter)
Sets a site link filter.
|
public DumpProcessingController(String projectName)
projectName
- Wikimedia projectname, e.g., "wikidatawiki" or "enwiki"public void setDownloadDirectory(String downloadDirectory) throws IOException
downloadDirectory
- the download base directoryIOException
- if the existence of the directory could not be checked or if
it did not exists and could not be created eitherpublic void setOfflineMode(boolean offlineModeEnabled)
offlineModeEnabled
- if true, all Web access is disabled and only local files will
be processedpublic void setPropertyFilter(Set<PropertyIdValue> propertyFilter)
propertyFilter
- set of properties that should be retained (can be empty)DocumentDataFilter.setPropertyFilter(Set)
public void setSiteLinkFilter(Set<String> siteLinkFilter)
siteLinkFilter
- set of siteLinks that should be retained (can be empty)DocumentDataFilter.setSiteLinkFilter(Set)
public void setLanguageFilter(Set<String> languageFilter)
languageFilter
- set of language codes that should be retained (can be empty)DocumentDataFilter.setLanguageFilter(Set)
public void registerMwRevisionProcessor(MwRevisionProcessor mwRevisionProcessor, String model, boolean onlyCurrentRevisions)
This only is used when processing dumps that contain revisions. In particular, plain JSON dumps contain no revision information.
Importantly, the MwRevision
that the registered processors will
receive is valid only during the execution of
MwRevisionProcessor.processRevision(MwRevision)
, but it will not
be permanent. If the data is to be retained permanently, the revision
processor needs to make its own copy.
mwRevisionProcessor
- the revision processor to registermodel
- the content model that the processor is registered for; it
will only be notified of revisions in that model; if null is
given, all revisions will be processed whatever their modelonlyCurrentRevisions
- if true, then the subscriber is only notified of the most
current revisions; if false, then it will receive all
revisions, current or notpublic void registerEntityDocumentProcessor(EntityDocumentProcessor entityDocumentProcessor, String model, boolean onlyCurrentRevisions)
It is possible to register processors for specific content types and to use either all revisions or only the most current ones. This functionality is only available when processing dumps that contain this information. In particular, plain JSON dumps do not specify content models at all and have only one (current) revision of each entity.
entityDocumentProcessor
- the entity document processor to registermodel
- the content model that the processor is registered for; it
will only be notified of revisions in that model; if null is
given, all revisions will be processed whatever their modelonlyCurrentRevisions
- if true, then the subscriber is only notified of the most
current revisions; if false, then it will receive all
revisions, current or notpublic Sites getSitesInformation() throws IOException
IOException
- if there was a problem accessing the sites table dump or the
dump download directorypublic void processAllRecentRevisionDumps()
Note that this method may not always provide reliable results since single incremental dump files are sometimes missing, even if earlier and later incremental dumps are available. In such a case, processing all recent dumps will miss some (random) revisions, thus reflecting a state that the wiki has never really been in. It might thus be preferable to process only a single (main) dump file without any incremental dumps.
public void processMostRecentMainDump()
getMostRecentDump(DumpContentType)
with
DumpContentType.CURRENT
or DumpContentType.FULL
, and
processing it with processDump(MwDumpFile)
. The individual
methods should be used for better control and error handling.processAllRecentRevisionDumps()
public void processMostRecentJsonDump()
getMostRecentDump(DumpContentType)
with
DumpContentType.JSON
, and processing it with
processDump(MwDumpFile)
. The individual methods should be used
for better control and error handling.processAllRecentRevisionDumps()
public void processDump(MwDumpFile dumpFile)
MwRevisionProcessor
objects will not be notified in this case.
Dumps of type DumpContentType.SITES
cannot be processed with this
method; use getSitesInformation()
to process these dumps.dumpFile
- the dump to processpublic MwDumpFile getMostRecentDump(DumpContentType dumpContentType)
dumpContentType
- the type of the dump, e.g., DumpContentType.JSON
public WmfDumpFileManager getWmfDumpFileManager()
DumpProcessingController
and this is often
preferable.
This dump file manager will not be updated if the settings change later.
Copyright © 2014–2024 Wikidata Toolkit Developers. Generated from source code published under the Apache License 2.0. For more information, see the Wikidata Toolkit homepage