Class WmfDumpFileManager
DumpProcessingController.processAllRecentRevisionDumps()
, since this
method takes care of freeing resources and might also provide parallelized
downloading/processing in the future.
Typically, the Web will be accessed to find information about dumps available
online. This Web access is mediated by a WebResourceFetcherImpl
object, provided upon construction. If null is given instead, the class will
operate in offline mode, using only previously downloaded files.
The location of the Wikimedia download site is currently hardwired, since the extraction methods used to get the data are highly specific to the format of files on this site. Other sites (if any) would most likely need different methods.
- Author:
- Markus Kroetzsch
-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
The name of the directory where downloaded dump files are stored. -
Constructor Summary
ConstructorDescriptionWmfDumpFileManager
(String projectName, DirectoryManager downloadDirectoryManager, WebResourceFetcher webResourceFetcher) Constructor. -
Method Summary
Modifier and TypeMethodDescriptionfindAllDumps
(DumpContentType dumpContentType) Returns a list of all dump files of the given type available either online or locally.findAllRelevantRevisionDumps
(boolean preferCurrent) Finds all page revision dump files, online or locally, that are relevant to obtain the most current state of the data.findMostRecentDump
(DumpContentType dumpContentType) Finds the most recent dump of the given type that is actually available.
-
Field Details
-
DOWNLOAD_DIRECTORY_NAME
The name of the directory where downloaded dump files are stored.- See Also:
-
-
Constructor Details
-
WmfDumpFileManager
public WmfDumpFileManager(String projectName, DirectoryManager downloadDirectoryManager, WebResourceFetcher webResourceFetcher) throws IOException Constructor.- Parameters:
projectName
- name of the project to obtain dumps for as used in the folder structure of the dump site, e.g., "wikidatawiki"downloadDirectoryManager
- the directory manager for the directory where the download directory for dump files should be; it will be created if neededwebResourceFetcher
- the web resource fetcher to access web resources or null if no web access should happen- Throws:
IOException
- if it was not possible to access the directory for managing dumpfiles
-
-
Method Details
-
findAllRelevantRevisionDumps
Finds all page revision dump files, online or locally, that are relevant to obtain the most current state of the data. Revision dump files are dumps that contain page revisions in MediaWiki's XML format.If the parameter preferCurrent is true, then dumps that contain only the current versions of all files will be preferred if available anywhere, even over previously downloaded dump files that contain all versions. However, dump files may still contain non-current revisions, and when processing multiple dumps there might even be overlaps (one revision occurring in multiple dumps).
The result is ordered with the most recent dump first. If a dump file A contains revisions of a page P, and Rmax is the maximal revision of P in A, then every dump file that comes after A should contain only revisions of P that are smaller than or equal to Rmax. In other words, the maximal revision found in the first file that contains P at all should also be the maximal revision overall.
- Parameters:
preferCurrent
- should dumps with current revisions be preferred?- Returns:
- an ordered list of all dump files that match the given criteria
-
findMostRecentDump
Finds the most recent dump of the given type that is actually available.- Parameters:
dumpContentType
- the type of the dump to look for- Returns:
- most recent main dump or null if no such dump exists
-
findAllDumps
Returns a list of all dump files of the given type available either online or locally. For dumps available both online and locally, the local version is included. The list is ordered with most recent dump date first. Online dumps found by this method might not be available yet (if their directory has been created online but the file was not uploaded or completely written yet).- Returns:
- a list of dump files of the given type
-