public class WmfDumpFileManager extends Object
DumpProcessingController.processAllRecentRevisionDumps()
, since this
method takes care of freeing resources and might also provide parallelized
downloading/processing in the future.
Typically, the Web will be accessed to find information about dumps available
online. This Web access is mediated by a WebResourceFetcherImpl
object, provided upon construction. If null is given instead, the class will
operate in offline mode, using only previously downloaded files.
The location of the Wikimedia download site is currently hardwired, since the extraction methods used to get the data are highly specific to the format of files on this site. Other sites (if any) would most likely need different methods.
Modifier and Type | Field and Description |
---|---|
static String |
DOWNLOAD_DIRECTORY_NAME
The name of the directory where downloaded dump files are stored.
|
Constructor and Description |
---|
WmfDumpFileManager(String projectName,
DirectoryManager downloadDirectoryManager,
WebResourceFetcher webResourceFetcher)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
List<MwDumpFile> |
findAllDumps(DumpContentType dumpContentType)
Returns a list of all dump files of the given type available either
online or locally.
|
List<MwDumpFile> |
findAllRelevantRevisionDumps(boolean preferCurrent)
Finds all page revision dump files, online or locally, that are relevant
to obtain the most current state of the data.
|
MwDumpFile |
findMostRecentDump(DumpContentType dumpContentType)
Finds the most recent dump of the given type that is actually available.
|
public static final String DOWNLOAD_DIRECTORY_NAME
public WmfDumpFileManager(String projectName, DirectoryManager downloadDirectoryManager, WebResourceFetcher webResourceFetcher) throws IOException
projectName
- name of the project to obtain dumps for as used in the folder
structure of the dump site, e.g., "wikidatawiki"downloadDirectoryManager
- the directory manager for the directory where the download
directory for dump files should be; it will be created if
neededwebResourceFetcher
- the web resource fetcher to access web resources or null if no
web access should happenIOException
- if it was not possible to access the directory for managing
dumpfilespublic List<MwDumpFile> findAllRelevantRevisionDumps(boolean preferCurrent)
If the parameter preferCurrent is true, then dumps that contain only the current versions of all files will be preferred if available anywhere, even over previously downloaded dump files that contain all versions. However, dump files may still contain non-current revisions, and when processing multiple dumps there might even be overlaps (one revision occurring in multiple dumps).
The result is ordered with the most recent dump first. If a dump file A contains revisions of a page P, and Rmax is the maximal revision of P in A, then every dump file that comes after A should contain only revisions of P that are smaller than or equal to Rmax. In other words, the maximal revision found in the first file that contains P at all should also be the maximal revision overall.
preferCurrent
- should dumps with current revisions be preferred?public MwDumpFile findMostRecentDump(DumpContentType dumpContentType)
dumpContentType
- the type of the dump to look forpublic List<MwDumpFile> findAllDumps(DumpContentType dumpContentType)
Copyright © 2014–2024 Wikidata Toolkit Developers. Generated from source code published under the Apache License 2.0. For more information, see the Wikidata Toolkit homepage