Class WmfDumpFileManager

java.lang.Object
org.wikidata.wdtk.dumpfiles.wmf.WmfDumpFileManager

public class WmfDumpFileManager extends Object
Class for providing access to available dumpfiles provided by the Wikimedia Foundation. The preferred access point for this class is DumpProcessingController.processAllRecentRevisionDumps(), since this method takes care of freeing resources and might also provide parallelized downloading/processing in the future.

Typically, the Web will be accessed to find information about dumps available online. This Web access is mediated by a WebResourceFetcherImpl object, provided upon construction. If null is given instead, the class will operate in offline mode, using only previously downloaded files.

The location of the Wikimedia download site is currently hardwired, since the extraction methods used to get the data are highly specific to the format of files on this site. Other sites (if any) would most likely need different methods.

Author:
Markus Kroetzsch
  • Field Details

    • DOWNLOAD_DIRECTORY_NAME

      public static final String DOWNLOAD_DIRECTORY_NAME
      The name of the directory where downloaded dump files are stored.
      See Also:
  • Constructor Details

    • WmfDumpFileManager

      public WmfDumpFileManager(String projectName, DirectoryManager downloadDirectoryManager, WebResourceFetcher webResourceFetcher) throws IOException
      Constructor.
      Parameters:
      projectName - name of the project to obtain dumps for as used in the folder structure of the dump site, e.g., "wikidatawiki"
      downloadDirectoryManager - the directory manager for the directory where the download directory for dump files should be; it will be created if needed
      webResourceFetcher - the web resource fetcher to access web resources or null if no web access should happen
      Throws:
      IOException - if it was not possible to access the directory for managing dumpfiles
  • Method Details

    • findAllRelevantRevisionDumps

      public List<MwDumpFile> findAllRelevantRevisionDumps(boolean preferCurrent)
      Finds all page revision dump files, online or locally, that are relevant to obtain the most current state of the data. Revision dump files are dumps that contain page revisions in MediaWiki's XML format.

      If the parameter preferCurrent is true, then dumps that contain only the current versions of all files will be preferred if available anywhere, even over previously downloaded dump files that contain all versions. However, dump files may still contain non-current revisions, and when processing multiple dumps there might even be overlaps (one revision occurring in multiple dumps).

      The result is ordered with the most recent dump first. If a dump file A contains revisions of a page P, and Rmax is the maximal revision of P in A, then every dump file that comes after A should contain only revisions of P that are smaller than or equal to Rmax. In other words, the maximal revision found in the first file that contains P at all should also be the maximal revision overall.

      Parameters:
      preferCurrent - should dumps with current revisions be preferred?
      Returns:
      an ordered list of all dump files that match the given criteria
    • findMostRecentDump

      public MwDumpFile findMostRecentDump(DumpContentType dumpContentType)
      Finds the most recent dump of the given type that is actually available.
      Parameters:
      dumpContentType - the type of the dump to look for
      Returns:
      most recent main dump or null if no such dump exists
    • findAllDumps

      public List<MwDumpFile> findAllDumps(DumpContentType dumpContentType)
      Returns a list of all dump files of the given type available either online or locally. For dumps available both online and locally, the local version is included. The list is ordered with most recent dump date first. Online dumps found by this method might not be available yet (if their directory has been created online but the file was not uploaded or completely written yet).
      Returns:
      a list of dump files of the given type