Class WmfDumpFile

java.lang.Object
org.wikidata.wdtk.dumpfiles.wmf.WmfDumpFile
All Implemented Interfaces:
MwDumpFile
Direct Known Subclasses:
JsonOnlineDumpFile, WmfLocalDumpFile, WmfOnlineStandardDumpFile

public abstract class WmfDumpFile extends Object implements MwDumpFile
Abstract base class for dump files provided by the Wikimedia Foundation.
Author:
Markus Kroetzsch
  • Field Details

    • DUMP_SITE_BASE_URL

      protected static final String DUMP_SITE_BASE_URL
      The default URL of the website to obtain the dump files from.
      See Also:
    • dateStamp

      protected final String dateStamp
    • projectName

      protected final String projectName
  • Constructor Details

    • WmfDumpFile

      public WmfDumpFile(String dateStamp, String projectName)
  • Method Details

    • getProjectName

      public String getProjectName()
      Description copied from interface: MwDumpFile
      Returns the project name for this dump. Together with the dump content type and date stamp, this identifies the dump, and it is therefore always available.
      Specified by:
      getProjectName in interface MwDumpFile
      Returns:
      a project name string
    • getDateStamp

      public String getDateStamp()
      Description copied from interface: MwDumpFile
      Returns the date stamp for this dump. Together with the project name and dump content type, this identifies the dump, and it is therefore always available.
      Specified by:
      getDateStamp in interface MwDumpFile
      Returns:
      a string that represents a date in format YYYYMMDD
    • isAvailable

      public boolean isAvailable()
      Description copied from interface: MwDumpFile
      Checks if the dump is actually available. Should be called before MwDumpFile.getDumpFileReader(). Depending on the type of dumpfile, this will trigger one or more checks to make sure that all relevant data can be accessed for this dump file. This is still no definite guarantee that the download will succeed, since there can always be IO errors anyway, but it helps to detect cases where the dump is clearly not in a usable state.
      Specified by:
      isAvailable in interface MwDumpFile
      Returns:
      true if the dump file is likely to be available
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • getDumpFileReader

      public BufferedReader getDumpFileReader() throws IOException
      Description copied from interface: MwDumpFile
      Returns a buffered reader that provides access to the (uncompressed) text content of the dump file.

      It is important to close the reader after use.

      Specified by:
      getDumpFileReader in interface MwDumpFile
      Returns:
      a buffered reader to read the dump file
      Throws:
      IOException - if the dump file contents could not be accessed
    • fetchIsDone

      protected abstract boolean fetchIsDone()
      Finds out if the dump is ready. For online dumps, this should return true if the file can be fetched from the Web. For local dumps, this should return true if the file is complete and not corrupted. For some types of dumps, there are ways of checking this easily (i.e., without reading the full file). If this is not possible, then the method should just return "true."
      Returns:
      true if the dump is done
    • getDumpFilePostfix

      public static String getDumpFilePostfix(DumpContentType dumpContentType)
      Returns the ending used by the Wikimedia-provided dumpfile names of the given type.
      Parameters:
      dumpContentType - the type of dump
      Returns:
      postfix of the dumpfile name
      Throws:
      IllegalArgumentException - if the given dump file type is not known
    • getDumpFileWebDirectory

      public static String getDumpFileWebDirectory(DumpContentType dumpContentType, String projectName)
      Returns the absolute directory on the Web site where dumpfiles of the given type can be found.
      Parameters:
      dumpContentType - the type of dump
      Returns:
      relative web directory for the current dumpfiles
      Throws:
      IllegalArgumentException - if the given dump file type is not known
    • getDumpFileCompressionType

      public static CompressionType getDumpFileCompressionType(String fileName)
      Returns the compression type of this kind of dump file using file suffixes
      Parameters:
      fileName - the name of the file
      Returns:
      compression type
      Throws:
      IllegalArgumentException - if the given dump file type is not known
    • getDumpFileDirectoryName

      public static String getDumpFileDirectoryName(DumpContentType dumpContentType, String dateStamp)
      Returns the name of the directory where the dumpfile of the given type and date should be stored.
      Parameters:
      dumpContentType - the type of the dump
      dateStamp - the date of the dump in format YYYYMMDD
      Returns:
      the local directory name for the dumpfile
    • getDateStampFromDumpFileDirectoryName

      public static String getDateStampFromDumpFileDirectoryName(DumpContentType dumpContentType, String directoryName)
      Extracts the date stamp from a dumpfile directory name in the form that is created by getDumpFileDirectoryName(DumpContentType, String). It is not checked that the given directory name has the right format; if it has not, the result will not be a date stamp but some other string.
      Parameters:
      dumpContentType -
      directoryName -
      Returns:
      the date stamp
    • getDumpFileName

      public static String getDumpFileName(DumpContentType dumpContentType, String projectName, String dateStamp)
      Returns the name under which this dump file. This is the name used online and also locally when downloading the file.
      Parameters:
      dumpContentType - the type of the dump
      projectName - the project name, e.g. "wikidatawiki"
      dateStamp - the date of the dump in format YYYYMMDD
      Returns:
      file name string
    • isRevisionDumpFile

      public static boolean isRevisionDumpFile(DumpContentType dumpContentType)
      Returns true if the given dump file type contains page revisions and false if it does not. Dumps that do not contain pages are for auxiliary information such as linked sites.
      Parameters:
      dumpContentType - the type of dump
      Returns:
      true if the dumpfile contains revisions
      Throws:
      IllegalArgumentException - if the given dump file type is not known