A quick little tool for extracting sets of pages from a MediaWiki dump file. Can read MediaWiki XML export dumps (version 0.3, minus uploads), perform optional filtering, and output back to XML or to SQL statements to add things directly to a database in 1.4 or 1.5 schema. Still very much under construction. MIT-style license like our other Java/C# tools; boilerplate to be added. Contains code from the Apache Commons Compress project for cross-platform bzip2 input/output support (Apache License 2.0). If strange XML errors are encountered under Java 1.4, try 1.5: * http://java.sun.com/j2se/1.5.0/download.jsp * http://www.apple.com/downloads/macosx/apple/java2se50release1.html USAGE: Sample command line for a direct database import: java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u -p You can also do complex filtering to produce multiple output files: java -jar mwdumper.jar \ --output=bzip2:pages_public.xml.bz2 \ --format=xml \ --filter=notalk \ --filter=namespace:\!NS_USER \ --filter=latest \ --output=bzip2:pages_current.xml.bz2 \ --format=xml \ --filter=latest \ --output=gzip:pages_full_1.5.sql.gz \ --format=sql:1.5 \ --output=gzip:pages_full_1.4.sql.gz \ --format=sql:1.4 \ pages_full.xml.gz A bare parameter will be interpreted as a file to read XML input from; if none is given or "-" input will be read from stdin. Input files with ".gz" or ".bz2" extensions will be decompressed as gzip and bzip2 streams, respectively. Internal decompression of 7-zip .7z files is not yet supported; you can pipe such files through p7zip's 7za: 7za e -so pages_full.xml.7z | java -jar mwdumper --format=sql:1.5 | mysql -u -p Defaults if no parameters are given: * read uncompressed XML from stdin * write uncompressed XML to stdout * no filtering Output sinks: --output=stdout Send uncompressed XML or SQL output to stdout for piping. (May have charset issues.) This is the default if no output is specified. --output=file: Write uncompressed output to a file. --output=gzip: Write compressed output to a file. --output=bzip2: Write compressed output to a file. --output=mysql: Valid only for SQL format output; opens a connection to the MySQL server and sends commands to it directly. This will look something like: mysql://localhost/databasename?user=&password= Output formats: --format=xml Output back to MediaWiki's XML export format; use this for filtering dumps for limited import. Output should be idempotent. --format=sql:1.4 SQL statements formatted for bulk import in MediaWiki 1.4's schema. --format=sql:1.5 SQL statements formatted for bulk import in MediaWiki 1.5's schema. Both SQL schema versions currently require that the table structure be already set up in an empty database; use maintenance/tables.sql from the MediaWiki distribution. Filter actions: --filter=latest Skips all but the last revision listed for each page. FIXME: currently this pays no attention to the timestamp or revision number, but simply the order of items in the dump. This may or may not be strictly correct. --filter=list: Excludes all pages whose titles do not appear in the given file. Use one title per line; blanks and lines starting with # are ignored. Talk and subject pages of given titles are both matched. --filter=exactlist: As above, but does not try to match associated talk/subject pages. --filter=namespace:[!] Includes only pages in (or not in, with "!") the given namespaces. You can use the NS_* constant names or the raw numeric keys. --filter=notalk Excludes all talk pages from output (including custom namespaces) --filter=titlematch: Excludes all pages whose titles do not match the regex. Misc options: --progress= Change progress reporting interval from the default 1000 revisions. --quiet Don't send any progress output to stderr. PERFORMANCE TIPS: To speed up importing into a database, you might try: * Java's -server option may significantly increase performance on some versions of Sun's JVM for large files. (Not all installations will have this available.) * Increase MySQL's innodb_log_file_size. The default is as little as 5mb, but you can improve performance dramatically by increasing this to reduce the number of disk writes. (See the my-huge.cnf sample config.) * If you don't need it, disable the binary log (log-bin option) during the import. On a standalone machine this is just wasteful, writing a second copy of every query that you'll never use. * Various other wacky tips in the MySQL reference manual at http://dev.mysql.com/mysql/en/innodb-tuning.html TODO: * Add some more junit tests * Include table initialization in SQL output * Allow use of table prefixes in SQL output * Ensure that titles and other bits are validated correctly. * Test XML input for robustness * Provide filter to strip ID numbers * is technically optional; live without it and use default namespaces * GUI frontend(s) * Port to Python? ;) Change history (abbreviated): 2005-10-25: Switched SqlWriter.sqlEscape back to less memory-hungry StringBuffer 2005-10-24: Fixed SQL output in non-UTF-8 locales 2005-10-21: Applied more speedup patches from Folke 2005-10-11: SQL direct connection, GUI work begins 2005-10-10: Applied speedup patches from Folke Behrens 2005-10-05: Use bulk inserts in SQL mode 2005-09-29: Converted from C# to Java 2005-08-27: Initial extraction code