Skip to content

internetarchive/RevisionChest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RevisionChest

RevisionChest is a high-performance Rust utility designed to extract and compress revision history from Wikipedia XML dumps. It supports .bz2 and .7z compressed XML dumps and outputs the extracted data in a custom .mwrev.zst format using Zstandard compression.

Features

  • Format Support: Processes Wikipedia XML dumps compressed with bzip2 (.bz2) or 7-Zip (.7z). Note that bz2 has better performance due to support for streaming decompression.
  • Parallel Processing: Efficiently processes multiple dump files in parallel using the rayon library.
  • Zstandard Compression: Outputs revisions in a space-efficient .zst format.
  • Metadata Extraction: Captures key revision metadata including page ID, namespace, revision ID, parent ID, timestamp, and contributor ID.
  • Flexible Metadata Storage: Store metadata in a SQLite database, a PostgreSQL database, or export it to a high-performance Parquet file.

Installation

Ensure you have Rust and Cargo installed. Clone the repository and build the project:

cargo build --release

The binary will be available at target/release/RevisionChest.

Usage

RevisionChest can be used to process a single file or an entire directory of dump files.

Processing a Single File

To process a single Wikipedia dump file and output to standard output:

./target/release/RevisionChest build <path_to_dump.xml.bz2_or_7z>

To process a single file and save the output to a specific directory:

./target/release/RevisionChest build <path_to_dump.xml.bz2> -o <output_directory>

Processing an Entire Directory

To process all .bz2 and .7z files in a directory in parallel:

./target/release/RevisionChest build -d <input_directory> -o <output_directory>

When using the -d (directory) flag, the -o (output directory) flag is required. Each input file will generate a corresponding .mwrev.zst file in the output directory.

Metadata Storage Options

RevisionChest provides several ways to store or export metadata.

SQLite (Default)

By default, metadata is stored in a SQLite database named index.db in your output directory:

./target/release/RevisionChest build -d <input_dir> -o <output_dir> --db index.db

Parquet Export

For high-performance processing without the overhead of a database, you can export metadata to a single Parquet file:

./target/release/RevisionChest build -d <input_dir> -o <output_dir> --parquet metadata.parquet

The --parquet flag automatically prevents the creation of the default SQLite file (implies --no-db).

PostgreSQL

To use PostgreSQL, set the following environment variables in your .env file:

DATABASE=postgres
DB_HOST=localhost
DB_PORT=5432
DB_NAME=revision_chest
DB_USER=your_user
DB_PASS=your_password

Disabling Metadata Storage

If you only need the .mwrev.zst files and don't want to save metadata:

./target/release/RevisionChest build -d <input_dir> -o <output_dir> --no-db

Synchronizing Recent Changes

The sync command allows you to fetch the most recent Wikipedia edits and update your local index and data files. It bridges the gap between static XML dumps and live data.

Configuration

To comply with Wikipedia's API policy, you must define a User-Agent in a .env file in the project root. You can use example.env as a template:

UA_APP_NAME=RevisionChest
UA_APP_VERSION=0.1.0
UA_CONTACT_INFO=https://github.com/yourusername/RevisionChest

Usage

To sync recent changes for a specific domain:

./target/release/RevisionChest sync --domain en.wikipedia.org -o <output_directory>

Parquet-based Sync

You can use a Parquet file to determine the starting timestamp and store metadata for new revisions. This is useful for large-scale processing without a database:

./target/release/RevisionChest sync --domain en.wikipedia.org -o <output_dir> --parquet metadata.parquet

If metadata.parquet exists, RevisionChest will read the latest timestamp from it to resume the sync. New metadata will be written to the same Parquet file at the end of the sync cycle.

Key Features & Constraints:

  • Automatic Resume: The command looks up the most recent timestamp in your database (or Parquet file if --parquet is used) and fetches edits starting from that point (minus 24 hours to ensure overlap).
  • 30-Day Limit: If the most recent revision in your database is older than 30 days, the sync will fail, as Wikipedia's Recent Changes API only retains data for 30 days.
  • Daily Rotation: New revisions are stored in a recentchanges/ subdirectory within your output folder, organized by date (e.g., recentchanges/2026-05-26.mwrev.zst).
  • Incremental Appends: If a file for a specific day already exists, the tool appends new data as a new Zstandard frame, preserving the validity of the archive.
  • Namespace Filtering: You can restrict the sync to specific namespaces:
    ./target/release/RevisionChest sync --domain en.wikipedia.org -o <output_directory> --namespace 0,118
  • Ongoing Sync: Keep the tool running and poll for new changes periodically:
    ./target/release/RevisionChest sync --domain en.wikipedia.org -o <output_directory> --ongoing --interval 5
    The interval defaults to 10 minutes and can also be set via SYNC_INTERVAL in .env.

Output Format

The output .mwrev.zst files contain revisions in the following format:

# page_id=... ns=... rev_id=... parent_rev_id=... timestamp=... user_id=...
 <line of text>
 <line of text>
...
  • Each revision starts with a header line beginning with #.
  • The actual revision text follows, with each line prefixed by a single space.
  • The output is compressed using Zstandard.

Dependencies

  • bzip2: For decompressing .bz2 files.
  • sevenz-rust: For decompressing .7z files.
  • quick-xml: For fast XML parsing.
  • zstd: For Zstandard compression.
  • rayon: For parallel data processing.
  • clap: For command-line argument parsing.
  • chrono: For logging timestamps.
  • parquet: For high-performance metadata export.
  • arrow: For columnar data representation.
  • rusqlite: For SQLite database support.
  • postgres: For PostgreSQL database support.

About

Transforms Wikipedia XML dumps into a more compact, stream-friendly format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages