RevisionChest is a high-performance Rust utility designed to extract and compress revision history from Wikipedia XML dumps. It supports .bz2 and .7z compressed XML dumps and outputs the extracted data in a custom .mwrev.zst format using Zstandard compression.
- Format Support: Processes Wikipedia XML dumps compressed with
bzip2(.bz2) or7-Zip(.7z). Note that bz2 has better performance due to support for streaming decompression. - Parallel Processing: Efficiently processes multiple dump files in parallel using the
rayonlibrary. - Zstandard Compression: Outputs revisions in a space-efficient
.zstformat. - Metadata Extraction: Captures key revision metadata including page ID, namespace, revision ID, parent ID, timestamp, and contributor ID.
- Flexible Metadata Storage: Store metadata in a SQLite database, a PostgreSQL database, or export it to a high-performance Parquet file.
Ensure you have Rust and Cargo installed. Clone the repository and build the project:
cargo build --releaseThe binary will be available at target/release/RevisionChest.
RevisionChest can be used to process a single file or an entire directory of dump files.
To process a single Wikipedia dump file and output to standard output:
./target/release/RevisionChest build <path_to_dump.xml.bz2_or_7z>To process a single file and save the output to a specific directory:
./target/release/RevisionChest build <path_to_dump.xml.bz2> -o <output_directory>To process all .bz2 and .7z files in a directory in parallel:
./target/release/RevisionChest build -d <input_directory> -o <output_directory>When using the -d (directory) flag, the -o (output directory) flag is required. Each input file will generate a corresponding .mwrev.zst file in the output directory.
RevisionChest provides several ways to store or export metadata.
By default, metadata is stored in a SQLite database named index.db in your output directory:
./target/release/RevisionChest build -d <input_dir> -o <output_dir> --db index.dbFor high-performance processing without the overhead of a database, you can export metadata to a single Parquet file:
./target/release/RevisionChest build -d <input_dir> -o <output_dir> --parquet metadata.parquetThe --parquet flag automatically prevents the creation of the default SQLite file (implies --no-db).
To use PostgreSQL, set the following environment variables in your .env file:
DATABASE=postgres
DB_HOST=localhost
DB_PORT=5432
DB_NAME=revision_chest
DB_USER=your_user
DB_PASS=your_passwordIf you only need the .mwrev.zst files and don't want to save metadata:
./target/release/RevisionChest build -d <input_dir> -o <output_dir> --no-dbThe sync command allows you to fetch the most recent Wikipedia edits and update your local index and data files. It bridges the gap between static XML dumps and live data.
To comply with Wikipedia's API policy, you must define a User-Agent in a .env file in the project root. You can use example.env as a template:
UA_APP_NAME=RevisionChest
UA_APP_VERSION=0.1.0
UA_CONTACT_INFO=https://github.com/yourusername/RevisionChestTo sync recent changes for a specific domain:
./target/release/RevisionChest sync --domain en.wikipedia.org -o <output_directory>You can use a Parquet file to determine the starting timestamp and store metadata for new revisions. This is useful for large-scale processing without a database:
./target/release/RevisionChest sync --domain en.wikipedia.org -o <output_dir> --parquet metadata.parquetIf metadata.parquet exists, RevisionChest will read the latest timestamp from it to resume the sync. New metadata will be written to the same Parquet file at the end of the sync cycle.
Key Features & Constraints:
- Automatic Resume: The command looks up the most recent timestamp in your database (or Parquet file if
--parquetis used) and fetches edits starting from that point (minus 24 hours to ensure overlap). - 30-Day Limit: If the most recent revision in your database is older than 30 days, the sync will fail, as Wikipedia's Recent Changes API only retains data for 30 days.
- Daily Rotation: New revisions are stored in a
recentchanges/subdirectory within your output folder, organized by date (e.g.,recentchanges/2026-05-26.mwrev.zst). - Incremental Appends: If a file for a specific day already exists, the tool appends new data as a new Zstandard frame, preserving the validity of the archive.
- Namespace Filtering: You can restrict the sync to specific namespaces:
./target/release/RevisionChest sync --domain en.wikipedia.org -o <output_directory> --namespace 0,118
- Ongoing Sync: Keep the tool running and poll for new changes periodically:
The interval defaults to 10 minutes and can also be set via
./target/release/RevisionChest sync --domain en.wikipedia.org -o <output_directory> --ongoing --interval 5
SYNC_INTERVALin.env.
The output .mwrev.zst files contain revisions in the following format:
# page_id=... ns=... rev_id=... parent_rev_id=... timestamp=... user_id=...
<line of text>
<line of text>
...
- Each revision starts with a header line beginning with
#. - The actual revision text follows, with each line prefixed by a single space.
- The output is compressed using Zstandard.
bzip2: For decompressing.bz2files.sevenz-rust: For decompressing.7zfiles.quick-xml: For fast XML parsing.zstd: For Zstandard compression.rayon: For parallel data processing.clap: For command-line argument parsing.chrono: For logging timestamps.parquet: For high-performance metadata export.arrow: For columnar data representation.rusqlite: For SQLite database support.postgres: For PostgreSQL database support.