Skip to content

feat(exporter): add content-app download log exporter#1258

Merged
decko merged 2 commits into
pulp:mainfrom
decko:feat/content-download-exporter-wt
Jun 16, 2026
Merged

feat(exporter): add content-app download log exporter#1258
decko merged 2 commits into
pulp:mainfrom
decko:feat/content-download-exporter-wt

Conversation

@decko

@decko decko commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary

  • New export-content-logs CLI command that exports Python (.whl/.whl.metadata) and RPM (.rpm) download logs from content-app (pulp-content) CloudWatch streams to Parquet
  • Parses package metadata from filenames: wheel regex (from pip/pulp_python) for Python, NEVRA parser (from pulp_rpm) for RPM
  • Type-specific Parquet schemas with common fields (timestamp, domain, distribution, package_name, package_version, architecture, artifact_size, cache_hit, org_id) plus Python-specific (build_tag, pyver, abi) or RPM-specific (epoch, release)
  • Fixes timezone bug in existing parse_time() (utcnow → timezone-aware) and switches to logGroupNames (plural) for JSON-structured log support

Test plan

  • 36 unit tests passing (31 new + 5 existing, no regressions)
  • Live-tested against prod CloudWatch: 4,138 Python downloads and 3,742 RPM downloads exported in 10 min window
  • Deploy CronJob config (follow-up: PULP-1831)

Jira: PULP-1811
Epic: PULP-1809

🤖 Generated with Claude Code

Summary by Sourcery

Add a new CLI command for exporting pulp-content download logs from CloudWatch to Parquet, with typed schemas for Python wheels and RPMs, and improve time handling and CloudWatch querying.

New Features:

  • Introduce an export-content-logs CLI entrypoint to export pulp-content download logs for Python wheels and RPMs to Parquet with S3 support.
  • Add parsing of content-app access log lines and content paths to extract package metadata and request details for Python and RPM artifacts.
  • Define dedicated PyArrow schemas and conversion logic for Python and RPM download records.

Bug Fixes:

  • Make parse_time return timezone-aware UTC datetimes instead of naive utcnow values.
  • Switch CloudWatch Logs querying to use logGroupNames to support structured log groups.

Tests:

  • Add unit tests covering content log parsing, filename parsing for wheels and RPMs, content-type filtering, schema-based Parquet conversion, and CloudWatch query building.
  • Extend test fixtures with sample content CloudWatch results and Parquet paths for the new exporter.

@sourcery-ai

sourcery-ai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Reviewer's Guide

Adds a new export-content-logs CLI entrypoint that queries CloudWatch Logs Insights for pulp-content download logs, parses Python wheel and RPM filenames into typed PyArrow schemas, and writes them to Parquet, while also fixing time parsing and CloudWatch query configuration issues.

Sequence diagram for export-content-logs CLI content export flow

sequenceDiagram
    actor User
    participant content_main
    participant parse_time
    participant build_content_query
    participant fetch_cloudwatch_logs
    participant convert_content_to_arrow_table
    participant write_parquet

    User->>content_main: export-content-logs
    content_main->>parse_time: parse_time(start_time)
    content_main->>parse_time: parse_time(end_time)
    content_main->>build_content_query: build_content_query()
    build_content_query-->>content_main: query
    content_main->>fetch_cloudwatch_logs: fetch_cloudwatch_logs(log_group, query, start_time, end_time, region)
    fetch_cloudwatch_logs-->>content_main: results
    alt results empty
        content_main-->>User: "No logs found"
    else results present
        content_main->>convert_content_to_arrow_table: convert_content_to_arrow_table(results, content_type)
        convert_content_to_arrow_table-->>content_main: table
        alt table length is 0
            content_main-->>User: "No downloads found"
        else table has rows
            content_main->>write_parquet: write_parquet(table, output_path, s3_credentials)
            write_parquet-->>content_main: success
            content_main-->>User: "Export completed successfully"
        end
    end
Loading

File-Level Changes

Change Details Files
Introduce a new content-app download logs exporter CLI command and wire it into the package entry points.
  • Add parse_content_args and content_main functions to handle content log export arguments, fetch logs, convert them to Arrow tables, and write Parquet output (optionally to S3).
  • Register the export-content-logs console script in the project configuration so it is installable as a CLI tool.
management_tools/pulp-access-logs-exporter/src/pulp_access_logs_exporter/cli.py
management_tools/pulp-access-logs-exporter/pyproject.toml
Implement parsing and transformation of pulp-content access log lines into typed records for Python wheels and RPMs and conversion to PyArrow tables.
  • Add regex-based parsing for content-app log lines, content paths, and filename-based metadata extraction for Python wheels and RPM NEVRA names.
  • Define shared and type-specific PyArrow schemas for Python and RPM content downloads and map parsed records into these schemas.
  • Implement filtering and transformation logic to read CloudWatch query results, skip non-matching/malformed entries, and build type-specific Arrow tables with derived fields like cache_hit, artifact_size, and org_id.
management_tools/pulp-access-logs-exporter/src/pulp_access_logs_exporter/content_parser.py
management_tools/pulp-access-logs-exporter/src/pulp_access_logs_exporter/content_cloudwatch.py
management_tools/pulp-access-logs-exporter/src/pulp_access_logs_exporter/content_schemas.py
Fix CloudWatch and time handling in the existing exporter to be timezone-aware and compatible with Logs Insights JSON results.
  • Update parse_time to return timezone-aware UTC datetimes instead of naive utcnow-based values, including for relative times.
  • Change CloudWatch Logs start_query usage to logGroupNames list form to support JSON-structured logs.
management_tools/pulp-access-logs-exporter/src/pulp_access_logs_exporter/cli.py
management_tools/pulp-access-logs-exporter/src/pulp_access_logs_exporter/cloudwatch.py
Add tests and fixtures to validate the new content exporter parsing, schema mapping, and Parquet writing behavior.
  • Provide pytest fixtures with representative CloudWatch content-app log records and temp Parquet paths.
  • Add unit tests for parsing content log lines, paths, and Python/RPM filenames; filtering by content type; building the CloudWatch query; converting results to Arrow tables for both Python and RPM; and writing/reading Parquet files.
management_tools/pulp-access-logs-exporter/tests/conftest.py
management_tools/pulp-access-logs-exporter/tests/test_content_export.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The new content_main/parse_content_args flow duplicates a lot of the CLI wiring patterns from the existing commands (time parsing, S3 options, CloudWatch args); consider factoring the shared argument and S3-credential handling into reusable helpers to keep the CLI surface consistent and easier to maintain.
  • In convert_content_to_arrow_table, most reasons for skipping a record (non-matching log line, non-content path, unsupported filename extension) are silent while malformed filenames emit warnings; if these skipped cases are significant operationally, you may want to add optional debug logging or counters so operators can distinguish “no data” from “heavily filtered data.”
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `content_main`/`parse_content_args` flow duplicates a lot of the CLI wiring patterns from the existing commands (time parsing, S3 options, CloudWatch args); consider factoring the shared argument and S3-credential handling into reusable helpers to keep the CLI surface consistent and easier to maintain.
- In `convert_content_to_arrow_table`, most reasons for skipping a record (non-matching log line, non-content path, unsupported filename extension) are silent while malformed filenames emit warnings; if these skipped cases are significant operationally, you may want to add optional debug logging or counters so operators can distinguish “no data” from “heavily filtered data.”

## Individual Comments

### Comment 1
<location path="management_tools/pulp-access-logs-exporter/src/pulp_access_logs_exporter/content_cloudwatch.py" line_range="50-53" />
<code_context>
+    return org_value
+
+
+def _parse_timestamp(timestamp_str):
+    if timestamp_str.endswith("Z"):
+        timestamp_str = timestamp_str[:-1]
+    return datetime.fromisoformat(timestamp_str)
+
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Handle invalid or missing @timestamp values more defensively.

`datetime.fromisoformat` will raise `ValueError` for missing, empty, or non-ISO `@timestamp` values, which will abort the export since `timestamp_str` comes from `result.get("@timestamp", "")`. Please guard this with try/except and either skip such records or fall back to another timestamp source so a single bad record doesn’t fail the run.
</issue_to_address>

### Comment 2
<location path="management_tools/pulp-access-logs-exporter/src/pulp_access_logs_exporter/content_cloudwatch.py" line_range="68-72" />
<code_context>
+    records = []
+    skipped_count = 0
+
+    for result in results:
+        message = result.get("message", result.get("@message", ""))
+        timestamp_str = result.get("@timestamp", "")
+
+        parsed_line = parse_content_log_line(message)
+        if parsed_line is None:
+            continue
</code_context>
<issue_to_address>
**suggestion:** Consider tracking or logging counts for records skipped due to log-line/path parse failures.

Right now only malformed filenames increment `skipped_count`; failures in `parse_content_log_line` / `parse_content_path` are ignored. Adding separate counters for these cases and including them in the final summary (like the filename warning) would make format drifts or unexpected volume drops easier to spot and debug.

Suggested implementation:

```python
    schema = SCHEMAS[content_type]
    filename_parser = FILENAME_PARSERS[content_type]
    records = []
    skipped_count = 0
    log_line_parse_skipped_count = 0
    path_parse_skipped_count = 0

```

```python
        parsed_line = parse_content_log_line(message)
        if parsed_line is None:
            log_line_parse_skipped_count += 1
            continue

```

```python
        parsed_path = parse_content_path(parsed_line["path"])
        if parsed_path is None:
            path_parse_skipped_count += 1
            continue

```

To fully implement the suggestion, you should:
1. Include `log_line_parse_skipped_count` and `path_parse_skipped_count` in whatever summary or warning is logged/emitted at the end of `convert_content_to_arrow_table`, similar to how `skipped_count` is currently surfaced.
2. Consider renaming `skipped_count` to something more specific (e.g. `filename_skipped_count`) in that summary to clearly distinguish between filename, log-line, and path parse failures.
3. If there are unit tests asserting the summary or logged message contents, update them to expect the new counters and their wording.
</issue_to_address>

### Comment 3
<location path="management_tools/pulp-access-logs-exporter/tests/test_content_export.py" line_range="186-195" />
<code_context>
+class TestContentToParquetPython:
</code_context>
<issue_to_address>
**suggestion (testing):** Add tests for malformed or non-download records to verify they are skipped and warnings are emitted

Currently `TestContentToParquetPython` (and RPM) only covers happy-path and empty results. Since `convert_content_to_arrow_table` is supposed to skip bad records and warn on malformed filenames, please add tests that: feed in an invalid `message` (no `CONTENT_LOG_REGEX` match), a valid log line with a non-content path, and a `.whl`/`.rpm` filename that makes the filename parser return `None`; then assert that these rows are excluded from the table, the row count matches only valid downloads, and optionally that the malformed-filename warning is emitted to `stderr` (via `capsys`/`capfd`).

Suggested implementation:

```python
class TestContentToParquetPython:
    def test_converts_python_downloads(self, sample_content_cloudwatch_results, sample_content_parquet_path):
        table = convert_content_to_arrow_table(
            sample_content_cloudwatch_results, "python"
        )
        assert table.num_rows == 2
        assert table.schema == PYTHON_SCHEMA

        write_parquet(table, sample_content_parquet_path)
        read_table = pq.read_table(sample_content_parquet_path)
        assert read_table.num_rows == 2

    def test_skips_invalid_message_records(self, sample_content_cloudwatch_results):
        # Add a record that does not match CONTENT_LOG_REGEX at all
        invalid_record = {
            "message": "this is not a valid access log line and should be ignored"
        }
        mixed_results = sample_content_cloudwatch_results + [invalid_record]

        table = convert_content_to_arrow_table(mixed_results, "python")

        # Only the original valid python download records should be present
        assert table.num_rows == 2

    def test_skips_non_content_path_records(self, sample_content_cloudwatch_results):
        # Start from a valid record and replace the path with a non-content path
        # so the regex still matches but parse_content_path returns None
        base_message = sample_content_cloudwatch_results[0]["message"]
        non_content_path = "/api/pypi/default/dist/simple/pkg/"
        non_content_message = base_message.replace("/pulp/content/", non_content_path)

        non_content_record = {"message": non_content_message}
        mixed_results = sample_content_cloudwatch_results + [non_content_record]

        table = convert_content_to_arrow_table(mixed_results, "python")

        # Non-content paths should be skipped
        assert table.num_rows == 2

    def test_skips_malformed_filename_and_warns(
        self, sample_content_cloudwatch_results, capsys
    ):
        # Use a valid record but make the filename malformed so the filename parser returns None
        base_message = sample_content_cloudwatch_results[0]["message"]
        # Intentionally break the filename while keeping the .whl extension so the
        # content-type filter still considers it a python download
        malformed_message = base_message.replace(".whl", "-malformed-file-name.whl")

        malformed_record = {"message": malformed_message}
        mixed_results = sample_content_cloudwatch_results + [malformed_record]

        table = convert_content_to_arrow_table(mixed_results, "python")

        # Only well-formed python downloads should be present
        assert table.num_rows == 2

        captured = capsys.readouterr()
        # Ensure a warning about the malformed filename was emitted
        assert "malformed" in captured.err.lower()

```

These tests assume:
1. `sample_content_cloudwatch_results[0]["message"]` contains the substring `/pulp/content/` which can be replaced with a non-content path for the non-content test. If the actual content path prefix differs, adjust the `.replace("/pulp/content/", non_content_path)` call to match the real prefix used in your logs.
2. The warning for malformed filenames contains the word `"malformed"` in `stderr`. If the real warning message is different, update the final assertion in `test_skips_malformed_filename_and_warns` to match the actual warning text (for example, assert on the exact message or a more specific substring).
3. If warnings are logged via `logging` instead of `stderr`, replace the `capsys` usage with `caplog` and assert on `caplog.records` or `caplog.text` accordingly.
</issue_to_address>

### Comment 4
<location path="management_tools/pulp-access-logs-exporter/tests/test_content_export.py" line_range="211-220" />
<code_context>
+class TestContentToParquetRpm:
</code_context>
<issue_to_address>
**suggestion (testing):** Cover edge cases for cache/size/org_id parsing (e.g. '-', empty, or unexpected values)

Existing Parquet tests cover only `HIT`/`MISS` and valid org IDs. In `content_cloudwatch`, the normalization helpers (`_parse_cache_hit`, `_parse_artifact_size`, `_parse_org_id`) also handle:
- `artifact_size` of `"-"` or empty → `None`
- `rh_org_id` of `"-"` or empty → `None`
- `cache` values other than `HIT`/`MISS``None`

Add one or two tests that feed CloudWatch records with these edge values and assert the resulting table has `None` in the corresponding columns to capture this behavior and guard against future format changes.

Suggested implementation:

```python
class TestContentToParquetRpm:
    def test_converts_rpm_downloads(self, sample_content_cloudwatch_results, sample_content_parquet_path):
        table = convert_content_to_arrow_table(
            sample_content_cloudwatch_results, "rpm"
        )
        assert table.num_rows == 2
        assert table.schema == RPM_SCHEMA

        write_parquet(table, sample_content_parquet_path)
        read_table = pq.read_table(sample_content_parquet_path)
        assert read_table.num_rows == 2

    def test_parses_edge_values_for_cache_size_and_org_id(
        self,
        sample_content_cloudwatch_results,
    ):
        # Create a record with edge values for cache/size/org_id
        edge_record = copy.deepcopy(sample_content_cloudwatch_results[0])
        edge_record["cache"] = "UNKNOWN"        # unexpected cache value -> None
        edge_record["artifact_size"] = "-"      # "-" size -> None
        edge_record["rh_org_id"] = "-"          # "-" org_id -> None

        records = list(sample_content_cloudwatch_results) + [edge_record]

        table = convert_content_to_arrow_table(records, "rpm")
        rows = table.to_pydict()

        # Original rows are still parsed as before
        assert table.num_rows == 3
        assert rows["cache_hit"][0] is True
        assert rows["cache_hit"][1] is False

        # Edge row normalized to None
        assert rows["cache_hit"][2] is None
        assert rows["artifact_size"][2] is None
        assert rows["org_id"][2] is None

    def test_parses_empty_values_for_size_and_org_id(
        self,
        sample_content_cloudwatch_results,
    ):
        # Create a record with empty values for size/org_id and "-" cache
        edge_record = copy.deepcopy(sample_content_cloudwatch_results[0])
        edge_record["cache"] = "-"              # "-" cache -> None
        edge_record["artifact_size"] = ""       # empty size -> None
        edge_record["rh_org_id"] = ""           # empty org_id -> None

        records = list(sample_content_cloudwatch_results) + [edge_record]

        table = convert_content_to_arrow_table(records, "rpm")
        rows = table.to_pydict()

        assert table.num_rows == 3

        assert rows["cache_hit"][2] is None
        assert rows["artifact_size"][2] is None
        assert rows["org_id"][2] is None

```

1. Ensure `import copy` is added at the top of `test_content_export.py`, e.g.:
   `import copy`
2. If the fields in `sample_content_cloudwatch_results` are nested (e.g. under a `"fields"` or similar key), update the `edge_record[...]` assignments to match the actual structure, so that the `cache`, `artifact_size`, and `rh_org_id` values used by `convert_content_to_arrow_table` are correctly overridden.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@decko

decko commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

Addressed Sourcery review feedback in 5b3a6e2:

Comment 1 (bug_risk — defensive timestamp parsing): Fixed. _parse_timestamp now returns None for empty/malformed values; the record is skipped.

Comment 2 (skip counters): Fixed. Added 5 per-category counters (log parse, path parse, extension, malformed filename, bad timestamp) printed to stderr for operational visibility.

Comment 3 (tests for malformed records): Partially addressed. Added 3 tests: invalid/non-matching records, malformed filename warning (capsys), and bad timestamps. Skipped the suggested non-content-path test — the suggested implementation had a bug (.replace("/pulp/content/") but our prefix is /api/pulp-content/).

Comment 4 (edge case tests for helpers): Not addressing. The suggested implementation tries to copy.deepcopy a fixture and modify top-level keys like edge_record["cache"], but these values are embedded inside the message string and parsed by regex — the test wouldn't work as written. The helper functions are 3 lines each and already covered by integration tests.

@decko

decko commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/retest

decko added 2 commits June 15, 2026 18:07
New `export-content-logs` CLI that exports Python (.whl) and RPM (.rpm)
download logs from the content-app (pulp-content) to Parquet, with parsed
package metadata (name, version, architecture) from filenames.

Separate from the existing PyPI API exporter — this captures actual
downloads with artifact_size, cache hit/miss, and version details that
API logs lack.

Fixes:
- Use logGroupNames (plural) for JSON-structured log support
- Use timezone-aware datetime to fix epoch offset in non-UTC systems

PULP-1811

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address Sourcery review feedback:
- Add per-category skip counters (log parse, path parse, extension,
  malformed filename, bad timestamp) printed to stderr for operational
  visibility
- Handle empty or malformed @timestamp values gracefully instead of
  crashing the entire export
- Add tests for malformed filenames, invalid records, and bad timestamps

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@decko decko force-pushed the feat/content-download-exporter-wt branch from 5b3a6e2 to 893b120 Compare June 15, 2026 21:07
@decko decko merged commit eb2bf52 into pulp:main Jun 16, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant