Skip to content

feat: Stream backups#526

Draft
regdocs wants to merge 1 commit into
developfrom
stream-backups
Draft

feat: Stream backups#526
regdocs wants to merge 1 commit into
developfrom
stream-backups

Conversation

@regdocs
Copy link
Copy Markdown
Member

@regdocs regdocs commented May 22, 2026

No description provided.

@regdocs regdocs self-assigned this May 22, 2026
@regdocs regdocs marked this pull request as draft May 22, 2026 06:42
@regdocs regdocs force-pushed the stream-backups branch 16 times, most recently from ea7c08f to 4418339 Compare May 25, 2026 03:03
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR adds a streaming backup path that pipes bench backup output directly into rclone via named FIFOs, avoiding the need to land full backup files on disk before uploading to S3. The approach is creative but the implementation is not yet production-ready.

  • backup() now unconditionally returns a hard-coded dict with a specific test site's paths instead of the previous fetch_latest_backup() call, so all callers — including the non-streaming offsite path — receive stale, wrong data.
  • The streaming branch uses a hardcoded /home/frappe/frappe-bench/ root for --backup-path-* arguments while the FIFO nodes are created under self.bench.directory; on any bench with a different directory these paths diverge and the job hangs indefinitely.
  • The AWS region is hardcoded to ap-south-1, rclone exit codes are never checked, FIFOs are not cleaned up on failure, synchronisation relies on a 3-second sleep, and AWS credentials are passed as CLI arguments (visible in /proc).

Confidence Score: 1/5

Not safe to merge — backup() returns hardcoded test data in production, a NameError crashes every non-offsite backup, and tab/space mixing prevents the entire module from importing.

The streaming implementation leaves backup() returning a hard-coded dict tied to a single test site, making every backup in every environment return wrong paths. The module also fails to import at all due to mixed tabs and spaces, and the non-offsite code path raises NameError before returning. These are not isolated edge cases — they affect all backup jobs across all sites.

agent/site.py — the backup() method, the backup_job() streaming branch, and the indentation throughout the new block all need attention before this is mergeable.

Important Files Changed

Filename Overview
agent/site.py Adds streaming backup via FIFOs and rclone, but the implementation has critical issues: hardcoded test data returned from backup(), hardcoded AWS region/path, race-condition sync via sleep, FIFO cleanup missing, credentials exposed in process args, NameError on non-offsite backups, tab/space mixing causing TabError, and FIFO path mismatch between bench.directory and hardcoded /home/frappe/frappe-bench/.

Sequence Diagram

sequenceDiagram
    participant Job as backup_job()
    participant FIFO as Named FIFO (mkfifo)
    participant Thread as rclone Thread
    participant Rclone as rclone rcat
    participant Backup as bench backup
    participant S3 as AWS S3

    Job->>FIFO: docker_execute mkfifo (each file)
    Job->>Thread: threading.Thread(start_rclone).start()
    Thread->>FIFO: open(fifo_path, "rb") [BLOCKS until writer]
    Job->>Job: time.sleep(3)
    Job->>Backup: bench execute backup --backup-path-db ...
    Note over Job,Backup: Uses hardcoded /home/frappe/frappe-bench/<br/>but FIFOs are at self.bench.directory
    Backup->>FIFO: open for writing (unblocks Thread)
    Thread->>Rclone: "Popen(rclone rcat, stdin=fd)"
    Backup->>FIFO: write compressed data
    FIFO->>Rclone: stream data
    Rclone->>S3: upload stream
    Backup->>FIFO: close (EOF)
    Rclone->>Thread: exit (code ignored)
    Job->>Job: t.join() for all threads
    Job->>Job: build uploaded_files dict (always succeeds)
Loading

Reviews (4): Last reviewed commit: "feat: Stream backups" | Re-trigger Greptile

Comment thread agent/site.py
Comment on lines +492 to +517
return {
'database': {
'path': '/home/frappe/benches/bench-40545-000001-u10-ksa/sites/test-s3-streaming-uploads.frappe.cloud/private/backups/database.sql.gz',
'file': '20260522_134200-test-s3-streaming-uploads_frappe_cloud-database.sql.gz',
'size': 10000,
'url':'https://test-s3-streaming-uploads.frappe.cloud/backups/database.sql.gz'
},
'site_config': {
'path': '/home/frappe/benches/bench-40545-000001-u10-ksa/sites/test-s3-streaming-uploads.frappe.cloud/private/backups/site_config_backup.json',
'file': 'site_config_backup.json',
'size': 328,
'url': 'https://test-s3-streaming-uploads.frappe.cloud/backups/site_config_backup.json'
},
'private': {
'path': '/home/frappe/benches/bench-40545-000001-u10-ksa/sites/test-s3-streaming-uploads.frappe.cloud/private/backups/private-files.tar',
'file': 'private-files.tar',
'size': 10240,
'url': 'https://test-s3-streaming-uploads.frappe.cloud/backups/private-files.tar'
},
'public': {
'path': '/home/frappe/benches/bench-40545-000001-u10-ksa/sites/test-s3-streaming-uploads.frappe.cloud/private/backups/files.tar',
'file': 'files.tar',
'size': 10240,
'url': 'https://test-s3-streaming-uploads.frappe.cloud/backups/files.tar'
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Hardcoded test data replaces real backup result

backup() now unconditionally returns a hard-coded dict containing a specific site name (test-s3-streaming-uploads.frappe.cloud), fixed file paths, and fixed sizes instead of calling self.fetch_latest_backup() as before. Every backup in production will return this stale, wrong data — callers that rely on real paths (e.g., upload_offsite_backup which opens backup_file["path"]) will either silently upload to the wrong location or raise a FileNotFoundError when the hard-coded path does not exist.

Comment thread agent/site.py
Comment on lines +915 to +933
subproc = subprocess.Popen(
[
"rclone", "rcat",
"--s3-provider", "AWS",
"--s3-access-key-id", auth["ACCESS_KEY"],
"--s3-secret-access-key", auth["SECRET_KEY"],
"--s3-region", "ap-south-1",
"--s3-endpoint", f"s3.ap-south-1.amazonaws.com",
f":s3:{bucket}/{prefix}/{file}",
],
stderr=subprocess.PIPE,
stdin=fd,
close_fds=True,
)
fd.close()
subprocs.append(subproc)
ret = subproc.wait()
err = subproc.stderr.read().decode()
print(f"rclone exit: {ret}, error: {err}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 AWS region is hardcoded to ap-south-1

The rclone invocation hard-codes --s3-region ap-south-1 and the matching endpoint. upload_offsite_backup correctly reads auth.get("REGION") and uses it, but the streaming path ignores that field entirely. Any customer whose bucket is in a different region will get silent upload failures or connection errors, and the job will report success because rclone exit codes are only logged with print, not checked.

Comment thread agent/site.py
Comment on lines +929 to +933
fd.close()
subprocs.append(subproc)
ret = subproc.wait()
err = subproc.stderr.read().decode()
print(f"rclone exit: {ret}, error: {err}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 rclone non-zero exit code is ignored

start_rclone calls subproc.wait(), reads ret, and prints it — but never raises if ret != 0. A failed upload (wrong credentials, bucket policy, network error) is silently treated as success. uploaded_files is then populated for every file in files_to_stream regardless of whether any upload actually succeeded, so the job completes without error while no data was transferred.

Comment thread agent/site.py
Comment on lines +891 to +892
for file in files_to_stream:
self.bench.docker_execute(f"mkfifo {file}", subdir=relative_path_to_backup_directory)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 FIFOs are never cleaned up on failure

The named pipes created by mkfifo are left behind if any subsequent step raises an exception (e.g., bench execute fails, rclone threads crash). On the next backup run the mkfifo call will fail because the paths already exist, permanently breaking streaming backups for that site until someone manually removes the stale FIFOs.

Comment thread agent/site.py
Comment on lines +939 to +940
import time
time.sleep(3)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Race condition: time.sleep(3) is not a reliable synchronization primitive

The 3-second sleep assumes all rclone reader threads will have opened their FIFO end and started listening before bench backup writes to them. Under high system load, slow process startup, or container scheduling delays this window may be insufficient, causing bench backup to block indefinitely because no reader has opened the pipe yet. The correct approach is to wait for each FIFO to have a reader (e.g., poll os.open with O_NONBLOCK on the write side, or use a threading Event per file).

Comment thread agent/site.py
Comment on lines +872 to +882
# setup fifo
# todays_dt = datetime.now().strftime("%Y%m%d_%H%M%S")
# public_file = todays_dt + '-files.tar'
# db_file = todays_dt + '-database.sql.gz'
# private_file = todays_dt + '-private-files.tar'
# config_file = todays_dt + '-site_config_backup.json'

public_file = 'files.tar'
db_file = 'database.sql.gz'
private_file = 'private-files.tar'
config_file = 'site_config_backup.json'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Static, timestamp-less filenames risk collisions and residual data

The commented-out block above shows the original intent was to use datetime-prefixed names (e.g., 20260522_134200-database.sql.gz). The active code uses bare names (database.sql.gz, files.tar, etc.). If the backup directory already contains a prior run's files with these names, Frappe's bench backup may overwrite them mid-stream while a concurrent process is reading, or the wrong data may be uploaded. fetch_latest_backup relied on max(databases, key=os.path.getmtime) to find the newest file — that logic is now bypassed.

Comment thread agent/site.py
Comment on lines +867 to +870
self.upload_offsite_backup(backup_files, offsite, keep_files_locally_after_offsite_backup)
if (offsite and backup_files)
else {}
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Mixed tabs and spaces cause TabError at import time

Lines 867–870, 894–899, and 901 use hard-tab indentation inside blocks that are indented with spaces. Python 3 forbids mixing tabs and spaces for indentation and raises TabError: inconsistent use of tabs and spaces in indentation at parse time — before any code runs. This means the entire agent/site.py module will fail to import, taking down every job that uses it. The same pattern repeats at the # get s3 config block (line 894) and # make rclone rcat start listening (line 901).

Comment thread agent/site.py
Comment on lines +919 to +920
"--s3-access-key-id", auth["ACCESS_KEY"],
"--s3-secret-access-key", auth["SECRET_KEY"],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security AWS credentials exposed in process listing

--s3-access-key-id and --s3-secret-access-key are passed as plain command-line arguments. On Linux, /proc/<pid>/cmdline is world-readable by default, and ps aux shows the full argument list to any user on the host while rclone is running. The existing upload_offsite_backup avoids this entirely by using the boto3 library with in-process credential handling. Prefer writing a temporary rclone config file (with restricted permissions) or supplying credentials via environment variables (RCLONE_S3_ACCESS_KEY_ID / RCLONE_S3_SECRET_ACCESS_KEY) passed through subprocess.Popen(env=...) rather than as CLI flags.

Comment thread agent/site.py
for file in files_to_stream
}

return {"backups": backup_files, "offsite": uploaded_files}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 NameError on every non-offsite backup

The new if offsite: block only assigns backup_files and uploaded_files when offsite is truthy. When a plain backup is requested (offsite=None, which is the default), neither variable is ever initialised, so return {"backups": backup_files, "offsite": uploaded_files} raises NameError: name 'backup_files' is not defined. Every standard backup job (no offsite target) will now crash at this line without performing or recording any backup.

Comment thread agent/site.py
Comment on lines +943 to +949
backup_files = self.backup(
with_files,
backup_path_db=os.path.join("/home/frappe/frappe-bench/", relative_path_to_backup_directory, db_file),
backup_path_conf=os.path.join("/home/frappe/frappe-bench/", relative_path_to_backup_directory, config_file),
backup_path_files=os.path.join("/home/frappe/frappe-bench/", relative_path_to_backup_directory, public_file),
backup_path_private_files=os.path.join("/home/frappe/frappe-bench/", relative_path_to_backup_directory, private_file),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hardcoded bench root diverges from FIFO location

The FIFOs are created under backup_directory_from_root = os.path.join(self.bench.directory, ...) (line 889), but the --backup-path-* arguments passed to bench backup are constructed from the hardcoded string /home/frappe/frappe-bench/ instead. On any bench whose directory differs from that literal (e.g., a versioned bench like /home/frappe/benches/bench-40545-000001-u10-ksa), bench backup will try to open a path that has no FIFO at the other end — the write side never connects, bench backup blocks or fails, and the rclone reader threads hang indefinitely in open(fifo_path, "rb"). The job would never complete. The fix is to replace /home/frappe/frappe-bench/ with self.bench.directory to match line 889.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant