feat: Stream backups#526
Conversation
ea7c08f to
4418339
Compare
|
| Filename | Overview |
|---|---|
| agent/site.py | Adds streaming backup via FIFOs and rclone, but the implementation has critical issues: hardcoded test data returned from backup(), hardcoded AWS region/path, race-condition sync via sleep, FIFO cleanup missing, credentials exposed in process args, NameError on non-offsite backups, tab/space mixing causing TabError, and FIFO path mismatch between bench.directory and hardcoded /home/frappe/frappe-bench/. |
Sequence Diagram
sequenceDiagram
participant Job as backup_job()
participant FIFO as Named FIFO (mkfifo)
participant Thread as rclone Thread
participant Rclone as rclone rcat
participant Backup as bench backup
participant S3 as AWS S3
Job->>FIFO: docker_execute mkfifo (each file)
Job->>Thread: threading.Thread(start_rclone).start()
Thread->>FIFO: open(fifo_path, "rb") [BLOCKS until writer]
Job->>Job: time.sleep(3)
Job->>Backup: bench execute backup --backup-path-db ...
Note over Job,Backup: Uses hardcoded /home/frappe/frappe-bench/<br/>but FIFOs are at self.bench.directory
Backup->>FIFO: open for writing (unblocks Thread)
Thread->>Rclone: "Popen(rclone rcat, stdin=fd)"
Backup->>FIFO: write compressed data
FIFO->>Rclone: stream data
Rclone->>S3: upload stream
Backup->>FIFO: close (EOF)
Rclone->>Thread: exit (code ignored)
Job->>Job: t.join() for all threads
Job->>Job: build uploaded_files dict (always succeeds)
Reviews (4): Last reviewed commit: "feat: Stream backups" | Re-trigger Greptile
| return { | ||
| 'database': { | ||
| 'path': '/home/frappe/benches/bench-40545-000001-u10-ksa/sites/test-s3-streaming-uploads.frappe.cloud/private/backups/database.sql.gz', | ||
| 'file': '20260522_134200-test-s3-streaming-uploads_frappe_cloud-database.sql.gz', | ||
| 'size': 10000, | ||
| 'url':'https://test-s3-streaming-uploads.frappe.cloud/backups/database.sql.gz' | ||
| }, | ||
| 'site_config': { | ||
| 'path': '/home/frappe/benches/bench-40545-000001-u10-ksa/sites/test-s3-streaming-uploads.frappe.cloud/private/backups/site_config_backup.json', | ||
| 'file': 'site_config_backup.json', | ||
| 'size': 328, | ||
| 'url': 'https://test-s3-streaming-uploads.frappe.cloud/backups/site_config_backup.json' | ||
| }, | ||
| 'private': { | ||
| 'path': '/home/frappe/benches/bench-40545-000001-u10-ksa/sites/test-s3-streaming-uploads.frappe.cloud/private/backups/private-files.tar', | ||
| 'file': 'private-files.tar', | ||
| 'size': 10240, | ||
| 'url': 'https://test-s3-streaming-uploads.frappe.cloud/backups/private-files.tar' | ||
| }, | ||
| 'public': { | ||
| 'path': '/home/frappe/benches/bench-40545-000001-u10-ksa/sites/test-s3-streaming-uploads.frappe.cloud/private/backups/files.tar', | ||
| 'file': 'files.tar', | ||
| 'size': 10240, | ||
| 'url': 'https://test-s3-streaming-uploads.frappe.cloud/backups/files.tar' | ||
| } | ||
| } |
There was a problem hiding this comment.
Hardcoded test data replaces real backup result
backup() now unconditionally returns a hard-coded dict containing a specific site name (test-s3-streaming-uploads.frappe.cloud), fixed file paths, and fixed sizes instead of calling self.fetch_latest_backup() as before. Every backup in production will return this stale, wrong data — callers that rely on real paths (e.g., upload_offsite_backup which opens backup_file["path"]) will either silently upload to the wrong location or raise a FileNotFoundError when the hard-coded path does not exist.
| subproc = subprocess.Popen( | ||
| [ | ||
| "rclone", "rcat", | ||
| "--s3-provider", "AWS", | ||
| "--s3-access-key-id", auth["ACCESS_KEY"], | ||
| "--s3-secret-access-key", auth["SECRET_KEY"], | ||
| "--s3-region", "ap-south-1", | ||
| "--s3-endpoint", f"s3.ap-south-1.amazonaws.com", | ||
| f":s3:{bucket}/{prefix}/{file}", | ||
| ], | ||
| stderr=subprocess.PIPE, | ||
| stdin=fd, | ||
| close_fds=True, | ||
| ) | ||
| fd.close() | ||
| subprocs.append(subproc) | ||
| ret = subproc.wait() | ||
| err = subproc.stderr.read().decode() | ||
| print(f"rclone exit: {ret}, error: {err}") |
There was a problem hiding this comment.
AWS region is hardcoded to
ap-south-1
The rclone invocation hard-codes --s3-region ap-south-1 and the matching endpoint. upload_offsite_backup correctly reads auth.get("REGION") and uses it, but the streaming path ignores that field entirely. Any customer whose bucket is in a different region will get silent upload failures or connection errors, and the job will report success because rclone exit codes are only logged with print, not checked.
| fd.close() | ||
| subprocs.append(subproc) | ||
| ret = subproc.wait() | ||
| err = subproc.stderr.read().decode() | ||
| print(f"rclone exit: {ret}, error: {err}") |
There was a problem hiding this comment.
rclone non-zero exit code is ignored
start_rclone calls subproc.wait(), reads ret, and prints it — but never raises if ret != 0. A failed upload (wrong credentials, bucket policy, network error) is silently treated as success. uploaded_files is then populated for every file in files_to_stream regardless of whether any upload actually succeeded, so the job completes without error while no data was transferred.
| for file in files_to_stream: | ||
| self.bench.docker_execute(f"mkfifo {file}", subdir=relative_path_to_backup_directory) |
There was a problem hiding this comment.
FIFOs are never cleaned up on failure
The named pipes created by mkfifo are left behind if any subsequent step raises an exception (e.g., bench execute fails, rclone threads crash). On the next backup run the mkfifo call will fail because the paths already exist, permanently breaking streaming backups for that site until someone manually removes the stale FIFOs.
| import time | ||
| time.sleep(3) |
There was a problem hiding this comment.
Race condition:
time.sleep(3) is not a reliable synchronization primitive
The 3-second sleep assumes all rclone reader threads will have opened their FIFO end and started listening before bench backup writes to them. Under high system load, slow process startup, or container scheduling delays this window may be insufficient, causing bench backup to block indefinitely because no reader has opened the pipe yet. The correct approach is to wait for each FIFO to have a reader (e.g., poll os.open with O_NONBLOCK on the write side, or use a threading Event per file).
| # setup fifo | ||
| # todays_dt = datetime.now().strftime("%Y%m%d_%H%M%S") | ||
| # public_file = todays_dt + '-files.tar' | ||
| # db_file = todays_dt + '-database.sql.gz' | ||
| # private_file = todays_dt + '-private-files.tar' | ||
| # config_file = todays_dt + '-site_config_backup.json' | ||
|
|
||
| public_file = 'files.tar' | ||
| db_file = 'database.sql.gz' | ||
| private_file = 'private-files.tar' | ||
| config_file = 'site_config_backup.json' |
There was a problem hiding this comment.
Static, timestamp-less filenames risk collisions and residual data
The commented-out block above shows the original intent was to use datetime-prefixed names (e.g., 20260522_134200-database.sql.gz). The active code uses bare names (database.sql.gz, files.tar, etc.). If the backup directory already contains a prior run's files with these names, Frappe's bench backup may overwrite them mid-stream while a concurrent process is reading, or the wrong data may be uploaded. fetch_latest_backup relied on max(databases, key=os.path.getmtime) to find the newest file — that logic is now bypassed.
| self.upload_offsite_backup(backup_files, offsite, keep_files_locally_after_offsite_backup) | ||
| if (offsite and backup_files) | ||
| else {} | ||
| ) |
There was a problem hiding this comment.
Mixed tabs and spaces cause
TabError at import time
Lines 867–870, 894–899, and 901 use hard-tab indentation inside blocks that are indented with spaces. Python 3 forbids mixing tabs and spaces for indentation and raises TabError: inconsistent use of tabs and spaces in indentation at parse time — before any code runs. This means the entire agent/site.py module will fail to import, taking down every job that uses it. The same pattern repeats at the # get s3 config block (line 894) and # make rclone rcat start listening (line 901).
| "--s3-access-key-id", auth["ACCESS_KEY"], | ||
| "--s3-secret-access-key", auth["SECRET_KEY"], |
There was a problem hiding this comment.
AWS credentials exposed in process listing
--s3-access-key-id and --s3-secret-access-key are passed as plain command-line arguments. On Linux, /proc/<pid>/cmdline is world-readable by default, and ps aux shows the full argument list to any user on the host while rclone is running. The existing upload_offsite_backup avoids this entirely by using the boto3 library with in-process credential handling. Prefer writing a temporary rclone config file (with restricted permissions) or supplying credentials via environment variables (RCLONE_S3_ACCESS_KEY_ID / RCLONE_S3_SECRET_ACCESS_KEY) passed through subprocess.Popen(env=...) rather than as CLI flags.
| for file in files_to_stream | ||
| } | ||
|
|
||
| return {"backups": backup_files, "offsite": uploaded_files} |
There was a problem hiding this comment.
NameError on every non-offsite backup
The new if offsite: block only assigns backup_files and uploaded_files when offsite is truthy. When a plain backup is requested (offsite=None, which is the default), neither variable is ever initialised, so return {"backups": backup_files, "offsite": uploaded_files} raises NameError: name 'backup_files' is not defined. Every standard backup job (no offsite target) will now crash at this line without performing or recording any backup.
| backup_files = self.backup( | ||
| with_files, | ||
| backup_path_db=os.path.join("/home/frappe/frappe-bench/", relative_path_to_backup_directory, db_file), | ||
| backup_path_conf=os.path.join("/home/frappe/frappe-bench/", relative_path_to_backup_directory, config_file), | ||
| backup_path_files=os.path.join("/home/frappe/frappe-bench/", relative_path_to_backup_directory, public_file), | ||
| backup_path_private_files=os.path.join("/home/frappe/frappe-bench/", relative_path_to_backup_directory, private_file), | ||
| ) |
There was a problem hiding this comment.
Hardcoded bench root diverges from FIFO location
The FIFOs are created under backup_directory_from_root = os.path.join(self.bench.directory, ...) (line 889), but the --backup-path-* arguments passed to bench backup are constructed from the hardcoded string /home/frappe/frappe-bench/ instead. On any bench whose directory differs from that literal (e.g., a versioned bench like /home/frappe/benches/bench-40545-000001-u10-ksa), bench backup will try to open a path that has no FIFO at the other end — the write side never connects, bench backup blocks or fails, and the rclone reader threads hang indefinitely in open(fifo_path, "rb"). The job would never complete. The fix is to replace /home/frappe/frappe-bench/ with self.bench.directory to match line 889.
No description provided.