Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 86 additions & 5 deletions pipeline/outputs/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,12 @@ The [Prometheus success/retry/error metrics values](../../administration/monitor
| `blob_database_file` | Absolute path to a database file to be used to store blob files contexts. | _none_ |
| `bucket` | S3 bucket name. | _none_ |
| `canned_acl` | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | _none_ |
| `compression` | Compression type for S3 objects. Supported values: `gzip`, `zstd`, `snappy`. `arrow` and `parquet` are also available if Apache Arrow was enabled at compile time. See [Compression](#compression). | _none_ |
| `compression` | Compression type for S3 objects. Supported values: `gzip`, `zstd`, `snappy`, `arrow`. When `format` is set to `parquet`, this controls the page-level `codec` inside the Parquet file (supported: `snappy`, `zstd`, `gzip`). `compression=parquet` is deprecated; use `format parquet` instead. See [Compression](#compression). | _none_ |
| `content_type` | A standard MIME type for the S3 object, set as the Content-Type HTTP header. | _none_ |
| `endpoint` | Custom endpoint for the S3 API. Endpoints can contain scheme and port. | _none_ |
| `external_id` | Specify an external ID for the STS API. Can be used with the `role_arn` parameter if your role requires an external ID. | _none_ |
| `file_delivery_attempt_limit` | File delivery attempt limit. | `1` |
| `format` | Set the record output format. Supported values: `json_lines`, `otlp_json`. When set to `otlp_json`, the `log_key` option isn't supported and only `logs` event chunks are converted. | `json_lines` |
| `format` | Set the output format. Supported values: `json_lines`, `otlp_json`, `parquet`. When set to `parquet`, records are converted to Apache Parquet columnar format (requires Apache Arrow Parquet support at compile time). The `compression` option controls the page-level `codec` inside the Parquet file. When set to `otlp_json`, the `log_key` option isn't supported and only `logs` event chunks are converted. | `json_lines` |
| `host` | IP address or hostname of the target HTTP server. | `127.0.0.1` |
| `json_date_format` | Specify the format of the date. Accepted values: `double`, `epoch`, `epoch_ms`, `iso8601` (2018-05-30T09:39:52.000681Z), `_java_sql_timestamp_` (2018-05-30 09:39:52.000681). | _none_ |
| `json_date_key` | Specify the name of the date key in the output record. To disable the time key, set the value to `false`. | `date` |
Expand Down Expand Up @@ -128,6 +128,85 @@ Fluent Bit compresses data before uploading to S3. Consumers must decompress the

{% endhint %}

## Parquet format

Setting `format` to `parquet` converts log records to Apache Parquet columnar format before uploading to S3. Parquet files are directly queryable by Athena, Spark, and Presto without additional transformation.

The `compression` option controls the page-level `codec` applied inside the Parquet file:

| `compression` value | Parquet page `codec` | Notes |
|---------------------|-------------------|-------|
| `snappy` | Snappy | Fast, moderate compression ratio. Industry standard default. |
| `zstd` | Zstandard | Better ratio, slightly slower. |
| `gzip` | Gzip | Best ratio, slowest. |
| _(unset)_ | Uncompressed | No page-level compression. |

{% hint style="info" %}

`format parquet` requires `use_put_object On`. Multipart uploads aren't supported with Parquet format.

{% endhint %}

### Example: Parquet with Snappy compression

```yaml
pipeline:
outputs:
- name: s3
match: '*'
bucket: my-bucket
region: us-east-1
format: parquet
compression: snappy
use_put_object: on
upload_timeout: 60s
total_file_size: 50M
s3_key_format: '/logs/dt=%Y-%m-%d/h=%H/$UUID.parquet'
```

### Example: Parquet without page-level compression

```yaml
pipeline:
outputs:
- name: s3
match: '*'
bucket: my-bucket
region: us-east-1
format: parquet
use_put_object: on
upload_timeout: 60s
s3_key_format: '/logs/dt=%Y-%m-%d/h=%H/$UUID.parquet'
```

### Migrating from `compression=parquet`

The `compression=parquet` syntax is deprecated. To migrate:

**Before (deprecated):**

```yaml
compression: parquet
```

**After (recommended):**

```yaml
format: parquet
compression: snappy
```

The deprecated syntax continues to work but produces Parquet files with uncompressed pages and emits a warning at startup.

### Build requirements

Parquet format requires Apache Arrow Parquet support at compile time:

- CMake flag: `-DFLB_ARROW=On`
- System packages: `arrow-glib-devel` and `parquet-glib-devel`

The `AWS for Fluent Bit` version 3 container image includes these dependencies by default.

## Permissions

The plugin requires the following AWS IAM permissions:
Expand Down Expand Up @@ -694,7 +773,7 @@ pipeline:
{% endtab %}
{% endtabs %}

Setting `Compression` to `arrow` makes Fluent Bit convert payload into Apache Arrow format.
Setting `compression` to `arrow` converts the payload to Apache Arrow (Feather) format. For Parquet output, use `format parquet` instead.

Load, analyze, and process stored data using popular data processing tools such as Python pandas, Apache Spark and Tensorflow.

Expand Down Expand Up @@ -766,7 +845,8 @@ pipeline:
region: us-east-2
bucket: <your_testing_bucket>
use_put_object: On
compression: parquet
format: parquet
compression: snappy
# other parameters
```

Expand All @@ -791,7 +871,8 @@ pipeline:
Region us-east-2
Bucket <your_testing_bucket>
Use_Put_Object On
Compression parquet
Format parquet
Compression snappy
# other parameters
```

Expand Down