Amazon S3 | Cribl Docs (2024)

These docs are for Cribl Stream 4.5, a product version we no longer actively maintain.

See the latest version (4.7).

CriblStream supports receiving data from Amazon S3 buckets, using eventnotifications through SQS.

Type: Pull | TLS Support: YES (secure API) | Event Breaker Support: YES

CriblStream running on Linux (only) can use this Source to read Parquet files, identified by a .parquet, .parq, or .pqt filename extension.

See our AmazonS3BetterPractices and UsingS3 Storage and Replay guides.

S3 Setup Strategy

The source S3 bucket must be configured to send s3:ObjectCreated:* events to an SQS queue, either directly (easiest) or via SNS (Amazon Simple Notification Service). See the event notification configuration guidelines below.

SQS messages will be deleted after they’re read, unless an error occurs, in which case CriblStream will retry. This means that although CriblStream will ignore files not matching the Filename Filter, their SQS events/notifications will still be read, and then deleted from the queue (along with those from files that match).

These ignored files will no longer be available to other S3 Sources targeting the same SQS queue. Ifyou still need to process these files, we suggest one of these alternatives:

  • Using a different, dedicated SQS queue. (Preferred and recommended.)

  • Applying a broad filter on a single Source, and then using pre-processing Pipelines an/or Route filters for further processing.

Compression

CriblStream can ingest compressed S3 files if they meet one of the following conditions:

  • Compressed with the x-gzip MIME type.
  • Ends with the .gz, .tgz, .tar.gz, .tgz, or .tar extension.
  • Can be uncompressed using the zlib.gunzip algorithm.

Storage Class Compatibility

CriblStream does not support data preview, collection, or replayfrom S3 Glacier or S3 DeepGlacier storage classes, whose stated retrieval lags(variously minutes to 48 hours) cannot guarantee data availability when theCollector needs it.

CriblStream does support data preview, collection, and replay from S3 GlacierInstant Retrieval when you’re using the S3 Intelligent-Tiering storage class.

Configuring CriblStream to Receive Data from Amazon S3

From the top nav, click Manage, then select a WorkerGroup to configure. Next, you have two options:

To configure via the graphical QuickConnect UI, click Routing > QuickConnect (Stream) or Collect (Edge). Next, click AddSource at left. Fromthe resulting drawer’s tiles, select [Pull> ] Amazon> S3. Next, click either AddDestination or (if displayed) SelectExisting. The resulting drawer will provide the options below.

Or, to configure via the Routing UI, click Data>Sources (Stream) or More> Sources (Edge). Fromthe resulting page’s tiles or left nav, select [Pull> ] Amazon> S3. Next,clickNewSource to open a NewSource modal that provides the options below.

General Settings

Input ID: Enter a unique name to identify this S3 Source definition.

Queue: The name, URL, or ARN of the SQS queue to read events from. When specifying a non-AWS URL, you must use the format: {url}/<queueName>. (E.g., https://host:port/<queueName>.) Thisvalue must be a JavaScript expression (which can evaluate to a constant), enclosed in single quotes, double quotes, or backticks.

OptionalSettings

Filename filter: Regex matching file names to download and process. Defaults to .*, to match all characters. This regex will be evaluated against the S3 key’s full path.

Region: AWS Region where the S3 bucket and SQS queue are located. Required, unless the Queue entry is a URL or ARN that includes a Region.

Tags: Optionally, add tags that you can use to filter and group Sources in CriblStream’s Manage Sources page. These tags aren’t added to processed events. Use a tab or hard return between (arbitrary) tag names.

Authentication

Use the Authentication method drop-down to select an AWS authentication method.

Auto: This default option uses the AWS instance’s metadata service to automatically obtain short-lived credentials from the IAM role attached to an EC2 instance, local credentials, sidecar, or other source. The attached IAM role grants CriblStream Workers access to authorized AWS resources. Can also use the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Works only when running on AWS.

Manual: If not running on AWS, you can select this option to enter a static set of user-associated IAM credentials (your access key and secret key) directly or by reference. This is useful for Workers not in an AWS VPC, e.g., those running a private cloud. The Manual option exposes these corresponding additional fields:

  • Access key: Enter your AWS access key. If not present, will fall back to the env.AWS_ACCESS_KEY_ID environment variable, or to the metadata endpoint for IAM role credentials.

  • Secret key: Enter your AWS secret key. If not present, will fall back to the env.AWS_SECRET_ACCESS_KEY environment variable, or to the metadata endpoint for IAM credentials.

Secret: If not running on AWS, you can select this option to supply a stored secret that references an AWS access key and secret key. The Secret option exposes this additional field:

  • Secret key pair: Use the drop-down to select a secret key pair that you’ve configured in CriblStream’s internal secrets manager or (if enabled) an external KMS. Follow the Create link if you need to configure a key pair.

Assume Role

Enable for S3: Whether to use Assume Role credentials to access S3. Defaults to Yes.

Enable for SQS: Whether to use Assume Role credentials when accessing SQS (Amazon Simple Queue Service). Defaults to No.

AWS account ID: SQS queue owner’s AWS account ID. Leave empty if the SQS queue is in the same AWS account.

AssumeRole ARN: Enter the Amazon Resource Name (ARN) of the role to assume.

External ID: Enter the External ID to use when assuming role.

Duration (seconds): Duration of the Assumed Role’s session, in seconds. Minimum is 900 (15 minutes). Maximum is 43200 (12 hours). Defaults to 3600 (1 hour).

Processing Settings

Custom Command

In this section, you can pass the data from this input to an external command for processing, before the data continues downstream.

Enabled: Defaults to No. Toggle to Yes to enable the custom command.

Command: Enter the command that will consume the data (via stdin) and will process its output (via stdout).

Arguments: Click Add Argument to add each argument to the command. You can drag arguments vertically to resequence them.

Event Breakers

This section defines event breaking rulesets that will be applied, in order.

Event Breaker Rulesets: A list of event breaking rulesets that will be applied to the input data stream before the data is sent through the Routes. Defaults to System Default Rule.

Event Breaker buffer timeout: How long (in milliseconds) the EventBreaker will wait for new data to be sent to a specific channel, before flushing out the data stream, as-is, to the Routes. Minimum 10ms, default 10000 (10sec), maxiumum 43200000 (12hours).

Fields

In this section, you can add Fields to each event, using Eval-like functionality.

Name: Field name.

Value: JavaScript expression to compute field’s value, enclosed in quotes or backticks. (Can evaluate to a constant.)

Pre-Processing

In this section’s Pipeline drop-down list, you can select a single existing Pipeline to process data from this input before the data is sent through the Routes.

Advanced Settings

Endpoint: S3 service endpoint. If empty, defaults to AWS’s region-specific endpoint. Otherwise, used to point to an S3-compatible endpoint. To access the AWS S3 endpoints, use the path-style URL format. You don’t need to specify the bucket name in the URL, because CriblStream will automatically add it to the URL path. For details, see AWS’ Path‑StyleRequests topic.

Signature version: Signature version to use for signing SQS requests. Defaults to v4.

Max messages: The maximum number of messages that SQS should return in a poll request. Amazon SQS never returns more messages than this value. (However, fewer messages might be returned.) Acceptable values: 1 to 10. Defaults to 1.

Visibility timeout seconds: The duration (in seconds) that the received messages are hidden from subsequent retrieve requests, after being retrieved by a ReceiveMessage request. Defaults to 600.

CriblStream will automatically extend this timeout until the initial request’s files have been processed – notably, in the case of large files that require additional processing time.

Num receivers: The number of receiver processes to run. The higher the number, the better the throughput, at the expense of CPU overhead. Defaults to 1.

Poll timeout(secs): How long to wait for events before polling again. Minimum 1second; default 10; maximum 20. Shortdurations increase the number ‑ and thus the cost – of requests sent to AWS. (TheUI will show a warning for intervals shorter than 5seconds.) Long durations increase the time the Source takes to react to configuration changes and system restarts.

Socket timeout: Socket inactivity timeout (in seconds). Increase this value if retrievals time out during backpressure. Defaults to 300 seconds.

Max Parquet chunksize (MB): Maximum size for each Parquet chunk. Defaults to 5MB. Valid range is 1to 100MB. CriblStream stores chunks in the location specified by the CRIBL_TMP_DIR environment variable. It removes the chunks immediately after reading them. See EnvironmentVariables.

Parquet chunk download timeout (seconds): The maximum time to wait for a Parquet file’s chunk to be downloaded. If a required chunk cannot not be downloaded within this time limit, processing will end. Defaults to 600seconds. Valid range is 1second to 3600seconds (1 hour).

Skip file on error: Toggle to Yes to skip files that trigger a processing error. (E.g., corrupted files.) Defaults to No, which enables retries after a processing error.

Reuse connections: Whether to reuse connections between requests. The default setting (Yes) can improve performance.

Reject unauthorized certificates: Whether to accept certificates that cannot be verified against a valid Certificate Authority (e.g., self-signed certificates). Defaults to Yes.

Environment: If you’re using GitOps, optionally use this field to specify a single Git branch on which to enable this configuration. Ifempty, the config will be enabled everywhere.

Connected Destinations

Select Send to Routes to enable conditional routing, filtering, and cloning of this Source’s data via the Routing table.

Select QuickConnect to send this Source’s data to one or more Destinations via independent, direct connections.

Internal Fields

CriblStream uses a set of internal fields to assist in handling of data. These “meta” fields are not part of an event, but they are accessible, and Functions can use them to make processing decisions.

Fields for this Source:

  • __final
  • __inputId
  • __source
  • _time

How to Configure S3 to Send Event Notifications to SQS

  1. Create a Standard SQS Queue. Note its ARN.

  2. Replace its access policy with one similar to the examples below. To do so, select the queue; and then, in the Permissions tab, click: EditPolicy Document (Advanced). (These examples differ only at line9, showing public access to the SQS queue versus S3-only access to the queue.)

  3. In the Amazon S3 console, add a notification configuration to publish events of the s3:ObjectCreated:* type to the SQS queue.

Permissive SQS access policyRestrictive SQS access policy

{ "Version": "2012-10-17", "Id": "example-ID", "Statement": [ { "Sid": "<SID name>", "Effect": "Allow", "Principal": { "AWS":"*"  }, "Action": [ "SQS:SendMessage" ], "Resource": "example-SQS-queue-ARN", "Condition": { "ArnLike": { "aws:SourceArn": "arn:aws:s3:*:*:example-bucket-name" } } } ]}
{ "Version": "2012-10-17", "Id": "example-ID", "Statement": [ { "Sid": "<SID name>", "Effect": "Allow", "Principal": { "Service":"s3.amazonaws.com"  }, "Action": [ "SQS:SendMessage" ], "Resource": "example-SQS-queue-ARN", "Condition": { "ArnLike": { "aws:SourceArn": "arn:aws:s3:*:*:example-bucket-name" } } } ]}

S3 and SQS Permissions

The following permissions are required on the S3 bucket:

  • s3:GetObject
  • s3:ListBucket

The following permissions are required on the SQS queue:

  • sqs:ReceiveMessage
  • sqs:DeleteMessage
  • sqs:ChangeMessageVisibility
  • sqs:GetQueueAttributes
  • sqs:GetQueueUrl

Temporary Access via SSO Provider

You can use Okta or SAML to provide access to S3 buckets using temporary security credentials.

Proxying Requests

If you need to proxy HTTP/S requests, see SystemProxyConfiguration.

How CriblStream Pulls Data

Workers poll message from SQS. The call will return messages if they are available, or will time out after 1second if no messages are available.

Each Worker gets its share of the load from S3. By default, S3 returns a maximum of 1message in a single poll request. You can change this default in Maxmessages.

Best Practices

Beyond these basics, also see our AmazonS3BetterPractices and UsingS3 Storage and Replay guides:

  • When CriblStream instances are deployed on AWS, use IAM Roles whenever possible.

    • Not only is this safer, but it also makes the configuration simpler to maintain.
  • Although optional, we highly recommend that you use a FilenameFilter.

    • This will ensure that CriblStream ingests only files of interest.
    • Ingesting only what’s strictly needed improves latency, processing power, and data quality.
  • If higher throughput is needed, increase AdvancedSettings > Numberof Receivers and/or Maxmessages. However, do note:

    • These are set at 1 by default. Which means, each WorkerProcess, in each CriblStream WorkerNode, will run 1 receiver consuming 1 message (i.e., S3 file) at a time.
    • Total S3 objects processed at a time per Worker Node = Worker Processes x Number of Receivers x Max Messages
    • Increased throughput implies additional CPU utilization.
  • When ingesting large files, tune up the VisibilityTimeout, or consider using smaller objects.

    • The default value of 600s works well in most cases, and while you certainly can increase it, we suggest that you also consider using smaller S3 objects.

Troubleshooting Notes

  • VPC endpoints for SQS and for S3 might need to be set up in your account. Check with your administrator for details.

  • If you’re having connectivity issues, but having no problems with the CLI, see if the AWS CLI proxy is in use. Check with your administrator for details.

Troubleshooting Resources

Cribl University offers an AdvancedTroubleshooting > SourceIntegrations:S3 short course. Tofollow the direct course link, first log into your CriblUniversity account. (Tocreate an account, click the Signup link. You’ll need to click through a short Terms&Conditions presentation, with chill music, before proceeding to courses – but Cribl’s training is always free of charge.) Once logged in, check out other useful AdvancedTroubleshooting short courses and TroubleshootingCriblets.

See also for information on common errors.

Amazon S3 | Cribl Docs (2024)
Top Articles
Latest Posts
Article information

Author: Dean Jakubowski Ret

Last Updated:

Views: 5708

Rating: 5 / 5 (70 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Dean Jakubowski Ret

Birthday: 1996-05-10

Address: Apt. 425 4346 Santiago Islands, Shariside, AK 38830-1874

Phone: +96313309894162

Job: Legacy Sales Designer

Hobby: Baseball, Wood carving, Candle making, Jigsaw puzzles, Lacemaking, Parkour, Drawing

Introduction: My name is Dean Jakubowski Ret, I am a enthusiastic, friendly, homely, handsome, zealous, brainy, elegant person who loves writing and wants to share my knowledge and understanding with you.