Configure Databricks S3 commit service-related settings

Databricks runs a commit service that coordinates writes to Amazon S3 from multiple clusters. This service runs in the Databricks control plane. For additional security, you can disable the service’s direct upload optimization as described in Disable the direct upload optimization. To further restrict access to your S3 buckets, see Restrict access to specific IP addresses.

If you receive AWS GuardDuty alerts related to the S3 commit service, see AWS GuardDuty alerts related to S3 commit service.

About the commit service

The S3 commit service helps guarantee consistency of writes across multiple clusters on a single table in specific cases. For example, the commit service helps Delta Lake implement ACID transactions.

In the default configuration, Databricks sends temporary AWS credentials from the compute plane to the control plane in the commit service API call. Instance profile credentials are valid for six hours.

The compute plane writes data directly to S3, and then the S3 commit service in the control plane provides concurrency control by finalizing the commit log upload (completing the multipart upload described below). The commit service does not read any data from S3. It puts a new file in S3 if it does not exist.

The most common data that is written to S3 by the Databricks commit service is the Delta log, which contains statistical aggregates from your data, such as the column’s minimum and maximum values. Most Delta log data is sent to S3 from the control plane using an Amazon S3 multipart upload.

After the cluster stages the multipart data to write the Delta log to S3, the S3 commit service in the Databricks control plane finishes the S3 multipart upload by letting S3 know that it is complete. As a performance optimization for very small updates, by default the commit service sometimes pushes small updates directly from the control plane to S3. This direct update optimization can be disabled. See Disable the direct upload optimization.

In addition to Delta Lake, the following Databricks features use the same S3 commit service:

The commit service is necessary because Amazon doesn’t provide an operation that puts an object only if it does not yet exist. Amazon S3 is a distributed system. If S3 receives multiple write requests for the same object simultaneously, it overwrites all but the last object written. Without the ability to centrally verify commits, simultaneous commits from different clusters would corrupt tables.

AWS GuardDuty alerts related to S3 commit service

Important

Commits to tables managed by Unity Catalog do not trigger GuardDuty alerts.

If you use AWS GuardDuty and you access data using AWS IAM instance profiles, GuardDuty may create alerts for default Databricks behavior related to Delta Lake, Structured Streaming, Auto Loader, or COPY INTO. These alerts are related to instance credential exfiltration detection, which is enabled by default. These alerts include the title UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.InsideAWS.

You can configure your Databricks deployment to address GuardDuty alerts related to the S3 commit service by creating an AWS instance profile that assumes the role of your original S3 data access IAM role.

As an alternative to using instance profile credentials, this new instance profile can configure clusters to assume a role with short duration tokens. This capability already exists in all recent Databricks Runtime versions and can be enforced globally via cluster policies.

If you have not already done so, create a normal instance profile to access the S3 data. This instance profile uses instance profile credentials to directly access the S3 data.

This section refers to the role ARN in this instance profile as the <data-role-arn>.
Create a new instance profile that will use tokens and references your instance profile that directly accesses the data. Your cluster will reference this new token-based instance profile. See Tutorial: Configure S3 access with an instance profile.

This instance profile does not need any direct S3 access. Instead it needs only the permissions to assume the IAM role that you use for data access. This section refers to the role ARN in this instance profile as the <cluster-role-arn>.
1. Add an attached IAM policy on the new cluster instance profile IAM role (<cluster-role-arn>). Add the following policy statement to your new cluster Instance profile IAM Role and replace <data-role-arn> with the ARN of your original instance profile that accesses your bucket.
```
{
  "Effect": "Allow",
  "Action": "sts:AssumeRole",
  "Resource": "<data-role-arn>"
}
```
2. Add a trust policy statement to your existing data access IAM Role and replace <cluster-role-arn> with the ARN of the original instance profile that accesses your bucket.
```
{
  "Effect": "Allow",
  "Principal": {
      "AWS": "<cluster-role-arn>"
  },
  "Action": "sts:AssumeRole"
}
```
To use notebook code that makes direct connection to S3 without using DBFS, configure your clusters to use the new token-based instance profile and to assume the data access role.
- Configure a cluster for S3 access to all buckets. Add the following to the cluster’s Spark configuration:
```
fs.s3a.credentialsType AssumeRole
fs.s3a.stsAssumeRole.arn <data-role-arn>
```
- You can configure this for a specific bucket:
```
fs.s3a.bucket.<bucket-name>.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
fs.s3a.bucket.<bucket-name>.assumed.role.arn <data-role-arn>
```

Disable the direct upload optimization

As a performance optimization for very small updates, by default the commit service sometimes pushes small updates directly from the control plane to S3. To disable this optimization, set the Spark parameter spark.hadoop.fs.s3a.databricks.s3commit.directPutFileSizeThreshold to 0. You can apply this setting in the cluster’s Spark config or set it using cluster policies.

Disabling this feature may result in a small performance impact for near real-time Structured Streaming queries with constant small updates. Consider testing the performance impact with your data before disabling this feature in production.

Restrict access to specific IP addresses

You can limit specific S3 buckets to be accessible only from specific IP addresses. For example, you can restrict access to only your own environment and the IP addresses for the Databricks control plane, including the S3 commit service. This reduces the risk that credentials are used from other locations. See (Optional) Restrict access to S3 buckets.