Python S3 Upload if File Is Different

Triggering a Lambda by uploading a file to S3 is i of the introductory examples of the service. As a tutorial, it tin be implemented in nether 15 minutes with canned lawmaking, and is something that a lot of people observe useful in real life. Only the tutorials that I've seen only look at the "happy path": they don't explore what happens (and how to recover) when things go incorrect. Nor do they expect at how the files become into S3 in the kickoff place, which is a key function of any awarding pattern.

This post is a "deep dive" on the standard tutorial, looking at architectural decisions and operational concerns in add-on to the simple mechanics of triggering a Lambda from an S3 upload.

Architecture

As the title says, the architecture uses two buckets and a Lambda role. The client uploads a file to the first ("staging") bucket, which triggers the Lambda; after processing the file, the Lambda moves it into the second ("archive") saucepan.

Why 2 buckets?

From a strictly technical perspective, there's no need to have two buckets. You can configure an S3 trigger to fire when a file is uploaded with a specific prefix, and could move the file to a different prefix after processing, so could keep everything inside a single bucket. Yous might also question the indicate of the archive saucepan entirely: once the file has been processed, why keep information technology? I think there are several answers to this question, each from a unlike perspective.

Get-go, as always, security: two buckets minimize blast radius. Clients crave privileges to upload files; if you accidentally grant too wide a scope, the files in the staging bucket might exist compromised. Still, since files are removed from the staging bucket after they're candy, at any signal in time that bucket should have few or no files in information technology. This assumes, of course, that those besides-wide privileges don't also allow access to the annal bucket. 1 fashion to protect against that is to adopt the habit of narrowly-scoped policies, that grant permissions on a single bucket.

Configuration management besides becomes easier: with a shared bucket, everything that touches that bucket — from IAM policies, to S3 life-bicycle policies, to application lawmaking — has to be configured with both a saucepan proper name and a prefix. By going to two buckets, you tin can eliminate the prefix (although the application might even so use prefixes to carve up files, for case past client ID).

Failure recovery and majority uploads are also easier when y'all separate new files from those that have been processed. In many cases information technology'south a simple matter of moving the files back into the upload bucket to trigger reprocessing.

How do files get uploaded?

All of the examples that I've seen assume that a file magically arrives in S3; how information technology gets there is "left as an exercise for the reader." However, this tin can be quite challenging for existent-world applications, especially those running in a user's browser. For this postal service I'm going to focus on ii approaches: straight PUT using a presigned URL, and multi-role upload using the JavaScript SDK.

Pre-signed URLs

Amazon Web Services are, in fact, web services: every operation is implemented every bit an HTTPS request. For many services you don't think almost that, and instead collaborate with the service via an Amazon-provided software evolution kit (SDK). For S3, however, the web-service nature is closer to the surface: you can skip the SDK and download files with Get, but as when interacting with a website, or upload files with a PUT or Mail.

The i caveat to interacting with S3, assuming that you haven't just exposed your bucket to the world, is that these GETs and PUTs must be signed, using the credentials belonging to a user or role. The actual signing process is rather complex, and requires admission credentials (which y'all don't want to provide to an arbitrary client, lest they be copied and used for nefarious purposes). As an alternative, S3 allows you lot to generate a pre-signed URL, using the credentials of the application generating the URL.

Using the the S3 SDK, generating a presigned URL is easy: here's some Python code (which might be run in a web-service Lambda) that will create a pre-signed URL for a PUT request. Note that you have to provide the expected content type: a URL signed for text/apparently tin't be used to upload a file with blazon prototype/jpeg.

s3_client = boto3.client('s3')  params = {     'Saucepan':      bucket,     'Cardinal':         primal,     'ContentType': content_type } url = s3_client.generate_presigned_url('put_object', params)

If you run this lawmaking yous'll become a long URL that contains all of the information needed to upload the file:

https://example.s3.amazonaws.com/example?AWSAccessKeyId=AKIA3XXXXXXXXXXXXXXX&Signature=M9SbH6zl9LpmM6%2F2POBk202dWjI%3D&content-type=text%2Fplain&Expires=1585003481

Of import caveat: only considering you lot can generate a presigned URL doesn't mean the URL will be valid. For this example, I used bogus access credentials and a referred to a saucepan and key that (probably) doesn't exist (certainly not one that I control). If yous paste it into a browser, yous'll go an "Access Denied" response (admitting due to expiration, not invalid credentials).

To upload a file, your client beginning requests the presigned URL from the server, so uses that URL to upload the file (in this example, running in a browser, selectedFile was populated from a file input field and content is the issue of using a FileReader to load that file).

async function uploadFile(selectedFile, content, url) {     console.log("uploading " + selectedFile.name);     const asking = {         method: 'PUT',         mode: 'cors',         cache: 'no-cache',         headers: {             'Content-Type': selectedFile.type         },         body: content     };     let response = expect fetch(url, request);     console.log("upload status: " + response.condition); }

Multi-part uploads

While presigned URLs are convenient, they have some limitations. The kickoff is that objects uploaded by a single PUT are express to 5 GB in size. While this may be larger than anything you expect to upload, there are some use cases that volition exceed that limit. And even if you are nether that limit, large files can yet be a trouble to upload: with a fast, 100Mbit/sec network dedicated to one user, it will take virtually two minutes to upload a i GB file — ii minutes in which your user's browser sits, apparently unresponsive. And if in that location's a network hiccup in the centre, you have to start the upload over over again.

A better alternative, even if your files aren't that big, is to use multi-part uploads with a client SDK. Under the covers, a multi-part upload starts by retrieving a token from S3, then uses that token to upload chunks of the files (typically around 10 MB each), and finally marks the upload as complete. I say "nether the covers" because all of the SDKs have a high-level interface that handles the details for you, including resending any failed chunks.

However, this has to be done with a client-side SDK: you tin't pre-sign a multi-part upload. Which means you must provide credentials to that client. Which in plow means that you desire to limit the scope of those credentials. And while y'all can utilise Amazon Cognito to provide limited-time credentials, yous can't use it to provide limited-scope credentials: all Cognito authenticated users share the same function.

To provide limited-scope credentials, you need to assume a role that has full general privileges to access the bucket while applying a "session" policy that restricts that access. This can be implemented using a Lambda equally an API endpoint:

sts_client = boto3.client('sts')  role_arn = os.environ['ASSUMED_ROLE_ARN'] session_name = f"{context.function_name}-{context.aws_request_id}"  response = sts_client.assume_role(     RoleArn=role_arn,     RoleSessionName=session_name,     Policy=json.dumps(session_policy) ) creds = response['Credentials']  return {     'statusCode': 200,     'headers': {         'Content-Type': 'application/json'     },     'torso': json.dumps({         'access_key':     creds['AccessKeyId'],         'secret_key':     creds['SecretAccessKey'],         'session_token':  creds['SessionToken'],         'region':         bone.environ['AWS_REGION'],         'bucket':         bucket     }) }

Even if y'all're not not familiar with the Python SDK, this should exist fairly easy to follow: STS (the Security Token Service) provides an assume_role method that returns credentials. The specific part isn't particularly of import, as long as it allows s3:PutObject on the staging bucket. However, to restrict that role to let uploading a unmarried file, you must employ a session policy:

session_policy = {     'Version': '2012-10-17',     'Statement': [         {             'Effect': 'Allow',             'Action': 's3:PutObject',             'Resource': f"arn:aws:s3:::{saucepan}/{cardinal}"         }     ] }

On the client side, you lot would use these credentials to construct a ManagedUpload object, then
apply it to perform the upload. Every bit with the prior example, selectedFile is prepare using an input
field. Unlike the prior example, there's no need to explicitly read the file'south contents into a buffer; the
SDK does that for yous.

async role uploadFile(selectedFile, accessKeyId, secretAccessKey, sessionToken, region, bucket) {     AWS.config.region = region;      AWS.config.credentials = new AWS.Credentials(accessKeyId, secretAccessKey, sessionToken);      console.log("uploading " + selectedFile.proper noun);     const params = {       Bucket:       bucket,        Key:          selectedFile.name,       ContentType:  selectedFile.type,       Body:         selectedFile      };     permit upload = new AWS.S3.ManagedUpload({ params: params });     upload.on('httpUploadProgress', function(evt) {         panel.log("uploaded " + evt.loaded + " of " + evt.full + " bytes for " + selectedFile.name);     });     render upload.promise(); }

If you utilise multi-part uploads, create a bucket life-bicycle rule that deletes incomplete uploads. If y'all don't practise this, you might find an ever-increasing S3 storage neb for your staging bucket that makes no sense given the pocket-sized number of objects in the bucket listing. The cause is interrupted multi-part uploads: the user closed their browser window, or lost network connectivity, or did something else to preclude the SDK from mark the upload complete. Unless you take a life-cycle rule, S3 will keep (and beak for) the parts of those uploads, in the hope that anytime a client will come up back and either complete them or explicitly abort them.

You'll likewise demand a CORS configuration on your bucket that (1) allows both PUT and POST requests, and (2) exposes the ETag header.

Work At Chariot

If you value continual learning, a culture of flexibility and trust, and existence surrounded past colleagues who are curious and love to share what they're learning (with articles like this i, for example!) nosotros encourage you to join our team. Many positions are remote — scan open positions, benefits, and learn more than well-nigh our interview process below.

Careers

A prototypical transformation Lambda

In this section I'm going to phone call out what I consider "best practices" when writing a Lambda. My implementation linguistic communication of choice is Python, only the same ideas utilize to whatsoever other language.

The Lambda handler

I like Lambda handlers that don't practice a lot of work inside the handler part, so my prototypical Lambda looks like this:

import boto3 import logging import os import urllib.parse  archive_bucket = bone.environ['ARCHIVE_BUCKET']  logger = logging.getLogger(__name__) logger.setLevel(logging.DEBUG)   s3_client = boto3.client('s3')  def lambda_handler(event, context):     print(json.dumps(event))     for record in event.go('Records', []):         eventName = record['eventName']         bucket = record['s3']['bucket']['proper name']         raw_key = record['s3']['object']['key']         fundamental = urllib.parse.unquote_plus(raw_key)         try:             logger.info(f"processing s3://{bucket}/{central}")             procedure(saucepan, key)             logger.info(f"moving s3://{bucket}/{cardinal} to s3://{archive_bucket}/{key}")             archive(bucket, key)         except Exception equally ex:             logger.exception(f"unhandled exception processing s3://{bucket}/{cardinal}")   def procedure(saucepan, key):     // do something here     laissez passer   def archive(bucket, key):     s3_client.copy(         CopySource={'Bucket': saucepan, 'Central': key },         Bucket=archive_bucket,         Key=fundamental)     s3_client.delete_object(Bucket=bucket, Key=key)

Breaking information technology down:

I get the proper name of the annal saucepan from an environment variable (the name of the upload bucket is part of the invocation result).
I'm using the Python logging module for all output. Although I don't do it hither, this lets me write JSON log messages, which are easier to use with CloudWatch Logs Insights or import into Elasticsearch.
I create the S3 customer outside the Lambda handler. In general, you want to create long-lived clients outside the handler so that they can exist reused across invocations. All the same, at the same time you don't want to establish network connections when loading a Python module, because that makes it hard to unit examination. In the case of the Boto3 library, all the same, I know that it creates connections lazily, so there'south no damage in creating the client as role of module initialization.
The handler function loops over the records in the event. I of the mutual mistakes that I meet with event-handler Lambdas is that they assume there will only exist a single tape in the event. And that may be right 99% of the time, but you still need to write a loop.
The S3 keys reported in the consequence are URL-encoded; you need to decode the primal before passing it to the SDK. I employ urllib.parse.unquote_plus(), which in addition to handling "percent-encoded" characters, will translate a + into a infinite.
For each input file, I call the process() office followed past the archive() part. This pair of calls is wrapped in a endeavor-catch block, meaning that an individual failure won't affect the other files in the event. It as well means that the Lambda runtime won't retry the consequence (which would almost certainly take the same failure, and which would mean that "early on" files would be processed multiple times).
In this example process() doesn't do anything; in the existent world this is where you'd put most of your code.
The annal() function moves the file as a combined copy and delete; there is no congenital-in "move" functioning (again, S3 is a web-service, and then is limited to the six HTTP "verbs").

Don't shop files on attached disk

Lambda provides 512 MB of temporary disk space. Information technology is tempting to apply that space to buffer your files during processing, merely doing then has two potential problems. First, it may not be enough space for your file. Second, and more important, yous have to be careful to keep it clean, deleting your temporary files even if the function throws an exception. If y'all don't, you may run out of space due to repeated invocations of the same Lambda environment.

There are alternatives. The outset, and easiest, is to download the entire file into RAM and piece of work with it there. Python makes this particularly like shooting fish in a barrel with its BytesIO object, which behaves identically to on-disk files. Yous may still have an issue with very large files, and will have to configure your Lambda with enough memory to hold the unabridged file (which may increase your per-invocation price), merely I believe that the simplified coding is worth it.

Yous tin also work with the response from an S3 Go request equally a stream of bytes. The various SDK docs caution against doing this: to quote the Java SDK, "the object contents […] stream direct from Amazon S3." I suspect that this warning is more than relevant to multi-threaded environments such every bit a Java awarding server, where a long-running request might block access to the SDK connectedness pool. For a single-threaded Lambda, I don't encounter information technology as an upshot.

Of more concern, the SDK for your language might not expose this stream in a mode that's consistent with standard file-based IO. Boto3, the Python SDK, is one of the offenders: its get_object() function returns a StreamingBody, which has its own methods for retrieving information and does not follow Python'due south io library conventions.

A final culling is to utilise byte-range retrieves from S3, using a relatively small buffer size. I don't think this is a specially good alternative, every bit you will have to handle the instance where your information records span retrieved byte ranges.

IAM Permissions

The principle of to the lowest degree privilege says that this Lambda should only exist allowed to read and delete objects in the upload bucket, and write objects in the annal bucket (in add-on to whatever permissions are needed to process the file). I like to manage these permissions as separate inline policies in the Lambda's execution role (shown here as a fragment from the CloudFormation resources definition):

Policies:   -     PolicyName:                   "ReadFromSource"     PolicyDocument:       Version:                    "2012-10-17"       Statement:         Effect:                   "Allow"         Activeness:           -                       "s3:DeleteObject"           -                       "s3:GetObject"         Resource:                 [ !Sub "arn:${AWS::Partition}:s3:::${UploadBucketName}/*" ]   -     PolicyName:                   "WriteToDestination"     PolicyDocument:       Version:                    "2012-10-17"       Statement:         Effect:                   "Permit"         Action:           -                       "s3:PutObject"         Resources:                 [ !Sub "arn:${AWS::Partitioning}:s3:::${ArchiveBucketName}/*" ]

I personally prefer inline office policies, rather than managed policies, because I like to tailor my roles' permissions to the applications that apply them. However, a real-world Lambda will require additional privileges in order to practice its piece of work, and you may find yourself bumping into IAM's 10kb limit for inline policies. If then, managed policies might be your all-time solution, but I would still target them at a unmarried awarding.

Treatment duplicate and out-of-gild invocations

In any distributed system you accept to be prepared for messages to exist resent or sent out of club; this one is no dissimilar. The standard arroyo to dealing with this trouble is to make your handlers idempotent: writing them in such a manner that you can telephone call them multiple times with the aforementioned input and go the same outcome. This tin be either very easy or incredibly difficult.

On the easy side, if you know that you'll simply get one version of a source file and the processing step will always produce the aforementioned output, just run information technology again. You may pay a little more for the excess Lambda invocations, but that'south most certainly less than you lot'll pay for developer time to ensure that the process only happens once.

Where things get difficult is when you take to deal with meantime processing different versions of the same file: version 1 of a file is uploaded, and while the Lambda is processing information technology, version ii is uploaded. Since Lambdas are spun upward equally needed, you're probable to have a race condition with two Lambdas processing the same file, and whichever finishes last wins. To bargain with this you take to implement some manner to keep track of in-process requests, mayhap using a transactional database, and delay or abort the second Lambda (and note that a delay will turn into an abort if the Lambda times-out).

Another alternative is to abandon the simple "2 buckets" pattern, and turn to an approach that uses a queue. The challenge here is providing concurrency: if you use a queue to single-thread file processing, you tin can hands find yourself backed up. One solution is multiple queues, with uploaded files distributed to queues based on a name hash; a multi-shard Kinesis topic gives you this hashing past design.

In a future postal service I'll swoop into these scenarios; for now I simply desire to make you aware that they be and should be considered in your architecture.

What if the file'southward too big to exist transformed by a Lambda?

Lambda is a not bad solution for asynchronously processing uploads, but it's not appropriate for all situations. It falls down with large files, long-running transformations, and many tasks that require native libraries. One place that I personally ran into those limitations was video transformation, with files that could be up to 2 GB and a a native library that was not available in the standard Lambda execution environment.

In that location are, to exist sure, ways to work-around all of these limitations. But rather than shoehorn your large, long-running task into an environment designed for short tasks, I recommend looking at other AWS services.

The first place that I would turn is AWS Batch: a service that runs Docker images on a cluster of EC2 instances that tin can be managed past the service to meet your functioning and throughput requirements. You tin can create a Docker image that packages your application with any third-political party libraries that it needs, and use a bucket-triggered Lambda to invoke that image with arguments that identify the file to be processed.

When shifting the file processing out of Lambda, you take to give some thought to how the overall pipeline looks. For example, do you use Lambda just as a trigger, and rely on the batch job to move the file from staging saucepan to archive saucepan? Or practice you have a second Lambda that'due south triggered by the CloudWatch Event indicating the batch chore is done? Or do y'all use a Step Office? These are topics for a futurity mail service.

Wrapping upwards: a skeleton application

I've made an example project available on GitHub. This example is rather more complex than just a Lambda to process files:

The cadre of the project is a CloudFormation script to build-out the infrastructure; the project's README file gives instructions on how to run information technology. This example doesn't use any AWS services that take a per-hour charge, only you will be charged for the content stored in S3, for API Gateway requests, and for Lambda invocations.

pettitneen1953.blogspot.com

Source: https://chariotsolutions.com/blog/post/two-buckets-and-a-lambda-a-pattern-for-file-processing/

Python S3 Upload if File Is Different

Architecture

Why 2 buckets?

How do files get uploaded?

Pre-signed URLs

Multi-part uploads

Work At Chariot

A prototypical transformation Lambda

The Lambda handler

Don't shop files on attached disk

IAM Permissions

Treatment duplicate and out-of-gild invocations

What if the file'southward too big to exist transformed by a Lambda?

Wrapping upwards: a skeleton application

0 Response to "Python S3 Upload if File Is Different"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel