Time Estimate: 10 - 15 minutes
In this section you will test the asynchronous document processing pipeline that you deployed in the previous steps. You will simulate the processing of multiple documents that already exist in an Amazon S3 bucket. We will leverage S3 Batch operations to achieve the above.
What is S3 Batch operations?
Amazon S3 batch operations performs large-scale batch operations on Amazon S3 objects.
For more information, see https://docs.aws.amazon.com/AmazonS3/latest/user-guide/batch-ops.html.
Download the sample documents below (jpg/jpeg, png, pdf).
Go to the Amazon S3 bucket textractpipeline-existingdocumentsbucketxxxx
created by the CDK commands and upload the pdf document and images that you downloaded. The xxx at the end of the textractpipeline-existingdocumentsbucketxxxx
should have been replaced with a random string in your AWS Account.
Download, open the inventory-test.csv
file and then replace the textractpipeline-existingdocumentsbucketxxxx
value with the actual Amazon S3 bucket name of Step 2 above where you uploaded the 3 documents (employmentapp.png, pdfdoc.pdf, twocolumn.jpg).
Go to the Amazon S3 bucket textractpipeline-inventoryandlogsxxxxx
and upload the csv inventory-test.csv
file containing the list of document names you just uploaded to the bucket textractpipeline-existingdocumentsbucketxxxx
. The CSV file should have two columns bucketName and objectName.
Note: You can also use Amazon S3 Inventory to automatically generate a list of documents in your Amazon S3 bucket.
Go to Amazon S3 and click on Batch operations like below:
Select CSV
under Mafifest format and under Path to manifest object navigate to your inventory-test.csv
file like in the screen shot below. Then choose Next.
Under Operation choose Invoke AWS Lambda function
and under Lambda function choose TextractPipeline-S3BatchProcessorxxxx
like in the screen shot below. Then choose Next.
Untick the Generate completion report box and under Permissions to access the specified resources choose Choose from existing IAM roles
and for IAM role TextractPipeline-S3BatchOperationRolexxxx
like in the screen shot below. Then click Next
Review and click Create job.
You should see now a screen like below:
From Amazon S3 Batch operations page, click on the Job ID link for the job you just created.
Click Confirm and run and then Run job.
From S3 Batch operations page, click refresh to see the job status.
Go to Amazon S3 bucket textractpipeline-existingdocumentsbucketxxxx
and you should see output generated for documents in your list like in the screen shot below:
Congrats! You run the asynchronous document pipeline for 3 existing documents in your S3 bucket using S3 batch operations !!