Test the Asynchronous Pipeline

Time Estimate: 10 - 15 minutes

In this section you will test the asynchronous document processing pipeline that you deployed in the previous steps. You will simulate the processing of multiple documents that already exist in an Amazon S3 bucket. We will leverage S3 Batch operations to achieve the above.

What is S3 Batch operations?
Amazon S3 batch operations performs large-scale batch operations on Amazon S3 objects.
For more information, see https://docs.aws.amazon.com/AmazonS3/latest/user-guide/batch-ops.html.

  1. Download the sample documents below (jpg/jpeg, png, pdf).

    employmentapp.png
    pdfdoc.pdf
    twocolumn.jpg

  2. Go to the Amazon S3 bucket textractpipeline-existingdocumentsbucketxxxx created by the CDK commands and upload the pdf document and images that you downloaded. The xxx at the end of the textractpipeline-existingdocumentsbucketxxxx should have been replaced with a random string in your AWS Account.

  3. Download, open the inventory-test.csv file and then replace the textractpipeline-existingdocumentsbucketxxxx value with the actual Amazon S3 bucket name of Step 2 above where you uploaded the 3 documents (employmentapp.png, pdfdoc.pdf, twocolumn.jpg).

    inventory-test.csv

  4. Go to the Amazon S3 bucket textractpipeline-inventoryandlogsxxxxx and upload the csv inventory-test.csv file containing the list of document names you just uploaded to the bucket textractpipeline-existingdocumentsbucketxxxx. The CSV file should have two columns bucketName and objectName.

    Note: You can also use Amazon S3 Inventory to automatically generate a list of documents in your Amazon S3 bucket.

  5. Go to Amazon S3 and click on Batch operations like below:

    Batch Operations
  6. Select CSV under Mafifest format and under Path to manifest object navigate to your inventory-test.csv file like in the screen shot below. Then choose Next.

    Batch Operations 1
  7. Under Operation choose Invoke AWS Lambda function and under Lambda function choose TextractPipeline-S3BatchProcessorxxxx like in the screen shot below. Then choose Next.

    Batch Operations 2
  8. Untick the Generate completion report box and under Permissions to access the specified resources choose Choose from existing IAM roles and for IAM role TextractPipeline-S3BatchOperationRolexxxx like in the screen shot below. Then click Next

    Batch Operations 3
  9. Review and click Create job.

  10. You should see now a screen like below:

    Batch Operations 4
  11. From Amazon S3 Batch operations page, click on the Job ID link for the job you just created.

  12. Click Confirm and run and then Run job.

  13. From S3 Batch operations page, click refresh to see the job status.

  14. Go to Amazon S3 bucket textractpipeline-existingdocumentsbucketxxxx and you should see output generated for documents in your list like in the screen shot below:

    Batch Operations 5

Congrats! You run the asynchronous document pipeline for 3 existing documents in your S3 bucket using S3 batch operations !!