AWS Certified Big Data - Specialty (#44)

A new algorithm has been written in Python to identify SPAM e-mails. The algorithm analyzes the free text contained within a sample set of 1 million e-mails stored on Amazon S3. The algorithm must be scaled across a production dataset of 5 PB, which also resides in Amazon S3 storage. Which AWS service strategy is best for this use case?

Copy the data into Amazon ElastiCache to perform text analysis on the in-memory data and export the results of the model into Amazon Machine Learning.
Use Amazon EMR to parallelize the text analysis tasks across the cluster using a streaming program step.
Use Amazon Elasticsearch Service to store the text and then use the Python Elasticsearch Client to run analysis against the text index.
Initiate a Python job from AWS Data Pipeline to run directly against the Amazon S3 text files.