Assignment 7

Due Friday December 8, 2017 11:59pm via sakai

Introduction

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a Hadoop cluster. In this assignment, you will learn how to write map-reduce applications to do some simple data analysis tasks on Amazon EMR.

Please take the time to read this assignment thoroughly. The writeup is much, much longer than your code will be. Be sure to follow directions carefully to avoid incurring fees from Amazon.

Project

In this assignment, you are asked to finish four programming tasks from easy to hard to learn how to write simple map-reduce applications. Currently, EMR runs on Hadoop v2.7.3, which contains two styles of API: the old-style (v1) and new-style (v2). Although the new-style API is more powerful, we will use the old-style API in this assignment since it is simpler for beginners. Official documents on the old-style API can be found at: hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

We will use two sample data provided by Amazon EMR as the input of our simple map-reduce applications. Check them and make sure you know the format of these data, since we need further do some analysis on them.

1. s3://elasticmapreduce/samples/wordcount/input/
The sample data is a set of files containing a lot of words. You can download one sample file of the data from the link below to check it.
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0001
2. s3://elasticmapreduce/samples/cloudfront/input/

The sample data is a series of Amazon CloudFront web distribution log files. Each entry in the CloudFront log files provides details about a single user request in the following format:

2014–07–05 20:00:00 LHR3 4260 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image–1.jpeg 200 - Mozilla/5.0%20(MacOS;%20U;%20Windows%20NT%205.1;%20en-US;%20rv:1.9.0.9)%20Gecko/2009040821%20IE/3.0.9

Detailed instructions on the format of the log file can be found at http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#BasicDistributionFileFormat.

Languages and systems

You should use Java for this assignment. If you are new to Java, you can find official Java tutorials offered by Oracle HERE. You can do this assignment on just about any operating system. The instructions in this document are for Windows 10, but you can do them on Linux or MacOS as well.

Groups

You may do this assignment yourself or work in a group comprising no more than three members. Your work will be held to a higher standard if working in a group. Working in groups is highly recommended.

Prerequisites

Please make sure that all the following items are on your computer before you start this assignment.

  1. Java SE Development Kit 8 is installed. Download it from HERE. Choose the option that is compatible with your system.

  2. Eclipse IDE for Java EE Developers is installed. Download it from HERE. Choose the option which is compatible with your system on the right part of the page.

  3. Download the Hadoop pre-complied library HERE. Download the binary of the version 2.7.4 and decompress it into a folder. In the rest of this document, the path of this folder is referred as <HADOOP_HOME>.

  4. Follow the instructions in the recitation slides from Week 9 to set up your own Hadoop cluster on Amazon EMR.

  5. Download the map-reduce wordcount source files.

Task 1: WordCount

In this section, we will learn how to run a WordCount map-reduce program on EMR. This program counts the number of occurrences of each word in a given input set. In this task we use the “wordcount” sample data as the input.

  1. Create a Java project by Eclipse named “WordCount” using jre1.8.0

  2. Modify the JDK Compliance. Right click the project and select “Properties”. In the left panel, choose “Java Compiler” and modify the “Compiler compliance level” to 1.5.


  3. Configure the Hadoop libraries. Right click the project, select “Build Path” and “Add Libraries”. Then choose “User Library”. On the “User Libraries” dialog box, create a new “Library” named “Hadoop”. Add choose the “Add External JARs…” option to import <HADOOP_HOME>/share/hadoop/common/hadoop-common–2.7.4.jar and all other jar files under the folder <HADOOP_HOME>/share/hadoop/mapreduce to this library. And you should get something like this:


  4. Add this “Hadoop” library to your “WordCount” project.

  5. Create a Java package named “aws.emr.wordcount” in the “src” folder. Then import all three Java source code files into this package.


  6. Export this project into a jar file. Right click the project, select “Export…” and “JAR file”. Then click “Next”, “Next”, select “aws.emr.wordcount.WordCount” as the main class of this jar.



Then we need to upload this jar file to Amazon EMR to run this job. Follow these steps:

  1. Log into https://aws.amazon.com/ and go to Amazon S3. Find the bucket of your cluster, which should begin with aws-logs. Choose this bucket and then create a folder named “outputs”, where we will save all the outputs of the map-reduce applications in this folder. Then upload the “WordCount.jar” that you just created on your computer.


  2. Go to EMR console and find your cluster. Click it and then select “Steps”.


  3. Then click “Add step”. In the “Add step” dialog box, set “Name” to “WordCount”. “JAR location” is the jar file you upload into the S3 bucket, which should be s3://<aws-bucket>/WordCount.jar” (<aws-bucket> is the name of your bucket). Set the field “Arguments” to:

    “s3://elasticmapreduce/samples/wordcount/input/ s3://<aws-bucket>/outputs/task1”.

    The first path is the path of the input of this task, and the second path is the output path of this map-reduce job.

    IMPORTANT: Always make sure that the output path is empty before you run the job, i.e., there is no “task1” in the “outputs” folder.


  4. Click “Add”. After a few seconds, this job will start by the EMR and your cluster will start computing.


  5. Once this job is finished, go to “Amazon S3” and you will find a new “task1” folder under the “outputs” folder. You can download all the files to check the results.

Task 1 Questions

Please answer the follow questions:

  1. What is the output of this task? What does each file contain?

  2. Explain how this map-reduce job works.

  3. If the input is just a file containing “hello world hello hadoop”, what are the outputs of the Mapper and Reducer?

Task 2: TimeCount on CloudFront

In this task, we will first run the same “WordCount” on the different input data: CloudFront. Create a new map-reduce job as we did in “Task 1”. The only difference is that when we set the field of “Arguments”, we should use

"s3://elasticmapreduce/samples/cloudfront/input/ s3://<aws-bucket>/outputs/task2",

where we change the input data to cloudfront and save the output files to “task2”. Run the job and download all the files to check the results.

Modify the “WordCount” source code to create a new map-reduce job called “TimeCount”, so that the “TimeCount” will only count the timestamps (such as “14:00:06”) in the input data. Run your code to check the results. Explain how you realize this job.

Task 3: RequestCount on CloudFront

Modify the “TimeCount” source code to create a new map-reduce job named “RequestCount”. This job counts how many user requests are send to the server each hour in all these days. Run your code to check the results. Explain how you implemented this job.

Task 4: Data Analysis on CloudFront

This part is open-ended. It is up to you to decide what to do here.

Design another map-reduce job to find out some more information about the CloudFront sample data by yourself. Run your code to check the results. Explain how you implemented this job.

Hint: If your job is very complex, you can split it into two (or more) map-reduce jobs, where the input of the second job is the output of the first job.

Submission

Your submission will contain:

  • Your name and the name of each group member.
  • All the code YOU write for each task.
  • A document (pdf or plain text) that answer the questions in “Task 1” and the explanation of how you realize the code for each task.
  • All the outputs of each task.

Only one group member should submit the assignment.

Important notes

  1. VERY IMPORTANT: Please DO NOT forget to turn down all your clusters and S3 storages after you finish this assignment. Otherwise, you will be charged once your $100 allotment runs out. Please check the recitation slides of Week 9 for detailed instructions.

  2. If you have any questions about this assignment, please email TA (Long Zhao): lz311@cs.rutgers.edu

  3. MapReduce Tutorial: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

  4. Hadoop Wiki: https://wiki.apache.org/hadoop