You are here:   ArielOrtiz.com > Programming Languages > MapReduce Exercises

MapReduce Exercises

Objectives

During this activity, students should be able to:

This activity helps the student develop the following skills, values and attitudes: ability to analyze and synthesize, capacity for identifying and solving problems, and efficient use of computer systems.

Activity Description

Individually, solve the following programming exercises using Erlang and the plists:mapreduce function. Place your functions in a module called mapred.

  1. A number of the form 2n that contains the consecutive digits 666 (i.e., the beast number) is called an Apocalyptic Number. The number 2157 is an apocalyptic number, because 2157 = 182687704666362864775460604089535377456991567872 which contains the beast number starting at the digit in position 10 (starting from the left).

    The function apocalyptic takes two integer parameters, S and E (0 ≤ SE). It starts by calling the MapReduce operation with a list of all integer numbers between S and E inclusively. The mapping function receives a number N and determines if 2N is or isn’t an apocalyptic number. If it is, it emits the tuple {true, N}, otherwise it emits the tuple {false, N}. After the reduction, it returns a list with all the apocalyptic numbers found between S and E, or an empty list if none were found. Examples:

    > mapred:apocalyptic(100, 200).                                                
    [157,192]
    > mapred:apocalyptic(100, 150).
    []
    > mapred:apocalyptic(800, 850).
    [800,807,819,820,822,823,824,826,828,836,838,840,841,842,
     844,846,848,850]
  2. The function max_access determines the IP address of the client computer (remote host) that has the greatest number of accesses to a certain web server by inspecting its log files. It receives a string with the name of the directory containing the log files to be inspected.

    Each of these log files is comprised of many lines similar to this one:

    189.191.131.167 [23-Oct-2008:12:03:38 -0500] "GET /apps/s200813/tc2006/noticias/" 200 541
    

    This is the description of each element in the previous line:

    • 189.191.131.167 – IP address of the client (remote host) which made the request to the server.
    • [23-Oct-2008:12:03:38 -0500] – The date and time that the server finished processing the request.
    • "GET /apps/s200813/tc2006/noticias/" – The request line from the client within double quotes.
    • 200 – Status code that the server sent back to the client. Status code 200 means everything is OK.
    • 541 – Size in bytes of the response body sent back to the client. This field may be absent.

    The max_access function calls the MapReduce operation with a list containing the pathnames of all the log files in the specified directory. The mapping function reads a single log file and for each line it emits a {IP_address, 1} tuple. After the reduction, the total count is computed for each unique IP_address, and the IP_address with the largest count is returned as a tuple {IP_address, Largest_count}.

    For example (using the contents of logs.zip):

    > mapred:max_access("logs").
    > {"10.48.9.90",613}
    
  3. The function locs computes the number of lines of code (LOCs) in a set of source files. It takes two inputs: a string with the name of the directory from where the search will start, and a wildcard string used to determine which files to include during the search. The search considers all the files in the specified directory and all its subdirectories at any depth.

    The function calls the MapReduce operation with a list containing the pathnames of all the files in the specified directory (and all its subdirectories) that match the given wildcard. The mapping function reads a single file, counts its number of lines, and emits a {File_name, Number_of_lines} tuple. After the reduction, the total number of files (TF) and the total number of lines in all files (TL) are computed. The function returns the tuple {TF, TL}.

    The following example (using the contents of nasm-2.05.zip) demonstrates how the locs function could be used:

    > mapred:locs("nasm-2.05/lib", "*.c").
    {2,74}
    > mapred:locs("nasm-2.05", "*.c").
    {68,75892}
    > mapred:locs("nasm-2.05", "*.pl").   
    {23,5429}
    

A Few Tips

Useful modules

In order to solve the previous problems, you will probably need to use additionally some or all of the following Erlang modules:

Reading a text file

Assume you have a text file called codemonkey.txt in the current working directory with the following content:

Code Monkey like Fritos.
Code Monkey like Tab and Mountain Dew.
Code Monkey very simple man,
with big warm fuzzy secret heart.
Code Monkey like you.

You can use the following Erlang code to read this file one line at a time:

> {ok, F} = file:open("codemonkey.txt", read).
{ok,<0.33.0>}
> io:get_line(F, '').
"Code Monkey like Fritos.\n"
> io:get_line(F, '').
"Code Monkey like Tab and Mountain Dew.\n"
> io:get_line(F, '').
"Code Monkey very simple man,\n"
> io:get_line(F, '').
"with big warm fuzzy secret heart.\n"
> io:get_line(F, '').
"Code Monkey like you.\n"
> io:get_line(F, '').
eof
> file:close(F).
ok

Examining the contents of a directory

The filelib:wildcard function returns a list of all files that match Unix-style wildcard-string. The filelib:is_dir function allows to determine if a name actually refers to a directory. For example:

> filelib:wildcard("*.txt").  
["codemonkey.txt"]
> filelib:is_dir("/home/aortiz").
true
> filelib:is_dir("codemonkey.txt").
false

Deliverables

Using the Online Assignment Delivery System (SETA), deliver the file called mapred.erl. No assignments will be accepted through e-mail or any other means.

IMPORTANT: The program source file must include at the top the author's personal information (name and student id) within comments. For example:

        
%% ITESM CEM, April 12, 2010.
%% Erlang Source File
%% Activity: MapReduce Exercises
%% Author: Steve Rogers, 449999

    .
    . (The rest of the program goes here)
    .

Due date: Monday, April 12.

Evaluation

This activity will be evaluated using the following criteria:

-10 The program doesn't contain within comments the author's personal information.
10 The program contains syntax errors.
DA The program was plagiarized.
10-100 Depending on the amount of exercises that were solved correctly.
© 1996-2010 by Ariel Ortiz (ariel.ortiz@itesm.mx)
Made with Django | Licensed under Creative Commons | Valid XHTML | Valid CSS