CSC108H Assignment 3: Six Degrees of Kevin Bacon

Due: Tuesday, December 6 at 10:00 p.m.

Introduction

In this assignment, you will practice working with data from files and with dictionaries. You will be expected to build your own suite of tests to verify that your functions work as expected. You will submit your tests with your code, and we will evaluate both. You may work alone or in teams of two. We strongly encourage working in a team. Pair programming and the ability to get assistance from a partner are significant advantages and will help you better understand the material. Be sure to register your partnership on MarkUs regardless of whether or not you have worked with your partner on a previous assignment.

Background

In assignment 2, you wrote a program that crawled the web, building a dataset (stored in image association lists) based on the webpages that were found. Here, in assignment 3, instead of building up a dataset yourself, you will use data that is freely available from the Internet Movie Database (IMDB). You will use it to create a program for playing Six Degrees of Kevin Bacon, a game about the connection between any actor and Kevin Bacon, based on links between actors in movies. Familiarize yourself with the game by using the website The Oracle of Bacon to find connections between a few pairs of actors. The number of links to get from an actor to Kevin Bacon is called the actor's "Bacon Number". (If an actor cannot be connected to Kevin Bacon, his or her Bacon Number is "Infinity".) Try to find actors with long chains of connections to Bacon. Actors with finite Bacon numbers of 7 and above are very difficult to find.

Like The Oracle of Bacon, your program's job will be to use the IMDB data to find the closest connection (with the fewest "steps") between an actor and Kevin Bacon. But your program will only be set up to look for connections to Kevin Bacon, not other actors, and won't have a fancy graphical user interface. Two important sub-tasks that your program must be able to handle are: (1) reading a list of actors and movies from a file and storing that information in a reasonable data-structure, and (2) given an "Actor A", find the shortest connection from that actor to Kevin Bacon.

Reading and Storing Actor and Movie Data

Your program will read input files that contain movie and actor information in the IMDB format which will be described later in this handout.

Because of the large number of actors who have been in movies, dictionaries are the appropriate Python data structure for storing our data about actors and movies. Dictionaries allow for quick lookup by key, even when the dictionary is very large. We will use two dictionaries to store our data. One dictionary will map each actor name to a list of the movies that actor has been in. Another dictionary will map movie names to a list of all of the actors in the movie. Throughout, we will represent actor and movie names as strings.

It will be helpful to think of the information that our dictionaries store as representing a graph. (This is not the kind of graph that you've seen in math classes, with points plotted on x and y axes, for instance. Computer scientists use the term differently.) A graph is composed of nodes (usually drawn as circles) and edges (connections between nodes). We use graphs to represent many different structures, such as maps (where nodes represent cities and edges represent roads between cities), games (where nodes represent states of the games, and edges represent legal moves), and even social networks (where nodes might represent facebook members and edges represent "friend" relationships). For this assignment, we can think of our actor and movie data as a graph where the nodes are actor names, and an edge between two actors indicates that they were in a movie together. We will store the movie name for each edge, so that in addition to finding an actor's Bacon Number, we can also say what movies make the connections. See the diagram below for an example graph. Below it are the two dictionaries that store the information shown in the graph. Make sure that you understand how those dictionaries represent that graph.

We like to use graphs because they have a mathematical basis. We can derive many useful properties from them, and these properties inform us about the structure being represented. For example, if we make every actor a node in a graph and then draw an edge between two nodes if the two actors have been in a movie together, then we can ask, "Can we connect this node to the node that represents Kevin Bacon?" This is equivalent to the question, "Does this actor have a finite Bacon Number?" Similarly, if we are able to answer the question, "What is the smallest number of edges on any path between this node and the node representing Kevin Bacon?", then we have answered, "What is this actor's Bacon Number?"

Finding the Shortest Connection

The task of finding the smallest number of edges between two nodes in a graph has many different solutions, but breadth-first search is a good procedure (or "algorithm") for solving it. This page at Rutgers provides an animation of breadth-first search and other searches operating on a graph. With breadth-first search, we start with the node representing the actor we want to connect to Kevin Bacon and search everyone he or she has worked with directly. If Kevin Bacon is among them, the actor has Bacon Number 1 and we stop. Next, we search one step further away from the original node: everyone that has worked with someone the actor has worked with. If Kevin Bacon is among them, the actor has Bacon Number 2 and we stop. The algorithm stops when there are no more connections to be made or when Kevin Bacon's node is encountered.

Here is pseudocode that uses breadth-first search to return the number of steps required to connect two actors:

def actor_distance_BFS(actor_A, actor_B, actor_dict, movie_dict):
    '''Return the number of movies required to connect actor_A and actor_B. 
    Return -1 if there is no connection. actor_dict maps an actor to the list  
    of movies she has appeared in. movie_dict maps a movie to the list of 
    actors appearing in the movie.'''

    # We keep 2 lists of actors:
    #   1. The actors that we have already investigated.
    #   2. The actors that need to be investigated because we have found a 
    #      connection beginning at actor_A. This list must be 
    #      ordered, since we want to investigate actors in the order we 
    #      discover them.
    #      -- Each time we put an actor in this list, we also store
    #         her distance from actor_A.

    remember that actor_A's distance from herself is 0
    put actor_A in our list of actors to investigate
    
    # Investigate one actor at a time until we find actor_B or run out of 
    # actors to investigate.
    while there are still actors to investigate
        take the next actor to be investigated
        add this current actor to the list of already-investigated actors

        for each movie this actor is in:
            for each co-star in each of these movies:
                if the co-star is actor_B
                   then the distance to this co-star is 1 more than to the current actor
                   and return the distance to the co-star
                if the co-star is already in either list of actors
                   do nothing
                otherwise 
                   the distance to this co-star is 1 more than to the current actor
                   remember the distance to this co-star
                   put the co-star in the list of actors to be investigated

    we've run out of actors in our to-investigate list, so actor_A and actor_B aren't connected

Before you start to code your own assignment solution, spend time to thoroughly understand the pseudocode above. Notice that it has two different lists of actors. The first is a list of actors that the algorithm has already investigated. It starts out empty. We use it to make sure that our program won't continue looping back over the same actors forever. The second list is the sequence of actors that we have encountered in our search but haven't yet investigated. In other words, for each actor in this list, we have found her distance from actor_A but haven't started looking systematically at each of her movies and co-stars. For breadth-first-search to work properly, this second list must be kept in order, and the actors in this list must be processed in first-come-first-served order. We call this structure a queue (and you will learn more about queues in CSC 148). For now, just use a Python list and make sure that you always add to one end of the list and remove from the other.

An example actor-movie graph

To ensure that you really understand the algorithm, trace it (just like tracing code) on the example graph. The graph shows 9 different actors who acted in 6 total movies. This example isn't realistic since real movies typically involve far more than a handful of actors, but this small example will be simpler to trace and is a good example to use while you're debugging your implementation. The actor and movie dictionaries corresponding to the graph are:

actor_dict = {'ActA': ['m1', 'm2'], 
              'ActB': ['m1', 'm2', 'm3'],  
              'ActC': ['m1'],
              'ActD': ['m3', 'm6'],
              'ActE': ['m6'],
              'ActF': ['m4'],
              'ActG': ['m4'],
              'ActH': ['m5'],
              'Kevin Bacon': ['m2', 'm5']}
    
    movie_dict =  {'m1': ['ActA', 'ActB', 'ActC'],
                   'm2': ['ActA', 'ActB', 'Kevin Bacon'],
                   'm3': ['ActB', 'ActD'],
                   'm4': ['ActF', 'ActG'],
                   'm5': ['ActH', 'Kevin Bacon'],
                   'm6': ['ActD', 'ActE']}

Program Requirements

Your program must play Six Degrees of Kevin Bacon. Given a file of movies and actors, your program should build actor and movie dictionaries and then repeatedly ask the user for an actor name. It should find and display the shortest path between the actor the user has entered and Kevin Bacon. If there is more than one shortest path, then your program displays the first one it finds. Finally, when the user presses return without entering an actor name, your program will print the largest finite Bacon number that was found.

Here are the specific requirements:

  1. Your main program must be in a file named bacon.py.
  2. A second file, bacon_functions.py, must contain implementations of the three top-level functions described below. The three top-level functions should be fully tested as discussed here.
  3. When bacon.py is run, your program should play the game and produce the output as described below. It should import bacon_functions and make use of them.
  4. Your code must be well designed and documented and must follow the style described in the Assignment Style Rules.

In addition to the three required functions, you may define as many helper functions as you'd like in any of your files. Each helper function should have a clear purpose and should not use global variables. You should test those helper functions, but you will not submit your tests for those functions.

Input Format

Working with real data is messy and computer scientists often find themselves in a situation where they need to parse real data files where the specification either isn't provided or isn't correct. For this assignment we want you to use the real IMDB data and we want you to work out how to extract the information you need by looking at the files. We are providing three increasingly sophisticated input files to help with this.

  1. example_data.txt: The data from the example graph (above) in a text file in the IMDB format. This data file is the first one you should use. You may want to create additional small, "clean" (easy to parse) sample data files to test specific scenarios. Here are some important things to notice about this file.
    • It has a header that you can ignore. Just read in the lines and throw them away.
    • It has a footer that you can also ignore.
    • Actors are separated by blank lines.
    • Probably the most important thing to notice is that while this input file matches the example graph from earlier in the handout, the full movie names and actor names are NOT shown in that earlier figure. This was done to make the diagram easier to read. If you really parse this sample IMDB file, you should get this actor dictionary. Notice that the names of the actors have been changed so that each is Firstname Lastname with each token capitalized. Hint: Look at the string methods. The case of the movie titles is unchanged and the years are kept with the movies and become part of the name.
  2. small_actor_data.txt: A very small selection from the full IMDB actor data file. The selection is large enough to run all of the output examples (below) and contains somewhat "dirtier" (harder to parse) input. This data file is a good one to use when you are testing your parsing function. Here are some new things to notice.
    • It also has a header but it isn't the same number of lines as before. If you were previously counting the number of header lines to skip, you'll have to figure out another way to determine when the header ends.
    • Sometimes the lines between actors that appear blank actually contain spaces and tabs.
    • The lines with the movie names contain various other bits of information after the dates. You want to ignore all that extra stuff.
    • Robert De Niro's name has three parts. All of them are capitalized. It appears that all the names in the file are already in title case, but since you need to do the conversion so your program will work on sample_data.txt you will need to leave this case conversion code in your program.
  3. large_actor_data.txt: A larger selection of about 1000 actors from the full IMDB actor data file. The data is a bit "dirtier", again. This data file is useful after your program is running when you want to play the game. It is almost an 11-megabyte file, so it could take some time to load. Here are a few more things to notice:
    • Watch out for Roman numerals like "(I)" after an actor's name. Cut those when you find them.
    • Sometimes the same actor is listed in the same movie with the same title more than once. (Check out "Jensen Ackles".) This is usually for TV shows that have specific episodes. Don't try to keep the episodes separate. Consider all the episodes for a given show in a given year to be the same movie. So even if two actors were not in the same episode, if they both acted in the same show in the same year, they are only one step apart.
    • The header and footers are different sizes again. Do your rules for finding them still work? Make sure they work for all three files.

If you're looking for a big challenge, you can try the full set of IMDB actor data, though when we are testing your submission, we will not try to run your program on anything bigger than the large data set above. The full data set is available directly from IMDB's ftp sites. Pick an FTP site and download actor.list.gz. You'll need to uncompress that file to recover the plain text. Warning: the full IMDB actor file is very big (over 600 megabytes!) and difficult to parse. If you download it into your user account at school, you will fill your "disk quota" and will not be able to save anything else to your account. Accessing the file over the web is not an option, because it would be an abuse of the resources IMDB is generously providing for free. (I.e., Do not try to use urllib to run your program using the remote data!)

Output

A transcript of the game has been copied below. The program's output is in codetext and the user's input is in bold. Your program should copy the format of the transcript exactly, including whitespace. After the text shown here, there are no extra spaces or tabs, only a newline character at the end of each line (including the last.) Your program may not produce exactly the same result for the actors in the examples because the connection your program can find depends on the data provided in the file it reads. The movie names are shown in green. When you read the movies from the IMDB file do not change the case of the titles.

Notice that when the user enters a name, the case is converted to title case and that when an actor name is not found in the database, the resulting bacon number is Infinity. When the user finally presses enter without entering a name, the final output includes the largest bacon number the user has discovered during the game. Be careful to match the output exactly.

Please enter an actor (or press return to exit): Robert De Niro
Robert De Niro has a Bacon Number of 1.
Robert De Niro was in Sleepers (1996) with Kevin Bacon.

Please enter an actor (or press return to exit):
Kevin Bacon
Kevin Bacon has a Bacon Number of 0.

Please enter an actor (or press return to exit):
keVin BaCon
Kevin Bacon has a Bacon Number of 0.

Please enter an actor (or press return to exit):
al gore
Al Gore has a Bacon Number of 2.
Al Gore was in Johnny Cash's America (2008) with Jon Langford.
Jon Langford was in Cavedweller (2004) with Kevin Bacon.

Please enter an actor (or press return to exit):
I am not a real actor
I Am Not A Real Actor has a Bacon Number of Infinity.

Please enter an actor (or press return to exit):
Thank you for playing! The largest Bacon Number you found was 2.

Required Functions

Function Name Description
parse_actor_data(actor_data) Return the actor information in the open reader actor_data as a dictionary. actor_data contains movie and actor information in IMDB's format. The returned dictionary contains the names of actors (string) as keys and lists of movies (string) the actor has been in as values.

Note: Take a look at one of the sample IMDB files to get a sense of the format. IMDB's format includes copyright header and footer information that should be discarded. Actor records are separated by blank lines. Each record starts with a line that contains the actor's name followed by a movie. (Assume that no actor has a "," (comma) in her name, so you can use it as a delimiter.) If the actor has been in more than one movie, the remainder of the record contains one movie per line, and the record ends when a blank line is encountered. Each movie is formatted as a title followed by the year the movie was produced (in parentheses) and additional information (which your function will ignore) in square, angle, and curly brackets. (Assume that no movie title contains parentheses. This is an invalid assumption, but we will expect the movie title your program generates to be incorrect if the title contains a parenthesis.) To match the expected output, the actor and movie title will need to have leading and trailing whitespace removed, and the movie title should include the year of production but no additional information. Note that records in these files can span multiple lines. For this reason, we recommend using readline in combination with a while-loop, rather than using a for-loop to read the lines.
invert_actor_dict(actor_dict) Return a dictionary that is the inverse of actor_dict. The original actor_dict maps actors (string) to lists of movies (string) in which they have appeared. The returned dictionary maps movies (string) to lists of actors (string) appearing in the movie.
find_connection(actor_name, actor_dict, movie_dict) Return a list of (movie, actor) tuples (both elements of type string) that represent a shortest connection between actor_name and Kevin Bacon that can be found in the actor_dict and movie_dict. (These dictionaries were produced by parse_actor_data and invert_actor_dict, respectively.) Each tuple in the returned list has a special property: the actor from the previous tuple and the actor from the current tuple both appeared in the stated movie. For the first tuple, the "actor from the previous tuple" is actor_name, and the last tuple must contain Kevin Bacon. If there is no connection between actor_name and Kevin Bacon, the returned list is empty.

For example, suppose you called find_connection using ActE and the dictionaries from the earlier example in this handout. In this case, the returned list would be [("m6", "ActD"), ("m3", "ActB"), ("m2", "Kevin Bacon")]. Note: You must use breadth-first search to solve this problem. Start with the sample breadth-first search pseudo-code and modify it to fit your needs. Notice that your find_connection function must return the full path from actor_name to Kevin Bacon instead of just the length of the path, so you will need to modify the pseudocode. However, we suggest implementing and testing the pseudocode without modification first, to make sure you understand the algorithm.

Testing

To help you verify that your code is providing the expected output, we are providing a very basic self test. As before, you should not interpret passing the self-test module as indicating that your code is completely correct. You should test each of the three required functions thoroughly yourself. Write (and submit) a nose testfile for each of the required functions. Name these three files test_parse_actor_data.py, test_invert_actor_dict.py, and test_find_connection.py.

We will evaluate the completeness of your testfiles by running them against flawed implementations we have written to see how many errors you catch. Then, the TAs will check the testfiles for redundant tests. The goal is to catch all of our errors without extra, unnecessary tests.

Also be sure to test your main block in bacon.py and be very careful that the output is exactly a match for the description in this handout. This is not tested by the self-test.

Recommended Approach

Start with invert_actor_dict(actor_dict) since it is the simplest function. 'But wait', you say, "Don't I need the output from parse_actor_data before I can run invert_actor_dict?" Yes, in your full program you will call invert_actor_dict on the output from parse_actor_data, but you don't want to do this when you are still developing either function. Design your tests before you start implementing the function, and in your tests, use small actor_dicts that you have created yourself. You can see one example of this in the self-test. Using your nose tests, find and correct all the bugs in your invert_actor_dict function.

Next, work on parse_actor_data. Like before, writing a significant number of the tests first will help you understand what you need to implement. Use small inputs at first and then use the larger sample files we have provided. You will be able to submit multiple files on this assignment, so you can create (and submit) extra input files that are needed for your testing. You'll need to submit your tests for this function, so design and implement them now and use them to make sure your parse_actor_data function is correct.

Before writing any code for find_connection, trace the breadth-first-search pseudocode on the provided example and on other examples you make up yourself. You don't need to do a full memory model trace; just keep track of which actors are in the already-investigated and the still-to-investigate lists of actors.

Next implement this pseudo code in Python in a function find_distance(actor_name, actor_dict, movie_dict) that returns the distance from actor_name to Kevin Bacon. We won't mark this function (and you don't even need to hand it in) unless you don't finish find_connection. Save this version of your program in a file part_marks.py just in case something goes wrong later. We will discuss this below.

Next work out how your find_connection function will solve the problem of returning the full paths and decide what values you need to save along the way. Do this on paper, before writing any code.

When you are confident that you understand how your find_connection function should work, convert the tests you did on paper into nose tests, so you can verify your code once you have written it.

Now make a copy of your part_marks.py file and call it bacon.py. Rename find_distance to find_connection and change its implementation to return the full paths as you planned.

Using your nose tests, find and correct the bugs in your find_connection function.

Finally, implement your main block so that it plays the game according to the script example in this handout.

The Extra Mile: Efficiency

You might have noticed that our program isn't very efficient. It creates the actor dictionary and the movie dictionary every time it is run. If the data has not changed, recreating these dictionaries wastes time and puts an extra load on the host of the IMDB file. It also remembers nothing about the graph from one search to the next. If you are interested in improving your code and learning more, Here are some suggestions. Do not make these changes on the version of your code that you submit for marking. The changes will only cause you to fail the testing and get a lower mark. This is only for fun for those students who are looking for a little more challenge and want to learn more.

Submission and Marking

On MarkUs submit a module called bacon_functions.py that contains the required functions. Also submit a module bacon.py that imports bacon_functions and when run, uses these functions to play Six Degrees of Kevin Bacon, providing exactly the prompts and output described in the requirements section of this handout. You should also submit three separate files named test_parse_actor_data.py, test_invert_actor_dict.py, and test_find_connection.py. Each file should contain nose tests that test the function the file is named after. If any of your tests use additional input files (text files you have created), submit these as well. You don't need to submit the input files we have given you; they will be placed in your directory when we test.

We hope that most students will be able to complete find_connections but we recognize that it is difficult to earn partial marks for this function. So we are allowing you to acknowledge that your find_connections function does not work correctly and to submit a version that correctly calculates the bacon distance but not the path. If you do this, we will mark your find_distance function for partial marks on find_connection.

Marking

Your assignment will be evaluated using these three criteria: