This is the second in a series of projects that will involve sorting a large amount of data. In this second phase, you will write a multi-process C program to sort a list of records of movies from imdb alphabetically by the data in a given column. You will make use of the concepts learned in lecture including file/directory access, forking processes, etc.
File formats for this part of the project are the same as in the first. The CSV file with movie metadata will remain the same. The sorting algorithm will also remain the same. If you properly modularized your code in Project 0, you should be able to reuse almost all of your code.
In this project, you will read in a directory name and walk through the directory structure to find all .csv files. There may be multiple levels of directories that you will need to recurse through. You will then fork child processes to sort each of the files and output the results to a different file. You should NOT use exec for this project. You should write one program that, when copied from the parent to the child process, will continue running. You can use the return value of fork() in a conditional to choose different blocks of code to run within your code. You will want to make sure to prevent zombie and orphan child processes. You will also want to make sure you to not create fork bombs that will bring down machines. In all cases of bad input, you should fail gracefully (e.g. no segfaults).
You will output metadata about your processes to STDOUT. This metadata will show the total number of processes created and the pids of all created processes.
Your code will read in a set of parameters via the command line. Records will be stored in .csv files in the provided directory. As mentioned above, directory structures may have multiple levels and you must find all .csv files. Your code should ignore non .csv files and .csv files that do not have the correct format of the movie_metadata csv (e.g. csv files that have other random data in them).
Remember, the first record (line) is the column headings and should not be sorted as data. Your code must take in a command-line parameter to determine which value type (column) to sort on. If that parameter is not present (?-> throw an error, or default behavior). The first argument to your program will be ‘-c’ to indicate sorting by column and the second will be the column name:
./sorter -c food
Be sure to check the arguments are there and that they correspond to a listed value type (column heading) in the CSV.
For this phase you’ll extend your flags from one to three. The second parameter to your program will be ‘-d’ indicating the directory the program should search for .csv files. This parameter is optional. The default behavior will search the current directory.
./sorter -c food -d thisdir/thatdir
The third parameter to your program will be ‘-o’ indicating the output directory for the sorted versions of the input file. This parameter is optional. The default behavior will be to output in the same directory as the source file.
./sorter -c movie_title -d thisdir -o thatdir
Your code will be reading in and traversing the entire directory. In order to run your code to test it, you will need to open each CSV and read it for processing:
./sorter -c movie_title -d thisdir -o thatdir
Your code’s output will be a series of new CSV files outputted to the file whose name is the name of the CSV file sorted, with “-sorted-<fieldname>” postpended to the name.
e.g: ‘movie_metadata.csv’ sorted on the field ‘movie_title’ will result in a file named “movie_metadata-sorted-movie_title.csv”.
On each new file in a directory you encounter, you should fork() a child process to do the actual sorting.
On each new directory you encounter, you should fork() a child process to process the directory.
To STDOUT, output the following metadata in the shown order:
Initial PID: XXXXX
PIDS of all child processes: AAA,BBB,CCC,DDD,EEE,FFF, etc
Total number of processes: ZZZZZ
You may assume the total number of files and directories will not exceed 255.
Your code should use Mergesort to do the actual sorting of records. It is a powerful algorithm with an excellent average case.
Submit your “sorter.c”, “sorter.h” and “mergesort.c” as well as any other source files your header file references.
Document your design, assumptions, difficulties you had and testing procedure. Include any test CSV files you used in your documentation. Be sure to also include a short description of how to use your code. Look at the man pages for suggestions on format and content. Do not neglect your header file. Be sure to describe the contents of it and why you needed them.
Here are some extra credit options for you. You can choose to do either, both, or neither.
Extra Credit (10 points):
Interpret the -c option as a comma separated list of fields. Cascade your sort over the fields. For example, if the command is:
./sorter -c food,drink,price
The sorter should sort first on the field food, then on drink, then on price. E.g. if there are multiple records where food is “Vegetable”, the order of these records would then be sorted by the drink field. If there are multiple records where food is “Vegetable”, and drink is “Water”, then those records would further be sorted by the drink field.
Document your algorithm and show examples of some runs of your program on test data.
Extra Credit (10 points):
Create a metadata output file in addition to STDOUT. In this file, output the hierarchy of forked process ids and their associated directory/file names. If one views this file, it should be clear which processes forked which processes, and be able to follow the execution of code. Document the format of your file, and the design of the code that produces the output. Consider using the linux utilities like gnuplot to visualize your data. This might require the use of shared memory/pipe constructs to transfer data back and forth between processes.
package (other than Soot) or code from any such library or package- all code you submit must be your own.
partial marks for any partial eorts.
Overview and Requirements
For this assignment, you will implement a series of functions that test and implement sorting in linked lists containing numeric data (not any other data type).
Your implementation is permitted to include, but does not have to include, your own “helper functions” which you may implement and must comment clearly.
Your implementation is not permitted to include:
- input from the user;
- output to the screen (e.g., debugging statements) beyond that in the testing script;
- any usage of Python’s built-in listtype in any way beyond that in the testing script.
Any assignment submission which uses Python’s built-in list type will receive a grade of zero.
A collection ordered such that each element is greater than or equal to the previous element (i.e., sorted with duplicates). For example, [1, 3, 5, 5, 7, 9, 9] is in non-decreasing order but [1, 3, 5, 3] is not.
Download a copy of Assn2Skeleton.py from OnQ and rename it Assignment2.py for your assignment submission. It contains some functions that will be useful for testing and the “stubs” of the functions you have to write for this assignment.
Your functions MAY NOT call these functions directly or use the Python built-in list type in any way. You may write additional helper functions of your own and use them in your required functions if you wish.
isSorted (4 points)
Write a function called isSorted which takes a linked list of numbers as a parameter and returns True if the list is sorted in non-decreasing order and False otherwise.
For example, if you create a linked list that is equivalent to [2,3,5,9], calling isSorted with that list should return True. But for a linked list equivalent to [2,3,9,5], it should return False.
Your function must not change the list or print anything, just return a Boolean result.
noDuplicates (4 points)
Write a function called noDuplicates which takes a linked list of numbers as a parameter. You may assume that this list is sorted in non-decreasing order. Your function must remove any duplicates from the list and return a pointer to the changed version of this list.
For example, if linkedListOfNumbers is a linked list equivalent to [2,2,3,5,5,5,5,7,7] and you call linkedListOfNumbers = noDuplicates(linkedListOfNumbers), linkedListOfNumbers should now contain to a linked list equivalent to [2,3,5,7].
insertInOrder (4 points)
Write a function called insertInOrder which takes two parameters:
1.a linked list of numbers, which you may assume is sorted in non-decreasing order; and,
2.a new value to add to the list (you may assume it’s a number)
Your function must add the value to the list and return a pointer to the changed version of the list. If the list already contains the value, add it as a duplicate.
For example, if linkedListOfNumbers is a linked list equivalent to [1,3,5,7] and you call linkedListOfNumbers = insertInOrder(linkedListOfNumbers, 5), linkedListOfNumbers should now point to a linked list equivalent to [1,3,5,5,7].
sortLinkedList (4 points)
This function must take a linked list of numbers as a parameter and return a pointer to a new list which contains the same values as the original list but is sorted in non-decreasing order. The linked list passed as a parameter must not be changed.
For example, if linkedListOfNumbers is a linked list equivalent to [4,2,6,3,4] and you call orderedLinkedListOfNumbers = sort(linkedListOfNumbers), orderedLinkedListOfNumbers should point to a linked list equivalent to [2,3,4,4,6] but linkedListOfNumbersshould still be [4,2,6,3,4] after the call.
The simplest way to do this sort is to use an algorithm called an “insertion sort”. Start with an empty list and add each element of the parameter list into the new list, putting it in the correct position to keep the new list in order. Once you’ve got insertInOrder working, this sort should be fairly easy.
Please download a copy of Assn2Testing.py from OnQ. This script will help you determine if your solution meets the administrative requirements for this assignment.
If you run it with your solution and it loads and runs without errors, it means that you have got all the right names, numbers of parameters, etc, and that your functions work correctly on one small test case. It is by no means a guarantee that your functions will work correctly with all examples, so you should also try your own test cases as well.
If you’re not able to write all of the functions for this assignment, this testing script can still be useful. You can comment out the parts relating to the functions you were not able to complete.