The proper usage of the script to meet requirements for exercise 4:
python sort.py -i INPUT_FILE
Running the script with the above arguments will cause for results
to be printed to standard console in the order read from the input file
rather than being written to a file and sorted.
Using option -o, the output can be written directly to a file
(python sort.py -h
will also display all the options for running sort.py).
do no use any --sort-fn sorting options with 'ex4' validation to avoid sorting displayed results.
We chose to implement the sorting as follows.
- A character with a lower ASCII value comes before a character with a higher ASCII value. This was chosen because parts of URLs are case sensitive and we felt it was important to respect that.
- If two strings start with the same character, the next character is used to break the tie (repeat as necessary)
- If looking at a character past the end of the string, it takes ASCII value 0 (so if you have two strings with the first 5 characters the same, if the first string ends after character 5 and the second doesn't, the first string will always come before the second in the output).
We have a python script sort.py in the root directory which takes four arguments.
The first argument, -i, provides the input file name
The second argument (optional), -o, provides the output file name to write to. If not specified, the results are printed directly to console.
The third argument (optional), --sort-fn, provides a set of sorting algorithms which can be applied to sort the input file.
mergesort2
(imported)mergesort
selectionsort2
(imported)selectionsort
heapsort
nosort
quicksort
(imported)radixsort2
(imported)radixsort
Each sorting option is associated with the corresponding sort function suggested by the name of the shown option. 'nosort' function (default) is intended to disable sorting and allow ex4 validation to work properly.
The four argument (optional), --validate-fn, enables the functionality of URL validation and transformation of URLs to canonical form. NOTE! validity of URLs is performed after the URLs have been transformed to the canonical form (output list is in canonical form).
ex4
: (default) runs the output requirement for ex4all
: enables Valid = true|false URLs to pass thoroughvalid
: enables only Valid = true URLs to pass throughinvalid
: enables only Valid = false URLs to pass through
The algorithms module in this root directory and contains all of the sorting algorithms in their own file. Loading our module also loads all of the individual sorts and allows easy separation of the files while still providing a simple interface. Url validation is also it own module to maintain functionality separation.
Unit tests are available in the tests/
folder:
python tests/test_algorithms.py
Unit test are included for cover
- URL validator
- URL canonicalizer
- URL comparators (<, >, ==)
Using the files we provide, one working command would be: python sort.py -i test_input.txt -o output.txt --sort-fn mergesort You can then test the correctness of the sorting by calling: diff test_output.txt output.txt
The output file is created at the path OUTPUT_FILE relative to the directory from which the script was run. For example, in our sample command, it would be located in the current working directory. If you run the script from .. relative to the sort.py script and provide -o other/directory/output.txt, it will write the file to ../other/directory/output.txt relative to the sort.py script.
Our choice for the O(n^2) algorithm was selection sort. We felt this was the simplest algorithm to use and was very simple to implement in Python. For the O(n log n) algorithms we included heapsort and mergesort, both of which have a worst case runtime within the specified parameters. Python had built in functions that made heapsort easy to use, and many of us have experience with mergesort and quicksort, but we opted to use the former as quicksort would have been much more complicated. The O(n) algorithm was radix sort, which uses buckets for each possible ASCII value to sort the URLs. Radix sort is a specialty sorting algorithm that takes advantage of the fact that the elements being sorted can be broken up into singular chunks that are limited in number (in this case one of 255 characters). This breaking up into singular chunks results in k passes, where k is the number of chunks in the longest element. If n >> k, the algorithm is essentially O(n). However, this is not always the case so the true worst case of this algorithm is O(kn). For all these algorithms we assume that the URLs use proper ASCII values and do not contain other symbols.
The documentation for our URL validation and canonicalization design may be found at https://github.com/Full-House-UW/section/wiki/URL-Validation-and-Canonicalization.
Group Division:
- Corneliu, Brandon - Merge sort
- Jeff - Script setup and design
- Conor - Radix sort
- Janette - Heapsort
- Nat - Testing
- Shahaf - Selection sort
- Shahaf & Janette - Coordination