Author: cs daixie

data mining

Question 1
1.
Please write codes to read the data file TrainingData.csv.
The first row is the header (variable names). Data are stored in
subsequent rows.
2.
Determine the number of variables and the number of records in this
dataset.
3.
Store the variable names in a list.
4.
Determine if there is any missing values in the data set. If yes, please
report the total number of missing values.
5.
Find the number of distinct LCID in the data set.
6.
Find the variable with the most missing values.
7.
Convert the variable hour_id to datetime format.
8.
What is the time duration of the entire data set?
9.
Determine the number of records per day.
10.
Use the median method in the statistics package (from statistics
import median) or else, do the followings:
(a)
Divide the entire data set by distinct value of LCID.
(b)
For each distinct LCID value, determine the median of each
variables in the divided data set.
(c)
Package the result in (b) in a dictionary.
11.
Determine the number of Complaint cases and Non-complaint cases
in the entire data set.
12.
Determine the top 10 LCIDs with the most complaint cases.
13.
Calculate the median value per day per each variable in the entire data
set.
14.
Use the first 5 digits of the LCID values to define a new variable Region.
15.
Determine the region with the most complaint cases found in the data
set.

2-3 trees

1
Introduction
2-3 trees are a form of balanced tree that can be used to implement a dictionary, as well as
other types of operations.
Recall that in its most basic form, a dictionary maintains a set of key/value pairs, and
supports the following operations:
insert: insert a given key/value pair into the data structure; if there is already a pair
with same key component, then simply update the value component;
search: given a key, determine if there is a key/value pair with a matching key com-
ponent, and if so, return the corresponding value component;
delete: given a key, delete the key/value pair, if any, with matching key component.
There are many data structures that may be used to implement a dictionary, such as
binary search trees, hash tables, etc. 2-3 trees are one such data structure. As we shall
see, if we implement a dictionary using 2-3 trees, then all three of the dictionary operations
take time O(log n), where n is the number of key/value pairs in the dictionary. We will also
see how to use 2-3 trees to implement other types of operations, beyond those of a simple
dictionary.
2
2-3 trees: the basics
Before we start, a word of warning: the exposition here of 2-3 trees is a little bit dierent
from what one usually sees in the literature. See Section 6 below for more on this.
Also, before we start, recall that the depth of a given node in a tree is the length of the
path (i.e., the number of edges in the path) from the root to that node. Also recall that
the height of a tree is the maximum depth of any node in the tree.
Now, at a high level, our notion of a 2-3 tree is that of a tree with the following properties:
key/value pairs stored only at leaves (with no duplicate keys);
all leaves are at the same depth;
looking at the leaves from left to right, keys appear in sorted order;
each internal node

C++ raspberry

Purpose of the Assignment

The general purpose of this assignment is to develop some simple C++ utilities for the Raspberry Pi Desktop, given a number of requirements, making use of the principles and techniques discussed throughout the course. This assignment is designed to give you experience in:

  • object-oriented programming using C++, using basic language constructs, classes, and data types
  • looking through Linux manual pages and documentation, as you will likely need to do this in your projects later
  • getting acquainted with the Linux-based Raspberry Pi Desktop system and services, which will help in your use of an actual Raspberry Pi later in the course
  • Assignment Task

    Your assignment task is to familiarize yourself with the Raspberry Pi Desktop and develop some simple C++ utilities that help manage files on the system.  In a way, you are building some stripped down replacements to utilities like mvcplscatrmdiff, and stat.  Because of the way Linux works, this is surprisingly easy and doesn’t take a whole lot of work.  (Please note that manual pages linked for your convenience above and below are for Debian Stretch, the same basic Linux distribution that Raspberry Pi Desktop is based upon.  That said, things may look or function differently under Raspberry Pi Desktop itself, and so you should consult the man pages on the system itself as your primary source of documentation.)

    Familiarizing Yourself with Raspberry Pi Desktop

    To complete this assignment, you will need access to Raspberry Pi Desktop, including its C++ compiler and requisite supporting tools, libraries, and packages.  It is likely easiest to build yourself a virtual machine running this system; details on how to do so can be found under Useful Links in the OWL site side bar.  You should do this as early as possible to make sure you are set up and ready to go for the assignment.

    If you have a computer that completely lacks virtualization support, Science Technology Services has a solution for remotely accessing something that is compatible for this work.  They have created a cloud based Linux machine running Raspberry Pi Desktop; this can be found at cs3307.gaul.csd.uwo.ca.  To access this machine, you should be able ssh to log in from pretty much anywhere, using your Western credentials for access.  You can scp/sftp files to and from this machine as necessary.

    Modeling for this Assignment

    For this assignment, you will be creating a C++ class to help manage files on the Raspberry Pi Desktop system.  (You might also be creating other support classes too, depending on how you do things.)  This class will nicely encapsulate both information pulled from the file system for the file(s) in question, as well as operations that can be performed on the files, with each instance of the class handling a single file.  This will be accomplished using a collection of system calls and file I/O operations.  First we will discuss the data that you need to concern yourselves with and how to access it, and then we will discuss operations that should be supported by your class and how to execute them.

    The file manager class you are to create will include at least the following information for this assignment:

    • Name.   The name of the file, given to the class through its constructor.
    • Type.  Whether the file is a regular file, directory, and so on.  Not only is this useful information to have, but this will also allow you to permit certain operators on certain types of files.  (When we explore design patterns in more detail, we will discuss a better way of doing this.). This can be retrieved using the stat() function, and can be found in the st_mode field of the structure provided by this function.  You can store the type as a string representation or using the same numeric code used in the st_mode field.
    • Size.  The size of the file.  This can be retrieved using the stat() function, and can be found in the st_size field of the structure provided by this function.
    • Owner.  The user who owns the file.  This can be retrieved using the stat() function, and can be found in the st_uid field of the structure provided by this function.  You must keep the numeric user ID from this field, as well as the string user name obtained using the getpwuid() function.
    • Group.  The group of the file.  This can be retrieved using the stat() function, and can be found in the st_gid field of the structure provided by this function.  You must keep the numeric group ID from this field, as well as the string group name obtained using the getgrgid() function.

database sql

AIMS AND OBJECTIVES:
to convert the ER/EER model into a relational data model;
to implement a relational database system (using ORACLE12g).
This is an individual Assignment. You are not permitted to work as a group when writing this
assignment.
Copying, Plagiarism: Plagiarism is the submission of somebody else’s work in a manner that gives
the impression that the work is your own. The Department of Computer Science and Computer
Engineering treats plagiarism very seriously. When it is detected, penalties are strictly imposed.
No extensions will be given: Penalties are applied to late assignments (5% of total assignment mark
given is deducted per day, accepted up to 5 days after the due date only). If there are circumstances
that prevent the assignment being submitted on time, an application for special consideration may be
made. See the departmental Student Handbook for details. Note that delays caused by computer
downtime cannot be accepted as a valid reason for a late submission without penalty. Students must
plan their work to allow for both scheduled and unscheduled downtime.
SUBMISSION GUIDELINES:
This assignment is to be submitted in soft-copy (.pdf, .txt or .sql) formats using the
submission link on LMS by 10:00am Monday September 23rd, 2019. The submission
link can be found under “Assignment 1 –Part 2” on the subject’s LMS page.
NOTE: This assignment must be typed and no hand-writing is allowed.
SUBMISSION CHECKLIST:
The transformation steps from the EER Model shown in Appendix A to final tables (Task 1);
Make sure to show each step of the transformation, and the final transformation tables;
The DDL implementation for the ‘Natural Therapy Centre Database System’ tables from Task 1
(create table statements), and the required insert statements (Task 2a and 2b respectively).
Students are referred to the Department of Computer Science and Computer Engineering’s Handbook
and policy documents with regard to plagiarism and assignment return, and also to the document on
‘Academic Misconduct’ in the subject learning guide.

Security checklist

Deliverables:
1. A single Word document (.docx) – containing all parts.
Scenario:
An Australian advertising company, iCreative, has grown concerned by the global rise in
cybercrime and ransomware.
They have asked you to:
For all branches – Identify and analyse application and networking-based threats to their
company; and
For the Brisbane branch only – recommend preventative and mitigative technologies
and strategies for potential intrusion and attacks on the network.
About the company:
iCreative is a growing advertising company consisting of three branches: The Brisbane (main)
branch; the Launceston branch; and the Portland branch. Each branch has five departments and
there are approximately 25 employees per department. The Brisbane branch has 1 mail server,
2 web servers, and 2 database servers. The Launceston and Portland branches are smaller
branches and so they each have only 1 mail server and 1 database server.
All branches have high-speed networks; however, the traffic can be quite heavy on weekdays.
This is especially true for the Brisbane branch.  

Part I. Potential Threats
You have been provided with a list of complaints from employees about the workstations at
iCreative:
Complaint 1 (Derek): My computer takes forever to start up and shut down. It is just
so slow all the time no matter what I am doing or what program is open.
Complaint 2 (Lexie): I feel like my hard drive is rather small. It only has about 30
document files on it but I keep receiving a notification telling me that my hard drive is
nearly full now. I brought this to the IT department and asked for a bigger hard drive
but they told me that my hard drive is already 2 TB in size and so I shouldn’t need a
bigger one. Instead, they told me to just uninstall any programs that I don’t use. The
only programs that are installed, though, are the ones the IT department installed on
there for me and one extra program that I installed and use every day.
Complaint 3 (Meredith): It takes forever to download a file from the database servers.
It doesn’t even matter what the size of the file is.
Complaint 4 (Alex): I keep receiving email notifications for “Undeliverable message”
but when I look at the messages they aren’t actually emails that I have sent. This is so
annoying and such a waste of time.
Complaint 5 (Richard): I don’t seem to be able to download updates for the antivirus
software or the operating system. It’s so frustrating.
Complaint 6 (Mark): I get very annoyed with the fan in my computer. It’s just so loud.
It seems to be spinning really fast and all the time. Even with no programs open the fan
is going crazy.
Complain 7 (All employees): Difficulty accessing the website, mail and database
servers.

Computing

Question (20 marks)
For this task you will create a class containing a number of methods for processing an array
of marks, which are scores in a test. Each mark is an integer in the range 0 to 100 inclusive.
On the Interact site for this subject, you have been provided with a class Marks (in Project2
code zipped folder), which has a method getMarks that returns an array of marks for you to
use in testing.
The class ProcessMarks that you create will have the methods specified below. Most will
accept an array of marks as an argument; one will accept an array of characters. The return
type should be appropriate for the returned value.
The max, min and range methods will return the maximum mark, the minimum mark
and the difference between the maximum and minimum marks respectively.
The mean and stdDev methods will return the mean and standard deviation of the set
of marks. Your textbook contains a description of how these can be calculated.
The median method will return the median value of the set of marks. The median
value is the middle one when the values are placed in order. To obtain an ordered
version of the marks you may use an appropriate sort method of the Java API’s Arrays
class. Be careful not to destroy the original array of marks. If there is an even number
of marks, the middle value is taken as the average of the two values that are nearest
to the middle.
The mode method will return the mode of the set of marks, which is the most
commonly occurring mark. To find the mode, use an ordered version of the set of
marks, as used for finding the median. If there is more than one value that is most
common, any one of the most common values will do for the mode.
The grades method will return an array of characters, which are the grades
corresponding to the integer marks in the array of marks. The grades are to be
assigned using the following lower boundaries for the corresponding marks: for grade 

A, the lower boundary is 85; for grade B, it is 75; for grade C, it is 65; for grade D, it is
50; and E is the grade for all other marks. A best solution for this method would not
have the values for the lower boundaries hardcoded but would use an array for these
values, which would allow the grade boundaries to be altered.
The gradeDistn method will accept an array of characters, which are the grades
assigned for the array of marks, such as returned by the grades method. The
gradeDistn method will return an array of integer values containing the distribution of
grades, which is the number of occurrences of each grade in the assigned grades. The
characters used for grades are fixed. The returned array should provide the
distribution in order from grade A to grade E.
The following points should be taken into account in the design of your program:
None of your code should change the contents of the original array of marks.
You should not make any assumption that the client code, that would use your
methods, should call them in any particular. That is, you should not assume that a
client that calls the range method will have previously called the max and min
methods.

Database Assignment

This is an individual Assignment. You are not permitted to work as a group when writing
this assignment.
Copying, Plagiarism: Plagiarism is the submission of somebody else’s work in a manner that
gives the impression that the work is your own. The Department of Computer Science and
Computer Engineering treats plagiarism very seriously. When it is detected, penalties are strictly
imposed.
No extensions will be given: Penalties are applied to late assignments (5% of total assignment
mark given is deducted per day, accepted up to 5 days after the due date only). If there are
circumstances that prevent the assignment being submitted on time, an application for special
consideration may be made. See Student Handbook for details. Note that delays caused by
computer downtime cannot be accepted as a valid reason for a late submission without penalty.
Students must plan their work to allow for both scheduled and unscheduled downtime.
SUBMISSION GUIDELINES:
This assignment is to be submitted in soft-copy (either PDF or JPEG) format using
the submission link on LMS, by 10:00 am Monday Sep 2nd, 2019. The submission
link can be found under “Assignment 1 – Part 1” on the subject’s LMS page.
SUBMISSION CHECKLIST:
ü Your (Enhanced) Entity-Relationship Model (EER) for the proposed database.
Students are referred to the Department of Computer Science and Computer Engineering’s
Handbook and policy documents with regard to plagiarism and assignment return, and also to the
document on ‘Academic Misconduct’ in the subject learning guide.

Programming for Scientists and Engineers

Problem A: String Overlap (30 points)
In this problem you will design and implement C++ code that identies overlap in strings.
Specically, design and implement a C++ program that does the following:
1. Asks a user to input a lename and then opens that le. If the le open fails, then
print the message “Unable to open le” and terminate the program using exit(1).
2. Reads the le contents, in order, into an array of strings. (See the le format expla-
nation below.)
3. Computes the string overlapping order described below, and then prints the strings
out, one per line, in that order.
4. Closes the le.
File Format: The data le consists of a number of strings, each on its own line. You do
not know in advance how many strings will be in the le, other than there will be no more
than 30 strings. Assume (i) there is at least one string in the le, (ii) the strings do not have
any whitespace in their interior, and (iii) the strings consist entirely of alphabetic, upper
case characters.
Here is an example of a data le containing four strings:
AGGTGTGGA
AAAATTA
AATTGTCGCTGA
GGAAAA
Overlapping Order: Here is the explanation of the string overlapping your program is
to nd in Step (3) above. Suppose we have the strings above read, in order, into an array
of strings. We start with the rst string in the array, AGGTGTGGA. What we want to
determine is which of the other strings’ beginning overlaps the most with the end of the
this rst string. Note this is a nontrivial problem since we do not know without further
analysis what the size of the overlapping substring will be. For example, if we just look
at the last character of AGGTGTGGA, the A, there are two other strings that begin with
A. If we look at the last two characters, GA, there are no strings starting with GA.

Traverse the Maze

Objective
This project provides experience of implementing recursive methods and using a generic linked
stack data structure for the purpose of an efficient depth-first search for the longest path in a
special tree structure. You will also review how to construct and use Java classes as well as
obtain experience with software design and testing.
ABET Program Learning Outcomes:
The ability to recognize the need for data structures and choose the appropriate
data structure (1,2,6)
The ability to implement and use stacks (1,2,6)
The ability to implement and use linked list and array based structures (1,2,6)
The Problem
The programming problem in this project is to find a path from the entry cell (upper left corner)
to the exit cell (lower right corner) of a maze. In our model the rectangular maze is represented
by a grid of cells. Each cell is bordered by 4 walls, and some of these walls are passable to the
neighboring cell(s). The outside wall of the whole maze is not passable. There are however (at
least) two very different interpretations of a passable wall. We say that the maze is directed if
for any given pair of cells A and B there are four cases:
– there is no passage between A and B
– each of A or B is accessible from the other
– B is accessible from A, but A is not accessible from B
– A is accessible from B, but B is not accessible from A
On the other hand, in an undirected maze each passage provides a two-way access, that is, there
are only two cases, #1 and #2 as listed above.
In this project your program shall exercise both options of building a maze, moreover in the
directed maze every passage will be selected by a given probability, while the undirected maze
will be built upon a pattern of predetermined passable walls. Your completed program is able to
build a maze based upon input wall pattern read from a file
build a maze with a random selection of all relevant directional passage
find a path through a maze if such a path exists
display the maze solution on the console showing the length of the path and the
locations of the cells along the path from the entry cell to the exit cell
report the failure of the search in an output message

Commandline Console

Description
Create a site that provides a web, form-based “Linux Shell” supporting basic commands that can be
performed on a remote “fake” in-memory le system. In this homework you’ll be working with:
Serving static les
Middleware
Handling forms, both GET and POST
Templating
A JavaScript Object Representation of a Simple File System
You’ll be creating 2 pages:
home – : a basic form that allows users to select a distribution of Linux Operating System.
vfs (virtual le system) – /vfs : a page that allows users to manipulate resources of a virtual le
system by submitting Linux commands through two forms and see system states returned from the
server (This is an in-memory le system).
Your directory layout should look similar with the following
once you’re done with the assignment (though
it can deviate from this example based on your implementation):

In the
views directory, you are not required to have the same les as above. If you’d like, you can use
template partials to reduce redundant markup. This was not covered in class, but you can check out the