Table
Learning Objectives
At the end of this sub-unit, students should
- know how to work with table.
- know how to use
map
andfilter
on table.
Working with Table
We already know how to deal with 2-dimensional array and table is just another 2-dimensional array. So what is so special about working with table? Given the signal processing view, we are going to work with tables using signal processing view to simplify many of the operations.
As a preliminary, we will look at how a table is typically represented. Our example table comes from an enrollment data of a certain university separated by year, sex, and course of study. The sample data is shown below.
-
Table
-
Data
The way we represent this data in Python is using a nested list.
Each row becomes a list and the entire table is a list of rows.
If you noticed that all the data is of type str
, that is intentional.
This is because due to the size of the data, it should not be part of the code.
Otherwise, it will be a fixed data.
So what can we do?
The most common way to store something more permanent that can be modified easily is to store the data as a file.
Typically, file stores only text.
So that is why all the data is a file.
To keep the topic simple, we will provide the code to read from a file format called comma separated value (.csv).
There is a module called csv
already available in Python.
More explanation will be given in later unit.
Read CSV
If you see weird character "", you need to use open(filename, 'r', encoding='utf-8-sig')
.
Since we are dealing with table, the first row is usually reserved for the table header.
Map
First, we consider what is the effect of map
on a table.
Recap that map(f, lst)
applies the function f
on each element elem
of the given list lst
.
Our list in this case is a table.
What is the element of a table?
It is the row of the table!
This means, the function f
is going to be applied to every row.
From the example above, we can imagine that f
is applied to the entire first row like the following.
What operations can we do with a row? There are many operations!
For example, add 100 to intake after conversion to integer.
Notice that in all of these operations, we are dealing with columns.
So map
operation on a table is a column operation.
This is because the element of the table is a row.
A typical column operation is what we call as a "cleanup" operation. Because the data comes from a csv file, we only have string type. This can be difficult to work with. To ease our task in the future, we want to make sure that the data will be a more reasonable type. Cleaning up the data consists of type conversion to the more reasonable type.
There is another advantage of performing a cleanup before doing any other work.
It will free up our mind to focus on other aspect of the problem solving.
As long as we always convert all the type into the reasonable type, we do not have to worry about this later on.
Here is an example of a cleanup of the data above.
We will convert the year and intake into integer.
We will also convert the sex into boolean with female being True
.
As for the course of study, we will leave it as string.
Now we can use it as a preprocessing step after reading the table from the csv file.
It is worth nothing that map
can only work with a single row.
it cannot work with two rows at the same time.
So if the operation we needed is to compare two rows or to aggregate some data from multiple rows, then we cannot use map
.
For instance, we cannot use map
on its own to find the total intake for each course of study for a given year.
This requires data from two different rows and we may not be able to assume that the data will be consecutive.
Furthermore, some data may be missing which may complicate the problem.
Summary of Effect of Map
Note that the number of rows is unchanged by map
.
Filter
Next, we want to know the effect of filter
on the data.
filter
will either keep the element or remove the element from a list.
Since the element of a table is a row, when filter
is applied to a table, we will be either keeping or removing rows.
What this mean is that filter
is a row selection operation.
For instance, maybe we want to keep only data for the year up to and including 2007.
Assuming that we have already cleaned up the data such that the year is an integer, we can use filter
to select which rows to keep as follows.
Using filter
, we can only remove rows.
Put it another way, we can choose to keep the rows that we want to use for further processing.
This will help us in the actual problem solving as another preprocessing step.
Summary of Effect of Filter
Other Common Operations
Grouping
Grouping is an operation that partitions the table into rows that share common properties. This is then typically followed by an aggregation operation that is applied on each group. A potential question is the following.
For each course of study, find the average female intake for the year between 2005 and 2010 inclusive of both years. Return the result as a list of pair of (course, average).
We can see that the problem can be decomposed into the following.
- Read data and cleanup (e.g., convert year to integer).
- Filter the data keeping only year between 2005 and 2010 (inclusive).
- Partition the data by course of study.
- For each course of study, calculate the average intake.
Grouping + aggregation operation is gone on Step 3 and Step 4.
We will leave generalization with blanks as an exercise as there can be many ways to solve this.
What we are going to show is a recursive way of solving this by using filter
to help us partition.
The partitioning is done as a binary choice.
We will look at the first row and partition all the rows into two categories.
- All the rows that share the same category as the first row.
- All the rows that do NOT share the same category as the first row.
We then aggregate the first category and we leave the second category to be solved by recursion.
Sorting
Another common operation is sorting. We have already discussed how to do sorting and there are several built-in sorting technique as well. Here, the idea of stable sort will be of great value.
Ranking
Ranking is slightly different from sorting. Sorting generally do not allow for ties. If there are ties during sorting, it means that the problem is not well-defined. We want our sorting to produce a total order in which for every pair of rows \(R_1\) and \(R_2\), we can determine one of the following three possibilities.
- \(R_1 < R_2\) and \(R_1\) must appear before \(R_2\) in the sort order.
- \(R_1 > R_2\) and \(R_2\) must appear before \(R_1\) in the sort order.
- \(R_1 = R_2\) which can only mean both have exactly the same data.
In ranking, some rows may have the same rank in case of ties. So simply sorting is insufficient. There are some common ranking definition. We will illustrate this with the following name and score.
Name | Score |
---|---|
Alice | 100 |
Bob | 90 |
Cathy | 90 |
David | 80 |
In this ranking system, ties receive the same higher rank as illustrated below. There will also be gaps in rank, as we can see from the missing 3rd rank.
Name | Score | Rank |
---|---|---|
Alice | 100 | 1 |
Bob | 90 | 2 |
Cathy | 90 | 2 |
David | 80 | 4 |
In this ranking system, ties receive the same lower rank as illustrated below. There will also be gaps in rank, as we can see from the missing 3rd rank.
Name | Score | Rank |
---|---|---|
Alice | 100 | 1 |
Bob | 90 | 3 |
Cathy | 90 | 3 |
David | 80 | 4 |
In this ranking system, ties receive the same higher rank as illustrated below. However, unlike 1224, there should be no gaps. This is often called dense rank.
Name | Score | Rank |
---|---|---|
Alice | 100 | 1 |
Bob | 90 | 2 |
Cathy | 90 | 2 |
David | 80 | 3 |
Problem Solving with Table
Average Intake
Average Intake
Problem Description
We have a table with the following header.
Year | Sex | Faculty | Intake |
---|---|---|---|
We want to find the average intake for a given faculty based on the year. Note that there can be more than rows for the same year and the same faculty: 1 row for male and 1 row for female. So we need to first find the total intake for any single year.
Task
Write the function average_intake(filename, faculty)
to find the average intake for the given faculty.
The data will be given in the file with the given filename.
Assumptions
filename
exist.- There will be at least one row for the given
faculty
.
Usually, the preprocessing step is the same for almost all problem with the same data. We need to cleanup the data first. Do not forget to also remove the header. Afterwards, we want to filter out the irrelevant rows. We want to keep only the relevant rows. Then we can begin problem solving.
To solve the problem, we will need to do the following.
- Calculate the sum of intake.
- Calculate the number of distinct year.
We can then compute the average by dividing the first with the second. This gives us the following code.
Average Intake
Alternatively, we can also do grouping based on year after we have done the filtering.
Average Intake
Faculty Rank
Faculty Rank
Problem Description
We have a table with the following header.
Year | Sex | Faculty | Intake |
---|---|---|---|
We want to rank the faculty based on female intake. The higher intake should be given higher rank (i.e., smaller number, with 1 being the highest rank). What we want is the top \(n\) ranked faculty in a given year. If there are ties, we use 1224 ranking.
Task
Write the function top_n(filename, year, n)
to return the list of top n
faculties in the given year
.
Resolve ties with 1224 ranking.
Assumptions
filename
exist.year
is an integer.n
is an integer.
Using the usual problem solving technique, we can decompose the problem as follows.
- Read data from file.
- Cleanup the data.
- Filter the data for the given year.
- Filter the data to include only female intake.
- Sort the data based on intake.
- Rank using 1224 ranking.
- Filter the data to include only faculties with rank at most
n
. - Remove rank.
Faculty Rank
This is nice, there is no loop in top_n
but we did use quite a number of helper functions.
However, these helper functions follow the same general structure so we can store these in our pattern bank!