Table

Learning Objectives

At the end of this sub-unit, students should

know how to work with table.
know how to use map and filter on table.

Working with Table

We already know how to deal with 2-dimensional array and table is just another 2-dimensional array. So what is so special about working with table? Given the signal processing view, we are going to work with tables using signal processing view to simplify many of the operations.

As a preliminary, we will look at how a table is typically represented. Our example table comes from an enrollment data of a certain university separated by year, sex, and course of study. The sample data is shown below.

Table

Data

table = [
  ['2005', 'M', 'Accountancy', '876'],
  ['2005', 'F', 'Accountancy', '530'],
  ['2005', 'M', 'Architecture', '299'],
  ['2005', 'F', 'Architecture', '175'],
  ['2005', 'M', 'Business', '1545'],
  ['2005', 'F', 'Business', '901'],
  ['2005', 'M', 'Dentistry', '33'],
  ['2005', 'F', 'Dentistry', '19'],
  ['2005', 'M', 'Education', '0'],
  ['2005', 'F', 'Education', '0'],
  ['2005', 'M', 'Engineering', '4028'],
  ['2005', 'F', 'Engineering', '1170'],
  # other rows omitted
]

The way we represent this data in Python is using a nested list. Each row becomes a list and the entire table is a list of rows. If you noticed that all the data is of type str, that is intentional. This is because due to the size of the data, it should not be part of the code. Otherwise, it will be a fixed data. So what can we do?

The most common way to store something more permanent that can be modified easily is to store the data as a file. Typically, file stores only text. So that is why all the data is a file. To keep the topic simple, we will provide the code to read from a file format called comma separated value (.csv). There is a module called csv already available in Python. More explanation will be given in later unit.

Read CSV

import csv
def read_csv(filename):
  rows = []
  with open(filename, 'r') as csvfile:
    reader = csv.reader(csvfile) # csv reader!
    for row in reader: # all data is read as string
      rows.append(list(row))
  return rows
  # format will be [[str, str, ...], ...]

If you see weird character "ï»¿", you need to use open(filename, 'r', encoding='utf-8-sig').

Since we are dealing with table, the first row is usually reserved for the table header.

Map

First, we consider what is the effect of map on a table. Recap that map(f, lst) applies the function f on each element elem of the given list lst. Our list in this case is a table. What is the element of a table? It is the row of the table!

This means, the function f is going to be applied to every row. From the example above, we can imagine that f is applied to the entire first row like the following.

f(['2005', 'M', 'Accountancy', '876'])

What operations can we do with a row? There are many operations!

Type ConversionRemove ColumnRearrange ColumnUpdate ColumnAdd Column

We can convert the the first column into an integer.

def f(row):
  return [int(row[0])] + row[1:]

For example, remove course of study.

def f(row):
  return row[:2] + row[3:]

For example, move intake to after year.

def f(row):
  return [row[0], row[3], row[1], row[2]]

For example, add 100 to intake after conversion to integer.

def f(row):
  return row[:3] + [int(row[3]) + 100]

For example, add a check if the row is for male or female.

def f(row):
  return row + [row[1] == 'M']

Notice that in all of these operations, we are dealing with columns. So map operation on a table is a column operation. This is because the element of the table is a row.

A typical column operation is what we call as a "cleanup" operation. Because the data comes from a csv file, we only have string type. This can be difficult to work with. To ease our task in the future, we want to make sure that the data will be a more reasonable type. Cleaning up the data consists of type conversion to the more reasonable type.

There is another advantage of performing a cleanup before doing any other work. It will free up our mind to focus on other aspect of the problem solving. As long as we always convert all the type into the reasonable type, we do not have to worry about this later on. Here is an example of a cleanup of the data above. We will convert the year and intake into integer. We will also convert the sex into boolean with female being True. As for the course of study, we will leave it as string.

def cleanup(row):
  return [
    int(row[0]),    # 1st column (year) is integer
    row[1] == 'F',  # 2nd column (sex) is boolean
    row[2],         # 3rd column (course) is string
    int(row[3])     # 4th column (intake) is integer
  ]

Now we can use it as a preprocessing step after reading the table from the csv file.

def task(filename):
  raw  = read_csv(filename)          # read raw data from filename
  head = raw[0]                      # get header
  data = list(map(cleanup, raw[1:])) # remove header and cleanup

  # the rest of the code to solve task

It is worth nothing that map can only work with a single row. it cannot work with two rows at the same time. So if the operation we needed is to compare two rows or to aggregate some data from multiple rows, then we cannot use map.

For instance, we cannot use map on its own to find the total intake for each course of study for a given year. This requires data from two different rows and we may not be able to assume that the data will be consecutive. Furthermore, some data may be missing which may complicate the problem.

Summary of Effect of Map

Change ColumnRemove ColumnAdd Column

Map01

Map02

Map03

Note that the number of rows is unchanged by map.

Filter

Next, we want to know the effect of filter on the data. filter will either keep the element or remove the element from a list. Since the element of a table is a row, when filter is applied to a table, we will be either keeping or removing rows. What this mean is that filter is a row selection operation.

For instance, maybe we want to keep only data for the year up to and including 2007. Assuming that we have already cleaned up the data such that the year is an integer, we can use filter to select which rows to keep as follows.

filter(lambda row: row[0] <= 2007, data)

Using filter, we can only remove rows. Put it another way, we can choose to keep the rows that we want to use for further processing. This will help us in the actual problem solving as another preprocessing step.

Summary of Effect of Filter

Select Rows

Filter01

Other Common Operations

Grouping

Grouping is an operation that partitions the table into rows that share common properties. This is then typically followed by an aggregation operation that is applied on each group. A potential question is the following.

For each course of study, find the average female intake for the year between 2005 and 2010 inclusive of both years. Return the result as a list of pair of (course, average).

We can see that the problem can be decomposed into the following.

Read data and cleanup (e.g., convert year to integer).
Filter the data keeping only year between 2005 and 2010 (inclusive).
Partition the data by course of study.
For each course of study, calculate the average intake.

Grouping + aggregation operation is gone on Step 3 and Step 4. We will leave generalization with blanks as an exercise as there can be many ways to solve this. What we are going to show is a recursive way of solving this by using filter to help us partition. The partitioning is done as a binary choice. We will look at the first row and partition all the rows into two categories.

All the rows that share the same category as the first row.
All the rows that do NOT share the same category as the first row.

We then aggregate the first category and we leave the second category to be solved by recursion.

def grouping(data):  # Assume data is already cleaned and filtered
  if len(data) == 0: # no more data
    return []
  else:
    fst = data[0]
    same = list(filter(lambda row: row[2] == fst[2], data)) # same category
    diff = list(filter(lambda row: row[2] != fst[2], data)) # different category

    # process data from same category
    same = list(map(lambda row: row[3], same))  # get intake
    return [(fst[2], sum(same)/len(same))] + grouping(diff)

Sorting

Another common operation is sorting. We have already discussed how to do sorting and there are several built-in sorting technique as well. Here, the idea of stable sort will be of great value.

Ranking

Ranking is slightly different from sorting. Sorting generally do not allow for ties. If there are ties during sorting, it means that the problem is not well-defined. We want our sorting to produce a total order in which for every pair of rows \(R_1\) and \(R_2\), we can determine one of the following three possibilities.

\(R_1 < R_2\) and \(R_1\) must appear before \(R_2\) in the sort order.
\(R_1 > R_2\) and \(R_2\) must appear before \(R_1\) in the sort order.
\(R_1 = R_2\) which can only mean both have exactly the same data.

In ranking, some rows may have the same rank in case of ties. So simply sorting is insufficient. There are some common ranking definition. We will illustrate this with the following name and score.

Name	Score
Alice	100
Bob	90
Cathy	90
David	80

122413341223

In this ranking system, ties receive the same higher rank as illustrated below. There will also be gaps in rank, as we can see from the missing 3^rd rank.

Name	Score	Rank
Alice	100	1
Bob	90	2
Cathy	90	2
David	80	4

def rank1224(data): # assume already sorted
  rank = 1
  data[0].append(rank)  # rank 1
  for i in range(1, len(data)):            # loop from top to bottom
    if not same_rank(data[i], data[i-1]):  # not the same rank as previous row
      rank = i + 1                         # rank is row number + 1
    data[i].append(rank)
  return data

In this ranking system, ties receive the same lower rank as illustrated below. There will also be gaps in rank, as we can see from the missing 3^rd rank.

Name	Score	Rank
Alice	100	1
Bob	90	3
Cathy	90	3
David	80	4

def rank1334(data): # assume already sorted
  rank = len(data)
  data[-1].append(rank)  # rank 1
  for i in range(len(data) - 2, -1, -1):   # loop from bottom to top
    if not same_rank(data[i], data[i+1]):  # not the same rank as next row
      rank = i + 1                         # rank is row number + 1
    data[i].append(rank)
  return data

In this ranking system, ties receive the same higher rank as illustrated below. However, unlike 1224, there should be no gaps. This is often called dense rank.

Name	Score	Rank
Alice	100	1
Bob	90	2
Cathy	90	2
David	80	3

def rank1223(data): # assume already sorted
  rank = 1
  data[0].append(rank)  # rank 1
  for i in range(1, len(data)):            # loop from top to bottom
    if not same_rank(data[i], data[i-1]):  # not the same rank as previous row
      rank = rank + 1                      # continue numbering
    data[i].append(rank)
  return data

Problem Solving with Table

Average Intake

Problem Description

We have a table with the following header.

Year	Sex	Faculty	Intake

We want to find the average intake for a given faculty based on the year. Note that there can be more than rows for the same year and the same faculty: 1 row for male and 1 row for female. So we need to first find the total intake for any single year.

Task

Write the function average_intake(filename, faculty) to find the average intake for the given faculty. The data will be given in the file with the given filename.

Assumptions

filename exist.
There will be at least one row for the given faculty.

Usually, the preprocessing step is the same for almost all problem with the same data. We need to cleanup the data first. Do not forget to also remove the header. Afterwards, we want to filter out the irrelevant rows. We want to keep only the relevant rows. Then we can begin problem solving.

To solve the problem, we will need to do the following.

Calculate the sum of intake.
Calculate the number of distinct year.

We can then compute the average by dividing the first with the second. This gives us the following code.

Average Intake

def cleanup(row):
  return [int(row[0]), row[1] == 'F', row[2], int(row[3])]

def average_intake(filename, faculty):
  data = list(map(cleanup, read_csv(filename)[1:]))        # remove header and cleanup (left as an exercise)
  data = list(filter(lambda row: row[2] == faculty, data)) # only keep data from the given faculty

  # problem solving
  sums, year = 0, []
  for row in data:
    if row[0] not in year:  # check if not already recorded
      year.append(row[0])   # record
    sums = sums + row[3]
  return sums / len(year)

Alternatively, we can also do grouping based on year after we have done the filtering.

Average Intake

def cleanup(row):
  return [int(row[0]), row[1] == 'F', row[2], int(row[3])]

def grouping(data):
  if len(data) == 0:
    return 0
  else:
    fst = data[0]
    same = list(filter(lambda row: row[0] == fst[0], data)) # same year
    diff = list(filter(lambda row: row[0] != fst[0], data)) # different year

    # count partition
    return 1 + grouping(diff)

def average_intake(filename, faculty):
  data = list(map(cleanup, read_csv(filename)[1:]))        # remove header and cleanup
  data = list(filter(lambda row: row[2] == faculty, data)) # only keep data from the given faculty

  # problem solving
  intk = list(map(lambda row: row[3], data))
  return sum(intk) / grouping(data)

Faculty Rank

Problem Description

We have a table with the following header.

Year	Sex	Faculty	Intake

We want to rank the faculty based on female intake. The higher intake should be given higher rank (i.e., smaller number, with 1 being the highest rank). What we want is the top \(n\) ranked faculty in a given year. If there are ties, we use 1224 ranking.

Task

Write the function top_n(filename, year, n) to return the list of top n faculties in the given year. Resolve ties with 1224 ranking.

Assumptions

filename exist.
year is an integer.
n is an integer.

Using the usual problem solving technique, we can decompose the problem as follows.

Read data from file.
Cleanup the data.
Filter the data for the given year.
Filter the data to include only female intake.
Sort the data based on intake.
Rank using 1224 ranking.
Filter the data to include only faculties with rank at most n.
Remove rank.

Faculty Rank

def cleanup(row):
  return [int(row[0]), row[1] == 'F', row[2], int(row[3])]

def rank1224(data): # assume already sorted
  rank = 1
  data[0].append(rank)  # rank 1
  for i in range(1, len(data)):
    if data[i][3] != data[i-1][3]:
      rank = i + 1
    data[i].append(rank)
  return data

def top_n(filename, year, n):
  data = list(map(cleanup, read_csv(filename)[1:]))     # remove header and cleanup
  data = list(filter(lambda row: row[0] == year, data)) # only keep data from the given year
  data = list(filter(lambda row: row[1], data))         # only keep data for female intake
  data = sorted(data,
                key = lambda row: row[3],
                reverse = True)                         # sorting
  data = rank1224(data)                                 # rank using 1224 ranking
  data = list(filter(lambda row: row[4] <= n, data))    # include only top n
  data = list(map(lambda row: row[:-1], data))          # remove rank
  return data

This is nice, there is no loop in top_n but we did use quite a number of helper functions. However, these helper functions follow the same general structure so we can store these in our pattern bank!