String Operations

Learning Objectives

At the end of this sub-unit, students should

understand substring check operations.
understand string operations involving indexing and slicing.
know how to use indexing and slicing on string.

Ordering

As we have seen earlier, a string is often thought of as a sequence of characters. A sequence is an interesting kind of collection because it has an order. As such, we often call a sequence as an ordered collection.

We say that a collection \(C\) is an ordered collection if we can have a mapping from a natural number (i.e., a sequence of (0, 1, 2, ...)¹) to the elements of \(C\). Since a string is an ordered collection, we can have a mapping from natural number to the elements of the string.

The element of a string is the character in the string itself. However, there is no special data type for character. A character is simply a string of one character. The implication is that an element of a string is a string.

So how are we going to do the mapping? The simplest way is to do the following mapping.

The number 0 is mapped to the first character.
The number 1 is mapped to the second character.
The number 2 is mapped to the third character.
:
The number (n-1) is mapped to the n^th character.

This gives us the following visualization for the string "ABCDEF".

Ordered

Instead of memorizing the visualization, it is better to come up with a procedure to generate the mapping.

Mapping

Given a string \(S\), we can generate most of the mapping as follows.

Start with the leftmost character in \(S\) and map it to the number 0.
Move one character to the right and map it to the previous number + 1.
Repeat until there are no more character.

It is always instructive to question our own assumption. Consider the procedure above. Is there a way that the procedure above can fail?

Unfortunately there is. If we look at the first instruction, we can ask ourselves, what if there is no leftmost character. One possible string that satisfies this condition is the empty string (i.e., ''). In fact, for an empty string, there is no possible mapping that you can make because not even the number 0 can be used.

Substring Check

The first thing that we will often do with a collection is to check if our collection contains an item. Since a string is a collection of characters (which itself is a string of one character), we want to check if a string contains a character. Luckily, we have such operator in Python called the in operator. There is also the opposite operator called not in, but this is not as important as we will see below.

needle in haystack      # returns True if haystack contains needle
needle not in haystack  # equivalent to not (needle in haystack)

Single Character Check

>>> 'D' in 'ABCDEF'
True
>>> 'G' in 'ABCDEF'
False
>>> 'a' in 'ABCDEF'        # case-sensitive
False
>>> 'f' not in 'ABCDEF'
True
>>> not ('D' in 'ABCDEF')  # parentheses to avoid ambiguity
False

The in operator is actually more powerful than that. The needle need not be just a single character but it can be what we call as a contiguous substring. For instance, 'BCD' is contiguous in 'ABCDEF' but 'ACE' is not. And so we expect 'BCD' in 'ABCDEF' to return True and 'ACE' in 'ABCDEF' to return False.

Contiguous Substring Check

>>> 'BCD' in 'ABCDEF'
True
>>> 'ACE' in 'ABCDEF'      # non-contiguous
False
>>> 'ABCDEF' in 'ABCDEF'   # we may use everything
True
>>> '' in 'ABCDEF'         # empty string is a substring of all string
True
>>> '' in ''               #   including empty string
True
>>> 'A' in ''              # but not the converse
False

Common Mistakes

This is a similar common mistake as "natural" reading of code but this is for sequence. Consider the following problem.

Check if the string "0" or "1" is a substring of "2112".

Based on the "natural" reading, we may write the following incorrect code.

>>> "0" or "1" in "2112"
'0'

The problem here is the order of operation. in operator is a relational operator. So it has a higher precedence than the logical operator or. The code above is actually equivalent to "0" or ("1" in "2112"). Based on short-circuit operation, the result will be "0". Maybe adding explicit parentheses will help.

>>> ("0" or "1") in "2112"
False

What is the problem now? Again, there is short-circuit happening. If we evaluate ("0" or "1") by short-circuit, we will get "0". So the code above reduces to "0" in "2112" and we do not check if "1" is in "2112". In the end, do not be lazy and write the two required operations explicitly.

>>> ("0" in "2112") or ("1" in "2112")
True

Indexing

Given the mapping from natural number to the element, the first string operation we will discuss is that of indexing. This is simply a way to retrieve the element of the string given a natural number. Since there is a mapping, we can retrieve the n^th character in the string by supplying the number (n-1). This operation is indexing and the general syntax is string[index] where string is a string and index is an integer.

Indexing

>>> "ABCDEF"[0]  # first character
'A'
>>> "ABCDEF"[2]  # third character
'C'
>>> "ABCDEF"[5]  # sixth character
'F'

Negative Indexing

Besides the mapping from natural number to the element, there is another mapping that we shall call negative indexing². Unlike the mapping procedure that starts from the leftmost character mapped to 0, negative indexing starts from the rightmost character mapped to -1. We will then move one character to the left and subtract 1 from the previous mapping.

Negative Mapping

Given a string \(S\), we can generate most of the mapping as follows.

Start with the rightmost character in \(S\) and map it to the number -1.
Move one character to the left and map it to the previous number - 1.
Repeat until there are no more character.

So given the same string 'ABCDEF', we can now give a more complete picture of all the indexing.

Indexing

To contrast negative indexing with the standard indexing, we may use positive indexing to refer to the standard version of indexing.

Negative Indexing

>>> "ABCDEF"[-2]  # fifth character
'E'
>>> "ABCDEF"[-5]  # second character
'B'
>>> "ABCDEF"[-6]  # first character
'A'

We can give a brief properties of positive and negative indexing for a string \(S\) with \(n\) characters.

Indexing Properties

Let \(S\) be a string with exactly \(n\) characters.

Character	Positive Index	Negative Index
First	`0`	`-n`
Last (n^th)	`(n-1)`	`-1`

So the valid range of indexes are from -n to (n-1). In this range, from -n to -1 are negative indexes and from 0 to (n-1) are positive indexes.

There is a simple formula to connect positive and negative indexes. If we let the positive index be \(i\) (e.g., \(i = 2\)) and the negative index be \(j\) (e.g., \(j = -4\)), then for a string of length \(n\), we can write \(i = n + j\) (e.g., if \(n = 6\), then \(2 = 6 + (-4)\)).

Bad Practices

If \(S\) is a string with exactly \(n\) characters, we know from the indexing properties that the valid range of indexes are from -n to (n-1). If you try to index from outside this range, you will get IndexError. That is like having a collection of 5 marbles and ask for the 6^th marble.

>>> "ABCDEF"[6]
IndexError: string index out of range
>>> "ABCDEF"[-7]
IndexError: string index out of range

String Theory

There is a peculiar property of a string that behaves like an infinite sequence. Consider a simple string 'X'. It is a string of one character, so the positive index is 0 and the negative index is -1. What happen if we evaluate 'X'[0]? We get back 'X'. So now we can see that if we evaluate 'X'[0][0][0][0][0][0], it does not matter how many [0] we add, the result will always be 'X'. It seems we cannot look beyond a string.

Slicing

Now that we know about positive/negative indexing, we can now talk about slicing. There are two variants of slicing shown below.

Slicing

Given a string s and three integers start, stop, and step, there are two variants of slicing.

s[start : stop]         # start and stop only
s[start : stop : step]  # start, stop, and step

Note that both slicing and indexing uses square bracket [...]. To differentiate between slicing and indexing, we need to look at the colon (i.e., :). If there is at least one colon, then it is a slicing. If there is no colon, then it is an indexing.

Slicing is quite involved, so we will first make simplifications. We will then relax these simplifications to arrive at the most general form of slicing.

Simple Slicing

We start with the first variant without step and we further impose the restriction that we will only use positive index. The idea of a slice s[start : stop] is to collect the characters from the string s starting from start up to stop and create a new string. But as a convention, we will always exclude the character at stop³.

As we are collecting the characters and forming a new string, we will be moving to the right (i.e., towards larger index). For illustration, consider the case of 'ABCDEF'[1:4].

Simple Slicing

Slicing 01

To evaluate 'ABCDEF'[1:4], we start from the element at the index determined by start, which is the character at index 1 (i.e., 'B'). We then move to the right as shown by the purple dashed line until we reach the element determined by stop. This is the character at index 4'.Remember that we always exclude the element at the index determined by#!py3 stop`.

Hence, the result is 'BCD'.

>>> 'ABCDEF'[1:4]
'BCD'

Before we give a more thorough procedure to do slicing, there are two interesting cases to consider.

What happen if start >= stop? Since we are always moving to the right then the element at stop is unreachable. And because we are always excluding the element at stop, the result should be an empty string. We will extend the unreachable case when we relax more restrictions.
What happen if stop is to the right of the rightmost character? In the case of indexing we have IndexError. But for slicing, there is no error and we simply include the last character.

These two cases are quite interesting because this means that slicing will never produce IndexError. One way to think about it is like slicing a cake. We make one slice in the middle of the cake and another slice outside of the plate, does not even touch the cake. What do we get in the end? The cake starting from the first cut.

Slicing 02

Interesting Cases for Simple Slicing

>>> "ABCDEF"[5:1]  # unreachable
''
>>> "ABCDEF"[2:7]  # beyond the edge
'CDEF'

For now, we have no possibility for start to be on the left of the leftmost element. But this will occur later and it will be another interesting case.

Negative Index

Let us now relax the requirement that we will only use positive index. How can we handle negative index. Actually, the way we worded the steps above is already indicative of how to handle negative index.

... we start from the element at the index determined by start ... until we reach the element determined by stop.

Notice how in our steps, we have never used algebra like, the next element is start + 1. Although this makes the explanation simple for the simple case, it cannot be easily extended to negative index. In particular, the first interesting case where the stop is unreachable from the start does not depend on the actual value. Instead, it depends on the element itself.

Incorrect Thinking

Consider the case of 'ABCDEF'[1:-2]. If our mental model is a the algebraic model, then we may erroneously believe that this is part of the unreachable case because 1 > -2. Similar erroneous thinking is to consider what happen if we keep on addng +1 to 1. Since we will never reach -2, we erroneously conclude that it is unreachable.

The actual result is explained next.

What this means is that if we consider 'ABCDEF'[1:-2], we use the index only to determine the first and the last element. Now it does not matter if we use positive or negative index because we are not doing an algebraic manipulation. Interestingly, the element at index -2 is the same element at index 4. So the result is exactly the same as our previous case! The result is 'BCD'.

Slicing 03

You may think a simpler explanation will be to simply convert negative index into positive index. But unfortunately, that only works if the index is within the valid range! And as we have seen above, slicing may use index that are beyond the valid range. So we opt to explain this in a more unified way by stating that the index (i.e., start and stop) are only used to determine the element. Afterwards, these information can be ignored.

Consider the case of 'ABCDEF'[-10:-4]. While we can convert -4 to its positive index version of 2, we cannot conver -10. If we simply plug it in into the formula \(i = n + j\), then we get \(i = 6 + (-10)\). \(i = -4\) is definitely not a positive index.

This brings us to the extension of the previous interesting case. What happen if start is on the left of the leftmost element. In this case, we simply move the start to be on the first element. So the result of 'ABCDEF'[-10:-4] is 'AB'.

Steps

Now we can extend this to include step. So far, we have been moving one step to the right. We can then say that by convention, this is equivalent to step = 1. So step = 1 is moving to the right by 1.

What about step = 2. We can say that it is moving to the right by 2. Generalizing this, we can say that if step = k, then we move to the right by k. Now, what if our step is negative? Say, step = -1? Since we are looking at directions, we can say that it is as if we are moving in the opposite direction. So for step = -k, instead of moving to the right by k, we simply move to the left by k. This gives us the ability to move backward!

Slicing with Steps

Consider 'ABCDEF'[-2:1:-2]. What will be the result?

Slicing 04

First, we determine the start and stop element. Secondly, we determine if this is unreachable or not. This is actually reachable because the stop element is to the left of the start element and we are moving to the left. Finally, we start collecting with a step of 2. But keep in mind we are moving to the left as shown in the image above.

The first character we collect is 'E' and the second character is 'C'. We stop here because we will pass the stop element. So the result is 'EC'.

>>> 'ABCDEF'[-2:1:-2]
'EC'

Now the previous extension to the interesting case becomes more interesting. Recap that we mentioned the following.

In this case, we simply move the start to be on the first element.

With the possibility of a negative step, we will also need to consider the case of start on the right of the rightmost element but with negative step. In this case, we start from the last element.

It is always nice to give names to operations so that we can all agree what we meant when we say things. This is similar to how we say that a case is unreachable. Names are powerful and do no be afraid to invent names, especially for concepts that we did not give names. In the spirit of naming things, let's name this process of starting from first/last element as restricting the slice. Of course we will have to give a more proper definition, but we can leave it for later.

Slicing with Restriction

Consider 'ABCDEF'[6:-5:-2]. What will be the result?

Slicing 05

Here we have two possibilities. The first is shown on the left where we imagine that there is an element at index 6 and simply start from there. However, we will only start collecting once we reach valid index. In this case, the result should be 'EC'.

The second possibility is to restrict the start to within the valid index first. So instead of starting at the imagined element, we simply move the start to the nearest valid index. This case is shown on the right and the result should be 'FD'.

We have two hypotheses, so we check.

>>> 'ABCDEF'[6:-5:-2]
'FD'

The answer is the second possibility. This is in line with our explanation that we simply start from first/last element.

It is good to look at it in terms of procedure. We can summarize the behavior of slicing as "Basic Slicing" below assuming that all the values are given. Note that the procedure below works if step is given. Otherwise, assume that step is 1.

Basic Slicing Procedure

Determine the start and stop element.
Determine if the stop element is unreachable.
- Unreachable if
  - stop is to the left of start and step is positive (i.e., move to the right).
  - stop is to the right of start and step is negative (i.e., move to the left).
- If unreachable then we return empty string and go to ...
Restrict the start element.
- If the start element is outside of valid index, we start from the nearest valid index.
Collect character by character and construct a new string.
- The next character is determined by the step.
  - If step is positive +k, we move to the right by k steps.
  - If step is negative -k, we move to the left by k steps.
- We stop when
  - we go outside of valid range, or
  - we reach the stop element, or
  - we pass the stop element.

The procedure may look complicated but that is because we need to explain the two terminologies used: unreachable and restricting. Otherwise, the procedure should hopefully be rather straightforward.

Missing Values

Finally, we need to consider missing values. This final relaxation is actually rather straightforward. We simply fill in the missing values with the default values according to the table below. The default value table is valid for a string with \(n\) character.

Step			Start	Stop
1		Step > 0	0 (i.e., leftmost)	\(n\) (i.e., right of rightmost)
		Step < 0	-1 (i.e., rightmost)	\(-(n + 1)\) (i.e., left of leftmost)

The default value for step is simply 1. On the other hand, the default value for start and stop depends on whether step is positive or negative. An easy way to remember them is shown in the parentheses.

Slicing

>>> "ABCDEF"[::-1]  # 'ABCDEF'[0:-7:-1] --> reverse!
'FEDCBA'
>>> "ABCDEF"[:-1:]  # 'ABCDEF'[0:-1:1] --> remove last!
'ABCDE'

In general, we have the following useful shorthands.

s[:-k:] removes the last k elements from the string s.
s[k1:-k2:] removes the first k1 and last k2 elements from the string s.

If you are unsure, we would recommend using only positive index and fill in all the values.

Functions on String

We have used the placeholder \(n\) for the number of characters in the string. But how do we know the number of characters in a string? There is a function that we can invoke to get the number of characters in the string.

First, a little bit about functions. We will discuss functions in more details in later units. For now, all we need to know is how to invoke functions. If our function name is f and it takes one argument, we can invoke the function by appending parentheses with the arguments inside. So to invoke f with argument arg, we write f(arg). Note that the argument must be supplied by us when we invoke.

Another thing is that even if we do not have any argument, we still need to provide parentheses. So if we have a function g that does not take in any argument, we can invoke it by writing g(). In short, a function can be described by the name, the arguments needed, and the result which we will call the return value.

In the case of the function to get the number of characters of a string, the information is as follows.

Name: len
Argument: Only one argument, which is the string.
Return Value: The number of characters of the string argument.

String Length

>>> len("CS1010S")
6
>>> len("")
0

Bad Practice

Note that the len only works on sequence. We will show more sequence later, but note that it will not work for boolean, integer, or floating point number.

>>> len(1010)
TypeError
>>> len(20.3)
TypeError
>>> len(True)
TypeError

Summary

Summary of String Operations

Operation	Syntax
Substring Check	`needle in haystack` `needle not in haystack`
Indexing	`s[idx]`
Slicing	`s[start : stop]` `s[start : stop : step]`
Length	`len(s)`

Summary of String Index

Given a string with \(n\) characters, we can assign two kinds of index (positive and negative index) as follows.

Positive Index
- Start from leftmost character, assign the value 0.
- Move to the right, and assign the value of the value on the left plus one.
Negative Index
- Start from rightmost character, assign the value -1.
- Move to the left, and assign the value of the value on the right minus one.

+ve	0	1	...	n - 2	n - 1
-ve	-n	-n + 1	...	-2	-1

Valid indexes are between \(-n\) and \(n - 1\) (inclusive) with \(-n\) to \(-1\) belonging to negative index and \(0\) to \(n - 1\) belonging to positive index. We can convert valid negative index \(j\) into valid positive index \(i\) (and vice versa) using the formula

\[i = n + j\]

Summary of String Slicing

For slicing, we have the following default values for missing values.

Step			Start	Stop
1		Step > 0	0 (i.e., leftmost)	\(n\) (i.e., right of rightmost)
		Step < 0	-1 (i.e., rightmost)	\(-(n + 1)\) (i.e., left of leftmost)

We use the definition of natural number starting from 0 instead of 1. Also, 0 is a perfectly good number and we don't want to waste good numbers. ↩
An alternative name we may use is the "backward indexing". In this case, we can also name "positive indexing" as "forward indexing". ↩
This is a useful convention. It has nice properties like s[x:y] + s[y:z] == s[x:z] for x < y < z. ↩