String Operations
Learning Objectives
At the end of this sub-unit, students should
- understand substring check operations.
- understand string operations involving indexing and slicing.
- know how to use indexing and slicing on string.
Ordering
As we have seen earlier, a string is often thought of as a sequence of characters. A sequence is an interesting kind of collection because it has an order. As such, we often call a sequence as an ordered collection.
We say that a collection \(C\) is an ordered collection if we can have a mapping from a natural number (i.e., a sequence of (0, 1, 2, ...)1) to the elements of \(C\). Since a string is an ordered collection, we can have a mapping from natural number to the elements of the string.
The element of a string is the character in the string itself. However, there is no special data type for character. A character is simply a string of one character. The implication is that an element of a string is a string.
So how are we going to do the mapping? The simplest way is to do the following mapping.
- The number 0 is mapped to the first character.
- The number 1 is mapped to the second character.
- The number 2 is mapped to the third character.
- :
- The number (n-1) is mapped to the nth character.
This gives us the following visualization for the string "ABCDEF"
.
Instead of memorizing the visualization, it is better to come up with a procedure to generate the mapping.
Mapping
Given a string \(S\), we can generate most of the mapping as follows.
- Start with the leftmost character in \(S\) and map it to the number 0.
- Move one character to the right and map it to the previous number + 1.
- Repeat until there are no more character.
It is always instructive to question our own assumption. Consider the procedure above. Is there a way that the procedure above can fail?
Unfortunately there is.
If we look at the first instruction, we can ask ourselves, what if there is no leftmost character.
One possible string that satisfies this condition is the empty string (i.e., ''
).
In fact, for an empty string, there is no possible mapping that you can make because not even the number 0 can be used.
Substring Check
The first thing that we will often do with a collection is to check if our collection contains an item.
Since a string is a collection of characters (which itself is a string of one character), we want to check if a string contains a character.
Luckily, we have such operator in Python called the in
operator.
There is also the opposite operator called not in
, but this is not as important as we will see below.
Single Character Check
The in
operator is actually more powerful than that.
The needle
need not be just a single character but it can be what we call as a contiguous substring.
For instance, 'BCD'
is contiguous in 'ABCDEF'
but 'ACE'
is not.
And so we expect 'BCD' in 'ABCDEF'
to return True
and 'ACE' in 'ABCDEF'
to return False
.
Contiguous Substring Check
Common Mistakes
This is a similar common mistake as "natural" reading of code but this is for sequence. Consider the following problem.
Check if the string
"0"
or"1"
is a substring of"2112"
.
Based on the "natural" reading, we may write the following incorrect code.
The problem here is the order of operation.
in
operator is a relational operator.
So it has a higher precedence than the logical operator or
.
The code above is actually equivalent to "0" or ("1" in "2112")
.
Based on short-circuit operation, the result will be "0"
.
Maybe adding explicit parentheses will help.
What is the problem now?
Again, there is short-circuit happening.
If we evaluate ("0" or "1")
by short-circuit, we will get "0"
.
So the code above reduces to "0" in "2112"
and we do not check if "1"
is in "2112"
.
In the end, do not be lazy and write the two required operations explicitly.
Indexing
Given the mapping from natural number to the element, the first string operation we will discuss is that of indexing.
This is simply a way to retrieve the element of the string given a natural number.
Since there is a mapping, we can retrieve the nth character in the string by supplying the number (n-1).
This operation is indexing and the general syntax is string[index]
where string
is a string and index
is an integer.
Indexing
Negative Indexing
Besides the mapping from natural number to the element, there is another mapping that we shall call negative indexing2. Unlike the mapping procedure that starts from the leftmost character mapped to 0, negative indexing starts from the rightmost character mapped to -1. We will then move one character to the left and subtract 1 from the previous mapping.
Negative Mapping
Given a string \(S\), we can generate most of the mapping as follows.
- Start with the rightmost character in \(S\) and map it to the number -1.
- Move one character to the left and map it to the previous number - 1.
- Repeat until there are no more character.
So given the same string 'ABCDEF'
, we can now give a more complete picture of all the indexing.
To contrast negative indexing with the standard indexing, we may use positive indexing to refer to the standard version of indexing.
Negative Indexing
We can give a brief properties of positive and negative indexing for a string \(S\) with \(n\) characters.
Indexing Properties
Let \(S\) be a string with exactly \(n\) characters.
Character | Positive Index | Negative Index |
---|---|---|
First | 0 |
-n |
Last (nth) | (n-1) |
-1 |
So the valid range of indexes are from -n
to (n-1)
.
In this range, from -n
to -1
are negative indexes and from 0
to (n-1)
are positive indexes.
There is a simple formula to connect positive and negative indexes. If we let the positive index be \(i\) (e.g., \(i = 2\)) and the negative index be \(j\) (e.g., \(j = -4\)), then for a string of length \(n\), we can write \(i = n + j\) (e.g., if \(n = 6\), then \(2 = 6 + (-4)\)).
Bad Practices
If \(S\) is a string with exactly \(n\) characters, we know from the indexing properties that the valid range of indexes are from -n
to (n-1)
.
If you try to index from outside this range, you will get IndexError
.
That is like having a collection of 5 marbles and ask for the 6th marble.
String Theory
There is a peculiar property of a string that behaves like an infinite sequence.
Consider a simple string 'X'
.
It is a string of one character, so the positive index is 0 and the negative index is -1.
What happen if we evaluate 'X'[0]
?
We get back 'X'
.
So now we can see that if we evaluate 'X'[0][0][0][0][0][0]
, it does not matter how many [0]
we add, the result will always be 'X'
.
It seems we cannot look beyond a string.
Slicing
Now that we know about positive/negative indexing, we can now talk about slicing. There are two variants of slicing shown below.
Slicing
Given a string s
and three integers start
, stop
, and step
, there are two variants of slicing.
Note that both slicing and indexing uses square bracket [...]
.
To differentiate between slicing and indexing, we need to look at the colon (i.e., :
).
If there is at least one colon, then it is a slicing.
If there is no colon, then it is an indexing.
Slicing is quite involved, so we will first make simplifications. We will then relax these simplifications to arrive at the most general form of slicing.
Simple Slicing
We start with the first variant without step
and we further impose the restriction that we will only use positive index.
The idea of a slice s[start : stop]
is to collect the characters from the string s
starting from start
up to stop
and create a new string.
But as a convention, we will always exclude the character at stop
3.
As we are collecting the characters and forming a new string, we will be moving to the right (i.e., towards larger index).
For illustration, consider the case of 'ABCDEF'[1:4]
.
Simple Slicing
To evaluate 'ABCDEF'[1:4]
, we start from the element at the index determined by start
, which is the character at index 1
(i.e., 'B'
).
We then move to the right as shown by the purple dashed line until we reach the element determined by stop
.
This is the character at index 4'.Remember that we always exclude the element at the index determined by
#!py3 stop`.
Hence, the result is 'BCD'
.
Before we give a more thorough procedure to do slicing, there are two interesting cases to consider.
- What happen if
start >= stop
? Since we are always moving to the right then the element atstop
is unreachable. And because we are always excluding the element atstop
, the result should be an empty string. We will extend the unreachable case when we relax more restrictions. - What happen if
stop
is to the right of the rightmost character? In the case of indexing we haveIndexError
. But for slicing, there is no error and we simply include the last character.
These two cases are quite interesting because this means that slicing will never produce IndexError
.
One way to think about it is like slicing a cake.
We make one slice in the middle of the cake and another slice outside of the plate, does not even touch the cake.
What do we get in the end?
The cake starting from the first cut.
Interesting Cases for Simple Slicing
For now, we have no possibility for start
to be on the left of the leftmost element.
But this will occur later and it will be another interesting case.
Negative Index
Let us now relax the requirement that we will only use positive index. How can we handle negative index. Actually, the way we worded the steps above is already indicative of how to handle negative index.
... we start from the element at the index determined by
start
... until we reach the element determined bystop
.
Notice how in our steps, we have never used algebra like, the next element is start + 1
.
Although this makes the explanation simple for the simple case, it cannot be easily extended to negative index.
In particular, the first interesting case where the stop
is unreachable from the start
does not depend on the actual value.
Instead, it depends on the element itself.
Incorrect Thinking
Consider the case of 'ABCDEF'[1:-2]
.
If our mental model is a the algebraic model, then we may erroneously believe that this is part of the unreachable case because 1 > -2
.
Similar erroneous thinking is to consider what happen if we keep on addng +1
to 1
.
Since we will never reach -2
, we erroneously conclude that it is unreachable.
The actual result is explained next.
What this means is that if we consider 'ABCDEF'[1:-2]
, we use the index only to determine the first and the last element.
Now it does not matter if we use positive or negative index because we are not doing an algebraic manipulation.
Interestingly, the element at index -2
is the same element at index 4
.
So the result is exactly the same as our previous case!
The result is 'BCD'
.
You may think a simpler explanation will be to simply convert negative index into positive index.
But unfortunately, that only works if the index is within the valid range!
And as we have seen above, slicing may use index that are beyond the valid range.
So we opt to explain this in a more unified way by stating that the index (i.e., start
and stop
) are only used to determine the element.
Afterwards, these information can be ignored.
Consider the case of 'ABCDEF'[-10:-4]
.
While we can convert -4
to its positive index version of 2
, we cannot conver -10
.
If we simply plug it in into the formula \(i = n + j\), then we get \(i = 6 + (-10)\).
\(i = -4\) is definitely not a positive index.
This brings us to the extension of the previous interesting case.
What happen if start
is on the left of the leftmost element.
In this case, we simply move the start
to be on the first element.
So the result of 'ABCDEF'[-10:-4]
is 'AB'
.
Steps
Now we can extend this to include step
.
So far, we have been moving one step to the right.
We can then say that by convention, this is equivalent to step = 1
.
So step = 1
is moving to the right by 1.
What about step = 2
.
We can say that it is moving to the right by 2.
Generalizing this, we can say that if step = k
, then we move to the right by k.
Now, what if our step is negative?
Say, step = -1
?
Since we are looking at directions, we can say that it is as if we are moving in the opposite direction.
So for step = -k
, instead of moving to the right by k, we simply move to the left by k.
This gives us the ability to move backward!
Slicing with Steps
Consider 'ABCDEF'[-2:1:-2]
.
What will be the result?
First, we determine the start and stop element. Secondly, we determine if this is unreachable or not. This is actually reachable because the stop element is to the left of the start element and we are moving to the left. Finally, we start collecting with a step of 2. But keep in mind we are moving to the left as shown in the image above.
The first character we collect is 'E'
and the second character is 'C'
.
We stop here because we will pass the stop element.
So the result is 'EC'
.
Now the previous extension to the interesting case becomes more interesting. Recap that we mentioned the following.
In this case, we simply move the start to be on the first element.
With the possibility of a negative step
, we will also need to consider the case of start
on the right of the rightmost element but with negative step
.
In this case, we start from the last element.
It is always nice to give names to operations so that we can all agree what we meant when we say things. This is similar to how we say that a case is unreachable. Names are powerful and do no be afraid to invent names, especially for concepts that we did not give names. In the spirit of naming things, let's name this process of starting from first/last element as restricting the slice. Of course we will have to give a more proper definition, but we can leave it for later.
Slicing with Restriction
Consider 'ABCDEF'[6:-5:-2]
.
What will be the result?
Here we have two possibilities.
The first is shown on the left where we imagine that there is an element at index 6
and simply start from there.
However, we will only start collecting once we reach valid index.
In this case, the result should be 'EC'
.
The second possibility is to restrict the start to within the valid index first.
So instead of starting at the imagined element, we simply move the start to the nearest valid index.
This case is shown on the right and the result should be 'FD'
.
We have two hypotheses, so we check.
The answer is the second possibility. This is in line with our explanation that we simply start from first/last element.
It is good to look at it in terms of procedure.
We can summarize the behavior of slicing as "Basic Slicing" below assuming that all the values are given.
Note that the procedure below works if step
is given.
Otherwise, assume that step
is 1
.
Basic Slicing Procedure
- Determine the start and stop element.
- Determine if the stop element is unreachable.
- Unreachable if
- stop is to the left of start and step is positive (i.e., move to the right).
- stop is to the right of start and step is negative (i.e., move to the left).
- If unreachable then we return empty string and go to ...
- Unreachable if
- Restrict the start element.
- If the start element is outside of valid index, we start from the nearest valid index.
- Collect character by character and construct a new string.
- The next character is determined by the step.
- If step is positive
+k
, we move to the right by k steps. - If step is negative
-k
, we move to the left by k steps.
- If step is positive
- We stop when
- we go outside of valid range, or
- we reach the stop element, or
- we pass the stop element.
- The next character is determined by the step.
The procedure may look complicated but that is because we need to explain the two terminologies used: unreachable and restricting. Otherwise, the procedure should hopefully be rather straightforward.
Missing Values
Finally, we need to consider missing values. This final relaxation is actually rather straightforward. We simply fill in the missing values with the default values according to the table below. The default value table is valid for a string with \(n\) character.
Step | Start | Stop | ||
---|---|---|---|---|
1 | Step > 0 | 0 (i.e., leftmost) | \(n\) (i.e., right of rightmost) | |
Step < 0 | -1 (i.e., rightmost) | \(-(n + 1)\) (i.e., left of leftmost) |
The default value for step
is simply 1.
On the other hand, the default value for start
and stop
depends on whether step
is positive or negative.
An easy way to remember them is shown in the parentheses.
Slicing
In general, we have the following useful shorthands.
s[:-k:]
removes the lastk
elements from the strings
.s[k1:-k2:]
removes the firstk1
and lastk2
elements from the strings
.
If you are unsure, we would recommend using only positive index and fill in all the values.
Functions on String
We have used the placeholder \(n\) for the number of characters in the string. But how do we know the number of characters in a string? There is a function that we can invoke to get the number of characters in the string.
First, a little bit about functions.
We will discuss functions in more details in later units.
For now, all we need to know is how to invoke functions.
If our function name is f
and it takes one argument, we can invoke the function by appending parentheses with the arguments inside.
So to invoke f
with argument arg
, we write f(arg)
.
Note that the argument must be supplied by us when we invoke.
Another thing is that even if we do not have any argument, we still need to provide parentheses.
So if we have a function g
that does not take in any argument, we can invoke it by writing g()
.
In short, a function can be described by the name, the arguments needed, and the result which we will call the return value.
In the case of the function to get the number of characters of a string, the information is as follows.
- Name:
len
- Argument: Only one argument, which is the string.
- Return Value: The number of characters of the string argument.
Bad Practice
Note that the len
only works on sequence.
We will show more sequence later, but note that it will not work for boolean, integer, or floating point number.
Summary
Summary of String Operations
Operation | Syntax |
---|---|
Substring Check | needle in haystack needle not in haystack |
Indexing | s[idx] |
Slicing | s[start : stop] s[start : stop : step] |
Length | len(s) |
Summary of String Index
Given a string with \(n\) characters, we can assign two kinds of index (positive and negative index) as follows.
- Positive Index
- Start from leftmost character, assign the value 0.
- Move to the right, and assign the value of the value on the left plus one.
- Negative Index
- Start from rightmost character, assign the value -1.
- Move to the left, and assign the value of the value on the right minus one.
+ve | 0 | 1 | ... | n - 2 | n - 1 |
---|---|---|---|---|---|
-ve | -n | -n + 1 | ... | -2 | -1 |
Valid indexes are between \(-n\) and \(n - 1\) (inclusive) with \(-n\) to \(-1\) belonging to negative index and \(0\) to \(n - 1\) belonging to positive index. We can convert valid negative index \(j\) into valid positive index \(i\) (and vice versa) using the formula
Summary of String Slicing
For slicing, we have the following default values for missing values.
Step | Start | Stop | ||
---|---|---|---|---|
1 | Step > 0 | 0 (i.e., leftmost) | \(n\) (i.e., right of rightmost) | |
Step < 0 | -1 (i.e., rightmost) | \(-(n + 1)\) (i.e., left of leftmost) |
-
We use the definition of natural number starting from 0 instead of 1. Also, 0 is a perfectly good number and we don't want to waste good numbers. ↩
-
An alternative name we may use is the "backward indexing". In this case, we can also name "positive indexing" as "forward indexing". ↩
-
This is a useful convention. It has nice properties like
s[x:y] + s[y:z] == s[x:z]
forx
<y
<z
. ↩