Using tags to filter lists in Python

| categories: python | tags:

Suppose you have a collection of items in a list, and you want to filter the list based on some properties of the items, and then accumulate some other property on the filtered items. We will look at some strategies for this here.

The particular application is that I have a list of courses that make up a curriculum, and I want to summarize the curriculum in a variety of ways. For example, I might want to know how many Gen Ed courses there are, or how many math, chemistry, biology and physics courses there are. I may want to know how may units overall are required.

A course will be represented by a class, which simply holds the data about the course. Here we consider the course number (which is really a string), the number of units of the course, and what category the course fits into. There will be 7 categories here: chemistry, biology, physics, math, engineering, general education, and free elective.

We will use some binary math to represent the categories. Essentially we define tags as if they are binary numbers, and then we can use binary operators to tell if an item is tagged a particular way. We use & to do a logical AND between a variable and a TAG. If it comes out True, the variable has that tag.

This works basically by defining a TAG like a binary number, e.g. TAG1 = 100, TAG2 = 010, TAG3 = 001. Then, if you have a number like 110, you know it is tagged with TAG1 and TAG2, but not TAG3. We can figure that out with code too.

100 & 110 = 100 = 1
010 & 110 = 010 = 2
print 1 & 3
print 2 & 3
1
2

Let us try out an example. The easiest way to define the tags, is as powers of two.

# define some tags
TAG1 = 2**0  # 100
TAG2 = 2**1  # 010

# Now define a variable that is "tagged"
a = TAG1
print a & TAG1 # remember that 0 = False, everything else is true
print a & TAG2
1
0

We can use multiple tags by adding them together.

# define some tags
TAG1 = 2**0  # 100
TAG2 = 2**1  # 010
TAG3 = 2**2  # 001

# Now define a variable that is "tagged"
a = TAG1 + TAG2  # 1 + 2 = 3 = 110 in binary
print a & TAG1 
print a & TAG2
print a & TAG3
1
2
0

You can see that the variable is not tagged by TAG3, but is tagged with TAG1 and TAG2. We might want to tag an item with more than one tag. We create groups of tags by simply adding them together. We can still check if a variable has a particular tag like we did before.

# define some tags
TAG1 = 2**0  # 100
TAG2 = 2**1  # 010
TAG3 = 2**2  # 001

# Now define a variable that is "tagged"
a = TAG1 + TAG2  # 1 + 2 = 3 = 110 in binary
print a & TAG1
print a & TAG2
print a & TAG3
1
2
0

It is trickier to say if a variable is tagged with a particular set of tags. Let us consider why. The binary representation of TAG1 + TAG2 is 110. The binary representation of TAG2 + TAG3 is 011. If we simply consider (TAG1 + TAG2) & (TAG2 & TAG3) we get 010. That actually tells us that we do not have a match, because 010 is not equal to (TAG2 & TAG3 = 011). In other words, the logical AND of the tag with some sum of tags is equal to the sum of tags when there is a match. So, we can check if that is the case with an equality comparison.

# define some tags
TAG1 = 2**0  # 100
TAG2 = 2**1  # 010
TAG3 = 2**2  # 001

# Now define a variable that is "tagged"
a = TAG1 + TAG2  # 1 + 2 = 3 = 110 in binary
print (a & (TAG1 + TAG2)) == TAG1 + TAG2
print (a & (TAG1 + TAG3)) == TAG1 + TAG3
print (a & (TAG2 + TAG3)) == TAG2 + TAG3
True
False
False

Ok, enough binary math, let us see an application. Below we create a set of tags indicating the category a course falls into, a class definition to store course data in attributes of an object, and a list of courses. Then, we show some examples of list comprehension filtering based on the tags to summarize properties of the list. The logical comparisons are simple below, as the courses are not multiply tagged at this point.

CHEMISTRY = 2**0
BIOLOGY = 2**1
PHYSICS = 2**2
MATH = 2**3
ENGINEERING = 2**4
GENED = 2**5
FREE = 2**6

class Course:
    '''simple container for course information'''
    def __init__(self, number, units, category):
        self.number = number
        self.units = units
        self.category = category
    def __repr__(self):
        return self.number


courses = [Course('09-105', 9, CHEMISTRY),
           Course('09-106', 9, CHEMISTRY),
           Course('33-105', 12, PHYSICS),
           Course('33-106', 12, PHYSICS),
           Course('21-120', 10, MATH),
           Course('21-122', 10, MATH),
           Course('21-259', 10, MATH),
           Course('06-100', 12, ENGINEERING),
           Course('xx-xxx', 9, GENED),     
           Course('xx-xxx', 9, FREE), 
           Course('03-232', 9, BIOLOGY)]

# print the total units
print ' Total units = {0}'.format(sum([x.units for x in courses]))

# get units of math required
math_units = sum([x.units  for x in courses if x.category & MATH])

# get total units of math, chemistry, physics and biology a | b is a
# logical OR. This gives a prescription for tagged with MATH OR
# CHEMISTRY OR PHYSICS OR BIOLOGY
BASIC_MS = MATH | CHEMISTRY | PHYSICS | BIOLOGY

# total units in those categories
basic_math_science = sum([x.units for x in courses if x.category & BASIC_MS])

print 'We require {0} units of math out of {1} units of basic math and science courses.'.format(math_units, basic_math_science)

# We are required to have at least 96 units of Math and Sciences.
print 'We are compliant on number of Math and science: ',basic_math_science >= 96
 Total units = 111
We require 30 units of math out of 81 units of basic math and science courses.
We are compliant on number of Math and science:  False

That is all for this example. With more data for each course, you could see what courses are taken in what semesters, how many units are in each semester, maybe create a prerequisite map, and view the curriculum by categories of courses, etc…

Copyright (C) 2014 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.5g

Discuss on Twitter