My package sysv_ipc celebrates its 11th birthday tomorrow, so I thought I would give it a mailing list as a gift. I didn’t want its sibling posix_ipc to get jealous, so I created one for that too.
July 2018 update: I’ll be giving a talk based on this guide at PyOhio next week. If you’re there, please come say hello!
It’s not always obvious, but migrating from Python 2 to 3 doesn’t have to be an overwhelming effort spike. I’ve done Python 2-to-3 migration assessments with several organizations, and in each case we were able to turn the unknowns into a set of straightforward to-do lists.
I’ve written a Python 2-to-3 migration guide [PDF] to help others who want to make the leap but aren’t sure where to start, or have maybe already begun but would like another perspective. It outlines some high level steps for the migration and also contains some nitty-gritty technical details, so it’s useful for both those who will plan the migration and the technical staff that will actually perform it.
The (very brief) summary is that most of the work can be done in advance without sacrificing Python 2 compatibility. What’s more, you can divide the work into manageable chunks that you can tick off one by one as you have time to work on them. Last but not least, many of the changes are routine and mechanical (for example, changing the print statement to a function), and there are tools that do a lot of the work for you.
I recently tripped over my reliance on a simple (and probably obscure) feature in Python’s distutils that setuptools doesn’t support. The result was that I created a tarball for my posix_ipc module that lacked critical files. By chance, I noticed when uploading the new tarball that it was about 75% smaller than the previous version. That’s a red flag!
Fortunately, the bad tarball was only on PyPI for about 3 minutes before I noticed the problem and removed the release.
I made debugging harder on myself by stepping away from the project for a long time and forgetting what changes I’d made since the previous release.
Background
In February 2014, I (finally) made my distribution PyPI–friendly. Prior to that I’d built my distribution tarballs with a custom script that explicitly listed each file to be included in the tarball. The typical, modern, and PyPI–friendly way to build tarballs is by writing a MANIFEST.in file that a distribution tool (like Python’s distutils) interprets into a MANIFEST file. A command like `python setup.py sdist` reads the manifest and builds the tarball.
That’s the method to which I switched in February 2014, with one exception—since my custom script already contained an explicit list of files, it was easier to write a MANIFEST file directly and skip the intermediate MANIFEST.in. That works fine with distutils.
I released version 1.0.0 of posix_ipc in March of 2015, and haven’t needed to make any changes to the code until just now (the beginning of 2018). However, in February 2016, I made a small change to setup.py that I thought was harmless. (Ha!)
I added a conditional import of setuptools so that I could build wheels. (Side note: I really like wheels!) The change allows me to build posix_ipc wheels on my laptop where I can ensure setuptools is available, but otherwise falls back on Python’s distutils which works just fine for everything else I need setup.py to do, including installing from a tarball. The code looks like this —
try:
import setuptools as distutools
except ImportError:
import distutils.core as distutools
The Problem
Just a few days ago, I released a maintenance release of posix_ipc, and it was then I noticed that the tarballs I built with my usual python setup.py sdist command were 75% smaller and missing several critical files. Because it had been 23 months since I made my “harmless” change to setup.py, the switch from using distutils to setuptools wasn’t exactly fresh in my mind.
However, some examination of my commit log and a realization that this was the first release I’d made after making that change gave me a suspicion, and grepping through setuptools‘ code revealed no references to MANIFEST, only MANIFEST.in.
[B]e sure to ignore any part of the distutils documentation that deals with MANIFEST or how it’s generated from MANIFEST.in; setuptools shields you from these issues and doesn’t work the same way in any case. Unlike the distutils, setuptools regenerates the source distribution manifest file every time you build a source distribution, and it builds it inside the project’s .egg-info directory, out of the way of your main project directory.
So that was the problem—setuptools doesn’t look for a MANIFEST file, only MANIFEST.in. Since I had the former but not the latter, setuptools used its defaults instead of my list of files in MANIFEST.
The Solution
This part was easy. I converted my MANIFEST file to a MANIFEST.in which works with both setuptools and distutils. That’s probably a more robust solution than the hardcoded list in MANIFEST anyway.
I’m pleased that posix_ipc has been stable and well-behaved for such a long time, but these long breaks between releases mean a certain amount of mental rust has always accumulated when it’s time for the next one.
This is a followup to an earlier post about using Python to measure the “Anglo-Saxonicity” of a text. I’ve used my code to analyze the Baby version of the British National Corpus, and I’ve found some interesting results.
Thanks to a suggestion from Ben Sizer, I decided to analyze the British National Corpus. I started with the ‘baby’ corpus which, as you might imagine, is smaller than the full corpus.
It’s described as a “100 million word snapshot of British English at the end of the twentieth century“. It categorizes text samples into four groups: academic, conversations, fiction, and news. Below are stack plots showing the percentage of Anglo-Saxon, non-Anglo-Saxon, and unknown words for each document in each of the four groups. The Y axis shows the percentage of words in each category. The numbers along the X axis identify individual documents within the group.
I’ve deliberately given the charts non-specific names of Group A, B, C, and D so that we can play a game. :-)
Before we get to the game, here’s the averages for each group in table form. (The numbers might not add exactly to 100% due to rounding.)
Anglo-Saxon (%)
Non-Anglo-Saxon (%)
Unknown (%)
Group A
67.0
17.7
15.3
Group B
56.1
25.8
18.1
Group C
72.9
13.2
13.9
Group D
58.6
22.0
19.3
Keep in mind that “unknown” words represent shortcomings in my database more than anything else.
The Game
The Baby BNC is organized into groups of academic, conversations, fiction, and news. Groups A, B, C, and D each represent one of those groups. Which do you think is which?
Click below to reveal the answer to the game and a discussion of the results.
Answers
Anglo-Saxon (%)
Non-Anglo-Saxon (%)
Unknown (%)
A = Fiction
67.0
17.7
15.3
B = Academic
56.1
25.8
18.1
C = Conversations
72.9
13.2
13.9
D = News
58.6
22.0
19.3
Discussion
With the hubris that only 20/20 hindsight can provide, I’ll say that I don’t find these numbers terribly surprising. Conversations have the highest proportion of Anglo-Saxon (72.9%) and the lowest of non-Anglo-Saxon (13.2%). Conversations are apt to use common words, and the 100 most common words in English are about 95% Anglo-Saxon. The relatively fast pace of conversation doesn’t encourage speakers to pause to search for those uncommon words lest they bore their listener or lose their chance to speak. I think the key here is not the fact that conversations are spoken, but that they’re impromptu. (Impromptu if you’re feeling French, off-the-cuff if you’re more Middle-English-y, or extemporaneous if you want to go full bore Latin.)
Academic writing is on the opposite end of the statistics, with the lowest portion of Anglo-Saxon words (56.1%) and the highest non-Anglo-Saxon (25.8%). Academic writing tends to be more ambitious and precise. Stylistically, it doesn’t shy away from more esoteric words because its audience is, by definition, well-educated. It doesn’t need to stick to the common core of English to get its point across. In addition, those who shaped academia were the educated members of society, and for many centuries education was tied to the church or limited to the gentry, and both spoke a lot of Latin and French. That has probably influenced even the modern day culture of academic writing.
Two of the academic samples managed to use fewer than half Anglo-Saxon words. They are a sample from Colliding Plane Waves in General Relativity (a subject Anglo-Saxons spent little time discussing, I’ll wager) and a sample from The Lancet, the British medical journal (49% and 47% Anglo-Saxon, respectively). It’s worth noting that these samples also displayed highest and 5th highest percentage of words of unknown etymology (26% and 21%, respectively) of the 30 samples in this category. A higher proportion of unknowns depresses the results in the other two categories.
Fiction rests in the middle of this small group of 4 categories, and I’m a little surprised that the percentage of Anglo-Saxon is as high as it is. I feel like fiction lends itself to the kind of description that tends to use more non-Anglo-Saxon words, but in this sample it’s not all that different from conversation.
News stands out for having barely more Anglo-Saxon words than academic writing, and also the highest percentage of words of unknown etymological origin. The news samples are drawn principally from The Independent, The Guardian, The Daily Telegraph, The Belfast Telegraph, The Liverpool Daily Post and Echo, The Northern Echo, and The Scotsman. It would be interesting to analyze each of these groups independently to see if they differ significantly.
Future
My hypothesis that conversations have a high percentage of Anglo-Saxon words because they’re off-the-cuff rather than because they’re spoken is something I can challenge with another experiment. Speeches are also spoken, but they’re often written in advance, without the pressure of immediacy, so the author would have time to reach for a thesaurus. I predict speeches will have an Anglo-Saxon/non-Anglo-Saxon profile closer to that of fiction than of either of the extremes in this data. It might vary dramatically based on speaker and audience, so I’ll have to choose a broad sample to smooth out biases.
I would also like to work with the American National Corpus.
Stay tuned, and let me know in the comments if you have observations or suggestions!
I took part in the Raleigh March for Science last Saturday. For the opportunity to learn about it, participate in it, photograph it, share it with you — oh, and also for, you know, being alive today — thanks, science!
This is a first look at a work in progress. I’m using Python to study text from an etymological perspective. Specifically, I’m measuring how many words in a given English language text have Anglo-Saxon origin. Many people (including myself) think that Anglo-Saxon words convey a different sense than their counterparts of French/Latin origin. To demonstrate the point in a small way, I’ve included a Latin and Anglo-Saxon version of each heading in this blog post.
French was introduced to English as the language of conquerers and nobility. French was also the language of some European royalty in the 18th and 19th century, further adding to its reputation as a language associated with high status. Even today, English words with French origins often have higher cultural status than their counterparts with Anglo-Saxon origins (think cuisine versus cooking, illumination versus light, create versus make, and escargot versus snail). By contrast, the Anglo-Saxon words are often considered more visceral (think sea versus ocean, sweat versus perspire, and free versus emancipated — more on that last pair in a moment).
For instance, when taunting someone, you reach for blunt Anglo-Saxon words. “Your mother was a hamster, and your father smelled of elderberries!” is 100% Anglo-Saxon, except for “elderberries” which was coined in Middle English from “elder” and “berry”, both of Anglo-Saxon origin.
Legal documents and government issuances, on the other hand, tend to include more words of Latin/French origin. It’s no coincidence that the Latin/French words “Emancipation Proclamation” describe a legal act, but if you want to stir the heart about emancipation, you say something like “Free at last!”(1) which is all Anglo-Saxon.
Others have written more eloquently than I about how word origin influences tone (Annalisa Quinn at NPR, Gemma Varnom, and M. Birch, to suggest a few), so I won’t belabor the point more than I already have. But I wanted to talk about how it inspired the project I’ve been working on.
The Project (The Work)
I should preface this by saying that I Am Not A Linguist, and I don’t even play one on TV.
I thought it would be interesting to perform lexicographical analysis of text from an etymological perspective. My etymological categorization is necessarily simple. When I look at a text, I put each word into one of three etymological categories: Anglo-Saxon, non-Anglo-Saxon, or unknown. From this rough grouping I generate statistics that allow me to compare one text to another.
For instance, does one author consistently use more Anglo-Saxon words than other authors? Does an author’s usage of Anglo-Saxon words change from one work to another? Also of interest to me is the etymology of words as the book progresses from front to back. Do the relative frequencies of etymologies change as the book progresses towards its exciting conclusion? For authors writing in English as a second language, is their word selection influenced by their first language?
All of the questions above can be explored with the tool I’ve written. It’s easier to show the tool’s output than describe it, so here’s an analysis of Lewis Carroll’s 1865 work “Alice’s Adventures in Wonderland”.
The graph below shows the relative frequency of the three etymological categories as the book progresses from beginning to end.
This graph shows the relative frequency of the three etymological categories as counting statistics for various part-of-speech categories.
The table below is a more detailed version of the chart immediately above. Some percentages may not add up to 100% due to rounding.
Total
%age of All Words
Anglo-Saxon
non-Anglo-Saxon
Unknown
All Words
26624
100%
18233 (68%)
3812 (14%)
4579 (17%)
Unique
3528
13%
1354 (38%)
899 (25%)
1275 (36%)
Nouns
8522
32%
4521 (53%)
2354 (27%)
1647 (19%)
Verbs
5479
20%
2994 (54%)
565 (10%)
1920 (35%)
Adjectives
1639
6%
896 (54%)
375 (22%)
368 (22%)
Adverbs
1974
7%
1348 (68%)
420 (21%)
206 (10%)
Other
9010
33%
8474 (94%)
98 (1%)
438 (4%)
Observations (What I See)
There’s some minor observations to be made here, but the strength of this tool will be in comparative analysis. It’s hard to draw conclusions from one analysis before I have an idea of what’s typical.
For instance, at first glance, the ratio of Anglo-Saxon to non-Anglo-Saxon words looks dramatic, but this says more about English than it does about Carroll. The most common words in English are overwhelmingly Anglo-Saxon in origin. (2) For the small sample size of works I’ve processed so far (just 8 in total), I can see that it’s common for roughly three quarters of the words to be Anglo-Saxon. Alice in Wonderland isn’t an outlier by that standard.
We can also see that the frequency of Anglo-Saxon words decreases slightly throughout the book. This is the kind of trend that I find interesting, but in this case it’s due to an increase in the number of words of unknown etymology. Sometimes a word’s etymology is truly unknown. More often, though, the etymology is classified as unknown for other reasons. Most likely, it’s simply not in my etymological database (which isn’t very complete yet). Also, the word could be a proper noun, an invented word (like “woodshadows” from James Joyce’s Ulysses), or a word for which the etymology is ambiguous. An example of this last category is “bank” which is Anglo-Saxon in origin when referring to the side of a river, but French/Italian in origin when referring to a place that handles money.
At present, the quantity of words classified as “unknown” is too large for my tastes, and I plan to reduce it by improving both my database and the tool.
Verbs are overrepresented in the “unknown” category. My guess is that this is an artifact of my stemmer having difficulty stemming verbs. (I’m currently using the Snowball Stemmer from NLTK.)
As you can see, at this point it’s easier to draw conclusions about the representation of the data than it is about the data themselves. That leads me to the next (and final) topic in this post.
Future (What’s to Come)
As I said in the introduction, this is an early look at a work in progress. Here’s some of the things I’d like to add –
Better etymological data
Large scale comparisons of text to look for trends (across authors, genres, etc.)
More numeric (rather than visual) descriptions of the data to facilitate automated comparison. One idea is to add the mean and standard deviation of the percentage of Anglo-Saxon words.
Open sourcing
If you have any suggestions on how to use this tool or make it more interesting, I’d love to hear them in the comments below. I moderate all comments to filter spam which is yet another Viking influence on England.
Endnotes
Like English itself, “Endnote” is an etymological hybrid. “End” is of Anglo-Saxon origin, while “note” comes from Old French/Latin.
1. Martin Luther King, Jr. isn’t the only person to have said “Free at last!”, but his use of it is perhaps the most famous. His “I Have a Dream” speech makes brilliant use of etymological contrasts. Many of his memorable phrases in that speech (“I have a dream today”, “Let freedom ring”, “Free at last”) are Anglo-Saxon.
2. In 2014 I pulled from Wikipedia a list of the 100 most common English words. At the time, it contained just four non-Anglo-Saxon words. They were “just” (ME < Latin), “people” (ME < Anglo-French < Latin), “use” (ME < Old French, replaced OE brucan, cognate w/modern Swedish bruk-), and “because” (ME < Fr ‘par cause’). There are lots of ways to count the 100 most common words, and doubtless the list would have been different in Carroll’s day. But my guess is that the presence of Anglo-Saxon hasn’t changed dramatically from that 96% regardless of when and how one counts.
I gave a slightly expanded version of the talk I gave at PyData Carolinas 2016 about connecting Python to compiled languages like C, C++, and Fortran. (Slides from that talk are here.) I appreciate the time and attention of everyone who attended last night, especially Francois Dion for organizing and reminding us of some of the interesting new things in Python 3.5 and 3.6.
Last night’s talk wasn’t recorded, but you can see the version of the talk I gave at PyData at https://www.youtube.com/watch?v=aUSokzzsEko , or you can watch the embedded version below.
Part of learning how to use any tool is exploring its strengths and weaknesses. I’m just starting to use the Python library Pandas, and my naïve use of it exposed a weakness that surprised me.
Background
I have a long list of objects, each with the properties “color” and “shape”. I want to count the frequency of each color/shape combination. A sample of what I’m trying to achieve could be represented in a grid like this –
circle square star
blue 8 41 18
orange 5 33 25
red 53 64 58
At first I implemented this with a dictionary of collections.Counter instances where the top level dictionary is keyed by shape, like so –
import collections
SHAPES = ('square', 'circle', 'star', )
frequencies = {shape: collections.Counter() for shape in SHAPES}
Then I counted my frequencies using the code below. (For simplicity, assume that my objects are simple 2-tuples of (shape, color)).
for shape, color in all_my_objects:
frequencies[shape][color] += 1
So far, so good.
Enter the Pandas
This looked to me like a perfect opportunity to use a Pandas DataFrame which would nicely support the operations I wanted to do after tallying the frequencies, like adding a column to represent the total number (sum) of instances of each color.
It was especially easy to try out a DataFrame because my counting loop ( for...all_my_objects) wouldn’t change, only the definition of frequencies. (Note that the code below requires I know in advance all the possible colors I can expect to see, which the Dict + Counter version does not. This isn’t a problem for me in my real-world application.)
import pandas as pd
frequencies = pd.DataFrame(columns=SHAPES, index=COLORS, data=0,
dtype='int')
for shape, color in all_my_objects:
frequencies[shape][color] += 1
It Works, But…
Both versions of the code get the job done, but using the DataFrame as a frequency counter turned out to be astonishingly slow. A DataFrame is simply not optimized for repeatedly accessing individual cells as I do above.
How Slow is it?
To isolate the effect pandas was having on performance, I used Python’s timeit module to benchmark some simpler variations on this code. In the version of Python I’m using (3.6), the default number of iterations for each timeit test is 1 million.
First, I timed how long it takes to increment a simple variable, just to get a baseline.
Second, I timed how long it takes to increment a variable stored inside a collections.Counter inside a dict. This mimics the first version of my code (above) for a frequency counter. It’s more complex than the simple variable version because Python has to resolve two hash table references (one inside the dict, and one inside the Counter). I expected this to be slower, and it was.
Third, I timed how long it takes to increment one cell inside a 2×2 NumPy array. Since Pandas is built atop NumPy, this gives an idea of how the DataFrame’s backing store performs without Pandas involved.
Fourth, I timed how long it takes to increment one cell inside a 2×2 Pandas DataStore. This is what I had used in my real code.
Raw Benchmark Results
Here’s what timeit showed me. Sorry for the cramped formatting.
$ python
Python 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit('data += 1', setup='data=0')
0.09242476700455882
>>> timeit.timeit('data[0][0]+=1',setup='from collections import Counter;data={0:Counter()}')
0.6838196019816678
>>> timeit.timeit('data[0][0]+=1',setup='import numpy as np;data=np.zeros((2,2))')
0.8909121589967981
>>> timeit.timeit('data[0][0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
157.56428507200326
>>>
Benchmark Results Summary
Here’s a summary of the results from above (decimals truncated at 3 digits). The rightmost column shows the results normalized so the fastest method (incrementing a simple variable) equals 1.
Actual (seconds)
Normalized (seconds)
Simple variable
0.092
1
Dict + Counter
0.683
7.398
Numpy 2D array
0.890
9.639
Pandas DataFrame
157.564
1704.784
As you can see, resolving the index references in the middle two cases (Dict + Counter in one case, NumPy array indices in the other) slows things down, which should come as no surprise. The NumPy array is a little slower than the Dict + Counter.
The DataFrame, however, is about 150 – 200 times slower than either of those two methods. Ouch!
I can’t really even give you a graph of all four of these methods together because the time consumed by the DataFrame throws the chart scale out of whack.
Here’s a bar chart of the first three methods –
Here’s a bar chart of all four –
Why Is My DataFrame Access So Slow?
One of the nice features of DataFrames is that they support dictionary-like labels for rows and columns. For instance, if I define my frequencies to look like this –
>>> SHAPES = ('square', 'circle', 'star', )
>>> COLORS = ('red', 'blue', 'orange')
>>> pd.DataFrame(columns=SHAPES, index=COLORS, data=0, dtype='int')
square circle star
red 0 0 0
blue 0 0 0
orange 0 0 0
>>>
Then frequencies['square']['orange'] is a valid reference.
Here are those techniques applied in order to the frequencies DataFrame so you can see how they work –
>>> frequencies['star']
red 0
blue 0
orange 0
Name: star, dtype: int64
>>> frequencies[['square', 'star']]
square star
red 0 0
blue 0 0
orange 0 0
>>> frequencies['red':'blue']
square circle star
red 0 0 0
blue 0 0 0
>>> frequencies[[True, False, True]]
square circle star
red 0 0 0
orange 0 0 0
>>> frequencies[lambda x: 'star']
red 0
blue 0
orange 0
Name: star, dtype: int64
This flexibility has a price. Slicing (which is what is invoked by the square brackets) calls an object’s __getitem__() method. The parameter to __getitem__() is the whatever was inside the square brackets. A DataFrame’s __getitem__() has to figure out what the passed parameter represents. Determining whether the parameter is a label reference, a callable, a boolean array, or something else takes time.
If you look at the DataFrame’s __getitem__() implementation, you can see all the code that has to execute to resolve a reference. (I linked to the version of the code that was current when I wrote this in February of 2017. By the time you read this, the actual implementation may differ.) Not only does __getitem__() have a lot to do, but because I’m accessing a cell (rather than a whole row or column), there’s two slice operations, so __getitem__() gets invoked twice each time I increment my counter.
This explains why the DataFrame is so much slower than the other methods. The dictionary and Counter both only support key lookup in a hash table, and a NumPy array has far fewer slicing options than a DataFrame, so its __getitem__() implementation can be much simpler.
Better DataFrame Indexing?
DataFrames support a few methods that exist explicitly to support “fast” getting and setting of scalars. Those methods are .at() (for label lookups) and .iat() (for integer-based index lookups). It also provides get_value() and set_value(), but those methods are deprecated in the version I have (0.19.2).
“Fast” is how the Panda’s documentation describes these methods. Let’s use timeit to get some hard data. I’ll try at() and iat(); I’ll also try get_value()/set_value() even though they’re deprecated.
>>> timeit.timeit("data.at['red','square']+=1",setup="import pandas as pd;data=pd.DataFrame(columns=('square','circle','star'),index=('red','blue','orange'),data=0,dtype='int')")
36.33179204000044
>>> timeit.timeit('data.iat[0,0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
42.01523362501757
>>> timeit.timeit('data.set_value(0,0,data.get_value(0,0)+1)',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
15.050199927005451
>>>
These methods are better, but they’re still pretty bad. Let’s put those numbers in context by comparing them to other techniques. This time, for normalized results, I’m going to use my Dict + Counter method as the baseline of 1 and compare all other methods to that. The row “DataFrame (naïve)” refers to naïve slicing, like frequencies[0][0].
Actual (seconds)
Normalized (seconds)
Dict + Counter
0.683
1
Numpy 2D array
0.890
1.302
DataFrame (get/set)
15.050
22.009
DataFrame (at)
36.331
53.130
DataFrame (iat)
42.015
61.441
DataFrame (naïve)
157.564
230.417
The best I can do with a DataFrame uses deprecated methods, and is still over 20 times slower than the Dict + Counter. If I use non-deprecated methods, it’s over 50 times slower.
Workaround
I like label-based access to my frequency counters, I like the way I can manipulate data in a DataFrame (not shown here, but it’s useful in my real-world code), and I like speed. I don’t necessarily need blazing fast speed, I just don’t want slow.
I can have my cake and eat it too by combining methods. I do my counting with the Dict + Counter method, and use the result as initialization data to a DataFrame constructor.
SHAPES = ('square', 'circle', 'star', )
frequencies = {shape: collections.Counter() for shape in SHAPES}
for shape, color in all_my_objects:
frequencies[shape][color] += 1
frequencies = pd.DataFrame(data=frequencies)
The frequencies DataFrame now looks something like this –
circle square star
blue 8 41 18
orange 5 33 25
red 53 64 58
The rows and columns appear in essentially random order; they’re ordered by whatever order Python returns the dict keys during DataFrame initialization. Getting them in a specific order is left as an exercise for the reader.
There’s one more detail to be aware of. If a particular (shape, color) combination doesn’t appear in my data, it will be represented by NaN in the DataFrame. They’re easy to set to 0 with frequencies.fillna(0).
Conclusion
What I was trying to do with Pandas – unfortunately, the very first thing I ever tried to do with it – didn’t play to its strengths. It didn’t break my code, but it slowed it down by a factor of ~1700. Since I had thousands of items to process, the difference was hard to overlook!
Pandas looks great for some things, and I expect I’ll continue using it. This was just a bump in the road, albeit an interesting one.