George V. Reilly

Patching Airflow Wheels

In Patching and Splitting Python Wheels, I wrote about some occasions when I had to take a Python wheel and patch it. Now I want to tell you about a very different approach that I used recently to patch Airflow wheels.

With the other wheels, we just needed to apply some tactical patches. With Airflow, we are making sub­stan­tive changes.

We've been using Airflow for years at work. We built up a lot of in­fra­struc­ture around Airflow 1 and we are gradually migrating to Airflow 2.

Several years ago, we forked the airflow package and made a large number of changes to it for internal con­sump­tion. Un­for­tu­nate­ly, this made it in­creas­ing­ly hard continue.

Patching and Splitting Python Wheels

I gave lightning talks at Python Ireland in May 2024 and Puget Sound Pro­gram­ming Python (PuPPy) in August 2024 about patching and splitting wheels: slides.

Patching Wheels

Last year, I wrote about manually Patching a Python Wheel. There was a cyclic dependency between the Torch 2.0.1 wheel and the Triton 2.0.0 wheel. While pip had no problem with this, Bazel certainly did. My workaround was to unzip the Torch wheel, edit the metadata to remove the dependency on Triton, and zip the wheel up again with a modified name.

At the beginning of this year, I had to patch Torch 2.1 for different reasons. Again, I needed to patch the Torch wheel because of continue.

Exploring Wordle

Unless YOUVE LIVED UNDER ROCKS, you've heard of Wordle, the online word game that has become wildly popular since late 2021. You've probably seen people posting their Wordle games as grids of little green, yellow, and black (or white) emojis on social media.

Wordle 797 4/6

⬛ ⬛ ⬛ ⬛ 🟨
🟨 ⬛ 🟩 ⬛ ⬛
⬛ ⬛ 🟩 🟨 ⬛
🟩 🟩 🟩 🟩 🟩

The problem that I want to address in this post is:

Given some GUESS=SCORE pairs for Wordle and a word list, pro­gram­mat­i­cal­ly find all the words from the list that are eligible as answers.

Let's look at this four-round game for Wordle 797:

continue.
J U D

Python Enums with Attributes

Python enu­mer­a­tions are useful for grouping related constants in a namespace. You can add additional behaviors to an enum class, but there isn't an easy and obvious way to add attributes to enum members.

class TileState(Enum):
    CORRECT = 1
    PRESENT = 2
    ABSENT  = 3

    def color(self):
        if self is self.CORRECT:
            return "Green"
        elif self is self.PRESENT:
          
continue.

Patching a Python Wheel

Recently, I had to create a new Python wheel for PyTorch. There is a cyclic dependency between PyTorch 2.0.1 and Triton 2.0.0: Torch depends upon Triton, but Triton also depends on Torch. Pip is okay with installing packages where there's a cyclic dependency. Bazel, however, does not handle cyclic de­pen­den­cies between packages. We use Bazel ex­ten­sive­ly at Stripe and this cyclic dependency prevented us from using the latest version of Torch.

I spent a few days trying to build the PyTorch wheel from source. It was a nightmare! I ran out of disk space on the root partition on my EC2 devbox trying to install system packages, so I had to continue.

Backwards Ranges in Python

In Python, if you want to specify a sequence of numbers from a up to (but excluding) b, you can write range(a, b). This generates the sequence a, a+1, a+2, ..., b-1. You start at a and keep going until the next number would be b.

In Python 3, range is lazy and the values in the sequence do not ma­te­ri­al­ize until you consume the range.

>>> range(3,12)
range(3, 12)
>>> list(range(3,12))
[3, 4, 5, 6, 7, 8, 9, 10, 11]

Trey Hunner makes the point that range is a lazy iterable rather than an iterator.

You can also step by an increment other than one: range(a, b, s). This generates a, a+s, a+2*s, ..., b-s (assuming that continue.

Accidentally Quadratic: Python List Membership

We had a per­for­mance regression in a test suite recently when the median test time jumped by two minutes.

We tracked it down to this (simplified) code fragment:

task_inclusions = [ some_collection_of_tasks() ]
invalid_tasks = [t.task_id() for t in airflow_tasks
                 if t.task_id() not in task_inclusions]

This looks fairly in­nocu­ous—and it was—until the size of the result returned from some_­col­lec­tion_of_­tasks() jumped from a few hundred to a few thousand.

The in comparison operator con­ve­nient­ly works with all of Python's standard sequences and col­lec­tions, but its efficiency varies. For a list and other sequences, in must search continue.

OrderedDict Initialization

An Or­dered­Dict is a Python dict which remembers insertion order. When iterating over an Or­dered­Dict, items are returned in that order. Ordinary dicts return their items in an un­spec­i­fied order.

Ironically, most of the ways of con­struct­ing an ini­tial­ized Or­dered­Dict end up breaking the ordering in Python 2.x and in Python 3.5 and below. Specif­i­cal­ly, using keyword arguments or passing a dict (mapping) will not retain the insertion order of the source code.

Python 2.7.13 (default, Dec 18 2016, 07:03:39)
>>> from collections import OrderedDict

>>> odict = OrderedDict()
>>> odict['one'] = 1
>>> odict['two'] = 2
>>> odict['three'] = 3
>>> odict['four'] = 4
>>> odict['five'] = 5
>>> odict.items()
[('one', 1), ('two', 2), ('three', 
continue.

Alembic: Data Migrations

We use Alembic to perform schema migrations whenever we add (or drop) tables or columns from our databases. It's less well known that Alembic can also perform data migrations, updating existing data in tables.

Here's an example adapted from a migration I put together this afternoon. I added a non-NULL Boolean stooge column to the old_timers table, with a default value of FALSE. I wanted to update certain rows to have stooge=TRUE as part of the migration. The following works with PostgreSQL.

Note the server_de­fault=sa.false() in the de­c­la­ra­tion of the stooge column, which is needed to initially set all instances of stooge=FALSE. I then declare a table which has only the two continue.

Rounding

I recently learned from a Stack­Over­flow question that the rounding behavior in Python 3.x is different from Python 2.x:

The round() function rounding strategy and return type have changed. Exact halfway cases are now rounded to the nearest even result instead of away from zero. (For example, round(2.5) now returns 2 rather than 3.)

The “away from zero” rounding strategy is the one that most of us learned at school. The “nearest even” strategy is also known as “banker’s rounding”.

There are actually five rounding strategies defined in IEEE 754:

Mode / Example Value +11.5 +12.5 −11.5 −12.5
to nearest, ties to even +12.0 +12.0 −12.0 −12.0
to nearest, ties away from zero +12.0 +13.0 −12.0 −13.0
toward 0 (truncation) +11.0 +12.0 −11.0 −12.0
toward +∞ (ceiling) +12.0 +13.0 −11.0 −12.0
toward −∞ (floor) +11.0 +12.0 −12.0 −13.0

Further continue.

Previous »