George V. Reilly https://www.georgevreilly.com/ george@reilly.org George V. Reilly tag:www.georgevreilly.com,2011-06-11:/atom/ https://www.georgevreilly.com/favicon.ico https://www.georgevreilly.com/feed-logo.png 2024-12-20T08:00:00Z acrylamid Patching Airflow Wheels tag:www.georgevreilly.com,2024-12-20:/blog/2024/12/20/PatchingAirflowWheels.html 2024-12-20T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>In <a class="reference external" href="https://www.georgevreilly.com/blog/2024/12/16/PatchingAndSplittingPythonWheels.html">Patching and Splitting Python Wheels</a>, I wrote about some occasions when I had to take a <a class="reference external" href="https://realpython.com/python-wheels/">Python wheel</a> and patch it. Now I want to tell you about a very different approach that I used recently to patch Airflow wheels.</p> <p>With the other wheels, we just needed to apply some tactical patches. With Airflow, we are making substantive changes.</p> <a class="reference external image-reference" href="https://airflow.apache.org/"><img alt="Apache Airflow" src="https://www.georgevreilly.com/content/binary/ApacheAirflowLogo.png"/></a> <p>We've been using <a class="reference external" href="https://airflow.apache.org/">Airflow</a> for years at work. We built up a lot of infrastructure around Airflow 1 and we are gradually migrating to <a class="reference external" href="https://www.astronomer.io/blog/introducing-airflow-2-0/">Airflow 2</a>.</p> <p>Several years ago, we forked the <a class="reference external" href="https://pypi.org/project/apache-airflow/">airflow package</a> and made a large number of changes to it for internal consumption. Unfortunately, this made it increasingly hard for us to merge changes from the <a class="reference external" href="https://github.com/apache/airflow">upstream repo</a> into our internal Git repository, as the repos continued to diverge.</p> <p>Airflow's <a class="reference external" href="https://github.com/apache/airflow/blob/2.10.2/dev/README_RELEASE_AIRFLOW.md">current release workflow</a>:</p> <ul class="simple"> <li>Create a release branch from <tt class="docutils literal">main</tt>.</li> <li>Create release candidates.</li> <li>Fix any problems, including cherry-picking from <tt class="docutils literal">main</tt>.</li> <li>Publish the final release, which is <a class="reference external" href="https://git-scm.com/book/en/v2/Git-Basics-Tagging">tagged</a>. The package is uploaded to PyPI.</li> </ul> <p>Note that this tagged branch is never merged back to <tt class="docutils literal">main</tt>, so you cannot checkout an official release from the <tt class="docutils literal">main</tt> branch. You must checkout the tag instead. (I don't know if this was also the release workflow for Airflow 1.)</p> <p>Our internal workflow is different. Engineers work on feature branches and create pull requests. These pull requests get merged into <tt class="docutils literal">master</tt>. Production deployments are built from <tt class="docutils literal">master</tt> only. We don't use tagged releases. This <tt class="docutils literal">master</tt>-centric assumption is baked deeply into our build and continuous integration systems. Since the upstream <tt class="docutils literal">main</tt> doesn't have release code, it's not suitable for merging into our <tt class="docutils literal">master</tt>.</p> <div class="section" id="git-clone-workflow"> <h3>Git Clone Workflow</h3> <p>To avoid the difficulties that we caused ourselves with Airflow 1, we created a fresh repository for Airflow 2, which does <em>not</em> have a copy of the upstream repo's code. We now maintain a set of patches for each upstream release that we care about. This new repo has build scripts and patches only.</p> <p>When I first set this up, I had the CI build script create a shallow clone of the upstream repo, then check out each tag, and apply our patches.</p> <pre class="code bash literal-block"> <span class="c1"># NOT SHOWN: create a virtualenv with Hatch and other build dependencies # from Airflow's pyproject.toml </span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>--depth<span class="o">=</span><span class="m">1</span><span class="w"> </span>https://github.com/apache/airflow.git<span class="w"> </span>worktree<span class="w"> </span><span class="nb">cd</span><span class="w"> </span>worktree<span class="w"> </span><span class="k">for</span><span class="w"> </span>tag<span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="o">(</span><span class="s2">&quot;2.10.2&quot;</span><span class="w"> </span><span class="s2">&quot;2.10.4&quot;</span><span class="o">)</span><span class="p">;</span><span class="w"> </span><span class="k">do</span><span class="w"> </span>git<span class="w"> </span>reset<span class="w"> </span>--hard<span class="w"> </span>HEAD<span class="w"> </span>rm<span class="w"> </span>-rf<span class="w"> </span>dist<span class="w"> </span>git<span class="w"> </span>fetch<span class="w"> </span>--depth<span class="w"> </span><span class="m">1</span><span class="w"> </span>origin<span class="w"> </span><span class="s2">&quot;</span><span class="nv">$tag</span><span class="s2">&quot;</span><span class="w"> </span>git<span class="w"> </span>checkout<span class="w"> </span>--quiet<span class="w"> </span>FETCH_HEAD<span class="w"> </span><span class="k">for</span><span class="w"> </span>p<span class="w"> </span><span class="k">in</span><span class="w"> </span>../patches/<span class="s2">&quot;</span><span class="nv">$tag</span><span class="s2">&quot;</span>/*.patch<span class="p">;</span><span class="w"> </span><span class="k">do</span><span class="w"> </span>git<span class="w"> </span>am<span class="w"> </span>&lt;<span class="w"> </span><span class="s2">&quot;</span><span class="nv">$p</span><span class="s2">&quot;</span><span class="w"> </span><span class="k">done</span><span class="w"> </span>python3<span class="w"> </span>-m<span class="w"> </span>build<span class="w"> </span>--wheel<span class="w"> </span>cp<span class="w"> </span>dist/*<span class="w"> </span>../build<span class="w"> </span><span class="k">done</span> </pre> <p>The first patch for each tag changes the version information so that our wheel won't conflict with the official wheel from upstream. It updates <tt class="docutils literal">tool.hatch.version</tt> in <tt class="docutils literal">pyproject.toml</tt> to read:</p> <pre class="code toml literal-block"> <span class="k">[tool.hatch.version]</span><span class="w"> </span><span class="n">source</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">&quot;code&quot;</span><span class="w"> </span><span class="n">expression</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">&quot;stripe_airflow_version()&quot;</span><span class="w"> </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">&quot;stripe_version.py&quot;</span> </pre> <p>instead of extracting the version information from <tt class="docutils literal">airflow/__init__.py</tt>.</p> <p>The <tt class="docutils literal">stripe_version.py</tt> script uses <a class="reference external" href="https://git-scm.com/docs/git-describe">git describe</a> to get the number of additional commits in our branch and the abbreviated SHA of the most recent commit, then prefixes these items with <tt class="docutils literal"><span class="pre">+stripe.${MAJOR}.</span></tt> All of this is suffixed to the actual version number from upstream, so we build a wheel that is named something like <tt class="docutils literal"><span class="pre">apache_airflow-${TAG}+stripe.1.${COUNT}.g${SHA}-py3-none-any.whl</span></tt>.</p> <p>While this system produced a working wheel, there was one critical omission. The official upstream wheel contained an extra 37MB of UI code in <tt class="docutils literal">www/static</tt>, which is used by the various Airflow website UIs.</p> <p>I spent quite a bit of effort to make our build generate this extra payload, but it turned out to be very difficult. <tt class="docutils literal">python3 <span class="pre">-m</span> hatch build <span class="pre">-t</span> custom</tt> requires Node.js and does a lot of extra steps that didn't interact well with the locked down egress rules of our CI.</p> </div> <div class="section" id="source-distribution-workflow"> <h3>Source Distribution Workflow</h3> <p>I realized that all of the <tt class="docutils literal">www/static</tt> tree could be extracted from the official release, and that we didn't have to generate it in CI.</p> <p>Instead of checking out a tag, our CI downloads the official <a class="reference external" href="https://packaging.python.org/en/latest/specifications/source-distribution-format/">source distribution</a> tarball, <tt class="docutils literal"><span class="pre">apache_airflow-${RELEASE}.tar.gz</span></tt>, untars the tarball, applies our patches, and builds a new wheel.</p> <p>It took me a while to figure out why our custom versioning wasn't working. Because the sdist contains a file called <tt class="docutils literal"><span class="pre">PKG-INFO</span></tt> at the root, Hatch takes the version from that. I had to update the <tt class="docutils literal">stripe_version.py</tt> script to modify the <tt class="docutils literal">Version:</tt> line in <tt class="docutils literal"><span class="pre">PKG-INFO</span></tt>.</p> </div> <div class="section" id="format-patch-workflow"> <h3>Format-Patch Workflow</h3> <p>So far, I've covered how the patched wheel is built in CI, but not how you would create new patches.</p> <p>For local development, you can check out the upstream tag (see <tt class="docutils literal">FETCH_HEAD</tt> above), then apply any existing patches that are relevant. Make other changes, commit them locally, and build the wheel by hand. When you have tested and have something that you're happy with, you can use <a class="reference external" href="https://git-scm.com/book/en/v2/Distributed-Git-Maintaining-a-Project">git format-patch</a> to create a series of patches. These patches can then be committed to the repo that we use to build the wheels.</p> <p>This workflow is less convenient than making changes directly in the forked code, as we did with Airflow 1. But now we only have a moderate amount of friction to upgrade to a newer release from upstream, instead of ever-increasing difficulty.</p> </div> Patching and Splitting Python Wheels tag:www.georgevreilly.com,2024-12-16:/blog/2024/12/16/PatchingAndSplittingPythonWheels.html 2024-12-16T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <img alt="Patching a bicycle tube" src="https://www.georgevreilly.com/content/binary/patching-bike-tube.jpg" style="width: 700px;"/> <p>I gave lightning talks at <a class="reference external" href="https://www.meetup.com/pythonireland/events/300802991/">Python Ireland</a> in May 2024 and <a class="reference external" href="https://www.meetup.com/psppython/events/302211630/">Puget Sound Programming Python (PuPPy)</a> in August 2024 about patching and splitting wheels: <a class="reference external" href="https://docs.google.com/presentation/d/1YXfI7U1oVgHLSgX8uYieIGdooQK-jTqXWMhpkUrAo4U/edit?usp=sharing">slides</a>.</p> <div class="section" id="patching-wheels"> <h3>Patching Wheels</h3> <p>Last year, I wrote about manually <a class="reference external" href="https://www.georgevreilly.com/blog/2023/08/10/PatchingAPythonWheel.html">Patching a Python Wheel</a>. There was a cyclic dependency between the Torch 2.0.1 wheel and the Triton 2.0.0 wheel. While <tt class="docutils literal">pip</tt> had no problem with this, <a class="reference external" href="https://bazel.build/">Bazel</a> certainly did. My <a class="reference external" href="https://www.georgevreilly.com/blog/2023/08/10/PatchingAPythonWheel.html">workaround</a> was to unzip the Torch wheel, edit the metadata to remove the dependency on Triton, and zip the wheel up again with a modified name.</p> <p>At the beginning of this year, I had to patch Torch 2.1 for <a class="reference external" href="https://github.com/georgevreilly/torch21">different reasons</a>. Again, I needed to patch the Torch wheel because of Bazel problems. Due to the way that Bazel installs each package in a different location, instead of one common <tt class="docutils literal"><span class="pre">site-packages</span></tt>, I had to <a class="reference external" href="https://github.com/georgevreilly/torch21">ensure</a> that Torch preloaded a series of <tt class="docutils literal"><span class="pre">lib*.so</span></tt> extensions in <em>topologically sorted</em> order. This time, I wrote a <a class="reference external" href="https://github.com/georgevreilly/torch21/blob/main/scripts/patcher">patcher script</a> to apply patches to a wheel.</p> <p>The <a class="reference external" href="https://github.com/georgevreilly/torch21/blob/main/scripts/patcher">patcher script</a> uses the official <a class="reference external" href="https://wheel.readthedocs.io/en/stable/">wheel</a> package to do most of the work of extracting the contents and packing the new wheel. My <a class="reference external" href="https://github.com/georgevreilly/torch21">torch21 repo</a> gives two examples of how to use <tt class="docutils literal">patcher</tt>.</p> </div> <div class="section" id="splitting-wheels"> <h3>Splitting Wheels</h3> <p>There are multiple variants of the Torch wheel. The Torch 2.1 wheel with CUDA 11.8, <tt class="docutils literal"><span class="pre">torch==2.1.2+cu118</span></tt>, is 2.5GB, and expands to 4GB! Almost all of that is in shared object libraries (<tt class="docutils literal"><span class="pre">lib/lib*.so</span></tt>), some 3.9GB.</p> <pre class="literal-block"> -rwxr-xr-x 1 georgevreilly stripe 125M Dec 12 18:05 libcudnn_adv_infer.so.8 -rwxr-xr-x 1 georgevreilly stripe 241M Dec 12 18:05 libtorch_cuda_linalg.so -rwxr-xr-x 1 georgevreilly stripe 451M Dec 12 18:05 libtorch_cpu.so -rwxr-xr-x 1 georgevreilly stripe 548M Dec 12 18:04 libcublasLt.so.11 -rwxr-xr-x 1 georgevreilly stripe 621M Dec 12 18:05 libcudnn_cnn_infer.so.8 -rwxr-xr-x 1 georgevreilly stripe 1355M Dec 12 18:05 libtorch_cuda.so </pre> <a class="reference external image-reference" href="https://gist.github.com/georgevreilly/702e9e8783dd5978bd3e4a151fadee1e"><img alt="Library interdependencies" src="https://www.georgevreilly.com/content/binary/torch-topo-deps.png" style="width: 700px;"/></a> <p>An internal system that we use for remotely building Bazel actions has a hard limit of 3GB. This is an internal policy, not an inherent Bazel limitation, but it led to difficulties with building apps that wanted to use the wheel.</p> <p>My solution was to split the wheel into two wheels, a <tt class="docutils literal">cudatorch</tt> wheel, which contained the two largest libraries, <tt class="docutils literal">libtorch_cuda.so</tt> (1355MB) and <tt class="docutils literal">libtorch_cuda_linalg.so</tt> (241MB), and a modified version of the <tt class="docutils literal">torch</tt> wheel, which contained everything else.</p> <p>I had to use <a class="reference external" href="https://manpages.ubuntu.com/manpages/noble/en/man1/patchelf.1.html">patchelf</a> to modify the <a class="reference external" href="https://en.wikipedia.org/wiki/Rpath">rpath</a> of the two libs in the <tt class="docutils literal">cudatorch</tt> wheel to something like <tt class="docutils literal"><span class="pre">$ORIGIN:$ORIGIN/../../torch/lib</span></tt>.</p> <p>In the <tt class="docutils literal">torch</tt> wheel, I had to patch <tt class="docutils literal">torch/__init__.py</tt> to preload the <tt class="docutils literal">cudatorch</tt> libs.</p> </div> Social Media Handles tag:www.georgevreilly.com,2024-12-01:/blog/2024/12/01/SocialMediaHandles.html 2024-12-01T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>The migration away from Twitter has really caught fire lately, with Bluesky being the obvious winner of new accounts in recent weeks.</p> <p>I'm available on other platforms, if you want to follow me. These are sorted from most to least active:</p> <ul class="simple"> <li>Bluesky: <a class="reference external" href="https://bsky.app/profile/georgevreilly.bsky.social">&#64;georgevreilly.bsky.social</a></li> <li>Threads: <a class="reference external" href="https://www.threads.net/@georgevreilly">&#64;georgevreilly</a></li> <li>Instagram: <a class="reference external" href="https://www.instagram.com/georgevreilly/">georgevreilly</a></li> <li>Substack: <a class="reference external" href="https://substack.com/@georgevreilly">&#64;georgevreilly</a></li> <li>LinkedIn: <a class="reference external" href="https://www.linkedin.com/in/georgevreilly/">georgevreilly</a></li> <li>GitHub: <a class="reference external" href="https://github.com/georgevreilly">&#64;georgevreilly</a></li> <li>Twitter: <a class="reference external" href="https://x.com/georgevreilly">&#64;georgevreilly</a> (going dormant soon)</li> <li>Mastodon: <a class="reference external" href="https://tech.lgbt/@georgevreilly">&#64;georgevreilly&#64;tech.lgbt</a></li> <li>Discord: <tt class="docutils literal">&#64;georgevreilly</tt></li> <li>Reddit: <a class="reference external" href="https://www.reddit.com/user/george_v_reilly/">u/george_v_reilly</a></li> <li>Tumblr: <a class="reference external" href="https://www.tumblr.com/georgevreilly">&#64;georgevreilly</a></li> <li>TikTok: <a class="reference external" href="https://www.tiktok.com/@georgevreilly">&#64;georgevreilly</a></li> </ul> <p>Basically, I'm <tt class="docutils literal">georgevreilly</tt> everywhere, except for a few old accounts from the 2000s where I used <tt class="docutils literal">george_v_reilly</tt>.</p> <p>I have a sweatshirt from attending CascadiaJS in 2012 that has <tt class="docutils literal">&#64;georgevreilly</tt> along the length of the left sleeve. It was either my Twitter or my GitHub handle (which are, of course, identical).</p> <p>I can also be reached via:</p> <ul class="simple"> <li>Linktree: <a class="reference external" href="https://linktr.ee/georgevreilly">georgevreilly</a></li> <li>About.me: <a class="reference external" href="https://about.me/georgevreilly">georgevreilly</a></li> </ul> Cold Brew Coffee Recipe for French Press tag:www.georgevreilly.com,2024-10-19:/blog/2024/10/19/ColdBrewCoffeeFrenchPressRecipe.html 2024-10-19T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <img alt="French Press" class="right-float" src="https://www.georgevreilly.com/content/binary/french-press.jpg"/> <p>Last year, I posted a recipe for <a class="reference external" href="https://www.georgevreilly.com/blog/2023/07/24/ColdBrewCoffeeRecipe.html">cold brew coffee</a> using an <a class="reference external" href="https://www.oxo.com/cold-brew-coffee-maker.html">Oxo Cold Brew Coffee Maker</a>. Recently, it occurred to me that I could use a French Press instead of the Oxo. I've made several batches, with good results. It's a little more convenient to make cold brew in the Oxo, but it's good to know that it can be made without buying special-use equipment.</p> <div class="section" id="ingredients"> <h3>Ingredients</h3> <ul class="simple"> <li>24 fl oz (700 ml) water</li> <li>6 oz (170 g) fresh <em>coarsely ground</em> coffee. Store-bought pre-ground coffee is too fine.</li> </ul> <p>This will fill a one-quart (one-liter) French Press. It <strong>yields about 16 fl oz</strong> (1 pt/500 ml) of cold brew coffee.</p> </div> <div class="section" id="instructions"> <h3>Instructions</h3> <ul class="simple"> <li>Grind the coffee beans coarsely.</li> <li>Place the ground coffee in the French Press.</li> <li>Pour water into the grounds, distributing it as best you can.</li> <li>Stir gently to ensure that all coffee grounds are wet.</li> <li>Cover the French Press with plastic wrap—this prevents absorption of odors in the fridge.</li> <li>Store it in the fridge, for anywhere between 6 and 24 hours.</li> <li>Remove the plastic wrap, insert the plunger, and plunge.</li> <li>Pour the brew through a coffee filter to strain any fine grounds that get past the plunger.</li> <li>Refrigerate the cold brew.</li> </ul> </div> Exploring Wordle tag:www.georgevreilly.com,2023-09-26:/blog/2023/09/26/ExploringWordle.html 2023-09-26T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>Unless YOUVE LIVED UNDER ROCKS, you've heard of <a class="reference external" href="https://en.wikipedia.org/wiki/Wordle">Wordle</a>, the online word game that has become wildly popular since late 2021. You've probably seen people posting their Wordle games as grids of little green, yellow, and black (or white) emojis on social media.</p> <div class="line-block"> <div class="line">Wordle 797 4/6</div> <div class="line"><br /></div> <div class="line">⬛ ⬛ ⬛ ⬛ 🟨</div> <div class="line">🟨 ⬛ 🟩 ⬛ ⬛</div> <div class="line">⬛ ⬛ 🟩 🟨 ⬛</div> <div class="line">🟩 🟩 🟩 🟩 🟩</div> </div> <p>The problem that I want to address in this post is:</p> <blockquote> Given some <tt class="docutils literal">GUESS=SCORE</tt> pairs for Wordle and a word list, programmatically find all the words from the list that are eligible as answers.</blockquote> <p>Let's look at this four-round game for Wordle 797:</p> <table class="wordle"> <tr><td class="absent">J</td> <td class="absent">U</td> <td class="absent">D</td> <td class="absent">G</td> <td class="present">E</td> <td class="gs">JUDGE=....e</td></tr> <tr><td class="present">C</td> <td class="absent">H</td> <td class="correct">E</td> <td class="absent">S</td> <td class="absent">T</td> <td class="gs">CHEST=c.E..</td></tr> <tr><td class="absent">W</td> <td class="absent">R</td> <td class="correct">E</td> <td class="present">C</td> <td class="absent">K</td> <td class="gs">WRECK=..Ec.</td></tr> <tr><td class="correct">O</td> <td class="correct">C</td> <td class="correct">E</td> <td class="correct">A</td> <td class="correct">N</td> <td class="gs">OCEAN=OCEAN</td></tr> </table><p>The letters of each guess are colored Green, Yellow, or Black (dark-gray).</p> <ul class="simple"> <li>A Green tile 🟩 means that the letter is <strong>correct</strong>: <tt class="docutils literal">E</tt> is the third letter of the answer.</li> <li>A Yellow tile 🟨 means that the letter is <strong>present</strong> <em>elsewhere</em> in the answer. There is a <tt class="docutils literal">C</tt> in the answer; it's not in columns 1 or 4, but it is correct in column 2. Likewise, an <tt class="docutils literal">E</tt> is present in the answer; it's not in column 5, but it's correct in column 3.</li> <li>A Black tile ⬛ is <strong>absent</strong> from the answer: <tt class="docutils literal">J</tt>, <tt class="docutils literal">U</tt>, <tt class="docutils literal">D</tt>, <tt class="docutils literal">G</tt>, <tt class="docutils literal">H</tt>, <tt class="docutils literal">S</tt>, <tt class="docutils literal">T</tt>, <tt class="docutils literal">W</tt>, <tt class="docutils literal">R</tt>, and <tt class="docutils literal">K</tt> do not appear anywhere in <tt class="docutils literal">OCEAN</tt>.</li> </ul> <p>(This definition of “absent” turns out to be inadequate, as you will discover later.)</p> <p>The <tt class="docutils literal">GUESS=SCORE</tt> notation is intended to be clear to read and also easier to write than Greens and Yellows. For example:</p> <div style="text-align: center; font-family: &#x27;Source Code Pro&#x27;, monospace; font-size: 48px;"> <div><i>GUESS=SCORE</i></div> <div>CHEST=c.E..</div> </div> <table class="wordle"> <tr><td class="present">C</td> <td class="absent">H</td> <td class="correct">E</td> <td class="absent">S</td> <td class="absent">T</td></tr> </table><ul class="simple"> <li>the <em>uppercase</em> <tt class="docutils literal">E</tt> at position 3 in the score denotes that <tt class="docutils literal">E</tt> is in the <strong>correct</strong> position (i.e., green 🟩);</li> <li>the <em>lowercase</em> <tt class="docutils literal">c</tt> at position 1 in the score denotes that <tt class="docutils literal">C</tt> is <strong>present</strong> somewhere in the answer, but it is in the wrong position (yellow 🟨);</li> <li>the <tt class="docutils literal">.</tt>s in the score at positions 2, 4, and 5 denote that the corresponding letters in the guess (<tt class="docutils literal">H</tt>, <tt class="docutils literal">S</tt>, and <tt class="docutils literal">T</tt>, respectively) are <strong>absent</strong> from the answer (black ⬛).</li> </ul> <div class="section" id="deducing-constraints"> <h3>Deducing Constraints</h3> <p>What can we deduce from the first three rows of guesses, <tt class="docutils literal"><span class="pre">JUDGE=....e</span> CHEST=c.E.. <span class="pre">WRECK=..Ec.</span></tt>?</p> <p>There is a set of <em>valid</em> letters, <tt class="docutils literal">C</tt> and <tt class="docutils literal">E</tt>, that are either <em>present</em> (yellow 🟨) or <em>correct</em> (green 🟩). Both <tt class="docutils literal">E</tt> and <tt class="docutils literal">C</tt> start out as present, but <tt class="docutils literal">E</tt> later finds its correct position, while <tt class="docutils literal">C</tt> does not.</p> <p>There is a set of <em>invalid</em> letters that are known to be <em>absent</em> from the answer (black ⬛): <tt class="docutils literal">J</tt>, <tt class="docutils literal">U</tt>, <tt class="docutils literal">D</tt>, <tt class="docutils literal">G</tt>, <tt class="docutils literal">H</tt>, <tt class="docutils literal">S</tt>, <tt class="docutils literal">T</tt>, <tt class="docutils literal">W</tt>, <tt class="docutils literal">R</tt>, and <tt class="docutils literal">K</tt>.</p> <p>The remaining letters of the alphabet are currently <em>unknown</em>. When they are played, they will turn into <em>valid</em> or <em>invalid</em> letters. Unless we already have all five correct letters, we will draw candidate letters from the unknown pool.</p> <p>Furthermore, we know something about <em>letter positions</em>. The <em>correct</em> letters are in the correct positions, while the <em>present</em> letters are in the wrong positions.</p> <p>A candidate word <em>must</em>:</p> <ol class="arabic simple"> <li>include all valid letters — <tt class="docutils literal">C</tt> and <tt class="docutils literal">E</tt></li> <li>exclude all invalid letters — <tt class="docutils literal">JUDGHSTWRK</tt></li> <li>match all “correct” positions — <tt class="docutils literal">3:E</tt></li> <li>not match any “present” positions — <tt class="docutils literal">1:C</tt>, <tt class="docutils literal">4:C</tt>, or <tt class="docutils literal">5:E</tt></li> </ol> <p>These constraints narrow the possible choices from the word list.</p> <p>The obvious way to solve this with a computer is to codify the constraints provided by previous guess–score pairs and run through the entire list of words to find eligible words. But no human solves Wordle by methodically examining thousands of words. Instead, you rack your brain for “what ends in <tt class="docutils literal">SE</tt> and has an <tt class="docutils literal">M</tt>?” or “I've tried <tt class="docutils literal">A</tt>, <tt class="docutils literal">E</tt>, and <tt class="docutils literal">I</tt>; will <tt class="docutils literal">O</tt> or <tt class="docutils literal">U</tt> work?” or “What are the most likely letters left on the keyboard at the bottom?”</p> <p>This article will show you how to solve Wordle programmatically. It won't help you much in playing Wordle by hand, though you may understand more about the game when you're finished reading.</p> </div> <div class="section" id="prototyping-with-pipes"> <h3>Prototyping with Pipes</h3> <p>Let's prototype the above constraints with a series of <a class="reference external" href="https://www.cyberciti.biz/faq/howto-use-grep-command-in-linux-unix/">grep's</a> in a <a class="reference external" href="https://en.wikipedia.org/wiki/Pipeline_(Unix)">Unix pipeline</a> tailored to this <tt class="docutils literal">OCEAN</tt> example:</p> <pre class="code bash literal-block"> <span class="c1"># JUDGE=....e CHEST=c.E.. WRECK=..Ec. </span><span class="w"> </span>grep<span class="w"> </span><span class="s1">'^.....$'</span><span class="w"> </span>/usr/share/dict/words<span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="c1"># Extract five-letter words </span><span class="w"> </span>tr<span class="w"> </span><span class="s1">'a-z'</span><span class="w"> </span><span class="s1">'A-Z'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="c1"># Translate each word to uppercase </span><span class="w"> </span>grep<span class="w"> </span><span class="s1">'^..E..$'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="c1"># Match CORRECT positions </span><span class="w"> </span>awk<span class="w"> </span><span class="s1">'/C/ &amp;&amp; /E/'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="c1"># Match ALL of VALID set, CORRECT|PRESENT </span><span class="w"> </span>grep<span class="w"> </span>-v<span class="w"> </span><span class="s1">'[JUDGHSTWRK]'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="c1"># Exclude INVALID set </span><span class="w"> </span>grep<span class="w"> </span><span class="s1">'^[^C]..[^C][^E]$'</span><span class="w"> </span><span class="c1"># Exclude PRESENT positions</span> </pre> <p>gives:</p> <pre class="literal-block"> ICENI ILEAC OCEAN OLEIC </pre> <p>(This was in Bash, on macOS 13.6. Zsh doesn't like the comments in the middle of the multi-line pipeline, so you may have to omit them. Other operating systems will have different versions of <tt class="docutils literal">/usr/share/dict/words</tt> that may not have all of these obscure words.)</p> <p>We can accomplish this with only the simplest features of regular expressions: the <a class="reference external" href="https://www.regular-expressions.info/dot.html">dot metacharacter</a> (<tt class="docutils literal">.</tt>), <a class="reference external" href="https://www.regular-expressions.info/charclass.html">character classes</a> (<tt class="docutils literal"><span class="pre">[JUD...]</span></tt>) and negated character classes (<tt class="docutils literal">[^E]</tt>), and the <tt class="docutils literal">^</tt> and <tt class="docutils literal">$</tt> <a class="reference external" href="https://www.regular-expressions.info/anchors.html">anchors</a>. Awk gives us <a class="reference external" href="https://www.georgevreilly.com/blog/2023/09/05/RegexConjunctions.html">regex conjunctions</a>, allowing us to match <em>all</em> of the chars.</p> <p>The above regular expressions are a simple mechanical transformation of the guess–score pairs. They could be simplified. For example, after <tt class="docutils literal">grep <span class="pre">'^..E..$'</span></tt>, the <tt class="docutils literal">E</tt> in <tt class="docutils literal">awk '/C/ &amp;&amp; /E/'</tt> is redundant. We're not going to optimize the regexes, however.</p> <p>Three of the four answers—<tt class="docutils literal">ICENI</tt>, <tt class="docutils literal">ILEAC</tt>, and <tt class="docutils literal">OLEIC</tt>—are far too obscure to be Wordle answers. Actual Wordle answers also exclude simple plurals (<tt class="docutils literal">YARDS</tt>) and simple past tense (<tt class="docutils literal">LIKED</tt>), but allow more complex plurals (<tt class="docutils literal">BOXES</tt>) and irregular past tense (<tt class="docutils literal">DWELT</tt>, <tt class="docutils literal">BROKE</tt>). We make no attempt to judge if an eligible word is <em>likely</em> as a Wordle answer; merely that it fits.</p> <p>Let's make a pipeline for Wordle 787 (<tt class="docutils literal">INDEX</tt>):</p> <pre class="code bash literal-block"> <span class="c1"># VOUCH=..... GRIPE=..i.e DENIM=deni. WIDEN=.iDEn </span><span class="w"> </span>grep<span class="w"> </span><span class="s1">'^.....$'</span><span class="w"> </span>/usr/share/dict/words<span class="w"> </span><span class="p">|</span><span class="w"> </span>tr<span class="w"> </span><span class="s1">'a-z'</span><span class="w"> </span><span class="s1">'A-Z'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span><span class="s1">'^..DE.$'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="c1"># CORRECT pos </span><span class="w"> </span>awk<span class="w"> </span><span class="s1">'/D/ &amp;&amp; /E/ &amp;&amp; /I/ &amp;&amp; /N/'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="c1"># VALID set </span><span class="w"> </span>grep<span class="w"> </span>-v<span class="w"> </span><span class="s1">'[VOUCHGRPMW]'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="c1"># INVALID set </span><span class="w"> </span>grep<span class="w"> </span><span class="s1">'^[^D][^EI][^IN][^I][^EN]$'</span><span class="w"> </span><span class="c1"># PRESENT pos</span> </pre> <p>yields:</p> <pre class="literal-block"> INDEX </pre> <p>This approach is promising, but constructing those regexes by hand is not maintainable.</p> </div> <div class="section" id="word-lists"> <h3>Word Lists</h3> <p>There are several sources of five-letter words.</p> <ul class="simple"> <li>Filtering <tt class="docutils literal">/usr/share/dict/words</tt> or similar lists.</li> <li><a class="reference external" href="https://github.com/georgevreilly/wordle/blob/main/wordle.txt">wordle.txt</a>: The nearly 15,000 words that Wordle accepts as entries. Many of these words are obscure.</li> <li><a class="reference external" href="https://github.com/georgevreilly/wordle/blob/main/answers.txt">answers.txt</a>: The 2,309 words that Wordle uses as answers. These words are fairly recognizable. They are a subset of the other list.</li> </ul> <p>The latter two lists were extracted from the source code of the game. In the various examples below, I use the larger 15,000-word list.</p> </div> <div class="section" id="initial-python-solution"> <h3>Initial Python Solution</h3> <p>Let's attempt to solve this in Python. The first piece is to parse a list of <tt class="docutils literal">GUESS=SCORE</tt> pairs.</p> <!-- wordle1 --> <pre class="code python literal-block"> <span class="k">def</span> <span class="nf">parse_guesses</span><span class="p">(</span><span class="n">guess_scores</span><span class="p">):</span><span class="w"> </span> <span class="n">invalid</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span> <span class="c1"># Black/Absent</span><span class="w"> </span> <span class="n">valid</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span> <span class="c1"># Green/Correct or Yellow/Present</span><span class="w"> </span> <span class="n">mask</span> <span class="o">=</span> <span class="p">[</span><span class="kc">None</span><span class="p">]</span> <span class="o">*</span> <span class="mi">5</span> <span class="c1"># Exact match for pos (Green/Correct)</span><span class="w"> </span> <span class="n">wrong_spot</span> <span class="o">=</span> <span class="p">[</span><span class="nb">set</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">)]</span> <span class="c1"># Wrong spot (Yellow/Present)</span><span class="w"> </span> <span class="k">for</span> <span class="n">gs</span> <span class="ow">in</span> <span class="n">guess_scores</span><span class="p">:</span><span class="w"> </span> <span class="n">guess</span><span class="p">,</span> <span class="n">score</span> <span class="o">=</span> <span class="n">gs</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&quot;=&quot;</span><span class="p">)</span><span class="w"> </span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">guess</span><span class="p">,</span> <span class="n">score</span><span class="p">)):</span><span class="w"> </span> <span class="k">assert</span> <span class="s2">&quot;A&quot;</span> <span class="o">&lt;=</span> <span class="n">g</span> <span class="o">&lt;=</span> <span class="s2">&quot;Z&quot;</span><span class="p">,</span> <span class="s2">&quot;GUESS should be uppercase&quot;</span><span class="w"> </span> <span class="k">if</span> <span class="s2">&quot;A&quot;</span> <span class="o">&lt;=</span> <span class="n">s</span> <span class="o">&lt;=</span> <span class="s2">&quot;Z&quot;</span><span class="p">:</span><span class="w"> </span> <span class="k">assert</span> <span class="n">g</span> <span class="o">==</span> <span class="n">s</span><span class="w"> </span> <span class="n">valid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="n">mask</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">g</span><span class="w"> </span> <span class="k">elif</span> <span class="s2">&quot;a&quot;</span> <span class="o">&lt;=</span> <span class="n">s</span> <span class="o">&lt;=</span> <span class="s2">&quot;z&quot;</span><span class="p">:</span><span class="w"> </span> <span class="k">assert</span> <span class="n">g</span> <span class="o">==</span> <span class="n">s</span><span class="o">.</span><span class="n">upper</span><span class="p">()</span><span class="w"> </span> <span class="n">valid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="n">wrong_spot</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="k">elif</span> <span class="n">s</span> <span class="o">==</span> <span class="s2">&quot;.&quot;</span><span class="p">:</span><span class="w"> </span> <span class="n">invalid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="k">else</span><span class="p">:</span><span class="w"> </span> <span class="k">raise</span> <span class="ne">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Unexpected </span><span class="si">{</span><span class="n">s</span><span class="si">}</span><span class="s2"> for </span><span class="si">{</span><span class="n">g</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="p">(</span><span class="n">invalid</span><span class="p">,</span> <span class="n">valid</span><span class="p">,</span> <span class="n">mask</span><span class="p">,</span> <span class="n">wrong_spot</span><span class="p">)</span> </pre> <p>Let's try it for the <tt class="docutils literal">OCEAN</tt> guesses:</p> <pre class="code pycon literal-block"> <span class="gp">&gt;&gt;&gt; </span><span class="n">invalid</span><span class="p">,</span> <span class="n">valid</span><span class="p">,</span> <span class="n">mask</span><span class="p">,</span> <span class="n">wrong_spot</span> <span class="o">=</span> <span class="n">parse_guesses</span><span class="p">(</span><span class="w"> </span><span class="gp">... </span> <span class="p">[</span><span class="s2">&quot;JUDGE=....e&quot;</span><span class="p">,</span> <span class="s2">&quot;CHEST=c.E..&quot;</span><span class="p">,</span> <span class="s2">&quot;WRECK=..Ec.&quot;</span><span class="p">])</span><span class="w"> </span><span class="go"> </span><span class="gp">&gt;&gt;&gt; </span><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">invalid</span><span class="si">=}</span><span class="se">\n</span><span class="si">{</span><span class="n">valid</span><span class="si">=}</span><span class="se">\n</span><span class="si">{</span><span class="n">mask</span><span class="si">=}</span><span class="se">\n</span><span class="si">{</span><span class="n">wrong_spot</span><span class="si">=}</span><span class="s2">&quot;</span><span class="p">)</span><span class="w"> </span><span class="go">invalid={'H', 'K', 'D', 'G', 'T', 'R', 'U', 'W', 'J', 'S'} valid={'E', 'C'} mask=[None, None, 'E', None, None] wrong_spot=[{'C'}, set(), set(), {'C'}, {'E'}] </span><span class="gp">&gt;&gt;&gt; </span><span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">vocab</span><span class="p">:</span><span class="w"> </span><span class="gp">... </span> <span class="k">if</span> <span class="n">is_eligible</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">invalid</span><span class="p">,</span> <span class="n">valid</span><span class="p">,</span> <span class="n">mask</span><span class="p">,</span> <span class="n">wrong_spot</span><span class="p">):</span><span class="w"> </span><span class="gp">... </span> <span class="nb">print</span><span class="p">(</span><span class="n">w</span><span class="p">)</span><span class="w"> </span><span class="gp">...</span><span class="w"> </span><span class="go">ICENI ILEAC OCEAN OLEIC</span> </pre> <p>Here's the <tt class="docutils literal">is_eligible</tt> function. We <a class="reference external" href="https://www.geeksforgeeks.org/short-circuiting-techniques-python/#">short-circuit the evaluation</a> and return as soon as any condition is <tt class="docutils literal">False</tt>.</p> <!-- wordle1 --> <pre class="code python literal-block"> <span class="k">def</span> <span class="nf">is_eligible</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">invalid</span><span class="p">,</span> <span class="n">valid</span><span class="p">,</span> <span class="n">mask</span><span class="p">,</span> <span class="n">wrong_spot</span><span class="p">):</span><span class="w"> </span> <span class="n">letters</span> <span class="o">=</span> <span class="p">{</span><span class="n">c</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">word</span><span class="p">}</span><span class="w"> </span> <span class="k">if</span> <span class="n">letters</span> <span class="o">&amp;</span> <span class="n">valid</span> <span class="o">!=</span> <span class="n">valid</span><span class="p">:</span><span class="w"> </span> <span class="c1"># Missing some 'valid' letters from the word;</span><span class="w"> </span> <span class="c1"># all Green/Correct and Yellow/Present letters are required</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s2">&quot;!Valid: </span><span class="si">%s</span><span class="s2">&quot;</span><span class="p">,</span> <span class="n">word</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="kc">False</span><span class="w"> </span> <span class="k">elif</span> <span class="nb">any</span><span class="p">(</span><span class="n">m</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span> <span class="ow">and</span> <span class="n">c</span> <span class="o">!=</span> <span class="n">m</span> <span class="k">for</span> <span class="n">c</span><span class="p">,</span> <span class="n">m</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">mask</span><span class="p">)):</span><span class="w"> </span> <span class="c1"># Some of the Green/Correct letters are not at their positions</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s2">&quot;!Mask: </span><span class="si">%s</span><span class="s2">&quot;</span><span class="p">,</span> <span class="n">word</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="kc">False</span><span class="w"> </span> <span class="k">elif</span> <span class="n">letters</span> <span class="o">&amp;</span> <span class="n">invalid</span><span class="p">:</span><span class="w"> </span> <span class="c1"># Some invalid (Black/Absent) letters are in the word</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s2">&quot;Invalid: </span><span class="si">%s</span><span class="s2">&quot;</span><span class="p">,</span> <span class="n">word</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="kc">False</span><span class="w"> </span> <span class="k">elif</span> <span class="nb">any</span><span class="p">(</span><span class="n">c</span> <span class="ow">in</span> <span class="n">ws</span> <span class="k">for</span> <span class="n">c</span><span class="p">,</span> <span class="n">ws</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">wrong_spot</span><span class="p">)):</span><span class="w"> </span> <span class="c1"># We have valid letters in the wrong position (Yellow/Present)</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s2">&quot;WrongSpot: </span><span class="si">%s</span><span class="s2">&quot;</span><span class="p">,</span> <span class="n">word</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="kc">False</span><span class="w"> </span> <span class="k">else</span><span class="p">:</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s2">&quot;Got: </span><span class="si">%s</span><span class="s2">&quot;</span><span class="p">,</span> <span class="n">word</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="kc">True</span> </pre> </div> <div class="section" id="converting-to-classes"> <h3>Converting to Classes</h3> <p>Returning four parallel collections from a function is a <a class="reference external" href="https://pragmaticways.com/31-code-smells-you-must-know/">code smell</a>. Let's refactor these functions into a <tt class="docutils literal">WordleGuesses</tt> class.</p> <p>First, we'll need some helper classes:</p> <ul class="simple"> <li><tt class="docutils literal">WordleError</tt>: an exception class;</li> <li><tt class="docutils literal">TileState</tt>: a <a class="reference external" href="https://www.georgevreilly.com/blog/2023/09/02/PythonEnumsWithAttributes.html">multi-attribute enumeration</a>;</li> <li><tt class="docutils literal">GuessScore</tt>: a <a class="reference external" href="https://realpython.com/python-data-classes/">dataclass</a> that manages a guess–score pair and the associated <tt class="docutils literal">TileState</tt>s.</li> <li>We'll also use <a class="reference external" href="https://bernat.tech/posts/the-state-of-type-hints-in-python/">type annotations</a> because it's 2023.</li> </ul> <!-- wordle2 --> <pre class="code python literal-block"> <span class="n">WORDLE_LEN</span> <span class="o">=</span> <span class="mi">5</span><span class="w"> </span><span class="k">class</span> <span class="nc">WordleError</span><span class="p">(</span><span class="ne">Exception</span><span class="p">):</span><span class="w"> </span><span class="sd">&quot;&quot;&quot;Base exception class&quot;&quot;&quot;</span><span class="w"> </span><span class="k">class</span> <span class="nc">TileState</span><span class="p">(</span><span class="n">namedtuple</span><span class="p">(</span><span class="s2">&quot;TileState&quot;</span><span class="p">,</span> <span class="s2">&quot;value emoji color css_color&quot;</span><span class="p">),</span> <span class="n">Enum</span><span class="p">):</span><span class="w"> </span> <span class="n">CORRECT</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="s2">&quot;</span><span class="se">\U0001F7E9</span><span class="s2">&quot;</span><span class="p">,</span> <span class="s2">&quot;Green&quot;</span><span class="p">,</span> <span class="s2">&quot;#6aaa64&quot;</span><span class="w"> </span> <span class="n">PRESENT</span> <span class="o">=</span> <span class="mi">2</span><span class="p">,</span> <span class="s2">&quot;</span><span class="se">\U0001F7E8</span><span class="s2">&quot;</span><span class="p">,</span> <span class="s2">&quot;Yellow&quot;</span><span class="p">,</span> <span class="s2">&quot;#c9b458&quot;</span><span class="w"> </span> <span class="n">ABSENT</span> <span class="o">=</span> <span class="mi">3</span><span class="p">,</span> <span class="s2">&quot;</span><span class="se">\U00002B1B</span><span class="s2">&quot;</span><span class="p">,</span> <span class="s2">&quot;Black&quot;</span><span class="p">,</span> <span class="s2">&quot;#838184&quot;</span><span class="w"> </span><span class="nd">&#64;dataclass</span><span class="w"> </span><span class="k">class</span> <span class="nc">GuessScore</span><span class="p">:</span><span class="w"> </span> <span class="n">guess</span><span class="p">:</span> <span class="nb">str</span><span class="w"> </span> <span class="n">score</span><span class="p">:</span> <span class="nb">str</span><span class="w"> </span> <span class="n">tiles</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">TileState</span><span class="p">]</span><span class="w"> </span> <span class="nd">&#64;classmethod</span><span class="w"> </span> <span class="k">def</span> <span class="nf">make</span><span class="p">(</span><span class="bp">cls</span><span class="p">,</span> <span class="n">guess_score</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s2">&quot;GuessScore&quot;</span><span class="p">:</span><span class="w"> </span> <span class="n">guess</span><span class="p">,</span> <span class="n">score</span> <span class="o">=</span> <span class="n">guess_score</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&quot;=&quot;</span><span class="p">)</span><span class="w"> </span> <span class="n">tiles</span> <span class="o">=</span> <span class="p">[</span><span class="bp">cls</span><span class="o">.</span><span class="n">tile_state</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">score</span><span class="p">]</span><span class="w"> </span> <span class="k">return</span> <span class="bp">cls</span><span class="p">(</span><span class="n">guess</span><span class="p">,</span> <span class="n">score</span><span class="p">,</span> <span class="n">tiles</span><span class="p">)</span><span class="w"> </span> <span class="nd">&#64;classmethod</span><span class="w"> </span> <span class="k">def</span> <span class="nf">tile_state</span><span class="p">(</span><span class="bp">cls</span><span class="p">,</span> <span class="n">score_tile</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">TileState</span><span class="p">:</span><span class="w"> </span> <span class="k">if</span> <span class="s2">&quot;A&quot;</span> <span class="o">&lt;=</span> <span class="n">score_tile</span> <span class="o">&lt;=</span> <span class="s2">&quot;Z&quot;</span><span class="p">:</span><span class="w"> </span> <span class="k">return</span> <span class="n">TileState</span><span class="o">.</span><span class="n">CORRECT</span><span class="w"> </span> <span class="k">elif</span> <span class="s2">&quot;a&quot;</span> <span class="o">&lt;=</span> <span class="n">score_tile</span> <span class="o">&lt;=</span> <span class="s2">&quot;z&quot;</span><span class="p">:</span><span class="w"> </span> <span class="k">return</span> <span class="n">TileState</span><span class="o">.</span><span class="n">PRESENT</span><span class="w"> </span> <span class="k">elif</span> <span class="n">score_tile</span> <span class="o">==</span> <span class="s2">&quot;.&quot;</span><span class="p">:</span><span class="w"> </span> <span class="k">return</span> <span class="n">TileState</span><span class="o">.</span><span class="n">ABSENT</span><span class="w"> </span> <span class="k">else</span><span class="p">:</span><span class="w"> </span> <span class="k">raise</span> <span class="n">WordleError</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Invalid score: </span><span class="si">{</span><span class="n">score_tile</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span><span class="w"> </span> <span class="k">def</span> <span class="fm">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span><span class="w"> </span> <span class="k">return</span> <span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="bp">self</span><span class="o">.</span><span class="n">guess</span><span class="si">}</span><span class="s2">=</span><span class="si">{</span><span class="bp">self</span><span class="o">.</span><span class="n">score</span><span class="si">}</span><span class="s2">&quot;</span><span class="w"> </span> <span class="k">def</span> <span class="nf">emojis</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">separator</span><span class="o">=</span><span class="s2">&quot;&quot;</span><span class="p">):</span><span class="w"> </span> <span class="k">return</span> <span class="n">separator</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">t</span><span class="o">.</span><span class="n">emoji</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">tiles</span><span class="p">)</span> </pre> <p>For brevity, I presented a minimal version of <tt class="docutils literal">GuessScore.make</tt> above. The version in my <a class="reference external" href="https://github.com/georgevreilly/wordle">Wordle repository</a> has robust validation.</p> <p>Let's add the main class, <tt class="docutils literal">WordleGuesses</tt>:</p> <!-- wordle2 --> <pre class="code python literal-block"> <span class="nd">&#64;dataclass</span><span class="w"> </span><span class="k">class</span> <span class="nc">WordleGuesses</span><span class="p">:</span><span class="w"> </span> <span class="n">mask</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span> <span class="o">|</span> <span class="kc">None</span><span class="p">]</span> <span class="c1"># Exact match for position (Green/Correct)</span><span class="w"> </span> <span class="n">valid</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="c1"># Green/Correct or Yellow/Present</span><span class="w"> </span> <span class="n">invalid</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="c1"># Black/Absent</span><span class="w"> </span> <span class="n">wrong_spot</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]]</span> <span class="c1"># Wrong spot (Yellow/Present)</span><span class="w"> </span> <span class="n">guess_scores</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">GuessScore</span><span class="p">]</span><span class="w"> </span> <span class="nd">&#64;classmethod</span><span class="w"> </span> <span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="bp">cls</span><span class="p">,</span> <span class="n">guess_scores</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">GuessScore</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="s2">&quot;WordleGuesses&quot;</span><span class="p">:</span><span class="w"> </span> <span class="n">mask</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span> <span class="o">|</span> <span class="kc">None</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="kc">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">WORDLE_LEN</span><span class="w"> </span> <span class="n">valid</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span><span class="w"> </span> <span class="n">invalid</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span><span class="w"> </span> <span class="n">wrong_spot</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="p">[</span><span class="nb">set</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">WORDLE_LEN</span><span class="p">)]</span><span class="w"> </span> <span class="k">for</span> <span class="n">gs</span> <span class="ow">in</span> <span class="n">guess_scores</span><span class="p">:</span><span class="w"> </span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">g</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">gs</span><span class="o">.</span><span class="n">tiles</span><span class="p">,</span> <span class="n">gs</span><span class="o">.</span><span class="n">guess</span><span class="p">)):</span><span class="w"> </span> <span class="k">if</span> <span class="n">t</span> <span class="ow">is</span> <span class="n">TileState</span><span class="o">.</span><span class="n">CORRECT</span><span class="p">:</span><span class="w"> </span> <span class="n">mask</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">g</span><span class="w"> </span> <span class="n">valid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="k">elif</span> <span class="n">t</span> <span class="ow">is</span> <span class="n">TileState</span><span class="o">.</span><span class="n">PRESENT</span><span class="p">:</span><span class="w"> </span> <span class="n">wrong_spot</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="n">valid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="k">elif</span> <span class="n">t</span> <span class="ow">is</span> <span class="n">TileState</span><span class="o">.</span><span class="n">ABSENT</span><span class="p">:</span><span class="w"> </span> <span class="n">invalid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="bp">cls</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="n">valid</span><span class="p">,</span> <span class="n">invalid</span><span class="p">,</span> <span class="n">wrong_spot</span><span class="p">,</span> <span class="n">guess_scores</span><span class="p">)</span> </pre> <p><tt class="docutils literal">WordleGuesses.parse</tt> is a bit shorter and clearer than <tt class="docutils literal">parse_guesses</tt>. It uses <tt class="docutils literal">TileState</tt> at each position to classify the current tile and accumulate state in the four member collections. Since <tt class="docutils literal">GuessScore.make</tt> has validated the input, <tt class="docutils literal">parse</tt> doesn't need to do any further validation.</p> <p>The <tt class="docutils literal">is_eligible</tt> method is essentially the same as its predecessor:</p> <!-- wordle2 --> <pre class="code python literal-block"> <span class="k">class</span> <span class="nc">WordleGuesses</span><span class="p">:</span><span class="w"> </span> <span class="k">def</span> <span class="nf">is_eligible</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">word</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span><span class="w"> </span> <span class="n">letters</span> <span class="o">=</span> <span class="p">{</span><span class="n">c</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">word</span><span class="p">}</span><span class="w"> </span> <span class="k">if</span> <span class="n">letters</span> <span class="o">&amp;</span> <span class="bp">self</span><span class="o">.</span><span class="n">valid</span> <span class="o">!=</span> <span class="bp">self</span><span class="o">.</span><span class="n">valid</span><span class="p">:</span><span class="w"> </span> <span class="c1"># Did not have the full set of green+yellow letters known to be valid</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s2">&quot;!Valid: </span><span class="si">%s</span><span class="s2">&quot;</span><span class="p">,</span> <span class="n">word</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="kc">False</span><span class="w"> </span> <span class="k">elif</span> <span class="nb">any</span><span class="p">(</span><span class="n">m</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span> <span class="ow">and</span> <span class="n">c</span> <span class="o">!=</span> <span class="n">m</span> <span class="k">for</span> <span class="n">c</span><span class="p">,</span> <span class="n">m</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">mask</span><span class="p">)):</span><span class="w"> </span> <span class="c1"># Couldn't find all the green/correct letters</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s2">&quot;!Mask: </span><span class="si">%s</span><span class="s2">&quot;</span><span class="p">,</span> <span class="n">word</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="kc">False</span><span class="w"> </span> <span class="k">elif</span> <span class="n">letters</span> <span class="o">&amp;</span> <span class="bp">self</span><span class="o">.</span><span class="n">invalid</span><span class="p">:</span><span class="w"> </span> <span class="c1"># Invalid (black) letters are in the word</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s2">&quot;Invalid: </span><span class="si">%s</span><span class="s2">&quot;</span><span class="p">,</span> <span class="n">word</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="kc">False</span><span class="w"> </span> <span class="k">elif</span> <span class="nb">any</span><span class="p">(</span><span class="n">c</span> <span class="ow">in</span> <span class="n">ws</span> <span class="k">for</span> <span class="n">c</span><span class="p">,</span> <span class="n">ws</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">wrong_spot</span><span class="p">)):</span><span class="w"> </span> <span class="c1"># Found some yellow letters: valid letters in wrong position</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s2">&quot;WrongSpot: </span><span class="si">%s</span><span class="s2">&quot;</span><span class="p">,</span> <span class="n">word</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="kc">False</span><span class="w"> </span> <span class="k">else</span><span class="p">:</span><span class="w"> </span> <span class="c1"># Potentially valid</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="s2">&quot;Got: </span><span class="si">%s</span><span class="s2">&quot;</span><span class="p">,</span> <span class="n">word</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="kc">True</span><span class="w"> </span> <span class="k">def</span> <span class="nf">find_eligible</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">vocabulary</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span><span class="w"> </span> <span class="k">return</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">vocabulary</span> <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">is_eligible</span><span class="p">(</span><span class="n">w</span><span class="p">)]</span> </pre> <p>There's a <a class="reference external" href="https://www.spinellis.gr/blog/20200225/">famous story</a> where Donald Knuth was asked by Jon Bentley to demonstrate <a class="reference external" href="http://www.literateprogramming.com/">literate programming</a> by finding the <em>K</em> most common words from a text file. Knuth turned in an eight-page gem of WEB, which was reviewed by Doug McIlroy, who demonstrated that the task could also be accomplished in a six-line pipeline.</p> <p>Wordle can also be solved with a six-line pipeline, but the regexes are quite difficult to type correctly and they have to be carefully hand tailored for each set of guess–score pairs. There is no one general six-line pipeline.</p> <p>I know that I'd much rather work with these Python classes. As we'll see below, they are a solid foundation that can be built upon in many ways.</p> </div> <div class="section" id="does-it-work"> <h3>Does it Work?</h3> <p>Let's try it!:</p> <pre class="code bash literal-block"> <span class="c1"># answer: ARBOR </span>$<span class="w"> </span>./wordle.py<span class="w"> </span><span class="nv">HARES</span><span class="o">=</span>.ar..<span class="w"> </span><span class="nv">GUILT</span><span class="o">=</span>.....<span class="w"> </span><span class="nv">CROAK</span><span class="o">=</span>.Roa.<span class="w"> </span><span class="nv">BRAVO</span><span class="o">=</span>bRa.o<span class="w"> </span>ARBOR<span class="w"> </span><span class="c1"># answer: CACHE </span>$<span class="w"> </span>./wordle.py<span class="w"> </span><span class="nv">CHAIR</span><span class="o">=</span>Cha..<span class="w"> </span><span class="nv">CLASH</span><span class="o">=</span>C.a.h<span class="w"> </span><span class="nv">CATCH</span><span class="o">=</span>CA.ch<span class="w"> </span>CACHE<span class="w"> </span>CAHOW<span class="w"> </span><span class="c1"># answer: TOXIC </span>$<span class="w"> </span>./wordle.py<span class="w"> </span><span class="nv">LEAKS</span><span class="o">=</span>.....<span class="w"> </span><span class="nv">MIGHT</span><span class="o">=</span>.i..t<span class="w"> </span><span class="nv">BLITZ</span><span class="o">=</span>..it.<span class="w"> </span><span class="nv">OPTIC</span><span class="o">=</span>o.tIC<span class="w"> </span><span class="nv">TONIC</span><span class="o">=</span>TO.IC<span class="w"> </span>TORIC<span class="w"> </span>TOXIC </pre> <p>This looks right but there are some subtle bugs in the code.</p> </div> <div class="section" id="fifty-is-the-new-witty"> <h3>Fifty is the new Witty</h3> <p>Here we expect to find <tt class="docutils literal">FIFTY</tt>, but no words match:</p> <pre class="code bash literal-block"> <span class="c1"># answer: FIFTY </span>$<span class="w"> </span>./wordle.py<span class="w"> </span><span class="nv">HARES</span><span class="o">=</span>.....<span class="w"> </span><span class="nv">BUILT</span><span class="o">=</span>..i.t<span class="w"> </span><span class="nv">TIMID</span><span class="o">=</span>tI...<span class="w"> </span><span class="nv">PINTO</span><span class="o">=</span>.I.T.<span class="w"> </span><span class="nv">WITTY</span><span class="o">=</span>.I.TY<span class="w"> </span>--None-- </pre> <p>Let's take a look at the state of the <tt class="docutils literal">WordleGuesses</tt> instance:</p> <pre class="code pycon literal-block"> <span class="gp">&gt;&gt;&gt; </span><span class="n">guess_scores</span> <span class="o">=</span> <span class="p">[</span><span class="n">GuessScore</span><span class="o">.</span><span class="n">make</span><span class="p">(</span><span class="n">gs</span><span class="p">)</span> <span class="k">for</span> <span class="n">gs</span> <span class="ow">in</span><span class="w"> </span><span class="go"> &quot;HARES=..... BUILT=..i.t TIMID=tI... PINTO=.I.T. WITTY=.I.TY&quot;.split()] </span><span class="gp">&gt;&gt;&gt; </span><span class="n">wg</span> <span class="o">=</span> <span class="n">WordleGuesses</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">guess_scores</span><span class="p">)</span><span class="w"> </span><span class="gp">&gt;&gt;&gt; </span><span class="n">wg</span><span class="w"> </span><span class="go">WordleGuesses(mask=[None, 'I', None, 'T', 'Y'], valid={'T', 'I', 'Y'}, invalid={ 'A', 'E', 'D', 'M', 'U', 'H', 'I', 'B', 'L', 'T', 'P', 'O', 'R', 'W', 'N', 'S'}, wrong_spot=[{'T'}, set(), {'I'}, set(), {'T'}], guess_scores=[GuessScore(guess='HARES', score='.....', tiles=[&lt;TileState.ABSENT: TileState(value=3, emoji='⬛', color='Black', css_color='#838184')&gt;, &lt;TileState.ABSENT: TileState(value=3, emoji='⬛', color='Black', css_color='#838184')&gt;, ... much snipped ...</span> </pre> <p>That's ugly.</p> </div> <div class="section" id="better-string-representation"> <h3>Better String Representation</h3> <p>Let's write a few helper functions to improve the <tt class="docutils literal">__repr__</tt>:</p> <!-- wordle3 --> <pre class="code python literal-block"> <span class="k">def</span> <span class="nf">letter_set</span><span class="p">(</span><span class="n">s</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span><span class="w"> </span> <span class="k">return</span> <span class="s2">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="n">s</span><span class="p">))</span><span class="w"> </span><span class="k">def</span> <span class="nf">letter_sets</span><span class="p">(</span><span class="n">ls</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]])</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span><span class="w"> </span> <span class="k">return</span> <span class="s2">&quot;[&quot;</span> <span class="o">+</span> <span class="s2">&quot;,&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">letter_set</span><span class="p">(</span><span class="n">e</span><span class="p">)</span> <span class="ow">or</span> <span class="s2">&quot;-&quot;</span> <span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="n">ls</span><span class="p">)</span> <span class="o">+</span> <span class="s2">&quot;]&quot;</span><span class="w"> </span><span class="k">def</span> <span class="nf">dash_mask</span><span class="p">(</span><span class="n">mask</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span> <span class="o">|</span> <span class="kc">None</span><span class="p">]):</span><span class="w"> </span> <span class="k">return</span> <span class="s2">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">m</span> <span class="ow">or</span> <span class="s2">&quot;-&quot;</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">mask</span><span class="p">)</span><span class="w"> </span><span class="k">class</span> <span class="nc">WordleGuesses</span><span class="p">:</span><span class="w"> </span> <span class="k">def</span> <span class="fm">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span><span class="w"> </span> <span class="n">mask</span> <span class="o">=</span> <span class="n">dash_mask</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">mask</span><span class="p">)</span><span class="w"> </span> <span class="n">valid</span> <span class="o">=</span> <span class="n">letter_set</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">valid</span><span class="p">)</span><span class="w"> </span> <span class="n">invalid</span> <span class="o">=</span> <span class="n">letter_set</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">invalid</span><span class="p">)</span><span class="w"> </span> <span class="n">wrong_spot</span> <span class="o">=</span> <span class="n">letter_sets</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">wrong_spot</span><span class="p">)</span><span class="w"> </span> <span class="n">unused</span> <span class="o">=</span> <span class="n">letter_set</span><span class="p">(</span><span class="w"> </span> <span class="nb">set</span><span class="p">(</span><span class="n">string</span><span class="o">.</span><span class="n">ascii_uppercase</span><span class="p">)</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">valid</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">invalid</span><span class="p">)</span><span class="w"> </span> <span class="n">_guess_scores</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&quot;, &quot;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">gs</span><span class="si">}</span><span class="s2">|</span><span class="si">{</span><span class="n">gs</span><span class="o">.</span><span class="n">emojis</span><span class="p">()</span><span class="si">}</span><span class="s2">&quot;</span><span class="w"> </span> <span class="k">for</span> <span class="n">gs</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">guess_scores</span><span class="p">)]</span><span class="w"> </span> <span class="k">return</span> <span class="p">(</span><span class="w"> </span> <span class="sa">f</span><span class="s2">&quot;WordleGuesses(</span><span class="si">{</span><span class="n">mask</span><span class="si">=}</span><span class="s2">, </span><span class="si">{</span><span class="n">valid</span><span class="si">=}</span><span class="s2">, </span><span class="si">{</span><span class="n">invalid</span><span class="si">=}</span><span class="s2">,</span><span class="se">\n</span><span class="s2">&quot;</span><span class="w"> </span> <span class="sa">f</span><span class="s2">&quot; </span><span class="si">{</span><span class="n">wrong_spot</span><span class="si">=}</span><span class="s2">, </span><span class="si">{</span><span class="n">unused</span><span class="si">=}</span><span class="s2">)&quot;</span><span class="w"> </span> <span class="p">)</span> </pre> <p>Let's run it again, printing out the instance:</p> <pre class="code bash literal-block"> <span class="c1"># answer: FIFTY </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-v<span class="w"> </span><span class="nv">HARES</span><span class="o">=</span>.....<span class="w"> </span><span class="nv">BUILT</span><span class="o">=</span>..i.t<span class="w"> </span><span class="nv">TIMID</span><span class="o">=</span>tI...<span class="w"> </span><span class="nv">PINTO</span><span class="o">=</span>.I.T.<span class="w"> </span><span class="nv">WITTY</span><span class="o">=</span>.I.TY<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span><span class="s1">'-I-TY'</span>,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span><span class="s1">'ITY'</span>,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span><span class="s1">'ABDEHILMNOPRSTUW'</span>,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=</span><span class="s1">'[T,-,I,-,T]'</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span><span class="s1">'CFGJKQVXZ'</span><span class="o">)</span><span class="w"> </span><span class="nv">guess_scores</span><span class="o">=</span><span class="w"> </span><span class="o">[</span><span class="s1">'HARES=.....|⬛⬛⬛⬛⬛, BUILT=..i.t|⬛⬛🟨⬛🟨, TIMID=tI...|🟨🟩⬛⬛⬛, PINTO=.I.T.|⬛🟩⬛🟩⬛, WITTY=.I.TY|⬛🟩⬛🟩🟩'</span><span class="o">]</span><span class="w"> </span>--None-- </pre> <p>That's a huge improvement in legibility over the default string representation!</p> <p>There's a <tt class="docutils literal">T</tt> in both <tt class="docutils literal">valid</tt> and <tt class="docutils literal">invalid</tt>—two sets that should be mutually exclusive. The first “absent” <tt class="docutils literal">T</tt> at position 3 in <tt class="docutils literal">WITTY</tt> has poisoned the second <tt class="docutils literal">T</tt> at position 4, which is “correct”. The <tt class="docutils literal">T</tt> at position 1 in <tt class="docutils literal">TIMID</tt> and the <tt class="docutils literal">T</tt> at position 5 in <tt class="docutils literal">BUILT</tt> are “present” because they are the only <tt class="docutils literal">T</tt> in those guesses.</p> <p>When there are two <tt class="docutils literal">T</tt>s in a guess, but only one <tt class="docutils literal">T</tt> in the answer, one of the <tt class="docutils literal">T</tt>s will either be “correct” or “present”. The second, superfluous <tt class="docutils literal">T</tt> will be “absent”.</p> </div> <div class="section" id="first-attempt-at-fixing-the-bug"> <h3>First Attempt at Fixing the Bug</h3> <p>Let's modify <tt class="docutils literal">WordleGuesses.parse</tt> slightly to address that. When we get an <tt class="docutils literal">ABSENT</tt> tile, we should add that letter to <tt class="docutils literal">invalid</tt> only if it's not already in <tt class="docutils literal">valid</tt>.</p> <!-- wordle4 --> <pre class="code python literal-block"> <span class="k">class</span> <span class="nc">WordleGuesses</span><span class="p">:</span><span class="w"> </span> <span class="nd">&#64;classmethod</span><span class="w"> </span> <span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="bp">cls</span><span class="p">,</span> <span class="n">guess_scores</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">GuessScore</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="s2">&quot;WordleGuesses&quot;</span><span class="p">:</span><span class="w"> </span> <span class="n">mask</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span> <span class="o">|</span> <span class="kc">None</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="kc">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">WORDLE_LEN</span><span class="w"> </span> <span class="n">valid</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span><span class="w"> </span> <span class="n">invalid</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span><span class="w"> </span> <span class="n">wrong_spot</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="p">[</span><span class="nb">set</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">WORDLE_LEN</span><span class="p">)]</span><span class="w"> </span> <span class="k">for</span> <span class="n">gs</span> <span class="ow">in</span> <span class="n">guess_scores</span><span class="p">:</span><span class="w"> </span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">g</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">gs</span><span class="o">.</span><span class="n">tiles</span><span class="p">,</span> <span class="n">gs</span><span class="o">.</span><span class="n">guess</span><span class="p">)):</span><span class="w"> </span> <span class="k">if</span> <span class="n">t</span> <span class="ow">is</span> <span class="n">TileState</span><span class="o">.</span><span class="n">CORRECT</span><span class="p">:</span><span class="w"> </span> <span class="n">mask</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">g</span><span class="w"> </span> <span class="n">valid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="k">elif</span> <span class="n">t</span> <span class="ow">is</span> <span class="n">TileState</span><span class="o">.</span><span class="n">PRESENT</span><span class="p">:</span><span class="w"> </span> <span class="n">wrong_spot</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="n">valid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="k">elif</span> <span class="n">t</span> <span class="ow">is</span> <span class="n">TileState</span><span class="o">.</span><span class="n">ABSENT</span><span class="p">:</span><span class="w"> </span> <span class="k">if</span> <span class="n">g</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">valid</span><span class="p">:</span> <span class="c1"># &lt;&lt;&lt; new</span><span class="w"> </span> <span class="n">invalid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="bp">cls</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="n">valid</span><span class="p">,</span> <span class="n">invalid</span><span class="p">,</span> <span class="n">wrong_spot</span><span class="p">,</span> <span class="n">guess_scores</span><span class="p">)</span> </pre> <p>Does it work? Yes! Now we have <tt class="docutils literal">FIFTY</tt>.</p> <pre class="code bash literal-block"> <span class="c1"># answer: FIFTY </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-v<span class="w"> </span><span class="nv">HARES</span><span class="o">=</span>.....<span class="w"> </span><span class="nv">BUILT</span><span class="o">=</span>..i.t<span class="w"> </span><span class="nv">TIMID</span><span class="o">=</span>tI...<span class="w"> </span><span class="nv">PINTO</span><span class="o">=</span>.I.T.<span class="w"> </span><span class="nv">WITTY</span><span class="o">=</span>.I.TY<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span><span class="s1">'-I-TY'</span>,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span><span class="s1">'ITY'</span>,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span><span class="s1">'ABDEHLMNOPRSUW'</span>,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=</span><span class="s1">'[T,-,I,-,T]'</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span><span class="s1">'CFGJKQVXZ'</span><span class="o">)</span><span class="w"> </span>FIFTY<span class="w"> </span>JITTY<span class="w"> </span>KITTY<span class="w"> </span>ZITTY </pre> <p>But we also have <tt class="docutils literal">JITTY</tt>, <tt class="docutils literal">KITTY</tt>, and <tt class="docutils literal">ZITTY</tt>, which should not been considered eligible since <tt class="docutils literal">WITTY</tt> was eliminated for the <tt class="docutils literal">T</tt> at position 3. We'll come back to this soon.</p> </div> <div class="section" id="the-problem-of-repeated-letters"> <h3>The Problem of Repeated Letters</h3> <p>There's a problem that we haven't grappled with properly yet: <em>repeated letters</em> in a guess or in an answer. We've made an implicit assumption that there are five distinct letters in each guess and in the answer.</p> <p>Here's an example that fails with the original <tt class="docutils literal">parse</tt>:</p> <pre class="code bash literal-block"> <span class="c1"># answer: EMPTY </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-v<span class="w"> </span><span class="nv">LODGE</span><span class="o">=</span>....e<span class="w"> </span><span class="nv">WIPER</span><span class="o">=</span>..Pe.<span class="w"> </span><span class="nv">TEPEE</span><span class="o">=</span>teP..<span class="w"> </span><span class="nv">EXPAT</span><span class="o">=</span>E.P.t<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span><span class="s1">'E-P--'</span>,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span><span class="s1">'EPT'</span>,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span><span class="s1">'ADEGILORWX'</span>,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=</span><span class="s1">'[T,E,-,E,ET]'</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span><span class="s1">'BCFHJKMNQSUVYZ'</span><span class="o">)</span><span class="w"> </span>--None-- </pre> <p>but works with the current <tt class="docutils literal">parse</tt>:</p> <pre class="code bash literal-block"> <span class="c1"># answer: EMPTY </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-v<span class="w"> </span><span class="nv">LODGE</span><span class="o">=</span>....e<span class="w"> </span><span class="nv">WIPER</span><span class="o">=</span>..Pe.<span class="w"> </span><span class="nv">TEPEE</span><span class="o">=</span>teP..<span class="w"> </span><span class="nv">EXPAT</span><span class="o">=</span>E.P.t<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span><span class="s1">'E-P--'</span>,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span><span class="s1">'EPT'</span>,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span><span class="s1">'ADGILORWX'</span>,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=</span><span class="s1">'[T,E,-,E,ET]'</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span><span class="s1">'BCFHJKMNQSUVYZ'</span><span class="o">)</span><span class="w"> </span>EMPTS<span class="w"> </span>EMPTY </pre> <p>Note that there is no longer an <tt class="docutils literal">E</tt> in <tt class="docutils literal">invalid</tt>. In <tt class="docutils literal">TEPEE=teP..</tt>, the <tt class="docutils literal">E</tt> in position 2 is considered “present”, while the two <tt class="docutils literal">E</tt>s in positions 4 and 5 are marked “absent”. This tells us that there is only one <tt class="docutils literal">E</tt> in the answer. Since <tt class="docutils literal">P</tt> is correct in position 3 of <tt class="docutils literal">TEPEE</tt>, the <tt class="docutils literal">E</tt> must be in position 1. This is confirmed by the subsequent <tt class="docutils literal">EXPAT=E.P.t</tt>, where the initial <tt class="docutils literal">E</tt> is marked “correct”.</p> <p>Our previous understanding of “absent” was too simple. An “absent” tile can mean one of two things:</p> <ol class="arabic simple"> <li>This letter is not in the answer at all—the usual case.</li> <li>If another copy of this letter is “correct” or “present” elsewhere in the same guess (i.e., <em>valid</em>), the letter is superfluous at this position. The guess has more instances of this letter than the answer does.</li> </ol> <p>Consider the results here:</p> <pre class="code bash literal-block"> <span class="c1"># answer: STYLE </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-v<span class="w"> </span><span class="nv">GROAN</span><span class="o">=</span>.....<span class="w"> </span><span class="nv">WHILE</span><span class="o">=</span>...LE<span class="w"> </span><span class="nv">BELLE</span><span class="o">=</span>...LE<span class="w"> </span><span class="nv">TUPLE</span><span class="o">=</span>t..LE<span class="w"> </span><span class="nv">STELE</span><span class="o">=</span>ST.LE<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span><span class="s1">'ST-LE'</span>,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span><span class="s1">'ELST'</span>,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span><span class="s1">'ABGHINOPRUW'</span>,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=</span><span class="s1">'[T,-,-,-,-]'</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span><span class="s1">'CDFJKMQVXYZ'</span><span class="o">)</span><span class="w"> </span>STELE<span class="w"> </span>STYLE </pre> <p><tt class="docutils literal">STELE</tt> was an incorrect guess, so it should not have been offered as an eligible word. <tt class="docutils literal">E</tt>&nbsp;is valid in position 5, but wrong in position 3.</p> <p>Another example:</p> <pre class="code bash literal-block"> <span class="c1"># answer: WRITE </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-v<span class="w"> </span><span class="nv">SABER</span><span class="o">=</span>...er<span class="w"> </span><span class="nv">REFIT</span><span class="o">=</span>re.it<span class="w"> </span><span class="nv">TRITE</span><span class="o">=</span>.RITE<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span><span class="s1">'-RITE'</span>,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span><span class="s1">'EIRT'</span>,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span><span class="s1">'ABFS'</span>,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=</span><span class="s1">'[R,E,-,EI,RT]'</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span><span class="s1">'CDGHJKLMNOPQUVWXYZ'</span><span class="o">)</span><span class="w"> </span>TRITE<span class="w"> </span>URITE<span class="w"> </span>WRITE </pre> <p><tt class="docutils literal">TRITE</tt> was an incorrect guess, so it should not have been offered. <tt class="docutils literal">4:T</tt> is valid, <tt class="docutils literal">1:T</tt> is wrong.</p> </div> <div class="section" id="fixing-repeated-absent-letters"> <h3>Fixing Repeated Absent Letters</h3> <p>We can fix this by making two passes through the tiles for each guess–score pair.</p> <ol class="arabic simple"> <li>Handle “correct” and “present” tiles as before.</li> <li>Add “absent” tiles to either <tt class="docutils literal">invalid</tt> or <tt class="docutils literal">wrong_spot</tt>.</li> </ol> <p>We need the second pass to handle a case like <tt class="docutils literal"><span class="pre">WITTY=.I.TY</span></tt>, where the “absent” <tt class="docutils literal">3:T</tt> precedes the “correct” <tt class="docutils literal">4:T</tt>: the <tt class="docutils literal">valid</tt> set must be fully updated before we process “absent” tiles.</p> <!-- wordle5 --> <pre class="code python literal-block"> <span class="k">class</span> <span class="nc">WordleGuesses</span><span class="p">:</span><span class="w"> </span> <span class="nd">&#64;classmethod</span><span class="w"> </span> <span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="bp">cls</span><span class="p">,</span> <span class="n">guess_scores</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">GuessScore</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="s2">&quot;WordleGuesses&quot;</span><span class="p">:</span><span class="w"> </span> <span class="n">mask</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span> <span class="o">|</span> <span class="kc">None</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="kc">None</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">WORDLE_LEN</span><span class="p">)]</span><span class="w"> </span> <span class="n">valid</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span><span class="w"> </span> <span class="n">invalid</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span><span class="w"> </span> <span class="n">wrong_spot</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="p">[</span><span class="nb">set</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">WORDLE_LEN</span><span class="p">)]</span><span class="w"> </span> <span class="k">for</span> <span class="n">gs</span> <span class="ow">in</span> <span class="n">guess_scores</span><span class="p">:</span><span class="w"> </span> <span class="c1"># First pass for correct and present</span><span class="w"> </span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">g</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">gs</span><span class="o">.</span><span class="n">tiles</span><span class="p">,</span> <span class="n">gs</span><span class="o">.</span><span class="n">guess</span><span class="p">)):</span><span class="w"> </span> <span class="k">if</span> <span class="n">t</span> <span class="ow">is</span> <span class="n">TileState</span><span class="o">.</span><span class="n">CORRECT</span><span class="p">:</span><span class="w"> </span> <span class="n">mask</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">g</span><span class="w"> </span> <span class="n">valid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="k">elif</span> <span class="n">t</span> <span class="ow">is</span> <span class="n">TileState</span><span class="o">.</span><span class="n">PRESENT</span><span class="p">:</span><span class="w"> </span> <span class="n">wrong_spot</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="n">valid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="c1"># Second pass for absent letters</span><span class="w"> </span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">g</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">gs</span><span class="o">.</span><span class="n">tiles</span><span class="p">,</span> <span class="n">gs</span><span class="o">.</span><span class="n">guess</span><span class="p">)):</span><span class="w"> </span> <span class="k">if</span> <span class="n">t</span> <span class="ow">is</span> <span class="n">TileState</span><span class="o">.</span><span class="n">ABSENT</span><span class="p">:</span><span class="w"> </span> <span class="k">if</span> <span class="n">g</span> <span class="ow">in</span> <span class="n">valid</span><span class="p">:</span><span class="w"> </span> <span class="c1"># There are more instances of `g` in `gs.guess`</span><span class="w"> </span> <span class="c1"># than in the answer</span><span class="w"> </span> <span class="n">wrong_spot</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="k">else</span><span class="p">:</span><span class="w"> </span> <span class="n">invalid</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="bp">cls</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="n">valid</span><span class="p">,</span> <span class="n">invalid</span><span class="p">,</span> <span class="n">wrong_spot</span><span class="p">,</span> <span class="n">guess_scores</span><span class="p">)</span> </pre> <p>We can see that <tt class="docutils literal">valid</tt> and <tt class="docutils literal">invalid</tt> are disjoint. The <tt class="docutils literal">is_eligible</tt> method needs no changes.</p> <p>Let's try the <tt class="docutils literal">WRITE</tt> example again:</p> <pre class="code bash literal-block"> <span class="c1"># answer: WRITE </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-v<span class="w"> </span><span class="nv">SABER</span><span class="o">=</span>...er<span class="w"> </span><span class="nv">REFIT</span><span class="o">=</span>re.it<span class="w"> </span><span class="nv">TRITE</span><span class="o">=</span>.RITE<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span><span class="s1">'-RITE'</span>,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span><span class="s1">'EIRT'</span>,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span><span class="s1">'ABFS'</span>,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=</span><span class="s1">'[RT,E,-,EI,RT]'</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span><span class="s1">'CDGHJKLMNOPQUVWXYZ'</span><span class="o">)</span><span class="w"> </span>URITE<span class="w"> </span>WRITE </pre> <p>There is now a <tt class="docutils literal">T</tt> in the first <tt class="docutils literal">wrong_spot</tt> entry.</p> <p>And <tt class="docutils literal">STYLE</tt>?</p> <pre class="code bash literal-block"> <span class="c1"># answer: STYLE </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-v<span class="w"> </span><span class="nv">GROAN</span><span class="o">=</span>.....<span class="w"> </span><span class="nv">WHILE</span><span class="o">=</span>...LE<span class="w"> </span><span class="nv">BELLE</span><span class="o">=</span>...LE<span class="w"> </span><span class="nv">TUPLE</span><span class="o">=</span>t..LE<span class="w"> </span><span class="nv">STELE</span><span class="o">=</span>ST.LE<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span><span class="s1">'ST-LE'</span>,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span><span class="s1">'ELST'</span>,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span><span class="s1">'ABGHINOPRUW'</span>,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=</span><span class="s1">'[T,E,EL,-,-]'</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span><span class="s1">'CDFJKMQVXYZ'</span><span class="o">)</span><span class="w"> </span>STYLE </pre> <p>Both the second and third <tt class="docutils literal">wrong_spot</tt>s now have an <tt class="docutils literal">E</tt>. The “absent” <tt class="docutils literal">3:L</tt> from <tt class="docutils literal">BELLE</tt> is also in the third <tt class="docutils literal">wrong_spot</tt>.</p> <p>What about some other examples?</p> <p>In our previous attempt at fixing the bug, neither <tt class="docutils literal">QUICK</tt> nor <tt class="docutils literal">SPICK</tt> were found because the first <tt class="docutils literal">C</tt> in <tt class="docutils literal">CHICK</tt> was “absent” and thus marked invalid. Now, the <tt class="docutils literal">valid</tt> and <tt class="docutils literal">invalid</tt> sets are disjoint, there's a <tt class="docutils literal">C</tt> in the first element of <tt class="docutils literal">wrong_spot</tt>, and both words are found:</p> <pre class="code bash literal-block"> <span class="c1"># answer: QUICK </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-v<span class="w"> </span><span class="nv">MORAL</span><span class="o">=</span>.....<span class="w"> </span><span class="nv">TWINE</span><span class="o">=</span>..I..<span class="w"> </span><span class="nv">CHICK</span><span class="o">=</span>..ICK<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span><span class="s1">'--ICK'</span>,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span><span class="s1">'CIK'</span>,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span><span class="s1">'AEHLMNORTW'</span>,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=</span><span class="s1">'[C,-,-,-,-]'</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span><span class="s1">'BDFGJPQSUVXYZ'</span><span class="o">)</span><span class="w"> </span>QUICK<span class="w"> </span>SPICK </pre> <p>As expected, we find only one answer for <tt class="docutils literal">FIFTY</tt> now:</p> <pre class="code bash literal-block"> <span class="c1"># answer: FIFTY </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-v<span class="w"> </span><span class="nv">HARES</span><span class="o">=</span>.....<span class="w"> </span><span class="nv">BUILT</span><span class="o">=</span>..i.t<span class="w"> </span><span class="nv">TIMID</span><span class="o">=</span>tI...<span class="w"> </span><span class="nv">PINTO</span><span class="o">=</span>.I.T.<span class="w"> </span><span class="nv">WITTY</span><span class="o">=</span>.I.TY<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span><span class="s1">'-I-TY'</span>,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span><span class="s1">'ITY'</span>,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span><span class="s1">'ABDEHLMNOPRSUW'</span>,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=</span><span class="s1">'[T,-,IT,I,T]'</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span><span class="s1">'CFGJKQVXZ'</span><span class="o">)</span><span class="w"> </span>FIFTY </pre> <p>The new <tt class="docutils literal">T</tt> in the third element of <tt class="docutils literal">wrong_spot</tt> blocks the rhymes for <tt class="docutils literal">WITTY</tt>.</p> </div> <div class="section" id="further-optimization-of-the-mask"> <h3>Further Optimization of the Mask</h3> <p>There's still room for improvement. If you guess <tt class="docutils literal">ANGLE=ANGle</tt>, it's immediately obvious (to a human player) that you should swap the <tt class="docutils literal">L</tt> and <tt class="docutils literal">E</tt> to guess <tt class="docutils literal">ANGEL</tt> on your next turn. Or swap the <tt class="docutils literal">P</tt> and <tt class="docutils literal">T</tt> in <tt class="docutils literal">SPRAT=SpRAt</tt> to guess <tt class="docutils literal">STRAP</tt>.</p> <p>Similarly, <tt class="docutils literal">TENET=TEN.t</tt> tells you that the fourth letter of the answer must be <tt class="docutils literal">T</tt>, while <tt class="docutils literal">CHORE=C.OrE</tt> must have <tt class="docutils literal">2:R</tt>.</p> <p>A more complex example:</p> <pre class="code bash literal-block"> <span class="c1"># answer: BURLY </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-v<span class="w"> </span><span class="nv">LOWER</span><span class="o">=</span>l...r<span class="w"> </span><span class="nv">FRAIL</span><span class="o">=</span>.r..l<span class="w"> </span><span class="nv">BLURT</span><span class="o">=</span>Blur.<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span><span class="s1">'B----'</span>,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span><span class="s1">'BLRU'</span>,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span><span class="s1">'AEFIOTW'</span>,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=</span><span class="s1">'[L,LR,U,R,LR]'</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span><span class="s1">'CDGHJKMNPQSVXYZ'</span><span class="o">)</span> </pre> <p>The <tt class="docutils literal">R</tt> is in the wrong spot in positions 5 (<tt class="docutils literal"><span class="pre">l...r</span></tt>), 2 (<tt class="docutils literal"><span class="pre">.r..l</span></tt>), and 4 (<tt class="docutils literal">Blur.</tt>). The <tt class="docutils literal">B</tt> is correct in position 1, so <tt class="docutils literal">R</tt> must be in position 3.</p> <p>The <tt class="docutils literal">L</tt> is in the wrong spot in positions 1, 5, and 2. <tt class="docutils literal">B</tt> is in position 1, <tt class="docutils literal">R</tt> is now in 3, so that leaves only position 4 for <tt class="docutils literal">L</tt>.</p> <p>There remain two possibilities for <tt class="docutils literal">U</tt>—positions 2 and 5; the information contained in <tt class="docutils literal">mask</tt> and <tt class="docutils literal">wrong_spot</tt> is not enough to determine where <tt class="docutils literal">U</tt> should go.</p> <p>The original mask, <tt class="docutils literal"><span class="pre">B----</span></tt>, was due to having only one “correct” letter. Using the cumulative information in the guesses and scores, we can infer a mask of <tt class="docutils literal"><span class="pre">B-RL-</span></tt>.</p> <p>In all of these cases, we can find exactly one remaining position where a “present” letter can be placed. In the <tt class="docutils literal">BURLY</tt> example, it takes two passes: we couldn't uniquely determine a place for <tt class="docutils literal">L</tt> until we had already placed <tt class="docutils literal">R</tt>.</p> <p>Up to now, we've been treating each tile in almost complete isolation. Let's optimize the mask programmatically.</p> <p>To account for repeated letters, such as the two <tt class="docutils literal">T</tt>s in <tt class="docutils literal">TENET=TEN.t</tt>, we use Python's <tt class="docutils literal">collections.Counter</tt> as a <a class="reference external" href="https://dbader.org/blog/sets-and-multiset-in-python">multiset</a>. <tt class="docutils literal">Counter</tt>'s union operation, <tt class="docutils literal">|=</tt>, computes the maximum of corresponding counts.</p> <p>First, we loop through <em>all</em> the guess–score pairs, building a <tt class="docutils literal">valid</tt> multiset of the “correct” and “present” letters. Then we subtract a multiset of the “correct” letters, yielding a multiset of the “present” letters.</p> <p>Second, we loop over <tt class="docutils literal">present</tt>, trying for each letter to find a single empty position where it can be placed in the mask. If there is such a position, we update <tt class="docutils literal">mask2</tt>, remove the letter from <tt class="docutils literal">present</tt>, and break out of the inner loop. If there isn't (as in the two possibilities for <tt class="docutils literal">U</tt> in <tt class="docutils literal">BURLY</tt>), then we use the little-known <a class="reference external" href="https://python-notes.curiousefficiency.org/en/latest/python_concepts/break_else.html">break-else</a> construct to exit from the outer loop.</p> <p>Finally, we merge <tt class="docutils literal">mask2</tt> into <tt class="docutils literal">self.mask</tt>. This <tt class="docutils literal">optimize</tt> method is called from the end of <tt class="docutils literal">WordleGuesses.parse</tt>.</p> <!-- wordle --> <pre class="code python literal-block"> <span class="k">class</span> <span class="nc">WordleGuesses</span><span class="p">:</span><span class="w"> </span> <span class="k">def</span> <span class="nf">optimize</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span> <span class="o">|</span> <span class="kc">None</span><span class="p">]:</span><span class="w"> </span><span class="sd">&quot;&quot;&quot;Use PRESENT tiles to improve `mask`.&quot;&quot;&quot;</span><span class="w"> </span> <span class="n">mask1</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span> <span class="o">|</span> <span class="kc">None</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">mask</span><span class="w"> </span> <span class="n">mask2</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span> <span class="o">|</span> <span class="kc">None</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="kc">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">WORDLE_LEN</span><span class="w"> </span> <span class="c1"># Compute `valid`, a multiset of the correct and present letters in all guesses</span><span class="w"> </span> <span class="n">valid</span><span class="p">:</span> <span class="n">Counter</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">()</span><span class="w"> </span> <span class="k">for</span> <span class="n">gs</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">guess_scores</span><span class="p">:</span><span class="w"> </span> <span class="n">valid</span> <span class="o">|=</span> <span class="n">Counter</span><span class="p">(</span><span class="w"> </span> <span class="n">g</span> <span class="k">for</span> <span class="n">g</span><span class="p">,</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">gs</span><span class="o">.</span><span class="n">guess</span><span class="p">,</span> <span class="n">gs</span><span class="o">.</span><span class="n">tiles</span><span class="p">)</span> <span class="k">if</span> <span class="n">t</span> <span class="ow">is</span> <span class="ow">not</span> <span class="n">TileState</span><span class="o">.</span><span class="n">ABSENT</span><span class="w"> </span> <span class="p">)</span><span class="w"> </span> <span class="n">correct</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">c</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">mask1</span> <span class="k">if</span> <span class="n">c</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">)</span><span class="w"> </span> <span class="c1"># Compute `present`, a multiset of the valid letters</span><span class="w"> </span> <span class="c1"># whose correct position is not yet known; i.e., PRESENT in any row.</span><span class="w"> </span> <span class="n">present</span> <span class="o">=</span> <span class="n">valid</span> <span class="o">-</span> <span class="n">correct</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">valid</span><span class="si">=}</span><span class="s2"> </span><span class="si">{</span><span class="n">correct</span><span class="si">=}</span><span class="s2"> </span><span class="si">{</span><span class="n">present</span><span class="si">=}</span><span class="s2">&quot;</span><span class="p">)</span><span class="w"> </span> <span class="k">def</span> <span class="nf">available</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">i</span><span class="p">):</span><span class="w"> </span> <span class="s2">&quot;Can `c` be placed in slot `i` of `mask2`?&quot;</span><span class="w"> </span> <span class="k">return</span> <span class="n">mask1</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="ow">is</span> <span class="kc">None</span> <span class="ow">and</span> <span class="n">mask2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="ow">is</span> <span class="kc">None</span> <span class="ow">and</span> <span class="n">c</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">wrong_spot</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span> <span class="k">while</span> <span class="n">present</span><span class="p">:</span><span class="w"> </span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">present</span><span class="p">:</span><span class="w"> </span> <span class="n">positions</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">WORDLE_LEN</span><span class="p">)</span> <span class="k">if</span> <span class="n">available</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">i</span><span class="p">)]</span><span class="w"> </span> <span class="c1"># Is there only one position where `c` can be placed?</span><span class="w"> </span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">positions</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span><span class="w"> </span> <span class="n">i</span> <span class="o">=</span> <span class="n">positions</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="w"> </span> <span class="n">mask2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">c</span><span class="w"> </span> <span class="n">present</span> <span class="o">-=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">c</span><span class="p">)</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s2"> -&gt; </span><span class="si">{</span><span class="n">c</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span><span class="w"> </span> <span class="k">break</span><span class="w"> </span> <span class="k">else</span><span class="p">:</span><span class="w"> </span> <span class="c1"># We reach this for-else only if there was no `break` in the for-loop;</span><span class="w"> </span> <span class="c1"># i.e., no one-element `positions` was found in `present`.</span><span class="w"> </span> <span class="c1"># We must abandon the outer loop, even though `present` is not empty.</span><span class="w"> </span> <span class="k">break</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">present</span><span class="si">=}</span><span class="s2"> </span><span class="si">{</span><span class="n">mask2</span><span class="si">=}</span><span class="s2">&quot;</span><span class="p">)</span><span class="w"> </span> <span class="bp">self</span><span class="o">.</span><span class="n">mask</span> <span class="o">=</span> <span class="p">[</span><span class="n">m1</span> <span class="ow">or</span> <span class="n">m2</span> <span class="k">for</span> <span class="n">m1</span><span class="p">,</span> <span class="n">m2</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">mask1</span><span class="p">,</span> <span class="n">mask2</span><span class="p">)]</span><span class="w"> </span> <span class="n">logging</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="w"> </span> <span class="sa">f</span><span class="s2">&quot;</span><span class="se">\t</span><span class="s2">optimize: </span><span class="si">{</span><span class="n">dash_mask</span><span class="p">(</span><span class="n">mask1</span><span class="p">)</span><span class="si">}</span><span class="s2"> | </span><span class="si">{</span><span class="n">dash_mask</span><span class="p">(</span><span class="n">mask2</span><span class="p">)</span><span class="si">}</span><span class="s2">&quot;</span><span class="w"> </span> <span class="sa">f</span><span class="s2">&quot; =&gt; </span><span class="si">{</span><span class="n">dash_mask</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">mask</span><span class="p">)</span><span class="si">}</span><span class="s2">&quot;</span><span class="w"> </span> <span class="p">)</span><span class="w"> </span> <span class="k">return</span> <span class="n">mask2</span> </pre> <p>Here are some examples of it in action. Going from <tt class="docutils literal"><span class="pre">---ET</span></tt> to <tt class="docutils literal"><span class="pre">-ESET</span></tt>:</p> <pre class="code bash literal-block"> <span class="c1"># answer: BESET </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-vv<span class="w"> </span><span class="nv">CIVET</span><span class="o">=</span>...ET<span class="w"> </span><span class="nv">EGRET</span><span class="o">=</span>e..ET<span class="w"> </span><span class="nv">SLEET</span><span class="o">=</span>s.eET<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span>---ET,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span>EST,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span>CGILRV,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=[</span>ES,-,E,-,-<span class="o">]</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span>ABDFHJKMNOPQUWXYZ<span class="o">)</span><span class="w"> </span><span class="nv">valid</span><span class="o">=</span>Counter<span class="o">({</span><span class="s1">'E'</span>:<span class="w"> </span><span class="m">2</span>,<span class="w"> </span><span class="s1">'T'</span>:<span class="w"> </span><span class="m">1</span>,<span class="w"> </span><span class="s1">'S'</span>:<span class="w"> </span><span class="m">1</span><span class="o">})</span><span class="w"> </span><span class="nv">correct</span><span class="o">=</span>Counter<span class="o">({</span><span class="s1">'E'</span>:<span class="w"> </span><span class="m">1</span>,<span class="w"> </span><span class="s1">'T'</span>:<span class="w"> </span><span class="m">1</span><span class="o">})</span><span class="w"> </span><span class="nv">present</span><span class="o">=</span>Counter<span class="o">({</span><span class="s1">'E'</span>:<span class="w"> </span><span class="m">1</span>,<span class="w"> </span><span class="s1">'S'</span>:<span class="w"> </span><span class="m">1</span><span class="o">})</span><span class="w"> </span><span class="m">2</span><span class="w"> </span>-&gt;<span class="w"> </span>E<span class="w"> </span><span class="m">3</span><span class="w"> </span>-&gt;<span class="w"> </span>S<span class="w"> </span><span class="nv">present</span><span class="o">=</span>Counter<span class="o">()</span><span class="w"> </span><span class="nv">mask2</span><span class="o">=[</span>None,<span class="w"> </span><span class="s1">'E'</span>,<span class="w"> </span><span class="s1">'S'</span>,<span class="w"> </span>None,<span class="w"> </span>None<span class="o">]</span><span class="w"> </span>optimize:<span class="w"> </span>---ET<span class="w"> </span><span class="p">|</span><span class="w"> </span>-ES--<span class="w"> </span><span class="o">=</span>&gt;<span class="w"> </span>-ESET </pre> <p>And from <tt class="docutils literal"><span class="pre">C----</span></tt> to <tt class="docutils literal">CLER-</tt>:</p> <pre class="code bash literal-block"> <span class="c1"># answer: CLERK </span>$<span class="w"> </span>./wordle.py<span class="w"> </span>-vv<span class="w"> </span><span class="nv">SINCE</span><span class="o">=</span>...ce<span class="w"> </span><span class="nv">CEDAR</span><span class="o">=</span>Ce..r<span class="w"> </span><span class="nv">CRUEL</span><span class="o">=</span>Cr.el<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span>C----,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span>CELR,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span>ADINSU,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=[</span>-,ER,-,CE,ELR<span class="o">]</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span>BFGHJKMOPQTVWXYZ<span class="o">)</span><span class="w"> </span><span class="nv">valid</span><span class="o">=</span>Counter<span class="o">({</span><span class="s1">'C'</span>:<span class="w"> </span><span class="m">1</span>,<span class="w"> </span><span class="s1">'E'</span>:<span class="w"> </span><span class="m">1</span>,<span class="w"> </span><span class="s1">'R'</span>:<span class="w"> </span><span class="m">1</span>,<span class="w"> </span><span class="s1">'L'</span>:<span class="w"> </span><span class="m">1</span><span class="o">})</span><span class="w"> </span><span class="nv">correct</span><span class="o">=</span>Counter<span class="o">({</span><span class="s1">'C'</span>:<span class="w"> </span><span class="m">1</span><span class="o">})</span><span class="w"> </span><span class="nv">present</span><span class="o">=</span>Counter<span class="o">({</span><span class="s1">'E'</span>:<span class="w"> </span><span class="m">1</span>,<span class="w"> </span><span class="s1">'R'</span>:<span class="w"> </span><span class="m">1</span>,<span class="w"> </span><span class="s1">'L'</span>:<span class="w"> </span><span class="m">1</span><span class="o">})</span><span class="w"> </span><span class="m">3</span><span class="w"> </span>-&gt;<span class="w"> </span>E<span class="w"> </span><span class="m">4</span><span class="w"> </span>-&gt;<span class="w"> </span>R<span class="w"> </span><span class="m">2</span><span class="w"> </span>-&gt;<span class="w"> </span>L<span class="w"> </span><span class="nv">present</span><span class="o">=</span>Counter<span class="o">()</span><span class="w"> </span><span class="nv">mask2</span><span class="o">=[</span>None,<span class="w"> </span><span class="s1">'L'</span>,<span class="w"> </span><span class="s1">'E'</span>,<span class="w"> </span><span class="s1">'R'</span>,<span class="w"> </span>None<span class="o">]</span><span class="w"> </span>optimize:<span class="w"> </span>C----<span class="w"> </span><span class="p">|</span><span class="w"> </span>-LER-<span class="w"> </span><span class="o">=</span>&gt;<span class="w"> </span>CLER- </pre> </div> <div class="section" id="demanding-an-explanation"> <h3>Demanding an Explanation</h3> <p>Would you like to know <em>why</em> a guess is ineligible? We can do that too.</p> <pre class="code bash literal-block"> <span class="c1"># answer: ROUSE </span>$<span class="w"> </span>./wordle.py<span class="w"> </span><span class="nv">THIEF</span><span class="o">=</span>...e.<span class="w"> </span><span class="nv">BLADE</span><span class="o">=</span>....E<span class="w"> </span><span class="nv">GROVE</span><span class="o">=</span>.ro.E<span class="w"> </span><span class="se">\ </span><span class="w"> </span>--words<span class="w"> </span>ROMEO<span class="w"> </span>PROSE<span class="w"> </span>STORE<span class="w"> </span>MURAL<span class="w"> </span>ROUSE<span class="w"> </span>--explain<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span>----E,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span>EOR,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span>ABDFGHILTV,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=[</span>-,R,O,E,-<span class="o">]</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span>CJKMNPQSUWXYZ<span class="o">)</span><span class="w"> </span>guess_scores:<span class="w"> </span><span class="o">[</span><span class="s1">'THIEF=...e.|⬛⬛⬛🟨⬛, BLADE=....E|⬛⬛⬛⬛🟩, GROVE=.ro.E|⬛🟨🟨⬛🟩'</span><span class="o">]</span><span class="w"> </span>ROMEO<span class="w"> </span>Mask:<span class="w"> </span>needs<span class="w"> </span>----E<span class="p">;</span><span class="w"> </span>WrongSpot:<span class="w"> </span>has<span class="w"> </span>---E-<span class="w"> </span>PROSE<span class="w"> </span>WrongSpot:<span class="w"> </span>has<span class="w"> </span>-RO--<span class="w"> </span>STORE<span class="w"> </span>Invalid:<span class="w"> </span>has<span class="w"> </span>-T---<span class="p">;</span><span class="w"> </span>WrongSpot:<span class="w"> </span>has<span class="w"> </span>--O--<span class="w"> </span>MURAL<span class="w"> </span>Valid:<span class="w"> </span>missing<span class="w"> </span>EO<span class="p">;</span><span class="w"> </span>Mask:<span class="w"> </span>needs<span class="w"> </span>----E<span class="p">;</span><span class="w"> </span>Invalid:<span class="w"> </span>has<span class="w"> </span>---AL<span class="w"> </span>ROUSE<span class="w"> </span>Eligible </pre> <pre class="code bash literal-block"> <span class="c1"># answer: BIRCH </span>$<span class="w"> </span>./wordle.py<span class="w"> </span><span class="nv">CLAIM</span><span class="o">=</span>c..i.<span class="w"> </span><span class="nv">TRICE</span><span class="o">=</span>.riC.<span class="w"> </span><span class="se">\ </span><span class="w"> </span>--words<span class="w"> </span>INCUR<span class="w"> </span>TAXIS<span class="w"> </span>PRICY<span class="w"> </span>ERICA<span class="w"> </span>BIRCH<span class="w"> </span>--explain<span class="w"> </span>WordleGuesses<span class="o">(</span><span class="nv">mask</span><span class="o">=</span>---C-,<span class="w"> </span><span class="nv">valid</span><span class="o">=</span>CIR,<span class="w"> </span><span class="nv">invalid</span><span class="o">=</span>AELMT,<span class="w"> </span><span class="nv">wrong_spot</span><span class="o">=[</span>C,R,I,I,-<span class="o">]</span>,<span class="w"> </span><span class="nv">unused</span><span class="o">=</span>BDFGHJKNOPQSUVWXYZ<span class="o">)</span><span class="w"> </span>guess_scores:<span class="w"> </span><span class="o">[</span><span class="s1">'CLAIM=c..i.|🟨⬛⬛🟨⬛, TRICE=.riC.|⬛🟨🟨🟩⬛'</span><span class="o">]</span><span class="w"> </span>INCUR<span class="w"> </span>Mask:<span class="w"> </span>needs<span class="w"> </span>---C-<span class="w"> </span>TAXIS<span class="w"> </span>Valid:<span class="w"> </span>missing<span class="w"> </span>CR<span class="p">;</span><span class="w"> </span>Mask:<span class="w"> </span>needs<span class="w"> </span>---C-<span class="p">;</span><span class="w"> </span>Invalid:<span class="w"> </span>has<span class="w"> </span>TA---<span class="p">;</span><span class="w"> </span>WrongSpot:<span class="w"> </span>has<span class="w"> </span>---I-<span class="w"> </span>PRICY<span class="w"> </span>WrongSpot:<span class="w"> </span>has<span class="w"> </span>-RI--<span class="w"> </span>ERICA<span class="w"> </span>Invalid:<span class="w"> </span>has<span class="w"> </span>E---A<span class="p">;</span><span class="w"> </span>WrongSpot:<span class="w"> </span>has<span class="w"> </span>-RI--<span class="w"> </span>BIRCH<span class="w"> </span>Eligible </pre> <p>Here's how those explanations were computed, using a variation on <tt class="docutils literal">is_eligible</tt>:</p> <!-- wordle --> <pre class="code python literal-block"> <span class="k">class</span> <span class="nc">WordleGuesses</span><span class="p">:</span><span class="w"> </span> <span class="k">def</span> <span class="nf">is_ineligible</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">word</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]:</span><span class="w"> </span> <span class="n">reasons</span> <span class="o">=</span> <span class="p">{}</span><span class="w"> </span> <span class="n">letters</span> <span class="o">=</span> <span class="p">{</span><span class="n">c</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">word</span><span class="p">}</span><span class="w"> </span> <span class="k">if</span> <span class="n">missing</span> <span class="o">:=</span> <span class="bp">self</span><span class="o">.</span><span class="n">valid</span> <span class="o">-</span> <span class="p">(</span><span class="n">letters</span> <span class="o">&amp;</span> <span class="bp">self</span><span class="o">.</span><span class="n">valid</span><span class="p">):</span><span class="w"> </span> <span class="c1"># Did not have the full set of green+yellow letters known to be valid</span><span class="w"> </span> <span class="n">reasons</span><span class="p">[</span><span class="s2">&quot;Valid&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">&quot;missing </span><span class="si">{</span><span class="n">letter_set</span><span class="p">(</span><span class="n">missing</span><span class="p">)</span><span class="si">}</span><span class="s2">&quot;</span><span class="w"> </span> <span class="n">mask</span> <span class="o">=</span> <span class="p">[(</span><span class="n">m</span> <span class="k">if</span> <span class="n">c</span> <span class="o">!=</span> <span class="n">m</span> <span class="k">else</span> <span class="kc">None</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span><span class="p">,</span> <span class="n">m</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">mask</span><span class="p">)]</span><span class="w"> </span> <span class="k">if</span> <span class="nb">any</span><span class="p">(</span><span class="n">mask</span><span class="p">):</span><span class="w"> </span> <span class="c1"># Couldn't find all the green/correct letters</span><span class="w"> </span> <span class="n">reasons</span><span class="p">[</span><span class="s2">&quot;Mask&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">&quot;needs </span><span class="si">{</span><span class="n">dash_mask</span><span class="p">(</span><span class="n">mask</span><span class="p">)</span><span class="si">}</span><span class="s2">&quot;</span><span class="w"> </span> <span class="n">invalid</span> <span class="o">=</span> <span class="p">[(</span><span class="n">c</span> <span class="k">if</span> <span class="n">c</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">invalid</span> <span class="k">else</span> <span class="kc">None</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">word</span><span class="p">]</span><span class="w"> </span> <span class="k">if</span> <span class="nb">any</span><span class="p">(</span><span class="n">invalid</span><span class="p">):</span><span class="w"> </span> <span class="c1"># Invalid (black) letters present at specific positions</span><span class="w"> </span> <span class="n">reasons</span><span class="p">[</span><span class="s2">&quot;Invalid&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">&quot;has </span><span class="si">{</span><span class="n">dash_mask</span><span class="p">(</span><span class="n">invalid</span><span class="p">)</span><span class="si">}</span><span class="s2">&quot;</span><span class="w"> </span> <span class="n">wrong</span> <span class="o">=</span> <span class="p">[(</span><span class="n">c</span> <span class="k">if</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">ws</span> <span class="k">else</span> <span class="kc">None</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span><span class="p">,</span> <span class="n">ws</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">wrong_spot</span><span class="p">)]</span><span class="w"> </span> <span class="k">if</span> <span class="nb">any</span><span class="p">(</span><span class="n">wrong</span><span class="p">):</span><span class="w"> </span> <span class="c1"># Found some yellow letters: valid letters in wrong position</span><span class="w"> </span> <span class="n">reasons</span><span class="p">[</span><span class="s2">&quot;WrongSpot&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">&quot;has </span><span class="si">{</span><span class="n">dash_mask</span><span class="p">(</span><span class="n">wrong</span><span class="p">)</span><span class="si">}</span><span class="s2">&quot;</span><span class="w"> </span> <span class="k">return</span> <span class="n">reasons</span><span class="w"> </span> <span class="k">def</span> <span class="nf">find_explanations_</span><span class="p">(</span><span class="w"> </span> <span class="bp">self</span><span class="p">,</span> <span class="n">vocabulary</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span><span class="w"> </span> <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">tuple</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span> <span class="o">|</span> <span class="kc">None</span><span class="p">]]:</span><span class="w"> </span> <span class="n">explanations</span> <span class="o">=</span> <span class="p">[]</span><span class="w"> </span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">vocabulary</span><span class="p">:</span><span class="w"> </span> <span class="n">reasons</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">is_ineligible</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span> <span class="n">why</span> <span class="o">=</span> <span class="kc">None</span><span class="w"> </span> <span class="k">if</span> <span class="n">reasons</span><span class="p">:</span><span class="w"> </span> <span class="n">why</span> <span class="o">=</span> <span class="s2">&quot;; &quot;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="w"> </span> <span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">k</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="n">v</span><span class="si">}</span><span class="s2">&quot;</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">is_ineligible</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="o">.</span><span class="n">items</span><span class="p">())</span><span class="w"> </span> <span class="n">explanations</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">word</span><span class="p">,</span> <span class="n">why</span><span class="p">))</span><span class="w"> </span> <span class="k">return</span> <span class="n">explanations</span> </pre> <p>This approach is slower than <tt class="docutils literal">is_eligible</tt>, though it's not noticeable when running <tt class="docutils literal">wordle.py</tt> for one set of guess–scores. I have a test tool (<tt class="docutils literal">score.py</tt>) that runs through the 200+ games that I've recorded. Using <tt class="docutils literal">find_explanations</tt>, it took about 10 seconds to run. Switching to <tt class="docutils literal">find_eligible</tt>, it dropped to 2 seconds (5x improvement). By prefiltering the word list with a regex made from the mask, the time drops to half a second (further 4x improvement).</p> <pre class="code python literal-block"> <span class="n">pattern</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">m</span> <span class="ow">or</span> <span class="s2">&quot;.&quot;</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">parsed_guesses</span><span class="o">.</span><span class="n">mask</span><span class="p">))</span><span class="w"> </span><span class="n">word_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">vocabulary</span> <span class="k">if</span> <span class="n">pattern</span><span class="o">.</span><span class="n">fullmatch</span><span class="p">(</span><span class="n">w</span><span class="p">)]</span><span class="w"> </span><span class="n">eligible</span> <span class="o">=</span> <span class="n">parsed_guesses</span><span class="o">.</span><span class="n">find_eligible</span><span class="p">(</span><span class="n">word_list</span><span class="p">)</span> </pre> </div> <div class="section" id="finally"> <h3>Finally</h3> <p>I thought I knew a lot about solving Wordle programmatically when I started this long post a month ago. As I wrote this, I realized that I could use a few ugly greps to accomplish the same thing as my Python code; wrote a tool to render games as HTML and emojis; spun off a couple of blog posts on <a class="reference external" href="https://www.georgevreilly.com/blog/2023/09/02/PythonEnumsWithAttributes.html">multi-attribute enumeration</a> and <a class="reference external" href="https://www.georgevreilly.com/blog/2023/09/05/RegexConjunctions.html">regex conjunctions</a>; found and fixed several bugs with repeated letters, greatly refining my understanding of the nuances; rewrote the sections on repeated letters repeatedly; added a means to explain ineligibility; and had a minor epiphany about optimizing the mask programmatically.</p> <p>The full code can be found in my <a class="reference external" href="https://github.com/georgevreilly/wordle">Wordle repository</a>.</p> </div> <div class="section" id="other-work"> <h3>Other Work</h3> <p>I found these articles after I completed the final draft of this post.</p> <ul class="simple"> <li>Bertsimas and Paskov used <a class="reference external" href="https://mitsloan.mit.edu/ideas-made-to-matter/how-algorithm-solves-wordle">Exact Dynamic Programming</a> to find <a class="reference external" href="http://wordle-page.s3-website-us-east-1.amazonaws.com/assets/Wordle_Paper_Final.pdf">An Exact and Interpretable Solution to Wordle</a>.</li> <li><a class="reference external" href="https://yannlandry.photography/blog/wordle-intelligent-solver">Yann Landry's Solver</a> is a little JavaScript and HTML tool that tries to pick the best next word using a scoring system.</li> <li><a class="reference external" href="https://www.inspiredpython.com/article/solving-wordle-puzzles-with-basic-python">Solving with Basic Python</a> makes suggestions for each round based on word commonality.</li> <li>Some <a class="reference external" href="https://mashable.com/article/wordle-tips-tricks">Tips and Tricks</a> for playing the game.</li> </ul> <!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -_ --> <!-- Sticking the Wordle stylesheet at the end out of the way --> <link rel="stylesheet" href="/wordle.css"></div> Regex Conjunctions tag:www.georgevreilly.com,2023-09-05:/blog/2023/09/05/RegexConjunctions.html 2023-09-05T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>Most regular expression engines make it easy to match <a class="reference external" href="https://www.regular-expressions.info/alternation.html">alternations</a> (or disjunctions) with the <tt class="docutils literal">|</tt> operator: to match <em>either</em> <tt class="docutils literal">foo</tt> <em>or</em> <tt class="docutils literal">bar</tt>, use <tt class="docutils literal">foo|bar</tt>.</p> <p>Few regex engines have any provisions for <a class="reference external" href="https://unix.stackexchange.com/a/55391/4060">conjunctions</a>, and the syntax is often horrible. Awk makes it easy to match <tt class="docutils literal">/pat1/ &amp;&amp; /pat2/ &amp;&amp; /pat3/</tt>.</p> <pre class="code bash literal-block"> $ cat <span class="s">&lt;&lt;EOF | awk '/bar/ &amp;&amp; /foo/' &gt; foo bar &gt; bar &gt; barfy food &gt; barfly &gt; EOF</span> foo bar barfy food </pre> <p>In the case of a Unix pipeline, the conjunction could also be expressed as a series of pipes: <tt class="docutils literal">... | grep pat1 | grep pat2 | grep pat3 | ...</tt>.</p> <p>The <a class="reference external" href="https://www.georgevreilly.com/blog/2020/04/23/regex-32-problems.html">longest regex</a> that I ever encountered was an enormous alternation—a true horror that shouldn't have been a regex at all.</p> Python Enums with Attributes tag:www.georgevreilly.com,2023-09-02:/blog/2023/09/02/PythonEnumsWithAttributes.html 2023-09-02T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>Python <a class="reference external" href="https://realpython.com/python-enum/">enumerations</a> are useful for grouping related constants in a namespace. You can add additional behaviors to an enum class, but there isn't an easy and obvious way to add attributes to enum members.</p> <pre class="code python literal-block"> <span class="k">class</span> <span class="nc">TileState</span><span class="p">(</span><span class="n">Enum</span><span class="p">):</span> <span class="n">CORRECT</span> <span class="o">=</span> <span class="mi">1</span> <span class="n">PRESENT</span> <span class="o">=</span> <span class="mi">2</span> <span class="n">ABSENT</span> <span class="o">=</span> <span class="mi">3</span> <span class="k">def</span> <span class="nf">color</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">if</span> <span class="bp">self</span> <span class="ow">is</span> <span class="bp">self</span><span class="o">.</span><span class="n">CORRECT</span><span class="p">:</span> <span class="k">return</span> <span class="s2">&quot;Green&quot;</span> <span class="k">elif</span> <span class="bp">self</span> <span class="ow">is</span> <span class="bp">self</span><span class="o">.</span><span class="n">PRESENT</span><span class="p">:</span> <span class="k">return</span> <span class="s2">&quot;Yellow&quot;</span> <span class="k">elif</span> <span class="bp">self</span> <span class="ow">is</span> <span class="bp">self</span><span class="o">.</span><span class="n">ABSENT</span><span class="p">:</span> <span class="k">return</span> <span class="s2">&quot;Black&quot;</span> <span class="k">def</span> <span class="nf">emoji</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="p">{</span> <span class="bp">self</span><span class="o">.</span><span class="n">CORRECT</span><span class="p">:</span> <span class="s2">&quot;</span><span class="se">\U0001F7E9</span><span class="s2">&quot;</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">PRESENT</span><span class="p">:</span> <span class="s2">&quot;</span><span class="se">\U0001F7E8</span><span class="s2">&quot;</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">ABSENT</span><span class="p">:</span> <span class="s2">&quot;</span><span class="se">\U00002B1B</span><span class="s2">&quot;</span><span class="p">,</span> <span class="p">}[</span><span class="bp">self</span><span class="p">]</span> </pre> <p>Accessing the members and the methods:</p> <pre class="code pycon literal-block"> <span class="gp">&gt;&gt;&gt; </span><span class="k">for</span> <span class="n">ts</span> <span class="ow">in</span> <span class="n">TileState</span><span class="p">:</span> <span class="gp">... </span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">ts</span><span class="o">.</span><span class="n">name</span><span class="si">:</span><span class="s2">&lt;7</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="n">ts</span><span class="o">.</span><span class="n">value</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">ts</span><span class="o">.</span><span class="n">color</span><span class="p">()</span><span class="si">:</span><span class="s2">&lt;6</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">ts</span><span class="o">.</span><span class="n">emoji</span><span class="p">()</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span> <span class="gp">...</span> <span class="go">CORRECT: 1 Green 🟩 PRESENT: 2 Yellow 🟨 ABSENT : 3 Black ⬛</span> </pre> <p>You can add methods like <tt class="docutils literal">color()</tt> and <tt class="docutils literal">emoji()</tt> above—you can even decorate them with <tt class="docutils literal">&#64;property</tt> so that you don't need parentheses—but you have to remember to update <em>every</em> method when you add or remove members from the enumeration.</p> <div class="section" id="namedtuples-to-the-rescue"> <h3>Namedtuples to the rescue</h3> <p>It <a class="reference external" href="https://stackoverflow.com/a/62601113/6364">turns out</a> that you can build a <a class="reference external" href="https://www.georgevreilly.com/blog/2016/01/14/PythonBaseClassOrder.html">mixin</a> enumeration from <a class="reference external" href="https://realpython.com/python-namedtuple/">namedtuple</a> and <tt class="docutils literal">Enum</tt> that gives terse construction syntax:</p> <pre class="code python literal-block"> <span class="k">class</span> <span class="nc">TileState</span><span class="p">(</span><span class="n">namedtuple</span><span class="p">(</span><span class="s2">&quot;TileState&quot;</span><span class="p">,</span> <span class="s2">&quot;value emoji color css_color&quot;</span><span class="p">),</span> <span class="n">Enum</span><span class="p">):</span> <span class="n">CORRECT</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="s2">&quot;</span><span class="se">\U0001F7E9</span><span class="s2">&quot;</span><span class="p">,</span> <span class="s2">&quot;Green&quot;</span><span class="p">,</span> <span class="s2">&quot;#6aaa64&quot;</span> <span class="n">PRESENT</span> <span class="o">=</span> <span class="mi">2</span><span class="p">,</span> <span class="s2">&quot;</span><span class="se">\U0001F7E8</span><span class="s2">&quot;</span><span class="p">,</span> <span class="s2">&quot;Yellow&quot;</span><span class="p">,</span> <span class="s2">&quot;#c9b458&quot;</span> <span class="n">ABSENT</span> <span class="o">=</span> <span class="mi">3</span><span class="p">,</span> <span class="s2">&quot;</span><span class="se">\U00002B1B</span><span class="s2">&quot;</span><span class="p">,</span> <span class="s2">&quot;Black&quot;</span><span class="p">,</span> <span class="s2">&quot;#838184&quot;</span> </pre> <p>Each member now has multiple read-only attributes, like <tt class="docutils literal">emoji</tt> and <tt class="docutils literal">css_color</tt>:</p> <pre class="code pycon literal-block"> <span class="gp">&gt;&gt;&gt; </span><span class="k">for</span> <span class="n">ts</span> <span class="ow">in</span> <span class="n">TileState</span><span class="p">:</span> <span class="gp">... </span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">ts</span><span class="o">.</span><span class="n">name</span><span class="si">:</span><span class="s2">&lt;7</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="n">ts</span><span class="o">.</span><span class="n">value</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">ts</span><span class="o">.</span><span class="n">emoji</span><span class="si">}</span><span class="s2"> U+</span><span class="si">{</span><span class="nb">ord</span><span class="p">(</span><span class="n">ts</span><span class="o">.</span><span class="n">emoji</span><span class="p">)</span><span class="si">:</span><span class="s2">05x</span><span class="si">}</span><span class="s2"> &quot;</span> <span class="gp">... </span> <span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">ts</span><span class="o">.</span><span class="n">color</span><span class="si">:</span><span class="s2">&lt;6</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">ts</span><span class="o">.</span><span class="n">css_color</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span> <span class="gp">...</span> <span class="go">CORRECT: 1 🟩 U+1f7e9 Green #6aaa64 PRESENT: 2 🟨 U+1f7e8 Yellow #c9b458 ABSENT : 3 ⬛ U+02b1b Black #838184</span> </pre> </div> Patching a Python Wheel tag:www.georgevreilly.com,2023-08-10:/blog/2023/08/10/PatchingAPythonWheel.html 2023-08-10T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>Recently, I had to create a new <a class="reference external" href="https://realpython.com/python-wheels/">Python wheel</a> for <a class="reference external" href="https://pytorch.org/">PyTorch</a>. There is a <a class="reference external" href="https://github.com/pytorch/pytorch/issues/99622">cyclic dependency</a> between PyTorch 2.0.1 and Triton 2.0.0: Torch depends upon Triton, but Triton also depends on Torch. <a class="reference external" href="https://pip.pypa.io/en/latest/">Pip</a> is okay with installing packages where there's a cyclic dependency. <a class="reference external" href="https://bazel.build/">Bazel</a>, however, <a class="reference external" href="https://github.com/bazelbuild/rules_python/issues/1076">does not handle</a> cyclic dependencies between packages. We use Bazel extensively at Stripe and this cyclic dependency prevented us from using the latest version of Torch.</p> <p>I spent a few days trying to build the PyTorch wheel from source. It was a <em>nightmare!</em> I ran out of disk space on the root partition on my EC2 devbox trying to install system packages, so I had to bring up a custom instance. Then I ran out of space on the main partition, trying to compile, so I had to bring up another custom instance. Then I realized I had installed CUDA 12.1 and couldn't install CUDA 11.8 over it, so yet another instance. Then a long list of other problems. I was eventually able to get <tt class="docutils literal">python setup.py develop</tt> to execute, but it took three hours! And I had little confidence that I was building the same thing that was in the official wheels.</p> <p>Then I had a brainwave: what if I <a class="reference external" href="https://en.wikipedia.org/wiki/Patch_(computing)">patch</a> the official Torch wheel and simply remove the requirement on Triton? All the officially built code would remain untouched. That worked!</p> <p>This post is adapted from my <a class="reference external" href="https://github.com/pytorch/pytorch/issues/99622#issuecomment-1604812054">writeup on the issue</a>.</p> <div class="section" id="what-is-a-wheel"> <h3>What is a Wheel?</h3> <p>A Python <a class="reference external" href="https://packaging.python.org/en/latest/specifications/binary-distribution-format/">wheel</a> is a ready-to-install Python package that requires no compilation at installation time. Unlike older formats such as source distributions or eggs, <tt class="docutils literal">setup.py</tt> is not run during installation from a wheel. The older formats conflated build and install and required arbitrary code to run.</p> <p>A wheel is a <a class="reference external" href="https://en.wikipedia.org/wiki/ZIP_(file_format)">Zip archive</a> with a specially formatted filename and a <tt class="docutils literal">.whl</tt> extension. The wheel contains a <tt class="docutils literal"><span class="pre">dist-info</span></tt> metadata directory and the installable payload. A wheel is either pure Python, which can install on any platform, or a platform (binary) wheel, which usually contains compiled Python extension code.</p> <p>Java JARs, Android APKs, Mozilla XPIs, and many other file types are also structured Zip archives.</p> </div> <div class="section" id="manual-patching"> <h3>Manual Patching</h3> <p>The wheel file's <a class="reference external" href="https://packaging.python.org/en/latest/specifications/binary-distribution-format/#file-contents">contents</a> include the <tt class="docutils literal"><span class="pre">{distribution}-{version}.dist-info/</span></tt> directory, which contains metadata about the wheel.</p> <p>In the case of PyTorch 2.0.1, I had <tt class="docutils literal"><span class="pre">torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl</span></tt>, a Linux <tt class="docutils literal">x86_64</tt> wheel for Python 3.8.</p> <p>I used <tt class="docutils literal">unzip</tt> to extract the wheel's contents into a directory, <tt class="docutils literal">torch201.2</tt>. (The <tt class="docutils literal">.2</tt> denoted my second attempt.) In the <tt class="docutils literal">torch201.2</tt> directory was the entire content of the wheel, including the <tt class="docutils literal"><span class="pre">torch-2.0.1.dist-info/</span></tt> subdirectory.</p> <pre class="code bash literal-block"> unzip -d torch201.2 torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl <span class="nb">cd</span> torch201.2 <span class="c1"># Rename the `dist-info` directory to include '+stripe.2' as a suffix for `2.0.1` </span>mv torch-2.0.1<span class="o">{</span>,+stripe.2<span class="o">}</span>.dist-info/ <span class="nb">cd</span> torch-2.0.1+stripe.2.dist-info/ </pre> <p>Normally, when we build wheels for forked version of Python packages at Stripe, we append <tt class="docutils literal"><span class="pre">+stripe.{major}.{commits}.{revision}</span></tt> to the version number. Both <tt class="docutils literal">commits</tt> and <tt class="docutils literal">revision</tt> come from the output of <tt class="docutils literal">git describe <span class="pre">--tags</span> HEAD</tt>, which <a class="reference external" href="https://git-scm.com/docs/git-describe#_examples">looks like</a> <tt class="docutils literal"><span class="pre">{tag}-{commits}-g{revision}</span></tt>; <tt class="docutils literal">major</tt> is currently hardcoded to <tt class="docutils literal">1</tt>. This suffix helps distinguish a forked wheel's version from the upstream version number.</p> <p>Since I wasn't forking, I used a simplified scheme, <tt class="docutils literal"><span class="pre">+stripe.{attempt}</span></tt>.</p> <p>Then I updated some <a class="reference external" href="https://packaging.python.org/en/latest/specifications/core-metadata/">fields</a> in <tt class="docutils literal"><span class="pre">torch-2.0.1+stripe.2.dist-info/METADATA</span></tt>:</p> <ul class="simple"> <li>Updated <tt class="docutils literal">Version</tt> to include <tt class="docutils literal">+stripe.2</tt></li> <li>Removed the <tt class="docutils literal"><span class="pre">Requires-Dist</span></tt> line for <tt class="docutils literal">triton</tt>. This is the crucial step to fix the cyclic dependency problem.</li> </ul> <p>Now I had to update <tt class="docutils literal"><span class="pre">torch-2.0.1+stripe.2.dist-info/RECORD</span></tt>, which contains signatures for all the files in the wheel, in the form <tt class="docutils literal"><span class="pre">{filename},sha256={safe_hash},{filesize}</span></tt>. Of course, <tt class="docutils literal">RECORD</tt> does not have an entry for itself.</p> <p>The paths to all the <tt class="docutils literal"><span class="pre">dist-info</span></tt> files needed to be updated in <tt class="docutils literal">RECORD</tt> to include the <tt class="docutils literal">+stripe.2</tt> suffix.</p> <p>In Vim terms:</p> <pre class="code vim literal-block"> <span class="p">:</span>%s<span class="sr">/^\(torch-2.0.1\)\(\.dist-info\)/</span>\<span class="m">1</span><span class="p">+</span>stripe.<span class="m">2</span>\<span class="m">2</span>/ </pre> <p>You can use this <tt class="docutils literal">record_hash.py</tt> script to compute the entry for a file:</p> <pre class="code python literal-block"> <span class="ch">#!/usr/bin/env python3</span> <span class="kn">import</span> <span class="nn">base64</span> <span class="kn">import</span> <span class="nn">hashlib</span> <span class="kn">import</span> <span class="nn">os</span> <span class="kn">import</span> <span class="nn">sys</span> <span class="n">filename</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s2">&quot;rb&quot;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> <span class="n">digest</span> <span class="o">=</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">())</span> <span class="n">safe_hash</span> <span class="o">=</span> <span class="n">base64</span><span class="o">.</span><span class="n">urlsafe_b64encode</span><span class="p">(</span><span class="n">digest</span><span class="o">.</span><span class="n">digest</span><span class="p">())</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s2">&quot;us-ascii&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">rstrip</span><span class="p">(</span><span class="s2">&quot;=&quot;</span><span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">filename</span><span class="si">}</span><span class="s2">,sha256=</span><span class="si">{</span><span class="n">safe_hash</span><span class="si">}</span><span class="s2">,</span><span class="si">{</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">getsize</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span> </pre> <p>The output will look like this:</p> <pre class="code bash literal-block"> $ ../record_hash.py torch-2.0.1+stripe.2.dist-info/METADATA torch-2.0.1+stripe.2.dist-info/METADATA,sha256<span class="o">=</span>StmZkVzCWlHIxaIGVJocXv7JsDnlrSaNXwtuIlE_PKc,24703 </pre> <p>Replace the <tt class="docutils literal">METADATA</tt> entry in <tt class="docutils literal">RECORD</tt> with the output from <tt class="docutils literal">record_hash.py</tt>.</p> <p>Finally, you can <tt class="docutils literal">zip</tt> up everything into a new wheel. Note the <tt class="docutils literal">+stripe.2</tt> in the new wheel's filename:</p> <pre class="literal-block"> zip ../torch-2.0.1+stripe.2-cp38-cp38-manylinux1_x86_64.whl -r . </pre> <p>At this point, you can upload the wheel to a private repository.</p> <p>To install the wheel:</p> <pre class="literal-block"> pip install torch==2.0.1+stripe.2 </pre> <p>You will not see <tt class="docutils literal">triton</tt> being installed, unlike before. However, if you do install <tt class="docutils literal">triton</tt>, it will be satisfied by this patched version of <tt class="docutils literal">torch</tt>.</p> </div> <div class="section" id="summary"> <h3>Summary</h3> <p>If you have to manually patch a Python wheel:</p> <ul class="simple"> <li>Decide upon a suffix, such as <tt class="docutils literal">+stripe.2</tt>.</li> <li>Unzip the wheel.</li> <li>Rename the <tt class="docutils literal"><span class="pre">dist-info</span></tt> directory to include the suffix.</li> <li>Update <tt class="docutils literal">Version</tt> in <tt class="docutils literal">METADATA</tt> to include the suffix.</li> <li><strong>Make other modifications.</strong></li> <li>Append the suffix to the <tt class="docutils literal"><span class="pre">dist-info</span></tt> entries in <tt class="docutils literal">RECORD</tt>.</li> <li>Use <tt class="docutils literal">record_hash.py</tt> to compute new entries for all modified files. Update <tt class="docutils literal">RECORD</tt> accordingly.</li> <li>Zip up the new wheel. Include the suffix in the filename.</li> <li><tt class="docutils literal">pip install</tt> the new wheel.</li> </ul> </div> Bram Moolenaar RIP tag:www.georgevreilly.com,2023-08-07:/blog/2023/08/07/BramMoolenaarRIP.html 2023-08-07T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <a class="reference external image-reference" href="https://www.vim.org/"><img alt="Vim" src="https://www.georgevreilly.com/content/binary/vim-logo-png-transparent.png" style="width: 250px;"/></a> <p>I woke up on Saturday to read on Bram Moolenaar's Facebook page an <a class="reference external" href="https://www.facebook.com/bram.moolenaar/posts/pfbid0d7rBdoVZu7Ww2yvmpEjmjJ1B3WYVFf86nFrFXczmRcYzjUxChq3xcjH84zURsZYjl">announcement</a> of his death. I knew Bram online for nearly 30 years and I was one of his relatively small number of Facebook friends, but we never met in real life. I knew that he had retired from Google Zurich to Tenerife, but I hadn't been aware that he had been ill.</p> <p>Bram was known to the world for his signature creation, the <a class="reference external" href="https://www.vim.org">Vim text editor</a>, used by millions of developers on Linux, macOS, and Windows. Vim stands for Vi IMproved, but it outgrew the original <tt class="docutils literal">vi</tt> long ago.</p> <p>I was an <a class="reference external" href="https://www.georgevreilly.com/blog/2005/12/30/20YearsOfVi.html">active contributor</a> to Vim in the 1990s: I wrote a lot of the Win32 console mode code as well as the alpha version of Windows gVim; my name is at the top of the <a class="reference external" href="https://vimhelp.org/os_win32.txt.html">page</a> if you do <tt class="docutils literal">:help win32</tt>. In the 00s, I ported Vim to Win64. I drifted away from active participation more than a decade ago, but I still lurk on the <a class="reference external" href="https://groups.google.com/g/vim_dev/">vim_dev</a> mailing list.</p> <p>Vim has been the thoroughly dominant flavor of <tt class="docutils literal">vi</tt> for a number of years, but that wasn’t the case in the 90s. There were Elvis, vile, xvi, and other things I no longer recall. Bram built a better <tt class="docutils literal">vi</tt> and he built a solid community of developers and users. I never saw the toxic behavior that’s prevalent in some tech communities. Bram was always a patient and reasonable leader. He poured countless hours into making Vim an ever better editor and he answered so many questions on the various mailing lists. Vim would not have succeeded half so well without the community that he built. I didn’t always agree with Bram's technical decisions (and neither did the NeoVim people), but I have enormous respect for what he accomplished, technically and socially.</p> <p>The other remarkable thing about Vim is that it’s <a class="reference external" href="https://vimdoc.sourceforge.net/htmldoc/uganda.html#license">charityware</a>. Vim users were strongly encouraged to donate to <a class="reference external" href="https://iccf-holland.org/">ICCF Holland</a>, which supports children in Kibaale, Uganda. Bram was the treasurer of ICCF and was involved with the work for many years. When I was at Microsoft, I got a bunch of Vim-loving engineers to donate; Microsoft matched our donations. I made another donation to ICCF today in his memory.</p> <p>It’s clear that work on Vim will continue. Although Bram was the benevolent dictator for life of Vim, a <a class="reference external" href="https://github.com/orgs/vim/people">handful of others</a> have commit rights and are planning <a class="reference external" href="https://groups.google.com/g/vim_dev/c/dq9Wu5jqVTw/m/puYIETTwAAAJ">future of the Vim project</a>. They have big shoes to fill. I don’t know enough about ICCF to say how severely this will affect them.</p> <p><strong>ETA</strong>: The upcoming Vim 9.1 release will be <a class="reference external" href="https://github.com/vim/vim/pull/12749">dedicated to Bram</a>, just as the <a class="reference external" href="https://groups.google.com/g/vim_announce/c/MJBKVd-xrEE/m/joVNaDgAAgAJ">9.0 release was dedicated to Sven Guckes</a>, who died last year. Sven was one of Vim's greatest ambassadors, endlessly helpful to users in the newsgroups. We stayed for a week with Sven in August 2014 in his Berlin apartment, and he was the most wonderful host, spending many hours showing us around his beloved city.</p> <p>The best articles that I’ve seen about Bram so far:</p> <ul class="simple"> <li><a class="reference external" href="https://www.theregister.com/2023/08/07/bram_moolenaar_obituary/">The Reg's obituary</a></li> <li><a class="reference external" href="https://j11g.com/2023/08/07/the-legacy-of-bram-moolenaar/">The Legacy of Bram Moolenaar</a>, Jan van den Berg</li> <li><a class="reference external" href="https://neovim.io/news/2023/08">Vim Boss</a>, Justin M. Keyes</li> </ul> <p><tt class="docutils literal">:wq!</tt></p> Cold Brew Coffee Recipe tag:www.georgevreilly.com,2023-07-24:/blog/2023/07/24/ColdBrewCoffeeRecipe.html 2023-07-24T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <a class="reference external image-reference" href="https://www.amazon.com/dp/B00JVSVM36/?tag=georgvreill-20"><img alt="Oxo Cold Brew Coffee Maker" class="right-float" src="https://images-na.ssl-images-amazon.com/images/P/B00JVSVM36.01.LZZZZZZZ.jpg" style="width: 300px;"/></a> <p>I often enjoy cold brew coffee in summer. I bought an <a class="reference external" href="https://www.oxo.com/cold-brew-coffee-maker.html">Oxo Cold Brew Coffee Maker</a> one winter when it was on sale at Bed, Bath &amp; Beyond. Before that, I used a <a class="reference external" href="https://www.organiccottonmart.com/blogs/sustainable-lifestyle/nut-milk-bag-vs-cheesecloth">nut milk bag</a> in a jar. I like the Oxo and it gets high marks in many reviews, such as <a class="reference external" href="https://www.homegrounds.co/oxo-cold-brew-coffee-maker-review/">HomeGrounds</a> or <a class="reference external" href="https://www.nytimes.com/wirecutter/reviews/best-cold-brew-coffee-maker/">Wirecutter</a>. It's easy to use, easy to clean, and makes a good brew.</p> <p>The only downside to making your own cold brew coffee is that you must plan ahead. You can make hot coffee in a few minutes, but cold brew takes hours.</p> <p>I have used this recipe for a number of years. It makes a smooth, less acidic coffee.</p> <p>ETA: See also <a class="reference external" href="https://www.georgevreilly.com/blog/2024/10/19/ColdBrewCoffeeFrenchPressRecipe.html">Cold Brew Coffee Recipe for French Press</a>.</p> <div class="section" id="ingredients"> <h3>Ingredients</h3> <ul class="simple"> <li>24 fl oz water</li> <li>6 oz fresh <em>coarsely ground</em> coffee. Store-bought pre-ground coffee is too fine.</li> </ul> <p>This will half-fill the Oxo jar. It <strong>yields about 16 fl oz</strong> (1 pt) of cold brew coffee. You can double the quantities in the Oxo, if you like.</p> </div> <div class="section" id="instructions"> <h3>Instructions</h3> <ul class="simple"> <li>Grind the coffee beans coarsely.</li> <li>Place the ground coffee in the Oxo jar.</li> <li>Pour water through the rain sprinkler top.</li> <li>Swirl gently to ensure that all coffee grounds are wet. (I once followed a recipe that called for vigorous stirring. Never again! The grounds absorbed so much more water that I only got half the yield.)</li> <li>Some people put the jar in the fridge at this point. I don't bother.</li> <li>Wait for the coffee to brew! Some instructions say 12 to 24 hours. I usually wait 6–8 hours.</li> <li>Put the jar on the stand. First, make sure that the switch is <em>closed</em>.</li> <li>Place the flask under the stand. Push the switch down to release the brew.</li> <li>Let the cold brew coffee drain. This will take 10–15 minutes.</li> <li>Refrigerate the cold brew.</li> <li>Use the coffee grounds to <a class="reference external" href="https://www.southernliving.com/garden/coffee-grounds-for-hydrangeas">turn hydrangeas blue</a> or <a class="reference external" href="https://www.healthline.com/nutrition/uses-for-coffee-grounds">exfoliate your skin</a>.</li> </ul> </div> <div class="section" id="adjustments"> <h3>Adjustments</h3> <p>You can vary the ratio of coffee to water and how long you steep the mixture.</p> </div> <div class="section" id="drinks"> <h3>Drinks</h3> <ul class="simple"> <li>2 fl oz of cold brew</li> <li>6 fl oz milk</li> </ul> <p><a class="reference external" href="https://www.forkinthekitchen.com/how-to-make-cold-brew-coffee/">Fork in the Kitchen</a> has some suggestions if you don't have a coffee grinder or an Oxo maker.</p> </div> Compressing Tar Files in Parallel tag:www.georgevreilly.com,2023-02-21:/blog/2023/02/21/CompressingTarFilesInParallel.html 2023-02-21T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>TL;DR: use <tt class="docutils literal">tar <span class="pre">-I</span> pigz</tt> or <tt class="docutils literal">tar <span class="pre">-I</span> lbzip2</tt> to compress large tar files much more quickly.</p> <p>I investigated various ways of compressing a 7GiB tar file.</p> <p>The built-in <tt class="docutils literal"><span class="pre">--gzip</span></tt> and <tt class="docutils literal"><span class="pre">--bzip2</span></tt> compression methods in GNU <tt class="docutils literal">tar</tt> are single-threaded. If you invoke an external compressor with <tt class="docutils literal"><span class="pre">--use-compress-program</span></tt>, you can get some huge reductions in compression time, with slightly worse compression ratios.</p> <p>You can use <a class="reference external" href="https://zlib.net/pigz/">pigz</a> as a parallel replacement for <tt class="docutils literal">gzip</tt> and <a class="reference external" href="https://linux.die.net/man/1/lbzip2">lbzip2</a> as a parallel version of <tt class="docutils literal">bzip2</tt>. Both of them will make heavy use of all the cores in your system, greatly reducing the <em>real</em> time relative to the <em>user</em> time.</p> <p>Single-threaded compression timing: <tt class="docutils literal">gzip</tt> is a lot faster than <tt class="docutils literal">bzip2</tt>:</p> <pre class="literal-block"> $ time tar --bzip2 -cf huge-bzip2.tar.bz2 hugedir real 13m15.352s user 12m53.972s sys 0m16.029s $ time tar --gzip -cf huge-gzip.tar.gz hugedir real 5m56.489s user 5m30.271s sys 0m14.633s </pre> <p><tt class="docutils literal">fast</tt> parallel compression timing: <tt class="docutils literal">pigz</tt> is the clear winner:</p> <pre class="literal-block"> $ time tar --use-compress-program='lbzip2 --fast' \ -cf huge-lbzip2-fast.tar.bz2 hugedir real 2m35.967s user 11m38.865s sys 0m26.981s $ time tar --use-compress-program='pigz --fast' \ -cf huge-pigz-fast.tar.gz hugedir real 0m58.222s user 3m22.134s sys 0m17.357s </pre> <p><tt class="docutils literal">best</tt> parallel compression timing: <tt class="docutils literal">lbzip2</tt> is much quicker than <tt class="docutils literal">pigz</tt>:</p> <pre class="literal-block"> $ time tar --use-compress-program='lbzip2 --best' \ -cf huge-lbzip2-best.tar.bz2 hugedir real 1m44.365s user 11m38.277s sys 0m13.551s $ time tar --use-compress-program='pigz --best' \ -cf huge-pigz-best.tar.gz hugedir real 2m27.694s user 16m20.441s sys 0m16.092s </pre> <p>Compressed file sizes: <tt class="docutils literal">bzip2</tt> family compresses better than <tt class="docutils literal">gzip</tt> family; <tt class="docutils literal">best</tt> is smaller than default compression level which is smaller than <tt class="docutils literal">fast</tt>:</p> <pre class="literal-block"> $ ls -lSr -rw-r--r-- 1 user group 2460438578 Feb 22 03:03 huge-lbzip2-best.tar.bz2 -rw-r--r-- 1 user group 2461172874 Feb 22 03:19 huge-bzip2.tar.bz2 -rw-r--r-- 1 user group 2689784220 Feb 22 03:00 huge-lbzip2-fast.tar.bz2 -rw-r--r-- 1 user group 2691286852 Feb 22 03:06 huge-pigz-best.tar.gz -rw-r--r-- 1 user group 2704591997 Feb 22 03:25 huge-gzip.tar.gz -rw-r--r-- 1 user group 2950547862 Feb 22 03:01 huge-pigz-fast.tar.gz -rw-r--r-- 1 user group 7365222400 Feb 22 03:00 huge.tar </pre> Implementing the Tree command in Rust, part 2: Printing Trees tag:www.georgevreilly.com,2023-01-24:/blog/2023/01/24/TreeInRust2PrintingTrees.html 2023-01-24T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>In <a class="reference external" href="https://www.georgevreilly.com/blog/2023/01/23/TreeInRust1WalkDirectories.html">Part 1</a>, we saw how to walk directory trees, recursively using <tt class="docutils literal"><span class="pre">fs::read_dir</span></tt> to construct an in-memory tree of <tt class="docutils literal">FileNode</tt>s. In Part 2, we'll implement the rest of the core of the <a class="reference external" href="https://en.wikipedia.org/wiki/Tree_(command)">tree command</a>: printing the directory tree with <a class="reference external" href="https://www.compart.com/en/unicode/block/U+2500">Box Drawing</a> characters.</p> <p>Let's take a look at some output from <tt class="docutils literal">tree</tt>:</p> <pre class="literal-block"> . ├── alloc.rs ├── ascii.rs ├── os │&nbsp;&nbsp; ├── wasi │&nbsp;&nbsp; │&nbsp;&nbsp; ├── ffi.rs │&nbsp;&nbsp; │&nbsp;&nbsp; ├── mod.rs ➊ │&nbsp;&nbsp; │&nbsp;&nbsp; └── net ➋ │&nbsp;&nbsp; │&nbsp;&nbsp; └── mod.rs │&nbsp;&nbsp; └── windows │&nbsp;&nbsp; ├── ffi.rs ➌ │&nbsp;&nbsp; ├── fs.rs │&nbsp;&nbsp; ├── io │&nbsp;&nbsp; │&nbsp;&nbsp; └── tests.rs │&nbsp;&nbsp; ├── mod.rs │&nbsp;&nbsp; └── thread.rs ├── personality │&nbsp;&nbsp; ├── dwarf │&nbsp;&nbsp; │&nbsp;&nbsp; ├── eh.rs │&nbsp;&nbsp; │&nbsp;&nbsp; ├── mod.rs │&nbsp;&nbsp; │&nbsp;&nbsp; └── tests.rs │&nbsp;&nbsp; ├── emcc.rs │&nbsp;&nbsp; └── gcc.rs └── personality.rs </pre> <p>The first thing that we notice is that most entries at any level, such as ➊, are preceded by <tt class="docutils literal">├──</tt>, while the last entry, ➋, is preceded by <tt class="docutils literal">└──</tt>. This <a class="reference external" href="https://realpython.com/directory-tree-generator-python/">article</a> about building a directory tree generator in Python calls them the <em>tee</em> and <em>elbow</em> connectors, and I'm going to use that terminology.</p> <p>The second thing we notice is that there are multiple <em>prefixes</em> before the connectors, either <tt class="docutils literal">│&nbsp;&nbsp;</tt>&nbsp;(<em>pipe</em>) or <tt class="docutils literal">&nbsp;&nbsp; </tt>&nbsp;(<em>space</em>), one prefix for each level. The rule is that children of a last entry, such as <tt class="docutils literal">os/windows</tt> ➌, get the space prefix, while children of other entries, such as <tt class="docutils literal">os/wasi</tt> or <tt class="docutils literal">personality</tt>, get the pipe prefix.</p> <p>For both connectors and prefixes, the last entry at a particular level gets special treatment.</p> <div class="section" id="the-print-tree-function"> <h3>The <tt class="docutils literal">print_tree</tt> function</h3> <p>A classic technique with recursion is to create a pair of functions: an outer public function that calls a private helper function with the initial set of parameters to visit recursively.</p> <p>Our <tt class="docutils literal">print_tree</tt> function uses an inner <tt class="docutils literal">visit</tt> function to recursively do almost all of the work.</p> <pre class="code rust literal-block"> <span class="k">pub</span><span class="w"> </span><span class="k">fn</span> <span class="nf">print_tree</span><span class="p">(</span><span class="n">root</span>: <span class="kp">&amp;</span><span class="kt">str</span><span class="p">,</span><span class="w"> </span><span class="n">dir</span>: <span class="kp">&amp;</span><span class="nc">Directory</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="n">OTHER_CHILD</span>: <span class="kp">&amp;</span><span class="kt">str</span> <span class="o">=</span><span class="w"> </span><span class="s">&quot;│ &quot;</span><span class="p">;</span><span class="w"> </span><span class="c1">// prefix: pipe </span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="n">OTHER_ENTRY</span>: <span class="kp">&amp;</span><span class="kt">str</span> <span class="o">=</span><span class="w"> </span><span class="s">&quot;├── &quot;</span><span class="p">;</span><span class="w"> </span><span class="c1">// connector: tee </span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="n">FINAL_CHILD</span>: <span class="kp">&amp;</span><span class="kt">str</span> <span class="o">=</span><span class="w"> </span><span class="s">&quot; &quot;</span><span class="p">;</span><span class="w"> </span><span class="c1">// prefix: no more siblings </span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="n">FINAL_ENTRY</span>: <span class="kp">&amp;</span><span class="kt">str</span> <span class="o">=</span><span class="w"> </span><span class="s">&quot;└── &quot;</span><span class="p">;</span><span class="w"> </span><span class="c1">// connector: elbow </span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">&quot;{}&quot;</span><span class="p">,</span><span class="w"> </span><span class="n">root</span><span class="p">);</span><span class="w"> </span><span class="err">➊</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">f</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">visit</span><span class="p">(</span><span class="n">dir</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;&quot;</span><span class="p">);</span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">&quot;</span><span class="se">\n</span><span class="s">{} directories, {} files&quot;</span><span class="p">,</span><span class="w"> </span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">f</span><span class="p">);</span><span class="w"> </span><span class="k">fn</span> <span class="nf">visit</span><span class="p">(</span><span class="n">node</span>: <span class="kp">&amp;</span><span class="nc">Directory</span><span class="p">,</span><span class="w"> </span><span class="n">prefix</span>: <span class="kp">&amp;</span><span class="kt">str</span><span class="p">)</span><span class="w"> </span>-&gt; <span class="p">(</span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err">➋</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">dirs</span>: <span class="kt">usize</span> <span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="c1">// counting this directory ➌ </span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">files</span>: <span class="kt">usize</span> <span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">node</span><span class="p">.</span><span class="n">entries</span><span class="p">.</span><span class="n">len</span><span class="p">();</span><span class="w"> </span><span class="err">➍</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">entry</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="o">&amp;</span><span class="n">node</span><span class="p">.</span><span class="n">entries</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">count</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">connector</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">count</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">FINAL_ENTRY</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">OTHER_ENTRY</span><span class="w"> </span><span class="p">};</span><span class="w"> </span><span class="err">➎</span><span class="w"> </span><span class="k">match</span><span class="w"> </span><span class="n">entry</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">FileTree</span>::<span class="n">DirNode</span><span class="p">(</span><span class="n">sub_dir</span><span class="p">)</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err">➏</span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">&quot;{}{}{}&quot;</span><span class="p">,</span><span class="w"> </span><span class="n">prefix</span><span class="p">,</span><span class="w"> </span><span class="n">connector</span><span class="p">,</span><span class="w"> </span><span class="n">sub_dir</span><span class="p">.</span><span class="n">name</span><span class="p">);</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">new_prefix</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="fm">format!</span><span class="p">(</span><span class="w"> </span><span class="err">➐</span><span class="w"> </span><span class="s">&quot;{}{}&quot;</span><span class="p">,</span><span class="w"> </span><span class="n">prefix</span><span class="p">,</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">count</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">FINAL_CHILD</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">OTHER_CHILD</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">);</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">f</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">visit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">sub_dir</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">new_prefix</span><span class="p">);</span><span class="w"> </span><span class="err">➑</span><span class="w"> </span><span class="n">dirs</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">d</span><span class="p">;</span><span class="w"> </span><span class="n">files</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">f</span><span class="p">;</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">FileTree</span>::<span class="n">LinkNode</span><span class="p">(</span><span class="n">symlink</span><span class="p">)</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="w"> </span><span class="s">&quot;{}{}{} -&gt; {}&quot;</span><span class="p">,</span><span class="w"> </span><span class="n">prefix</span><span class="p">,</span><span class="w"> </span><span class="n">connector</span><span class="p">,</span><span class="w"> </span><span class="n">symlink</span><span class="p">.</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">symlink</span><span class="p">.</span><span class="n">target</span><span class="p">);</span><span class="w"> </span><span class="n">files</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">FileTree</span>::<span class="n">FileNode</span><span class="p">(</span><span class="n">file</span><span class="p">)</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">&quot;{}{}{}&quot;</span><span class="p">,</span><span class="w"> </span><span class="n">prefix</span><span class="p">,</span><span class="w"> </span><span class="n">connector</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="p">.</span><span class="n">name</span><span class="p">);</span><span class="w"> </span><span class="n">files</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">(</span><span class="n">dirs</span><span class="p">,</span><span class="w"> </span><span class="n">files</span><span class="p">)</span><span class="w"> </span><span class="err">➒</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span> </pre> <ol class="arabic simple"> <li>The outer function, <tt class="docutils literal">print_tree</tt>, simply prints the name of the root node on a line by itself; calls the inner <tt class="docutils literal">visit</tt> function with the <tt class="docutils literal">dir</tt> node and an empty prefix; and finally prints the number of directories and files visited. This is for compatibility with the output of <tt class="docutils literal">tree</tt>.</li> <li>The inner <tt class="docutils literal">visit</tt> function takes two parameters: <tt class="docutils literal">node</tt>, a <tt class="docutils literal">Directory</tt>, and <tt class="docutils literal">prefix</tt>, a string which is initially empty.</li> <li>Keep track of the number of <tt class="docutils literal">dirs</tt> and <tt class="docutils literal">files</tt> seen at this level and in sub-directories.</li> <li>We count downwards from the number of entries in this directory to zero. When <tt class="docutils literal">count</tt> is zero, we are on the last entry, which gets special treatment.</li> <li>Compute the connector, <tt class="docutils literal">└──</tt> (<em>elbow</em>) for the last entry; <tt class="docutils literal">├──</tt> (<em>tee</em>) otherwise.</li> <li>Match the <tt class="docutils literal"><span class="pre">FileTree::DirNode</span></tt> variant and <a class="reference external" href="https://doc.rust-lang.org/reference/patterns.html#destructuring">destructure</a> the value into <tt class="docutils literal">sub_dir</tt>, a <tt class="docutils literal">&amp;Directory</tt>.</li> <li>Before recursively visiting a sub-directory, we compute a new prefix, by appending the appropriate sub-prefix to the current prefix. If there are further entries (<tt class="docutils literal">count &gt; 0</tt>), the sub-prefix for the current level is <tt class="docutils literal">│&nbsp;&nbsp;</tt>&nbsp;(<em>pipe</em>); otherwise, it's <tt class="docutils literal">&nbsp;&nbsp; </tt>&nbsp;(<em>spaces</em>).</li> <li>Call <tt class="docutils literal">visit</tt> recursively, then add to the running totals of <tt class="docutils literal">dirs</tt> and <tt class="docutils literal">files</tt>.</li> <li><tt class="docutils literal">visit</tt> returns a tuple of the counts of directories and files that were recursively visited.</li> </ol> <p>One subtlety that is not obvious from the above is that <tt class="docutils literal">OTHER_CHILD</tt> actually contains two <a class="reference external" href="https://en.wikipedia.org/wiki/Non-breaking_space">non-breaking spaces</a>:</p> <pre class="code rust literal-block"> <span class="k">const</span><span class="w"> </span><span class="n">OTHER_CHILD</span>: <span class="kp">&amp;</span><span class="kt">str</span> <span class="o">=</span><span class="w"> </span><span class="s">&quot;│</span><span class="se">\u{00A0}\u{00A0}</span><span class="s"> &quot;</span><span class="p">;</span><span class="w"> </span><span class="c1">// prefix: pipe</span> </pre> <p>This is for compatibility with the output of <tt class="docutils literal">tree</tt>:</p> <pre class="code bash literal-block"> $ diff &lt;<span class="o">(</span>cargo run -q -- ./tests<span class="o">)</span> &lt;<span class="o">(</span>tree ./tests<span class="o">)</span> <span class="o">&amp;&amp;</span> <span class="nb">echo</span> <span class="s2">&quot;no difference&quot;</span> no difference </pre> <p>Using <a class="reference external" href="https://www.georgevreilly.com/blog/2022/01/31/DiffFileFragment.html">process substitution</a> to generate two different inputs for <tt class="docutils literal">diff</tt>.</p> </div> <div class="section" id="the-main-function"> <h3>The <tt class="docutils literal">main</tt> function</h3> <p>Let's tie it all together.</p> <pre class="code rust literal-block"> <span class="k">fn</span> <span class="nf">main</span><span class="p">()</span><span class="w"> </span>-&gt; <span class="nc">io</span>::<span class="nb">Result</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">root</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">env</span>::<span class="n">args</span><span class="p">().</span><span class="n">nth</span><span class="p">(</span><span class="mi">1</span><span class="p">).</span><span class="n">unwrap_or</span><span class="p">(</span><span class="s">&quot;.&quot;</span><span class="p">.</span><span class="n">to_string</span><span class="p">());</span><span class="w"> </span><span class="err">➊</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">dir</span>: <span class="nc">Directory</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dir_walk</span><span class="p">(</span><span class="w"> </span><span class="err">➋</span><span class="w"> </span><span class="o">&amp;</span><span class="n">PathBuf</span>::<span class="n">from</span><span class="p">(</span><span class="n">root</span><span class="p">.</span><span class="n">clone</span><span class="p">()),</span><span class="w"> </span><span class="err">➌</span><span class="w"> </span><span class="n">is_not_hidden</span><span class="p">,</span><span class="w"> </span><span class="n">sort_by_name</span><span class="p">)</span><span class="o">?</span><span class="p">;</span><span class="w"> </span><span class="err">➍</span><span class="w"> </span><span class="n">print_tree</span><span class="p">(</span><span class="o">&amp;</span><span class="n">root</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">dir</span><span class="p">);</span><span class="w"> </span><span class="err">➎</span><span class="w"> </span><span class="nb">Ok</span><span class="p">(())</span><span class="w"> </span><span class="err">➏</span><span class="w"> </span><span class="p">}</span> </pre> <ol class="arabic simple"> <li>The simplest possible way to get a single, optional command-line argument. If omitted, we default to <tt class="docutils literal">.</tt>, the current directory. For more sophisticated argument parsing, we could use <a class="reference external" href="https://docs.rs/clap/latest/clap/">Clap</a>.</li> <li>Use <tt class="docutils literal">dir_walk</tt> from <a class="reference external" href="https://www.georgevreilly.com/blog/2023/01/23/TreeInRust1WalkDirectories.html">Part 1</a> to recursively build a directory of <tt class="docutils literal">FileTree</tt> nodes.</li> <li>Create a <tt class="docutils literal">PathBuf</tt> from <tt class="docutils literal">root</tt>, a string; <tt class="docutils literal">clone</tt> is needed because <tt class="docutils literal"><span class="pre">PathBuf::from</span></tt> takes ownership of the string buffer. Use the <tt class="docutils literal">is_not_hidden</tt> filter and the <tt class="docutils literal">sort_by_name</tt> comparator from <a class="reference external" href="https://www.georgevreilly.com/blog/2023/01/23/TreeInRust1WalkDirectories.html">Part 1</a>.</li> <li>The <a class="reference external" href="https://doc.rust-lang.org/reference/expressions/operator-expr.html#the-question-mark-operator">postfix question mark operator</a>, <tt class="docutils literal">?</tt>, is used to propagate errors.</li> <li>Let <tt class="docutils literal">print_tree</tt> draw the diagram.</li> <li>Return the <tt class="docutils literal">Ok</tt> <a class="reference external" href="https://doc.rust-lang.org/std/primitive.unit.html">unit</a> result to indicate success.</li> </ol> </div> <div class="section" id="baum"> <h3>Baum</h3> <p>You can find the <a class="reference external" href="https://github.com/georgevreilly/baum">Baum</a> source code on GitHub.</p> <p>In Part 3, we'll discuss testing.</p> </div> <div class="section" id="resources"> <h3>Resources</h3> <ul class="simple"> <li><a class="reference external" href="https://github.com/Old-Man-Programmer/tree/">Official tree source</a>: The actual source for <tt class="docutils literal">tree</tt>, written in old-school C.</li> <li><a class="reference external" href="https://two-wrongs.com/draw-a-tree-structure-with-only-css.html">Draw a Tree Structure With Only CSS</a>: Use CSS to draw links in nested, unordered lists.</li> <li><a class="reference external" href="https://realpython.com/directory-tree-generator-python/">Build a Python Directory Tree Generator for the Command Line</a>.</li> <li>Kevin Newton has implemented <a class="reference external" href="https://github.com/kddnewton/tree">Tree in Multiple Languages</a>.</li> <li><a class="reference external" href="https://github.com/dduan/tre">Tre</a> is a modern alternative to <tt class="docutils literal">tree</tt> in Rust.</li> </ul> </div> Implementing the Tree command in Rust, part 1: Walking Directories tag:www.georgevreilly.com,2023-01-23:/blog/2023/01/23/TreeInRust1WalkDirectories.html 2023-01-23T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <img alt="tree tree core/src/num for Rust" class="right-float" src="https://www.georgevreilly.com/content/binary/rust-core-src-num-tree.png" style="width: 160px;"/> <p>I've been learning Rust lately. I started by reading several books, including <a class="reference external" href="https://www.manning.com/books/rust-in-action">Rust in Action</a>, <a class="reference external" href="https://www.manning.com/books/code-like-a-pro-in-rust">Code Like a Pro in Rust</a>, and most of <a class="reference external" href="https://learning.oreilly.com/library/view/programming-rust-2nd/9781492052586/">Programming Rust</a>. Now, I'm starting to actually write code.</p> <p>I read the <a class="reference external" href="https://www.goodreads.com/review/show/5183138397">Command-Line Rust</a> book last month, which challenged readers to write our own implementations of the <a class="reference external" href="https://en.wikipedia.org/wiki/Tree_(command)">tree command</a>.</p> <p>I decided to accept the challenge.</p> <p>At its simplest, <tt class="docutils literal">tree</tt> simply prints a directory tree, using some of the Unicode <a class="reference external" href="https://www.compart.com/en/unicode/block/U+2500">Box Drawing</a> characters to show the hierarchical relationship, as in the image at right.</p> <p>I've split the code into two phases, which will be covered in two blog posts.</p> <ol class="arabic simple"> <li>Walking the directory tree on disk to build an in-memory tree.</li> <li>Pretty-printing the in-memory tree.</li> </ol> <p>While it's certainly possible to print a subtree as it's being read, separating the two phases yields code that is cleaner, simpler, and more testable.</p> <p>In future, I will insert a third phase, <em>processing</em>, between the reading and writing phases, by a weak analogy with Extract-Transform-Load (<a class="reference external" href="https://en.wikipedia.org/wiki/Extract,_transform,_load">ETL</a>).</p> <div class="section" id="walking-the-directory-tree"> <h3>Walking the Directory Tree</h3> <p>There are three kinds of file tree node that I care about: <tt class="docutils literal">File</tt>, <tt class="docutils literal">Directory</tt>, and <tt class="docutils literal">Symlink</tt>. These are the variants exposed by Rust's <a class="reference external" href="https://doc.rust-lang.org/std/fs/struct.FileType.html">FileType</a>.</p> <ul class="simple"> <li><tt class="docutils literal">File</tt> has a name and file system metadata;</li> <li><tt class="docutils literal">Symlink</tt> has a name, a target, and metadata;</li> <li><tt class="docutils literal">Directory</tt> has a name and a list of child file tree nodes.</li> </ul> <p>Here, <em>name</em> refers to the last component of a path; e.g., the <tt class="docutils literal">gamma</tt> in <tt class="docutils literal">alpha/beta/gamma</tt>. The file system metadata is not currently used, but will be in future.</p> <pre class="code rust literal-block"> <span class="cp">#[derive(Debug)]</span><span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">struct</span> <span class="nc">File</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">name</span>: <span class="nb">String</span><span class="p">,</span><span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">metadata</span>: <span class="nc">fs</span>::<span class="n">Metadata</span><span class="p">,</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="cp">#[derive(Debug)]</span><span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">struct</span> <span class="nc">Symlink</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">name</span>: <span class="nb">String</span><span class="p">,</span><span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">target</span>: <span class="nb">String</span><span class="p">,</span><span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">metadata</span>: <span class="nc">fs</span>::<span class="n">Metadata</span><span class="p">,</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="cp">#[derive(Debug)]</span><span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">struct</span> <span class="nc">Directory</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">name</span>: <span class="nb">String</span><span class="p">,</span><span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">entries</span>: <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">FileTree</span><span class="o">&gt;</span><span class="p">,</span><span class="w"> </span><span class="p">}</span> </pre> <p>File and directory paths are not guaranteed to be UTF-8. Indeed, Unix file paths are an arbitrary sequence of bytes, while Windows file paths are an opaque sequence of 16-bit integers. You might think that I should be using <tt class="docutils literal">OsString</tt> here, since it holds a <a class="reference external" href="https://doc.rust-lang.org/std/ffi/struct.OsString.html">platform-native string</a>. <tt class="docutils literal">String</tt> has to be valid UTF-8; <tt class="docutils literal">OsString</tt> doesn't. Unfortunately, it's not easy to look at the actual data in an <tt class="docutils literal">OsString</tt>, unless you convert it (possibly lossily) to a <tt class="docutils literal">String</tt>. See <a class="reference external" href="https://docs.rs/bstr/0.2.8/bstr/#file-paths-and-os-strings">File paths and OS strings</a> for more.</p> <p>The obvious way to represent a file tree node in Rust is as an <a class="reference external" href="https://hashrust.com/blog/why-rust-enums-are-so-cool/">enum</a> with three tuple-like variants.</p> <pre class="code rust literal-block"> <span class="cp">#[derive(Debug)]</span><span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">enum</span> <span class="nc">FileTree</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">DirNode</span><span class="p">(</span><span class="n">Directory</span><span class="p">),</span><span class="w"> </span><span class="n">FileNode</span><span class="p">(</span><span class="n">File</span><span class="p">),</span><span class="w"> </span><span class="n">LinkNode</span><span class="p">(</span><span class="n">Symlink</span><span class="p">),</span><span class="w"> </span><span class="p">}</span> </pre> <p>Here, each variant in the enum holds a struct of similar name. We will be able to take advantage of Rust's <a class="reference external" href="https://doc.rust-lang.org/book/ch18-03-pattern-syntax.html#destructuring-enums">pattern matching</a> to handle each variant.</p> <p>We'll use <tt class="docutils literal"><span class="pre">fs::read_dir</span></tt> to read each directory in the hierarchy. The <a class="reference external" href="https://doc.rust-lang.org/std/fs/struct.ReadDir.html">read_dir</a> function returns an iterator that yields instances of <tt class="docutils literal"><span class="pre">io::Result&lt;DirEntry&gt;</span></tt>. If a <tt class="docutils literal">DirEntry</tt> is a directory, we can recursively invoke our <tt class="docutils literal">dir_walk</tt> function to read the child directory and add its contents to our in-memory tree.</p> <p>The <a class="reference external" href="https://docs.rs/walkdir/latest/walkdir/">walkdir</a> crate also walks through a directory tree, but it hides the recursion from you. It's an excellent choice otherwise.</p> <div class="section" id="skipping-and-sorting"> <h4>Skipping and Sorting</h4> <p>In each directory that we read, we need to consider two factors.</p> <ol class="arabic simple"> <li>Which entries to skip, such as hidden files.</li> <li>How to sort the entries.</li> </ol> <p>We almost always want to skip <a class="reference external" href="https://en.wikipedia.org/wiki/Hidden_file_and_hidden_directory">hidden files and directories</a>—on Unix, those entries whose names start with the <tt class="docutils literal">.</tt> character. Every directory includes entries for <tt class="docutils literal">.</tt> (itself) and <tt class="docutils literal">..</tt> (parent directory), and may include other hidden files or directories, such as <tt class="docutils literal">.vimrc</tt> or <tt class="docutils literal">.git</tt>.</p> <p>On Windows, hidden files are controlled by an <a class="reference external" href="https://www.raymond.cc/blog/reset-system-and-hidden-attributes-for-files-or-folders-caused-by-virus/">attribute</a>, not by their name.</p> <p>For more complicated usage, we might want to skip <a class="reference external" href="https://git-scm.com/docs/gitignore">ignored files</a>, as specified in <tt class="docutils literal">.gitignore</tt>.</p> <p>The simplest useful filter for entry names is one that rejects hidden files and directories.</p> <pre class="code rust literal-block"> <span class="k">pub</span><span class="w"> </span><span class="k">fn</span> <span class="nf">is_not_hidden</span><span class="p">(</span><span class="n">name</span>: <span class="kp">&amp;</span><span class="kt">str</span><span class="p">)</span><span class="w"> </span>-&gt; <span class="kt">bool</span> <span class="p">{</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="o">!</span><span class="n">name</span><span class="p">.</span><span class="n">starts_with</span><span class="p">(</span><span class="sc">'.'</span><span class="p">);</span><span class="w"> </span><span class="p">}</span> </pre> <p>Disk I/O is <a class="reference external" href="https://louwrentius.com/understanding-storage-performance-iops-and-latency.html">costly and slow</a>, compared to memory access. It's far more efficient to not read a directory at all than it is to eliminate a subtree at a later stage. Even if the OS has cached the relevant directory contents, there's still a <a class="reference external" href="https://gms.tf/on-the-costs-of-syscalls.html">cost to the syscall</a> to retrieve that data from the kernel.</p> <p>There is <a class="reference external" href="https://stackoverflow.com/a/8977490/6364">no specific order</a> to entries in a directory or to the results returned by low-level APIs like <tt class="docutils literal"><span class="pre">fs::read_dir</span></tt>. By default, <tt class="docutils literal">ls</tt> sorts entries alphabetically, but it can also sort by creation time, modification time, or size, in ascending or descending order.</p> <p>Unix filesystems are case-sensitive, but Mac filesystems (APFS and HFS+) are case-insensitive by default, although they preserve the case of the original filename. Windows' filesystems (NTFS, exFAT, and FAT32) are <a class="reference external" href="https://learn.microsoft.com/en-us/windows/win32/fileio/filesystem-functionality-comparison">likewise</a> case-preserving and case-insensitive.</p> <p>Here is a case-sensitive <a class="reference external" href="https://doc.rust-lang.org/std/vec/struct.Vec.html#method.sort_by">comparator</a> for use with <tt class="docutils literal">sort_by</tt>:</p> <pre class="code rust literal-block"> <span class="k">pub</span><span class="w"> </span><span class="k">fn</span> <span class="nf">sort_by_name</span><span class="p">(</span><span class="n">a</span>: <span class="kp">&amp;</span><span class="nc">fs</span>::<span class="n">DirEntry</span><span class="p">,</span><span class="w"> </span><span class="n">b</span>: <span class="kp">&amp;</span><span class="nc">fs</span>::<span class="n">DirEntry</span><span class="p">)</span><span class="w"> </span>-&gt; <span class="nc">Ordering</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">a_name</span>: <span class="nb">String</span> <span class="o">=</span><span class="w"> </span><span class="n">a</span><span class="p">.</span><span class="n">path</span><span class="p">().</span><span class="n">file_name</span><span class="p">().</span><span class="n">unwrap</span><span class="p">().</span><span class="n">to_str</span><span class="p">().</span><span class="n">unwrap</span><span class="p">().</span><span class="n">into</span><span class="p">();</span><span class="w"> </span><span class="err">➊</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">b_name</span>: <span class="nb">String</span> <span class="o">=</span><span class="w"> </span><span class="n">b</span><span class="p">.</span><span class="n">path</span><span class="p">().</span><span class="n">file_name</span><span class="p">().</span><span class="n">unwrap</span><span class="p">().</span><span class="n">to_str</span><span class="p">().</span><span class="n">unwrap</span><span class="p">().</span><span class="n">into</span><span class="p">();</span><span class="w"> </span><span class="n">a_name</span><span class="p">.</span><span class="n">cmp</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b_name</span><span class="p">)</span><span class="w"> </span><span class="err">➋</span><span class="w"> </span><span class="p">}</span> </pre> <ol class="arabic simple"> <li>This messy expression is necessary to get the <em>name</em> as a <tt class="docutils literal">String</tt>.</li> <li><tt class="docutils literal">cmp</tt> returns <tt class="docutils literal">Less</tt>, <tt class="docutils literal">Equal</tt>, or <tt class="docutils literal">Greater</tt> from the <tt class="docutils literal">Ordering</tt> enum.</li> </ol> <p>More on <tt class="docutils literal">Ordering</tt> <a class="reference external" href="https://www.philipdaniels.com/blog/2019/rust-equality-and-ordering/">here</a>.</p> </div> </div> <div class="section" id="the-dir-walk-function"> <h3>The <tt class="docutils literal">dir_walk</tt> function</h3> <p>Finally, the recursive <tt class="docutils literal">dir_walk</tt> function that creates the tree of <tt class="docutils literal">FileTree</tt> nodes.</p> <pre class="code rust literal-block"> <span class="k">pub</span><span class="w"> </span><span class="k">fn</span> <span class="nf">dir_walk</span><span class="p">(</span><span class="w"> </span><span class="n">root</span>: <span class="kp">&amp;</span><span class="nc">PathBuf</span><span class="p">,</span><span class="w"> </span><span class="n">filter</span>: <span class="nc">fn</span><span class="p">(</span><span class="n">name</span>: <span class="kp">&amp;</span><span class="kt">str</span><span class="p">)</span><span class="w"> </span>-&gt; <span class="kt">bool</span><span class="p">,</span><span class="w"> </span><span class="err">➊</span><span class="w"> </span><span class="n">compare</span>: <span class="nc">fn</span><span class="p">(</span><span class="n">a</span>: <span class="kp">&amp;</span><span class="nc">fs</span>::<span class="n">DirEntry</span><span class="p">,</span><span class="w"> </span><span class="n">b</span>: <span class="kp">&amp;</span><span class="nc">fs</span>::<span class="n">DirEntry</span><span class="p">)</span><span class="w"> </span>-&gt; <span class="nc">Ordering</span><span class="p">,</span><span class="w"> </span><span class="p">)</span><span class="w"> </span>-&gt; <span class="nc">io</span>::<span class="nb">Result</span><span class="o">&lt;</span><span class="n">Directory</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">entries</span>: <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">fs</span>::<span class="n">DirEntry</span><span class="o">&gt;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fs</span>::<span class="n">read_dir</span><span class="p">(</span><span class="n">root</span><span class="p">)</span><span class="o">?</span><span class="w"> </span><span class="p">.</span><span class="n">filter_map</span><span class="p">(</span><span class="o">|</span><span class="n">result</span><span class="o">|</span><span class="w"> </span><span class="n">result</span><span class="p">.</span><span class="n">ok</span><span class="p">())</span><span class="w"> </span><span class="p">.</span><span class="n">collect</span><span class="p">();</span><span class="w"> </span><span class="err">➋</span><span class="w"> </span><span class="n">entries</span><span class="p">.</span><span class="n">sort_by</span><span class="p">(</span><span class="n">compare</span><span class="p">);</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">directory</span>: <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">FileTree</span><span class="o">&gt;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">Vec</span>::<span class="n">with_capacity</span><span class="p">(</span><span class="n">entries</span><span class="p">.</span><span class="n">len</span><span class="p">());</span><span class="w"> </span><span class="err">➌</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">e</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">entries</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">path</span><span class="p">();</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">name</span>: <span class="nb">String</span> <span class="o">=</span><span class="w"> </span><span class="n">path</span><span class="p">.</span><span class="n">file_name</span><span class="p">().</span><span class="n">unwrap</span><span class="p">().</span><span class="n">to_str</span><span class="p">().</span><span class="n">unwrap</span><span class="p">().</span><span class="n">into</span><span class="p">();</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="o">!</span><span class="n">filter</span><span class="p">(</span><span class="o">&amp;</span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err">➍</span><span class="w"> </span><span class="k">continue</span><span class="p">;</span><span class="w"> </span><span class="p">};</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fs</span>::<span class="n">metadata</span><span class="p">(</span><span class="o">&amp;</span><span class="n">path</span><span class="p">).</span><span class="n">unwrap</span><span class="p">();</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">node</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">match</span><span class="w"> </span><span class="n">path</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err">➎</span><span class="w"> </span><span class="n">path</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">path</span><span class="p">.</span><span class="n">is_dir</span><span class="p">()</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">FileTree</span>::<span class="n">DirNode</span><span class="p">(</span><span class="w"> </span><span class="err">➏</span><span class="w"> </span><span class="n">dir_walk</span><span class="p">(</span><span class="o">&amp;</span><span class="n">root</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">name</span><span class="p">),</span><span class="w"> </span><span class="n">filter</span><span class="p">,</span><span class="w"> </span><span class="n">compare</span><span class="p">)</span><span class="o">?</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">path</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">path</span><span class="p">.</span><span class="n">is_symlink</span><span class="p">()</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="n">FileTree</span>::<span class="n">LinkNode</span><span class="p">(</span><span class="n">Symlink</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">name</span>: <span class="nc">name</span><span class="p">.</span><span class="n">into</span><span class="p">(),</span><span class="w"> </span><span class="n">target</span>: <span class="nc">fs</span>::<span class="n">read_link</span><span class="p">(</span><span class="n">path</span><span class="p">).</span><span class="n">unwrap</span><span class="p">().</span><span class="n">to_string_lossy</span><span class="p">().</span><span class="n">to_string</span><span class="p">(),</span><span class="w"> </span><span class="n">metadata</span>: <span class="nc">metadata</span><span class="p">,</span><span class="w"> </span><span class="p">}),</span><span class="w"> </span><span class="n">path</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">path</span><span class="p">.</span><span class="n">is_file</span><span class="p">()</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="n">FileTree</span>::<span class="n">FileNode</span><span class="p">(</span><span class="n">File</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">name</span>: <span class="nc">name</span><span class="p">.</span><span class="n">into</span><span class="p">(),</span><span class="w"> </span><span class="n">metadata</span>: <span class="nc">metadata</span><span class="p">,</span><span class="w"> </span><span class="p">}),</span><span class="w"> </span><span class="n">_</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="fm">unreachable!</span><span class="p">(),</span><span class="w"> </span><span class="p">};</span><span class="w"> </span><span class="n">directory</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">node</span><span class="p">);</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">root</span><span class="w"> </span><span class="p">.</span><span class="n">file_name</span><span class="p">()</span><span class="w"> </span><span class="p">.</span><span class="n">unwrap_or</span><span class="p">(</span><span class="n">OsStr</span>::<span class="n">new</span><span class="p">(</span><span class="s">&quot;.&quot;</span><span class="p">))</span><span class="w"> </span><span class="err">➐</span><span class="w"> </span><span class="p">.</span><span class="n">to_str</span><span class="p">()</span><span class="w"> </span><span class="p">.</span><span class="n">unwrap</span><span class="p">()</span><span class="w"> </span><span class="p">.</span><span class="n">into</span><span class="p">();</span><span class="w"> </span><span class="nb">Ok</span><span class="p">(</span><span class="n">Directory</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err">➑</span><span class="w"> </span><span class="n">name</span>: <span class="nc">name</span><span class="p">,</span><span class="w"> </span><span class="n">entries</span>: <span class="nc">directory</span><span class="p">,</span><span class="w"> </span><span class="p">})</span><span class="w"> </span><span class="p">}</span> </pre> <ol class="arabic simple"> <li>Currently, the <tt class="docutils literal">filter</tt> and <tt class="docutils literal">compare</tt> parameters are <tt class="docutils literal">fn</tt>s. They could probably be <tt class="docutils literal">FnMut</tt> traits.</li> <li>Read directory. Discard any <tt class="docutils literal">Error</tt> results. Collect into a <tt class="docutils literal">Vec</tt>.</li> <li>We'll need at most this many entries.</li> <li>Use <tt class="docutils literal">filter</tt> to discard names that won't be visited.</li> <li>Match the path as a <tt class="docutils literal">DirNode</tt>, <tt class="docutils literal">LinkNode</tt>, or <tt class="docutils literal">FileNode</tt>, by using <a class="reference external" href="https://doc.rust-lang.org/book/ch18-03-pattern-syntax.html#extra-conditionals-with-match-guards">match guards</a>.</li> <li>Visit the subdirectory recursively.</li> <li>If <tt class="docutils literal">root</tt> was <tt class="docutils literal">&quot;.&quot;</tt>, the <tt class="docutils literal">file_name()</tt> will be <tt class="docutils literal">None</tt>.</li> <li>Return a <tt class="docutils literal">Directory</tt> for this directory, wrapped in an <tt class="docutils literal"><span class="pre">io::Result</span></tt>.</li> </ol> <p>In <a class="reference external" href="https://www.georgevreilly.com/blog/2023/01/24/TreeInRust2PrintingTrees.html">Part 2</a>, we'll print the directory tree.</p> </div> fsymbols for Unicode weirdness tag:www.georgevreilly.com,2022-12-31:/blog/2022/12/31/FSymbolsForUnicodeWeirdness.html 2023-01-01T04:35:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>My display name on Twitter currently looks like @ɢᴇᴏʀɢᴇᴠʀᴇɪʟʟʏ@ᴛᴇᴄʜ.ʟɢʙᴛ, an attempt to route around Twitter's apparent censorship of Mastodon information.</p> <p>I used the <a href="https://fsymbols.com/generators/">FSymbols Generators</a> to produce several variants.</p> <div class="codehilite"><pre><span></span>@𝕘𝕖𝕠𝕣𝕘𝕖𝕧𝕣𝕖𝕚𝕝𝕝𝕪@𝕥𝕖𝕔𝕙.𝕝𝕘𝕓𝕥 ʇqƃʅ.ɥɔǝʇ@ʎʅʅᴉǝɹʌǝƃɹoǝƃ@ @𝗀𝖾𝗈𝗋𝗀𝖾𝗏𝗋𝖾𝗂𝗅𝗅𝗒@𝗍𝖾𝖼𝗁.𝗅𝗀𝖻𝗍 @𝘨𝘦𝘰𝘳𝘨𝘦𝘷𝘳𝘦𝘪𝘭𝘭𝘺@𝘵𝘦𝘤𝘩.𝘭𝘨𝘣𝘵 @𝑔𝑒𝑜𝑟𝑔𝑒𝑣𝑟𝑒𝑖𝑙𝑙𝑦@𝑡𝑒𝑐ℎ.𝑙𝑔𝑏𝑡 @𝙜𝙚𝙤𝙧𝙜𝙚𝙫𝙧𝙚𝙞𝙡𝙡𝙮@𝙩𝙚𝙘𝙝.𝙡𝙜𝙗𝙩 @𝚐𝚎𝚘𝚛𝚐𝚎𝚟𝚛𝚎𝚒𝚕𝚕𝚢@𝚝𝚎𝚌𝚑.𝚕𝚐𝚋𝚝 @𝔤𝔢𝔬𝔯𝔤𝔢𝔳𝔯𝔢𝔦𝔩𝔩𝔶@𝔱𝔢𝔠𝔥.𝔩𝔤𝔟𝔱 </pre></div> <p>Many of these variants come from <a href="https://www.compart.com/en/unicode/block/U+1D400">Unicode Block "Mathematical Alphanumeric Symbols"</a>.</p> <p>There are a lot more things you can do with Unicode than just <a href="https://www.georgevreilly.com/blog/2016/02/12/UnicodeUpsideDownMappingPart2.html">upside-down text</a>.</p> Backwards Ranges in Python tag:www.georgevreilly.com,2022-12-19:/blog/2022/12/19/BackwardsRangesInPython.html 2022-12-19T23:20:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>In Python, if you want to specify a sequence of numbers from <code>a</code> up to (but excluding) <code>b</code>, you can write <code>range(a, b)</code>. This generates the sequence <code>a, a+1, a+2, ..., b-1</code>. You start at <code>a</code> and keep going until the next number would be <code>b</code>.</p> <p>In Python 3, <code>range</code> is <em>lazy</em> and the values in the sequence do not materialize until you consume the range.</p> <div class="codehilite"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">12</span><span class="p">)</span> <span class="go">range(3, 12)</span> <span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">12</span><span class="p">))</span> <span class="go">[3, 4, 5, 6, 7, 8, 9, 10, 11]</span> </pre></div> <p>Trey Hunner makes the point that <a href="https://treyhunner.com/2018/02/python-range-is-not-an-iterator/">range is a lazy iterable</a> rather than an iterator.</p> <p>You can also <em>step</em> by an increment other than one: <code>range(a, b, s)</code>. This generates <code>a, a+s, a+2*s, ..., b-s</code> (assuming that <code>(b - a) % s == 0</code>; i.e., <code>a</code> and <code>b</code> are separated by an exact multiple of <code>s</code>.)</p> <div class="codehilite"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span> <span class="go">[3, 6, 9]</span> </pre></div> <p>What if you want to count down? <code>range(b, a, -s)</code> won't do what you want.</p> <div class="codehilite"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span> <span class="o">-</span><span class="mi">3</span><span class="p">))</span> <span class="go">[12, 9, 6]</span> </pre></div> <p>Why? Because you're starting at <code>b</code>, a value that doesn't appear in the forward range, and you're ending before you reach <code>a</code>, a value that is certainly in the forward range. You have to subtract <code>s</code> from both <code>b</code> and <code>a</code>:</p> <p>When you use <code>range(b-s, a-s, -s)</code>, you get <code>b-s, b-2*s, ..., a+s, a</code>.</p> <div class="codehilite"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">12</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="o">-</span><span class="mi">3</span><span class="p">))</span> <span class="go">[9, 6, 3]</span> <span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">12</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="o">-</span><span class="mi">3</span><span class="p">)),</span> <span class="nb">list</span><span class="p">(</span><span class="nb">reversed</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">3</span><span class="p">)))</span> <span class="go">([9, 6, 3], [9, 6, 3])</span> </pre></div> Ulysses at 100 tag:www.georgevreilly.com,2022-02-02:/blog/2022/02/02/UlyssesAt100.html 2022-02-02T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <a class="reference external image-reference" href="https://www.irishtimes.com/news/ireland/irish-news/an-post-launches-new-stamps-to-celebrate-centenary-of-ulysses-1.4787040"><img alt="new stamps celebrating the centenary of Ulysses" src="https://www.irishtimes.com/polopoly_fs/1.4787039.1643276940!/image/image.jpg_gen/derivatives/box_620_330/image.jpg"/></a> <p>On 2nd February 1882, in the Dublin suburb of Rathgar, a son was given unto John and May Joyce. James Joyce celebrated his 40th birthday in Paris on 2nd February 1922 by receiving the first printed copy of his novel <em>Ulysses</em>. Parts of it had already been published in literary magazines and the book was eagerly received by the cognoscenti. It took more than a decade for <em>Ulysses</em> to be published in Britain and the United States. Censors had considered the book obscene, but the courts established that it had legitimate literary merit.</p> <p>For decades, <em>Ulysses</em> was poorly received in Ireland. The book was considered <a class="reference external" href="https://www.irishtimes.com/news/ireland/irish-news/the-year-of-ulysses-2022-marks-centenary-of-joyce-s-experimental-masterpiece-1.4766055">blasphemous</a> and obscene by many. Worse, Joyce had freely borrowed from life, populating the pages of <em>Ulysses</em> with people he had known in Dublin.</p> <p>By the time of the centenary of Joyce's birth in 1982, attitudes had changed in Ireland. <em>Ulysses</em> was now celebrated, if not widely read. RTÉ Radio broadcast a <a class="reference external" href="https://www.rte.ie/archives/exhibitions/681-history-of-rte/706-rte-1980s/327476-ulysses-broadcast/">25-hour reading</a> of the entire book.</p> <p>I was a schoolboy of almost seventeen in Dublin in February 1982. I did not, alas, listen to the <a class="reference external" href="https://archive.org/details/Ulysses-Audiobook-Merged">RTÉ recording</a> at the time, but at some point that year, I started reading <em>Ulysses</em> for myself. And like so many would-be readers before and since, I hit Episode 3, &quot;Proteus&quot;, which opens with &quot;Ineluctable modality of the visible&quot; and promptly dives into Stephen Dedalus's impenetrable thoughts. If I could give some advice to myself 40 years ago, it would be to &quot;skip over the hard bits&quot;. I'm in good company on that recommendation. Daniel Mulhall, Ireland's Ambassador to the US and author of the recent <a class="reference external" href="https://www.amazon.com/dp/1848408293/?tag=georgvreill-20">Ulysses: A Reader's Odyssey</a>, gives the same advice. Episode 4, &quot;Calypso&quot;, introduces us to Leopold Bloom and is far more enjoyable.</p> <p>Unfortunately, I did not have the benefit of that advice then, and I had little to do with <em>Ulysses</em> for the next two decades. In <a class="reference external" href="https://www.georgevreilly.com/blog/2003/06/11/Bloomsday.html">2003</a>, I took part in the <a class="reference external" href="https://www.wildgeeseseattle.org/">Wild Geese Players of Seattle</a>'s staged reading of Episodes 8 and 9, &quot;Lestrygonians&quot; and &quot;Scylla and Charybdis&quot;. In 2004, I helped adapt the next episode, &quot;Wandering Rocks&quot;, for that year's staged reading. When Kieran O'Malley, the group's original founder, moved back to Ireland in 2005 or 2006, I took over as dramaturg. I've led the Geese for many years now, and I've <a class="reference external" href="https://github.com/WildGeeseSeattle/Ulysses">adapted scripts</a> for the entire book.</p> <p>Why? The connection with Dublin that I share with Joyce is certainly part of it. I've come to love the book. (Most of it; there are certainly parts that I find tedious.) It's a book in which very little happens, and yet it encompasses everything. We get an extremely rounded picture of Bloom and his inner life. It's funny and sad and erudite and annoying and wise. Joyce has distilled the human condition into one summer's day in Dublin.</p> <p>And now it is the centenary of the publication of <em>Ulysses</em>. I posted some <a class="reference external" href="https://www.wildgeeseseattle.org/ulysses-at-100.html">centenary material</a> at the Wild Geese website.</p> <p><em>Ulysses</em> has become an ineluctable part of my life.</p> Diffing a fragment of a file tag:www.georgevreilly.com,2022-01-31:/blog/2022/01/31/DiffFileFragment.html 2022-01-31T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>A while back, I had extracted some code out of a large file into a separate file and made some modifications. I wanted to check that the differences were minimal. Let's say that the extracted code had been between lines 123 and 456 of <tt class="docutils literal">large_old_file</tt>.</p> <pre class="code bash literal-block"> diff -u &lt;<span class="o">(</span>sed -n <span class="s1">'123,456p;457q'</span> large_old_file<span class="o">)</span> new_file </pre> <p>What's happening here?</p> <ul class="simple"> <li><tt class="docutils literal">sed <span class="pre">-n</span> '123,456p'</tt> is printing lines 123–456 of <tt class="docutils literal">large_old_file</tt>.</li> <li>The <tt class="docutils literal">457q</tt> tells sed to abandon the file at line 457. Otherwise, it will keep reading all the way to the end.</li> <li>The <tt class="docutils literal">&lt;(sed <span class="pre">...)</span></tt> is an example of <a class="reference external" href="https://tldp.org/LDP/abs/html/process-sub.html">process substitution</a>. The <em>output</em> of the <tt class="docutils literal">sed</tt> invocation becomes the first <em>input</em> of the <tt class="docutils literal">diff</tt> command.</li> </ul> <p>A similar example: <a class="reference external" href="https://www.georgevreilly.com/blog/2017/01/11/DiffTransformedFile.html">Diff a Transformed File</a>.</p> <p>BTW, these days, I usually use <a class="reference external" href="https://github.com/dandavison/delta">delta</a> for diffing at the command line, especially with Git.</p> 40 Years of Programming tag:www.georgevreilly.com,2022-01-31:/blog/2022/01/31/40YearsOfProgramming.html 2022-01-31T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>40 years ago this month, I sat down at a computer and wrote a program. (Or &quot;programme&quot;, as I spelled it then.) It was the first time I had ever used a computer. Very few people had used computers in 1982, in Ireland or elsewhere.</p> <p>What was the program? No idea. Just a few lines of AppleSoft Basic. But it was enough to get me hooked and change my life.</p> <p>I still get a hit when a little bit of code unlocks in my brain. It's quite addictive. There's always more to learn and to see.</p> <p>I wrote more about this in 2012: <a class="reference external" href="https://www.georgevreilly.com/blog/2012/01/26/30YearsOfProgramming.html">30 Years of Programming</a>.</p> On Circumnavigating the Aubreyiad Again tag:www.georgevreilly.com,2021-12-30:/blog/2021/12/30/CircumnavigatingAubreyiad.html 2021-12-30T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>At the beginning of 2021, prompted by Russell Crowe's defense of <em>Master and Commander</em>, I began yet another re-read of the twenty Aubrey-Maturin novels. Or, as the fandom would have it, another circumnavigation. It's probably my fifth or sixth circumnavigation, since I bought the complete boxed set as a Christmas present to myself in the early aughts.</p> <p>I completed the twentieth book, <em>Blue at the Mizzen</em>, yesterday, and also the few pages of the final, unfinished novel, <em>21</em>. (I also read about <a class="reference external" href="https://www.goodreads.com/user/year_in_books/2021/3723742">120 other books</a> in 2021, down from a stupendous <a class="reference external" href="https://www.goodreads.com/user/year_in_books/2020/3723742">200 books in 2020</a>, but that's neither here nor there.)</p> <blockquote class="twitter-tweet"> <p lang="en" dir="ltr">I think I&#39;m due for another re-read of Patrick O&#39;Brian&#39;s Aubrey/Maturin novels (all 6,500 pages) and a rewatch of Master and Commander. <a href="https://t.co/gVf9IBan7e">pic.twitter.com/gVf9IBan7e</a></p>&mdash; George V. Reilly (@georgevreilly) <a href="https://twitter.com/georgevreilly/status/1350913122345783297?ref_src=twsrc%5Etfw"> January 17, 2021</a> </blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script><p>Why did I put myself through re-reading 6,500 pages of a dense <em>roman-fleuve</em> yet again? For the sheer pleasure of joining up once more with my old friends, Captain Jack Aubrey and Dr Stephen Maturin, in their 15-year fight against Napoleon.</p> <p>They are an unlikely pair of friends. Jack Aubrey, a big, hearty English naval officer, is utterly competent in his domain, magnificent at sea but naïve and easily duped on land. Stephen Maturin, the illegitimate son of an Irish officer and a Catalan lady, is a renowned physician and naturalist, a former <a class="reference external" href="https://en.wikipedia.org/wiki/Society_of_United_Irishmen">United Irishman</a> turned British intelligence agent, a Catholic in a Protestant service, and a perpetual landlubber and sloven. They have little in common, save a shared love of music and of natural philosophy. Both are Fellows of the <a class="reference external" href="https://royalsociety.org/about-us/history/">Royal Society</a>—Jack, to many's surprise, is a mathematician and astronomer.</p> <p>And yet, they are fast friends and Stephen follows Jack from ship to ship. A captain must hold himself aloof from his crew and his officers. He is the sole authority, often months of sailing away from his superiors. He dines alone, save when invited to the officers' wardroom or when he invites them to join him. Stephen, as Jack's particular friend, is exempt from the normal strictures, allowing Jack to retain his humanity on the long voyages.</p> <p>It is the friendship and the two main characters that hold me, along with the adventure and the travel. O'Brian immersed himself in the eighteenth and early nineteenth centuries, and his encyclopaedic knowledge helped him bring the era to life with incredible verisimilitude. O'Brian was an accomplished storyteller and often <a class="reference external" href="https://quotingobrian.tumblr.com/">very funny</a>.</p> <p>The characters sound and act like people of the time, not like transplanted twentieth century Americans. Jack, Stephen, and the other characters would be at home in the pages of Jane Austen (sister to two Royal Navy officers).</p> <p><a class="reference external" href="https://www.tor.com/series/re-reading-patrick-obrians-aubrey-maturin-series/">Jo Walton's re-read</a> will give you a taste of the books.</p> Review: Crafting Interpreters tag:www.georgevreilly.com,2021-12-28:/blog/2021/12/28/ReviewCraftingInterpreters.html 2021-12-28T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <a class="reference external image-reference" href="https://www.amazon.com/dp/0990582930/?tag=georgvreill-20"><img alt="Crafting Interpreters" class="right-float" src="https://images-na.ssl-images-amazon.com/images/I/41-7uSeOyCL._SX398_BO1,204,203,200_.jpg"/></a> <div class="line-block"> <div class="line">Title: <a class="reference external" href="https://craftinginterpreters.com/">Crafting Interpreters</a></div> <div class="line">Author: Robert Nystrom</div> <div class="line">Rating: ★ ★ ★ ★ ★</div> <div class="line">Publisher: Genever Benning</div> <div class="line">Copyright: 2021</div> <div class="line">ISBN: <a class="reference external" href="https://www.amazon.com/dp/0990582930/?tag=georgvreill-20">978-0990582939</a></div> <div class="line">Pages: 640</div> <div class="line">Keywords: programming, interpreters</div> <div class="line">Reading period: 10–28 December, 2021</div> </div> <p>I've read hundreds of technical books over the last 40 years. <em>Crafting Interpreters</em> is an instant classic, and far more readable and fun than many of the classics.</p> <p>Nystrom covers a lot of ground in this book, building two very different interpreters for Lox, a small dynamic language of his own design. He takes us through <em>every line</em> of jlox, a Java-based tree-walk interpreter, and of clox, a bytecode virtual machine written in C.</p> <p>For the first implementation, jlox, he covers such topics as scanning, parsing expressions with recursive descent, evaluating expressions, control flow, functions and closures, classes, and inheritance.</p> <p>Starting with an empty slate, Nystrom adds just enough code to implement the topic of each chapter, having a working albeit incomplete implementation of the interpreter by the end of the chapter. He adds new code as he goes, inserting an extra <tt class="docutils literal">case</tt> into a <tt class="docutils literal">switch</tt> here or writing a new function there, or replacing a few lines of an earlier implementation with something that's just been explained. Knuth's <a class="reference external" href="https://en.wikipedia.org/wiki/Literate_programming">Literate Programming</a> explains a finished implementation, broken into separate pieces for exposition. Nystrom's continual, ever-evolving exposition is slower to get to the point, but it's excellent pedagogy. I would be remiss if I didn't mention the hundreds of hand-drawn illustrations, which add a quirky flavor to the tone of the book. He has a blog post on how he <a class="reference external" href="http://journal.stuffwithstuff.com/2020/04/05/crafting-crafting-interpreters/">pulled this organization off</a> and another on how he created a <a class="reference external" href="http://journal.stuffwithstuff.com/2021/07/29/640-pages-in-15-months/">physical book</a> from the text.</p> <p>clox is a very different second implementation of a Lox interpreter. Instead of a slow interpreter walking an abstract syntax tree, he develops a stack-based virtual machine, compiles Lox into bytecode, and interprets the bytecode. He covers theory and practical considerations for creating a bytecode virtual machine, makes use of Pratt’s “top-down operator precedence parsing”, and implements closures and classes in C. In jlox, he used Java's <tt class="docutils literal">HashMap</tt> to manage identifiers and relied on Java's garbage collection for memory management. For clox, he implements a hash table and a mark-and-sweep garbage collector. Although he has to cover similar topics (parsing, local variables, closures) each time, he finds a fresh perspective for the second implementation.</p> <p>I read the entire book for free at <a class="reference external" href="https://craftinginterpreters.com/">https://craftinginterpreters.com/</a>, but I liked it so much that I've ordered a physical copy. In fact, I actually read much of the book on the website in 2020, but life intervened and I didn't finish it, so this month, I read it again from the start.</p> <p>This book is not a textbook and you don't get an exhaustive introduction to building interpreters, much less compilers. In the final year of my Computer Science degree at Trinity College Dublin in 1986–87, I studied the <a class="reference external" href="https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools">Dragon Book</a> when the first edition was brand new. <em>Crafting Interpreters</em> is a lot more fun than the Dragon Book.</p> <p>Highly recommended!</p> Path Traversal Attacks tag:www.georgevreilly.com,2021-10-05:/blog/2021/10/05/PathTraversalAttacks.html 2021-10-05T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>I was surprised to read this evening that the Apache Web Server just fixed an actively exploited path traversal flaw.</p> <blockquote class="twitter-tweet"> <p lang="en" dir="ltr"> 🚨 Apache has disclosed an *actively exploited* Path traversal flaw in the <a href="https://twitter.com/hashtag/opensource?src=hash&amp;ref_src=twsrc%5Etfw">#opensource</a> &quot;httpd&quot; server. Over 112,000 exposed Apache servers run version 2.4.49, and should be upgraded now!<br> New fix checks for encoded path traversal characters e.g. /../.%2E/<a href="https://t.co/1tLNc3LAul">https://t.co/1tLNc3LAul</a> <a href="https://t.co/mDHLEU3k9N">pic.twitter.com/mDHLEU3k9N</a> </p>&mdash; Ax Sharma (@Ax_Sharma) <a href="https://twitter.com/Ax_Sharma/status/1445391350053183500?ref_src=twsrc%5Etfw">October 5, 2021</a> </blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script><p>Apparently, it was <a class="reference external" href="https://github.com/apache/httpd/commit/4c79fd280dfa3eede5a6f3baebc7ef2e55b3eb6a">introduced over a year ago</a>.</p> <p>I'm gobsmacked that Apache didn't have a robust suite of tests for this.</p> <p>Directory Traversal attacks have been a problem for web servers since the beginning. <a class="reference external" href="https://owasp.org/www-community/attacks/Path_Traversal">OWASP</a>, <a class="reference external" href="https://portswigger.net/web-security/file-path-traversal">PortSwigger</a>, and <a class="reference external" href="https://spanning.com/blog/directory-traversal-web-based-application-security-part-8/">Spanning</a> all have explanations that you can read. The essence is that you make a request to a URL that looks like <tt class="docutils literal"><span class="pre">http://example.com/cgi-bin/../../../../etc/passwd</span></tt> and, voilà, you get access to something that you shouldn't. Each of the <tt class="docutils literal">..</tt> path segments climbs up a level of the file system. Even the simplest web server knows better than to blindly allow a sequence of <tt class="docutils literal">..</tt> path segments, so you have to be a little clever about how you express them.</p> <div class="section" id="iis-unicode-exploit"> <h3>IIS Unicode Exploit</h3> <p>I remember when I worked on the IIS development team at Microsoft in 1997–2004, we got hit by <a class="reference external" href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2000-0884">CVE-2000-0884</a> in 2000, which made use of an <a class="reference external" href="https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings">overlong UTF-8 encoding</a>.</p> <p>URLs allow <a class="reference external" href="https://developer.mozilla.org/en-US/docs/Glossary/percent-encoding">percent encoding</a> for characters that can't be sent literally. For example, <tt class="docutils literal">%3D</tt> encodes an <tt class="docutils literal">=</tt> as the two-digit hexadecimal value of <tt class="docutils literal">=</tt>’s ASCII code. UTF-8 characters beyond U+007F require two or more bytes of storage, each of which can be percent encoded; e.g., U+00C1 (<tt class="docutils literal">Á</tt>) is encoded as the <tt class="docutils literal">C3 81</tt> byte pair in UTF-8, and as <tt class="docutils literal">%C3%81</tt> in percent encoding.</p> <p>The slash character, <tt class="docutils literal">/</tt> or U+002F, can be percent encoded as <tt class="docutils literal">%2F</tt>. IIS 4 and 5 were smart enough to treat <tt class="docutils literal">%2F</tt> as a slash and to defend against sequences like <tt class="docutils literal"><span class="pre">..%2F..%2F</span></tt>. However, the attackers encoded a slash as <tt class="docutils literal">%C0%AF</tt>—a sequence that is burned into my brain. This two-byte UTF-8 sequence can be decoded as U+002F, though it should not be treated as valid as it is <a class="reference external" href="https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings">overlong</a>: the five payload bits in the leading byte are all zero.</p> <p>The <a class="reference external" href="https://www.giac.org/paper/gcih/115/iis-unicode-exploit/101163">GIAC paper</a> explains in some detail how this could be exploited.</p> </div> <div class="section" id="windows-security-push"> <h3>Windows Security Push</h3> <p>Windows XP went on sale in late 2001, touted as the most secure version of Windows ever. (It was, at that time.)</p> <p>Right around Christmas 2001, the <a class="reference external" href="https://www.giac.org/paper/gcih/274/windows-xp-upnp-exploits/102906">UPnP vulnerabilty</a> was disclosed. Brian Valentine, the Senior VP who ran Windows, threw a shitfit. It was announced that <em>all</em> of Windows would spend the month of February 2002 undergoing security training, so that we could <a class="reference external" href="https://owasp.org/www-community/Threat_Modeling">threat model</a> and review our code.</p> <p>For IIS 6, which would be released in Windows Server 2003, we had fundamentally rearchitected it with a new worker process model (inspired by Apache's) and we had rewritten much of it. There was a new kernel mode driver, http.sys, that terminated all requests and routed them to the appropriate handler in kernel or user mode. I was part of the http.sys dev team at that point.</p> <p>IIS had already gotten serious about security by then. We had to, after <a class="reference external" href="https://en.wikipedia.org/wiki/Code_Red_(computer_worm)">Code Red</a>, <a class="reference external" href="https://en.wikipedia.org/wiki/Nimda">Nimda</a>, the Unicode exploit, and others. <a class="reference external" href="https://www.linkedin.com/in/mikehow/">Mike Howard</a> had been the IIS Security Program Manager before he went on to bigger responsiblities. A lot of the first edition of his <a class="reference external" href="https://www.amazon.com/Writing-Secure-Second-Developer-Practices/dp/0735617228">Writing Secure Code</a> book was based on his experience with securing IIS, and a lot of the second edition benefited from the Security Push experience.</p> <p>Since http.sys was new and an obvious target, our team actually spent two months carefully reviewing everything. It turned out that we had done a good job over the previous couple of years and we didn't find much to worry about.</p> <p>We did identify that the URL canonicalization in http.sys was overly complicated. I rewrote that component and I created a ton of unit tests for it. Developers writing unit tests was not common at Microsoft back in 2002: we had a separate caste of testers to write tests.</p> <p>I've been out of the loop since I left IIS in 2004, but to my knowledge, there were no further vulnerabilities in URL handling.</p> <p>I'm surprised and disappointed that Apache would mess up path traversal in the 2020s.</p> </div> Accidentally Quadratic: Python List Membership tag:www.georgevreilly.com,2021-10-04:/blog/2021/10/04/AccidentallyQuadraticPythonListMembership.html 2021-10-04T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p>We had a performance regression in a test suite recently when the median test time jumped by two minutes.</p> <a class="reference external image-reference" href="https://www.bigocheatsheet.com/"><img alt="Big O Cheat Sheet" src="https://www.georgevreilly.com/content/binary/bigochart.gif"/></a> <p>We tracked it down to this (simplified) code fragment:</p> <pre class="code python literal-block"> <span class="n">task_inclusions</span> <span class="o">=</span> <span class="p">[</span> <span class="n">some_collection_of_tasks</span><span class="p">()</span> <span class="p">]</span> <span class="n">invalid_tasks</span> <span class="o">=</span> <span class="p">[</span><span class="n">t</span><span class="o">.</span><span class="n">task_id</span><span class="p">()</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">airflow_tasks</span> <span class="k">if</span> <span class="n">t</span><span class="o">.</span><span class="n">task_id</span><span class="p">()</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">task_inclusions</span><span class="p">]</span> </pre> <p>This looks fairly innocuous—and it was—until the size of the result returned from <tt class="docutils literal">some_collection_of_tasks()</tt> jumped from a few hundred to a few thousand.</p> <p>The <a class="reference external" href="https://docs.python.org/3/reference/expressions.html#membership-test-operations">in comparison operator</a> conveniently works with all of Python's standard sequences and collections, but its efficiency varies. For a <tt class="docutils literal">list</tt> and other sequences, <tt class="docutils literal">in</tt> must search linearly through all the elements until it finds a matching element <em>or</em> the list is exhausted. In other words, <tt class="docutils literal">x in some_list</tt> takes <span class="formula"><i>O</i>(<i>n</i>)</span> time. For a <tt class="docutils literal">set</tt> or a <tt class="docutils literal">dict</tt>, however, <tt class="docutils literal">x in container</tt> takes, on average, only <span class="formula"><i>O</i>(1)</span> time. See <a class="reference external" href="https://wiki.python.org/moin/TimeComplexity">Time Complexity</a> for more.</p> <p>The <tt class="docutils literal">invalid_tasks</tt> list comprehension was explicitly looping through one list, <tt class="docutils literal">airflow_tasks</tt>, and implicitly doing a linear search through <tt class="docutils literal">task_inclusions</tt> for each value of <tt class="docutils literal">t</tt>. The nested loop was hidden and its effect only became apparent when <tt class="docutils literal">task_inclusions</tt> grew large.</p> <p>The list comprehension was actually taking <span class="formula"><i>O</i>(<i>n</i><sup>2</sup>)</span> time. When <span class="formula"><i>n</i></span> was comparatively small (a few hundred), this wasn't a problem. When <span class="formula"><i>n</i></span> grew to several thousand, it became a big problem.</p> <p>This is a classic example of an <a class="reference external" href="https://accidentallyquadratic.tumblr.com/">accidentally quadratic</a> algorithm. Indeed, Nelson describes a very similar problem with <a class="reference external" href="https://accidentallyquadratic.tumblr.com/post/161243900944/mercurial-changegroup-application">Mercurial changegroups</a>.</p> <p>This performance regression was compounded because this fragment of code was being called thousands of times—I believe once for each task— making the overall cost cubic, <span class="formula"><i>O</i>(<i>n</i><sup>3</sup>)</span>.</p> <p>The fix here is similar: Use a <tt class="docutils literal">set</tt> instead of a <tt class="docutils literal">list</tt> and get <span class="formula"><i>O</i>(1)</span> membership testing. The <tt class="docutils literal">invalid_tasks</tt> list comprehension now takes the expected <span class="formula"><i>O</i>(<i>n</i>)</span> time.</p> <pre class="code python literal-block"> <span class="n">task_inclusions</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span> <span class="n">some_collection_of_tasks</span><span class="p">()</span> <span class="p">)</span> <span class="n">invalid_tasks</span> <span class="o">=</span> <span class="p">[</span><span class="n">t</span><span class="o">.</span><span class="n">task_id</span><span class="p">()</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">airflow_tasks</span> <span class="k">if</span> <span class="n">t</span><span class="o">.</span><span class="n">task_id</span><span class="p">()</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">task_inclusions</span><span class="p">]</span> </pre> <p>More at <a class="reference external" href="https://www.coengoedegebure.com/understanding-big-o-notation/">Understanding Big-O Notation</a> and the <a class="reference external" href="https://www.bigocheatsheet.com/">Big-O Cheat Sheet</a>.</p> Passphrase Generators tag:www.georgevreilly.com,2021-05-10:/blog/2021/05/10/PassphraseGenerators.html 2021-05-10T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <a class="reference external image-reference" href="https://xkcd.com/936/"><img alt="Password Strength" src="https://imgs.xkcd.com/comics/password_strength.png"/></a> <p>I've been using <a class="reference external" href="https://en.wikipedia.org/wiki/Password_manager">password managers</a> for at least 15 years to keep track of all my passwords. I have separate, distinct, strong passwords for hundreds of sites, and I've only memorized the handful that I need to actually type regularly.</p> <p>I started out with the <a class="reference external" href="https://www.georgevreilly.com/blog/2006/02/06/200KeePassEntries.html">KeePass</a> desktop app originally, but I switched to the online <a class="reference external" href="https://www.georgevreilly.com/blog/2016/01/07/DicewareAndLastpass.html">LastPass</a> app about a decade ago. At work, we use <a class="reference external" href="https://1password.com/">1Password</a>.</p> <p>When I register for a site, LastPass generates a random password for me, such as:</p> <pre class="literal-block"> tV%5joS$U6^uY5xU T2oEUY!g70Iv1b&amp;I 8kNHg9*A5GMR9%8D </pre> <p>LastPass securely syncs my passwords between machines and devices. Its browser integration and its Android and iPhone apps mean that I rarely ever have to actually type any of those ugly messes in.</p> <p>But when I do have to type in such a password, it's unpleasant in a browser. It doesn't help that LastPass in some cases displays passwords in a sans-serif font that makes it easy to <a class="reference external" href="https://typography.guru/journal/letters-symbols-misrecognition/">misrecognize</a> letters such as <tt class="docutils literal">Il</tt>, <tt class="docutils literal">0O</tt>, <tt class="docutils literal">5S</tt>, or <tt class="docutils literal">8B</tt>. It's far more painful in an Android app, where you have to switch the keyboard in and out of symbol mode. It's usually even worse in iPhone apps, which rarely offer you an option to see your password in the clear as you're laboriously typing it, so it's easy to make a mistake. When I tried to use a remote control to enter my Netflix and Amazon Prime passwords into a new set-top box, I got so annoyed that I brought down a real keyboard and plugged it into the USB port.</p> <p><a class="reference external" href="https://theintercept.com/2015/03/26/passphrases-can-memorize-attackers-cant-guess/">Passphrases</a> have nice properties compared to random passwords: they're human readable, they're much easier—if longer—to type, and you can actually remember them if you have to. A passphrase of at least five words (chosen by a secure random generator) is computationally infeasible to crack.</p> <p>The ur-example of random passphrase generators is <a class="reference external" href="https://en.wikipedia.org/wiki/Diceware">Diceware</a> from 1995. There are various problems with the Diceware wordlist, which are rectified by more modern lists, such as the <a class="reference external" href="https://www.eff.org/deeplinks/2016/07/new-wordlists-random-passphrases">EFF Wordlists</a>.</p> <p>Which would you rather type? The <a class="reference external" href="http://www.catb.org/jargon/html/L/line-noise.html">line noise</a> above or one of these passphrases?:</p> <pre class="literal-block"> confident starfish aftermost elsewhere jasmine shun baggage chaps reward cuddle avenue rut pardon skating earlobe latter blissful snippet jolt corroding upstage-divinely-ninth-unfilled-skeleton SkimmingMachinistBlessHesitancyKissableRink </pre> <p>When I want to generate a random passphrase, I tend to use either the <a class="reference external" href="https://github.com/ulif/diceware">Python diceware</a> command-line tool or Glenn Rempe's JavaScript-based <a class="reference external" href="https://www.rempe.us/diceware/#eff">Diceware website</a>. Both use cryptographic random number generators to generate excellent passphrases.</p> <p>The <a class="reference external" href="https://1password.com/password-generator/">1Password Online Generator</a> (in Memorable Password mode) also generates passphrases, as do the desktop and browser versions of 1Password.</p> <p>My master password for LastPass is a passphrase, as is my laptop password. I'm also using <a class="reference external" href="https://authy.com/">Authy</a> for 2FA, but that's a post for another time.</p> <div class="admonition tip"> <p class="first admonition-title">Tip</p> <p>If you have to supply answers for one of those misbegotten <a class="reference external" href="https://www.okta.com/blog/2021/03/security-questions/">security questions</a>, such as your favorite movie or your first car, <em>do not answer truthfully</em>. Truthful answers increase your risk of identity theft. The answers are often guessable, can frequently be learned easily about you, and may be obtained through a password breach on another site.</p> <p>Instead, generate a passphrase as the &quot;answer&quot; <em>and store it and the question in the Notes field of your password manager</em>. If you have to supply the answer to a security question over the phone to a customer service rep, you'll be thankful that you chose something that you can clearly say aloud.</p> <p class="last">Also <a class="reference external" href="https://www.mentalfloss.com/article/522136/taking-facebook-quizzes-could-put-you-risk-identity-theft">Facebook quizzes</a> and memes like &quot;Your porn name is your middle name and the first car you had&quot; are trying to obtain your answers to common security questions. Don't answer them.</p> </div> Punctuating James Joyce tag:www.georgevreilly.com,2021-05-08:/blog/2021/05/08/PunctuatingJamesJoyce.html 2021-05-08T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <a class="reference external image-reference" href="https://www.writermag.com/improve-your-writing/revision-grammar/punctuation-bootcamp/"><img alt="Punctuation Boot Camp: Our ultimate grammar guide" src="https://cdn.writermag.com/2018/07/punctuationbootcamp_news-e1540567976133.jpg"/></a> <p>In <a class="reference external" href="https://lithub.com/the-punctuation-marks-loved-and-hated-by-famous-writers/">The Punctuation Marks Loved (and Hated) by Famous Writers</a>, Emily Temple relays a range of opinions from writers such as Tom Wolfe, Elmore Leonard, and Ursula K. Le Guin on periods, semicolons, hyphens and more.</p> <p>On commas:</p> <blockquote> <p>Listens to the sound of the sentence, and is always right, Bob: Toni Morrison</p> <blockquote> [On her editor, Bob Gottlieb, who famously “was always inserting commas into Morrison’s sentences and she was always taking them out”] We read the same way. We think the same way. He is overwhelmingly aggressive about commas and all sorts of things. He does not understand that commas are for pauses and breath. He thinks commas are for grammatical things. We have come to an understanding, but it is still a fight.</blockquote> </blockquote> <p>On periods:</p> <blockquote> <p>Tolerates it, if he must: Cormac McCarthy</p> <blockquote> <p>I believe in periods, in capitals, in the occasional comma, and that’s it.</p> <ul class="simple"> <li></li> </ul> <p>James Joyce is a good model for punctuation. He keeps it to an absolute minimum. There’s no reason to blot the page up with weird little marks. I mean, if you write properly you shouldn’t have to punctuate.</p> </blockquote> </blockquote> <p>My own prose tends towards longer sentences, often sprinkled with dashes, parentheses, and semicolons.</p> <p>Since 2004, I've adapted all of James Joyce's <em>Ulysses</em> for staged readings by the <a class="reference external" href="https://www.wildgeeseseattle.org/">Wild Geese Players of Seattle</a>, and I'm in the Morrison camp, not the McCarthy–Joyce one.</p> <p>Paragraphs like these work on the printed page. (More or less.)</p> <blockquote> <p>The tear is bloody near your eye. Talking through his bloody hat. Fitter for him go home to the little sleepwalking bitch he married, Mooney, the bumbailiff's daughter, mother kept a kip in Hardwicke street, that used to be stravaging about the landings Bantam Lyons told me that was stopping there at two in the morning without a stitch on her, exposing her person, open to all comers, fair field and no favour.</p> <p class="attribution">&mdash;Anonymous narrator, Episode 12, “Cyclops”, L400</p> </blockquote> <p></p> <blockquote> <p>Martin Cunningham forgot to give us his <a class="reference external" href="http://www.jjon.org/joyce-s-allusions/spellingbee-conundrum">spellingbee conundrum</a> this morning. It is amusing to view the unpar one ar alleled embarra two ars is it? double ess ment of a harassed pedlar while gauging au the symmetry with a y of a peeled pear under a cemetery wall. Silly, isn't it? Cemetery put in of course on account of the symmetry.</p> <p class="attribution">&mdash;Mr Bloom, Episode 7, “Aeolus”, L170</p> </blockquote> <p>But imagine trying to read those sentences <em>aloud</em> during a performance and bring the sense of the text to the audience.</p> <p>As an aide to my performers, I've introduced “cadence bars” (denoted by ‘≀’) to the scripts to augment Joyce's sparse punctuation and to bring out the individual fragments.</p> <blockquote> The tear is bloody near your eye. Talking through his bloody hat. Fitter for him go home ≀ to the little sleepwalking bitch he married, Mooney, the bum·bailiff's daughter, mother kept a kip in Hardwicke street, that used to be stravaging about the landings ≀ Bantam Lyons told me ≀ that was stopping there at two in the morning ≀ without a stitch on her, exposing her person, open to all comers, fair field and no favour.</blockquote> <p></p> <blockquote> Martin Cunningham forgot to give us his spelling·bee conundrum this morning. It is amusing to view the ≀ unpar ≀ one ar ≀ alleled ≀ embarra ≀ two ars is it? ≀ double ess ≀ ment ≀ of a harassed pedlar ≀ while gauging ≀ au ≀ the symmetry ≀ with a y ≀ of a peeled pear ≀ under a cemetery wall. Silly, isn't it? Cemetery put in of course ≀ on account of the symmetry.</blockquote> <p>I've also added some pseudo-hyphens (bum·bailiff, spelling·bee, what·do·you·call·him) to counteract Joyce's Germanic habit of stringing several words into one.</p> <p>This seems to help, though some of our readers have to fight a tendency to pause too much when they encounter a ‘≀’ symbol.</p> Now You Have 32 Problems tag:www.georgevreilly.com,2020-04-23:/blog/2020/04/23/regex-32-problems.html 2020-04-23T08:00:00Z George V. Reilly https://www.georgevreilly.com/ george@reilly.org <p></p> <blockquote> <p>Some people, when confronted with a problem, think “I know, I'll use regular expressions.” <a class="reference external" href="http://regex.info/blog/2006-09-15/247">Now they have two problems</a>.</p> <blockquote> — Jaime Zawinksi</blockquote> </blockquote> <p>A Twitter thread about <a class="reference external" href="https://twitter.com/nbashaw/status/1253186961482715136">very long regexes</a> reminded me of the <a class="reference external" href="https://www.georgevreilly.com/blog/2009/07/11/64bitWindows7.html">longest regex</a> that I ever ran afoul of, a particularly horrible multilevel mess that had worked acceptably on the 32-bit .NET CLR, but brought the 64-bit CLR to its knees.</p> <blockquote> <p>Whenever I ran our ASP.NET web application [on Win64], it would go berserk, eat up all 4GB of my physical RAM, push the working set of IIS's w3wp.exe to <em>12GB</em>, and max out one of my 4&nbsp;cores! The only way to maintain any sanity was to run <tt class="docutils literal">iisreset</tt> every 20&nbsp;minutes to gently kill the process.</p> <p>WinDbg and Process Explorer showed that the rogue thread was stuck in a loop in <tt class="docutils literal">mscorjit!LifetimesListInteriorBlocksHelperIterative&lt;GCInfoLiveRecordManipulator&gt;</tt>. I passed a minidump on to my former colleagues in IIS, who sent it to the CLR team. They said:</p> <blockquote> The only thing I can tell is that it is Regex, and some regex expression compiled down to a method with 456KB of IL. That is <em>huge</em>, and yes 12GB of RAM consumed for something like that is expected.</blockquote> <p>With that clue, I was able to track down the problem, a particularly foul regex, built from a 10KB string, with 32&nbsp;alternating expressions, each of which contains dozens of alternated subexpressions. The string is built from many smaller strings, so it's not obvious in the source just how ugly it is.</p> </blockquote> <p>I never wrote a followup post explaining how I dealt with this beast.</p> <p>The regex was used on the <a class="reference external" href="https://www.cozi.com/calendar/">Cozi calendar</a> to parse appointments in everyday language, such as “Ann/John Dinner out Friday at 8pm” or “John's birthday every Dec. 7”. These would get translated into (possibly recurring) <a class="reference external" href="https://tools.ietf.org/html/rfc5545">iCalendar</a> appointments.</p> <p>Some of the subexpressions mentioned above looked like:</p> <ul class="simple"> <li><tt class="docutils literal">ordinals = <span class="pre">&quot;1st|2nd|...|31st&quot;</span></tt></li> <li><tt class="docutils literal">short_days = <span class="pre">&quot;Sun|Mon|...|Sat&quot;</span></tt></li> <li><tt class="docutils literal">full_days = <span class="pre">&quot;Sunday|Monday|...|Saturday&quot;</span></tt></li> <li><tt class="docutils literal">short_months = <span class="pre">&quot;Jan|Feb|...|Dec&quot;</span></tt></li> <li><tt class="docutils literal">full_months = <span class="pre">&quot;January|February|...|December&quot;</span></tt></li> <li><tt class="docutils literal">recurrence = <span class="pre">&quot;((every|each)?</span> (first|second|third|fourth|fifth|last)? &quot; + &quot;(&quot; + short_days + &quot;|&quot; + full_days + &quot;)&quot; + ...</tt></li> </ul> <p>I've elided the intermediate values but they were spelled out in the original. Some of the simpler subexpressions were repeated several times, nested inside others.</p> <p>This all screamed that a <em>grammar</em> and a <em>real parser</em> were needed, but the test suite also screamed <em>here be dragons!</em></p> <p>I resisted the temptation to rewrite the appointment parser from scratch with a proper grammar, or to experiment with a real natural language parser, though it remained on my personal todo list for the rest of my time at Cozi. We were migrating from C# to Python at that point, and the legacy appointment parser was one of the few remaining pieces that prevented us from shutting down the .NET servers.</p> <p>Instead, I changed the appointment parser code so that it didn't attempt to match the entire 10KB monster in one go. I looped through each of the 32 top-level disjunctions, manually performing the alternation. If any one of those matched, then I had what I needed. Reducing the regexes to a few hundred characters each tamed the combinatorial explosion of backtracking state.</p> <p>Regexes definitely have a place, but do not try to implement a full grammar as a single regular expression.</p>