GHC 2019-11-13

2019-11-13 17:24:07, https://git.io/JerCH in jarun/googler
Improve textwrap in presence of zero-width sequences
====================================================

Fixes #287 which I promised to fix long ago.

Three commits: 90f4ea4 is some cleanup work around indentation I committed back in June... I re-examined it and it should be good. See commit message for details. Next is some py34 legacy removal paving the way for today's work. Then there's the meat of this PR. I'll repeat the commit message here.

---
    
Example (with keyword highlighting on, but the highlight effect is
obviously lost in textual form):

Before:

```console
$ googler -n3 --np linux                                                       |
                                                                               |
 1.  Linux.org                                                                 |
     https://www.linux.org/                                                    |
     5 days ago ... Friendly Linux Forum. ... This is a video from my          |
     series of chapters in my book "Essential Linux Command Line"...           |
     Continue… Load more…                                                      |
                                                                               |
 2.  Linux - Wikipedia                                                         |
     https://en.wikipedia.org/wiki/Linux                                       |
     Linux is a family of open source Unix-like operating systems              |
     based on the Linux kernel, an operating system kernel first               |
     released on September 17, 1991, by ...                                    |
                                                                               |
 3.  The Linux Foundation – Supporting Open Source Ecosystems                  |
     https://www.linuxfoundation.org/                                          |
     The Linux Foundation supports the creation of sustainable open            |
     source projects and ecosystems in blockchain, deep learning, networking,  |
     and more.                                                                 |
                                                                               |
```

After:

```console
$ googler -n3 --np linux                                                       |
                                                                               |
 1.  Linux.org                                                                 |
     https://www.linux.org/                                                    |
     5 days ago ... Friendly Linux Forum. ... This is a video from my series   |
     of chapters in my book "Essential Linux Command Line"... Continue… Load   |
     more…                                                                     |
                                                                               |
 2.  Linux - Wikipedia                                                         |
     https://en.wikipedia.org/wiki/Linux                                       |
     Linux is a family of open source Unix-like operating systems based on the |
     Linux kernel, an operating system kernel first released on September 17,  |
     1991, by ...                                                              |
                                                                               |
 3.  The Linux Foundation – Supporting Open Source Ecosystems                  |
     https://www.linuxfoundation.org/                                          |
     The Linux Foundation supports the creation of sustainable open source     |
     projects and ecosystems in blockchain, deep learning, networking, and     |
     more.                                                                     |
                                                                               |
```

The idea is to use a text wrapper that keeps track of the position of
each source character, so that zero-width sequences can be inserted at
known offsets afterwards.

So, now we have two hacks on top of PSL textwrap: a CJK monkey patch,
and a position-tracking wrapper. Naturally one would question whether
it's cleaner to just implement a variable-width capable (variable-width
*sequences* capable, not just characters) from scratch. The answer is
no. Just look at the non-variable-width-capable implementation in
PSL[1] and one would conclude that piling on hacks is still cleaner.

[1] https://github.com/python/cpython/blob/3.8/Lib/textwrap.py

Admittedly the TrackedTextwrap implementation is ever so slightly
involved, it would be nice to set up unit tests for it. I actually have
one written but can't really bother to set up the whole unittest
environment for it... So here I include it in the commit message for
prosperity:

```py
import random
import re

import pytest


@pytest.mark.parametrize("iteration", range(50))
def test_tracked_textwrap(iteration):
    whitespace = "\t\n\v\f\r "
    s = """This module provides runtime support for type hints as specified by PEP 484, PEP 526, PEP 544,
PEP 586, PEP 589, and PEP 591. The most fundamental support consists of the types Any, Union, Tuple,
Callable, TypeVar, and Generic. For full specification please see PEP 484. For a simplified
introduction to type hints see PEP 483."""
    wrapped = TrackedTextwrap(s, 80)
    lines = wrapped.lines
    # ['This module provides runtime support for type hints as specified by PEP 484, PEP',
    # '526, PEP 544, PEP 586, PEP 589, and PEP 591. The most fundamental support',
    # 'consists of the types Any, Union, Tuple, Callable, TypeVar, and Generic. For',
    # 'full specification please see PEP 484. For a simplified introduction to type',
    # 'hints see PEP 483.']

    # Test all coordinates point to expected characters.
    for offset, ch in enumerate(s):
        row, col = wrapped.get_coordinate(offset)
        assert col <= len(lines[row])
        if col == len(lines[row]):
            # Dropped whitespace
            assert ch in whitespace
        else:
            assert lines[row][col] == ch or (
                ch in whitespace and lines[row][col] == " "
            )

    # Test insertion.
    # Make the entire paragraph blue.
    insertions = [("\x1b[34m", 0), ("\x1b[0m", len(s))]
    for m in re.finditer(r"PEP\s+\d+", s):
        # Mark all "PEP *" as bold.
        insertions.extend([("\x1b[1m", m.start()), ("\x1b[22m", m.end())])
    # Insert in random order.
    random.shuffle(insertions)
    for seq, offset in insertions:
        wrapped.insert_zero_width_sequence(seq, offset)
    assert wrapped.lines == [
        "\x1b[34mThis module provides runtime support for type hints as specified by \x1b[1mPEP 484\x1b[22m, \x1b[1mPEP",
        "526\x1b[22m, \x1b[1mPEP 544\x1b[22m, \x1b[1mPEP 586\x1b[22m, \x1b[1mPEP 589\x1b[22m, and \x1b[1mPEP 591\x1b[22m. The most fundamental support",
        "consists of the types Any, Union, Tuple, Callable, TypeVar, and Generic. For",
        "full specification please see \x1b[1mPEP 484\x1b[22m. For a simplified introduction to type",
        "hints see \x1b[1mPEP 483\x1b[22m.\x1b[0m",
    ]
```

Note that I did program very defensively here: the underlying
assumptions about the PSL textwrap algorithm should be sound (I read the
documentaion carefully in full, and grokked the implementation), but I'm
still checking my assumptions and failing noisily in case my assumption
fails.

Final note on minor changes in behavior: LFs in the abstract are no
longer dropped when rendering; they are now handled. I'm honestly don't
even think LFs would survive our parser, where we actively drop them
when constructing the abstract; the `abstract.replace('\n', '')` is
probably an artifact of the past (didn't bother to check). Anyway, now a
remaining LF (if ever) is handled like any other whitespace when passed
through textwrap, which means it's replaced by a space and possibly
dropped.