GHC 2018-10-16

2018-10-16 19:50:35, https://git.io/fxgeJ in jarun/googler

So, what's the solution?

In fact, in the wake of https://github.com/jarun/googler/issues/249#issuecomment-428043444, I came to the painful realization that #76, the previous parser rewrite, was doomed from the beginning. I was rewriting undocumented spaghetti code (essentially gotos) into documented and programmatically guarded spaghetti code (still, essentially gotos). The rationale at that point was to improve the existing code, making it understandable and extendable without having to painfully go through the whole parser, under the constraint of not using lxml, bs4, or any other higher level parser library, so without too much thought I just largely went along with the original design. The mistake there was "without too much thought". It was a bad design in the first place. **Mixing parser logic and parser plumbing makes everything hard to understand and debug.** It's also very hard to add new plumbing to the single pass parser, as we can see here.

The obvious solution is to build a data structure, just like using lxml, bs4, or any other higher level parser library. But how can we do it without introducing a dependency? Well, the key realization is embarrassingly simple: it's in fact very easy to write such a library...

In fact, I just wrote a DOM with CSS selector support in ~1150 loc: https://github.com/zmwangx/dim/blob/master/dim.py

The module is a generic implementation and supports most of CSS selectors level 3 syntax (missing: pseudo-classes (like `:nth-child()`), pseudo-elements (like `:before` and `:after`, doesn't make sense outside rendering contexts anyway), and namespace prefixes (useless for pure HTML)). Much of that ~1150 loc is devoted to docstrings, `__repr__` and `__str__`, optional methods that enrich the API, etc., so essential loc is much lower than 1k. The best part? Outside the DOM builder / HTML parser (only ~50 loc) and CSS selector parser, almost all the code is trivial: grab any method, its implementation is short and sweet and hardly requires any context to understand. You can see the code for yourself.

The module is also strictly typed ("strictly" in the sense of `mypy --strict`), tested (99% coverage at this point), and fully documented (documentation at https://docs.tcl.sh/py/dim/), so I can guarantee that code is of pretty good quality. It can be safely embedded.

Adding ~1150 loc (compared to ~700 loc at the moment for the parser https://github.com/jarun/googler/blob/ea5ca238cc2a3aad9ae2a6a3e541424c5f130332/googler#L873-L1550) might give you pause, but as I said the code is conceptually a lot simpler, it's just the structural approach, the rich documentation, and the plethora of methods — many of them optional — take up space.

And with this higher level API, we can write code like this (a PoC) — concise, readable parser logic:

```py
import textwrap
from urllib.parse import urlparse, parse_qs

from dim import *

html = open("/path/to/googler/debug/html/output").read()
tree = parse_html(html)
index = 0
for h3 in tree.select_all("h3.r"):
    if any(ancestor.matched_by("div.hp-xpdbox") for ancestor in h3.ancestors()):
        continue
    try:
        a = h3.select("a")
        title = a.text
        url = parse_qs(urlparse(a.attr("href")).query)["q"][0]
        abstract = h3.next_element_sibling().select("span.st").text.replace("\n", "")
    except AttributeError:
        continue
    index += 1
    print(
        textwrap.dedent(
            f"""\
            {index} {title}  [{urlparse(url).netloc}]
            {abstract}
            """
        )
    )

spell_orig = tree.select("span.spell_orig")
if spell_orig:
    autocorrected_to = next(
        filter(lambda el: el.tag == "a", spell_orig.previous_siblings()), None
    ).text
    print(f"(Showing results for '{autocorrected_to}')")
```

When you feed in the HTML from `googler -d helol world`, you get

```
1 "Hello, World!" program - Wikipedia  [en.wikipedia.org]
A "Hello, World!" program is a computer program that outputs or displays the message "Hello, World!". Being a very simple program in most programming ...

2 HelloWorld, A Merkle Company  [www.helloworld.com]
A powerful combination of native technology and marketing strategy allowing brands to create unforgettable interactions, drive consumer demand and ...

3 The Hello World Collection  [helloworldcollection.de]
The largest collection of Hello World programs on the Internet.

4 Hello, World! - Learn Python - Free Interactive Python Tutorial  [www.learnpython.org]
Hello, World! Python is a very simple language, and has a very straightforward syntax. It encourages programmers to program without boilerplate (prepared) ...

5 Hello, World! - Learn Java - Free Interactive Java Tutorial  [www.learnjavaonline.org]
Hello, World! Java is an object oriented language (OOP). Objects in Java are called "classes". Let's go over the Hello world program, which simply prints "Hello, ...

6 Computer Programming/Hello world - Wikibooks, open books for an ...  [en.wikibooks.org]
Hello, world! programs make the text "Hello, world!" appear on a computer screen. It is usually the first program encountered when learning a programming ...

7 The Hello World Program: Hands-on Computer Science  [thehelloworldprogram.com]
Online videos and tutorials combining technology and art. Learn computer science, programming, and web development with us, your educational and ...

8 Hello World: the first multi-artist album composed by artists with an ...  [www.helloworldalbum.net]
Hello World” is the first multi-artist music album composed with Artificial Intelligence. Its goal is to show that AI can be used to create new, compelling music, and ...

9 Hello World - Rust By Example - Rust Documentation  [doc.rust-lang.org]
Hello World. This is the source code of the traditional Hello World program. // This is a comment, and will be ignored by the compiler // You can test this code by ...

(Showing results for 'hello world')
```

You can run the code from your browser here: https://repl.it/@zmwangx/googler-via-dim

---

I think this is the way to go for googler v4.0. Of course, to actually adopt this, I'll be very conservative — I'm going to gather a lot of Google responses and make sure the new parser gives the exact same results as the old parser (except when the old parser is wrong, e.g. at autocorrect detection).

2018-10-16 19:18:04, https://git.io/fxgeU in jarun/googler

Autocorrect detection is broken (and bigger problems)
=====================================================

Autocorrect detection is broken. Sample query: `helol world`.

Old documentation says:

https://github.com/jarun/googler/blob/ea5ca238cc2a3aad9ae2a6a3e541424c5f130332/googler#L1094-L1117

However, there's no `a.spell` anymore. Now we have some stupid class `gL9Hy` that probably should not be hardcoded:

```html
<div>
  <span class="gL9Hy">Showing results for</span>
  <a class="gL9Hy" href="/search?q=hello+world&amp;oe=UTF-8&amp;hl=en&amp;sa=X&amp;as_q=&amp;nfpr=&amp;spell=1&amp;ved=0ahUKEwjL756goYreAhWNVN8KHR2gADIQvwUIEQ"><b><i>hello</i></b> world</a>
  <br/>
  <span class="spell_orig">Search instead for <a href="/search?q=helol+world&amp;oe=UTF-8&amp;hl=en&amp;sa=X&amp;as_q&amp;nfpr=1&amp;spell"><b><i>helol</i></b> world</a>
  </span>
</div>
```

We have to find `span.spell_orig`, then backtrack to extract the corrected phrase. This can hardly be handled in our current framework: we use a single pass parser carrying only the minimum amount of state, with no lookbehind at all. Bolting on backtracking means another ad hoc register and more random-looking code littered through multiple methods — it gets real ugly real fast.

(To be continued.)