GHC 2020-07-18

9 comments.

, https://git.io/JJngo in jarun/googler
Btw we've racked up a sizable list of fixes and improvements so it might be time to consider a release.

, https://git.io/JJngK in jarun/googler
Do not print an additional blank line when abstract is empty
============================================================

Not actually sure why we never changed this... But I happened to be developing against a request that resulted in an empty abstract today, and the double blank line sure looked pointless (especially when it's repeated due to the result duplication bug).

Before (yeah, this is without the duplicate suppression):

```console
$ ./googler python3.5 eol

 1.  Is there official guide for Python 3.x release lifecycle? - Stack ...
     https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle


 2.  Is there official guide for Python 3.x release lifecycle? - Stack ...
     https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle


 3.  17. Development Cycle — Python Developer's Guide
     https://devguide.python.org/devcycle/
     A branch less than 5 years old but no longer in maintenance mode is a ... For reference, here are the Python versions that most recently reached their
     end-of-life: ...
```

After:

```console
$ ./googler python3.5 eol

 1.  Is there official guide for Python 3.x release lifecycle? - Stack ...
     https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle

 2.  Is there official guide for Python 3.x release lifecycle? - Stack ...
     https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle

 3.  17. Development Cycle — Python Developer's Guide
     https://devguide.python.org/devcycle/
     A branch less than 5 years old but no longer in maintenance mode is a ... For reference, here are the Python versions that most recently reached their
     end-of-life: ...
```

(As an aside, it would be nice if we could actually display the text in a featured snippet, but those snippets are a little bit too liberal in format to reliably parse, so I guess we're stuck with a bare title and URL for now).

, https://git.io/JJng6 in jarun/googler
Done.

, https://git.io/JJnEH in jarun/googler
Deduplicate results
===================

Previously results may be duplicated, e.g. for the response https://git.io/JJn05 the top result (from a featured snippet) is shown in googler output twice.

The reason that happened is that a feature snippet could contain `div.g` inside a `div.g`, so when we select results based on div.g we picked up the same result twice -- the second container is a child of the first.

Instead of tracking node ancestry which is rather annoying, we introduce `__eq__` on `Result` and make sure no duplicate is recorded that way. Also introduced `__hash__`, not actually in use but why not.

---

Before:

```
$ ./googler --debug --parse /tmp/googler-response-44zgwc5a.html
[DEBUG] googler version 4.1
[DEBUG] Python version 3.8.2

 1.  Is there official guide for Python 3.x release lifecycle? - Stack ...
     https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle


 2.  Is there official guide for Python 3.x release lifecycle? - Stack ...
     https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle


 3.  17. Development Cycle — Python Developer's Guide
     https://devguide.python.org/devcycle/
     A branch less than 5 years old but no longer in maintenance mode is a ... For reference, here are the Python versions that most recently reached their
     end-of-life: ...

...
```

After:

```
$ ./googler --debug --parse /tmp/googler-response-44zgwc5a.html
[DEBUG] googler version 4.1
[DEBUG] Python version 3.8.2

 1.  Is there official guide for Python 3.x release lifecycle? - Stack ...
     https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle


 2.  17. Development Cycle — Python Developer's Guide
     https://devguide.python.org/devcycle/
     A branch less than 5 years old but no longer in maintenance mode is a ... For reference, here are the Python versions that most recently reached their
     end-of-life: ...

...
```

, https://git.io/JJnEQ in jarun/googler
Fix sitelinks parsing
=====================

Sitelink abstract now bears the following structure

```html
<div class="s">
  <div class="st" style="overflow:hidden;width:220px">Sign in. Use your Google Account ... Use Guest mode to sign in ...<br></div>
</div>
```

at least for me. I think the parser has been broken for quite a while in this regard.

, https://git.io/JJnE7 in jarun/googler
Introduce hidden debug option --parse to parse dumped HTML
==========================================================

There used to be a separate directory called `devbin` or something with a script to load dumped HTML and call the parser, and the parser only.

This option makes it easy to directly load dumped HTML with googler and trace throughout the entire parsing and rendering cycle.

I introduced this to help me fix two other bugs (PRs to follow).

, https://git.io/JJnE5 in jarun/googler
Make debug mode even more informative
=====================================

- Check new version (cached for 24 hours);
- Print `platform.platform()`.

, https://git.io/JJnkS in jarun/googler
Ctrl-C doesn't interrupt program immediately on Windows when connection is slow/stuck
=====================================================================================

So I had the pleasure of using googler on Windows. There was a connection problem to Google, and I further had the pleasure to experience the confusion of Ctrl-C seemingly doing nothing for more than ten seconds before a `KeyboardInterrupt` is processed eventually.

Of course it may not be easy for people to reproduce a connection problem to Google, but I wrote this simple TCP honeypot one can play with (all the code can be found in this gist, too: https://gist.github.com/zmwangx/9d4a730d881df4080cb7b1ee3046ee70):

```python
# A TCP honeypot listening on 127.0.0.1:8080 that stalls connections.

import socketserver
import time


class HoneypotServer(socketserver.TCPServer):
    # Stall each connection for 10 seconds.
    def get_request(self):
        time.sleep(10)
        return self.socket.accept()


def main():
    server = HoneypotServer(("127.0.0.1", 8080), socketserver.BaseRequestHandler)
    try:
        server.serve_forever()
    except KeyboardInterrupt:
        pass


if __name__ == "__main__":
    main()
```

Then, a very stripped down client doing a simple `HTTPSConnection.connect` would expose the problem:

```python
# A naive HTTPS client trying to connect to 127.0.0.1:8080.

import http.client


def main():
    conn = http.client.HTTPSConnection("127.0.0.1", 8080)
    try:
        conn.connect()
    except (OSError, KeyboardInterrupt):
        pass


if __name__ == "__main__":
    main()
```

Regardless of when you Ctrl-C on Windows, this would hang for 10 seconds.

**The fundamental problem is Python seemingly not being able to process an interrupt signal when it's blocked on socket operations like recv.**

It is said that Ctrl-Break would work better and interrupt immediately. I use an Apple Extended Keyboard which doesn't have these Windows keys, and remapping a function key to Break doesn't seem to have the desired effect, can't tell if it's a problem with the remapping or Ctrl-Break simply doesn't work as advertised here.

I don't actually have a solution here. I know for one that actual async programming with `asyncio` could solve the problem:

```python
# A async HTTPS client trying to connect to 127.0.0.1:8080.

import asyncio

import aiohttp


async def main():
    async with aiohttp.ClientSession() as client:
        try:
            async with client.get("https://127.0.0.1:8080/"):
                pass
        except aiohttp.ClientError:
            pass


if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        pass
```

Ctrl-C cleanly interrupts this on Windows. But this is apparently not suitable:

- Dependency on `aiohttp` since PSL asyncio only has [TCP primitives](https://docs.python.org/3/library/asyncio-stream.html);
- Probably will lose some low level flexibility like our `socket.create_connection` override for `-4`, `-6`;
- Got a whole other API, and blue/red functions to boot.

I assume other async solutions like gevent could work, too, but again it's dependencies, dependencies, dependencies.

Anyway, just putting this out there in case some Windows expert could fix the problem without external batteries.

, https://git.io/JJnk9 in jarun/googler
README: add note on GUI browser integration in WSL
==================================================

I was using googler in WSL myself and had to Google this one, so might as well document it.