Do not print an additional blank line when abstract is empty
============================================================
Not actually sure why we never changed this... But I happened to be developing against a request that resulted in an empty abstract today, and the double blank line sure looked pointless (especially when it's repeated due to the result duplication bug).
Before (yeah, this is without the duplicate suppression):
```console
$ ./googler python3.5 eol
1. Is there official guide for Python 3.x release lifecycle? - Stack ...
https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle
2. Is there official guide for Python 3.x release lifecycle? - Stack ...
https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle
3. 17. Development Cycle — Python Developer's Guide
https://devguide.python.org/devcycle/
A branch less than 5 years old but no longer in maintenance mode is a ... For reference, here are the Python versions that most recently reached their
end-of-life: ...
```
After:
```console
$ ./googler python3.5 eol
1. Is there official guide for Python 3.x release lifecycle? - Stack ...
https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle
2. Is there official guide for Python 3.x release lifecycle? - Stack ...
https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle
3. 17. Development Cycle — Python Developer's Guide
https://devguide.python.org/devcycle/
A branch less than 5 years old but no longer in maintenance mode is a ... For reference, here are the Python versions that most recently reached their
end-of-life: ...
```
(As an aside, it would be nice if we could actually display the text in a featured snippet, but those snippets are a little bit too liberal in format to reliably parse, so I guess we're stuck with a bare title and URL for now).
Deduplicate results
===================
Previously results may be duplicated, e.g. for the response https://git.io/JJn05 the top result (from a featured snippet) is shown in googler output twice.
The reason that happened is that a feature snippet could contain `div.g` inside a `div.g`, so when we select results based on div.g we picked up the same result twice -- the second container is a child of the first.
Instead of tracking node ancestry which is rather annoying, we introduce `__eq__` on `Result` and make sure no duplicate is recorded that way. Also introduced `__hash__`, not actually in use but why not.
---
Before:
```
$ ./googler --debug --parse /tmp/googler-response-44zgwc5a.html
[DEBUG] googler version 4.1
[DEBUG] Python version 3.8.2
1. Is there official guide for Python 3.x release lifecycle? - Stack ...
https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle
2. Is there official guide for Python 3.x release lifecycle? - Stack ...
https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle
3. 17. Development Cycle — Python Developer's Guide
https://devguide.python.org/devcycle/
A branch less than 5 years old but no longer in maintenance mode is a ... For reference, here are the Python versions that most recently reached their
end-of-life: ...
...
```
After:
```
$ ./googler --debug --parse /tmp/googler-response-44zgwc5a.html
[DEBUG] googler version 4.1
[DEBUG] Python version 3.8.2
1. Is there official guide for Python 3.x release lifecycle? - Stack ...
https://stackoverflow.com/questions/40655195/is-there-official-guide-for-python-3-x-release-lifecycle
2. 17. Development Cycle — Python Developer's Guide
https://devguide.python.org/devcycle/
A branch less than 5 years old but no longer in maintenance mode is a ... For reference, here are the Python versions that most recently reached their
end-of-life: ...
...
```
Fix sitelinks parsing
=====================
Sitelink abstract now bears the following structure
```html
<div class="s">
<div class="st" style="overflow:hidden;width:220px">Sign in. Use your Google Account ... Use Guest mode to sign in ...<br></div>
</div>
```
at least for me. I think the parser has been broken for quite a while in this regard.
Introduce hidden debug option --parse to parse dumped HTML
==========================================================
There used to be a separate directory called `devbin` or something with a script to load dumped HTML and call the parser, and the parser only.
This option makes it easy to directly load dumped HTML with googler and trace throughout the entire parsing and rendering cycle.
I introduced this to help me fix two other bugs (PRs to follow).
Ctrl-C doesn't interrupt program immediately on Windows when connection is slow/stuck
=====================================================================================
So I had the pleasure of using googler on Windows. There was a connection problem to Google, and I further had the pleasure to experience the confusion of Ctrl-C seemingly doing nothing for more than ten seconds before a `KeyboardInterrupt` is processed eventually.
Of course it may not be easy for people to reproduce a connection problem to Google, but I wrote this simple TCP honeypot one can play with (all the code can be found in this gist, too: https://gist.github.com/zmwangx/9d4a730d881df4080cb7b1ee3046ee70):
```python
# A TCP honeypot listening on 127.0.0.1:8080 that stalls connections.
import socketserver
import time
class HoneypotServer(socketserver.TCPServer):
# Stall each connection for 10 seconds.
def get_request(self):
time.sleep(10)
return self.socket.accept()
def main():
server = HoneypotServer(("127.0.0.1", 8080), socketserver.BaseRequestHandler)
try:
server.serve_forever()
except KeyboardInterrupt:
pass
if __name__ == "__main__":
main()
```
Then, a very stripped down client doing a simple `HTTPSConnection.connect` would expose the problem:
```python
# A naive HTTPS client trying to connect to 127.0.0.1:8080.
import http.client
def main():
conn = http.client.HTTPSConnection("127.0.0.1", 8080)
try:
conn.connect()
except (OSError, KeyboardInterrupt):
pass
if __name__ == "__main__":
main()
```
Regardless of when you Ctrl-C on Windows, this would hang for 10 seconds.
**The fundamental problem is Python seemingly not being able to process an interrupt signal when it's blocked on socket operations like recv.**
It is said that Ctrl-Break would work better and interrupt immediately. I use an Apple Extended Keyboard which doesn't have these Windows keys, and remapping a function key to Break doesn't seem to have the desired effect, can't tell if it's a problem with the remapping or Ctrl-Break simply doesn't work as advertised here.
I don't actually have a solution here. I know for one that actual async programming with `asyncio` could solve the problem:
```python
# A async HTTPS client trying to connect to 127.0.0.1:8080.
import asyncio
import aiohttp
async def main():
async with aiohttp.ClientSession() as client:
try:
async with client.get("https://127.0.0.1:8080/"):
pass
except aiohttp.ClientError:
pass
if __name__ == "__main__":
try:
asyncio.run(main())
except KeyboardInterrupt:
pass
```
Ctrl-C cleanly interrupts this on Windows. But this is apparently not suitable:
- Dependency on `aiohttp` since PSL asyncio only has [TCP primitives](https://docs.python.org/3/library/asyncio-stream.html);
- Probably will lose some low level flexibility like our `socket.create_connection` override for `-4`, `-6`;
- Got a whole other API, and blue/red functions to boot.
I assume other async solutions like gevent could work, too, but again it's dependencies, dependencies, dependencies.
Anyway, just putting this out there in case some Windows expert could fix the problem without external batteries.
README: add note on GUI browser integration in WSL
==================================================
I was using googler in WSL myself and had to Google this one, so might as well document it.