GHC 2020-10-10

14 comments.

, https://git.io/JTJrn in python/python-docs-theme
Yeah, sphinx-doc/sphinx#7720 led to  sphinx-doc/sphinx#7721 led to sphinx-doc/sphinx@b345acc.

, https://git.io/JTJrc in jarun/googler
Add time_it decorator to profile the most time-consuming methods in debug mode
==============================================================================

Could be useful for getting a sense of the rough performance of things without pulling out a full blown profiler.

Timing example on a really crappy Linux box:

<img width="491" alt="Screen Shot 2020-10-10 at 10 06 00 PM" src="https://user-images.githubusercontent.com/4149852/95657174-8e1f5880-0b45-11eb-9147-e01d0dbfd279.png">

Related: #382.


, https://git.io/JTJrC in jarun/googler
Interesting. Some minimal profiling can be added to debug mode directly to print the most noteworthy timing stats.

, https://git.io/JTJEI in python/python-docs-theme
Wait, was this change intentional? Just noticed that [Python 3.10 docs](https://docs.python.org/3.10/library/functions.html), rendered with Sphinx 3.2.1, features background color-less code blocks.

In any case I certainly prefer the old style.

, https://git.io/JTJ3A in jarun/googler
Performance
===========

@jarun mentioned feeling slow, so I did some profiling.

Since the networking part differs by environment and is mostly outside our control (unless we want to get into the HTTP client implementation business), I simply prefetched responses using `--debug`, then profiled `googler --parse <file>` (recall that `--parse` is the hidden developer tool to reproducibly parse a response HTML file).

Here's an overview of some queries I performed and the most important time measurements (`parse` is the total amount of time spent parsing, `parse_html` is the DOM-building part of that — using my DOM building library [dim.py](https://github.com/zmwangx/dim)):

| query            | size     | #nodes | #elements | parse  | parse_html |
| ---------------- | -------- | ------ | --------- | ------ | ---------- |
| apple            | 694461B  | 2058   | 1617      | 0.122s | 0.062s     |
| covid            | 1459581B | 6376   | 4690      | 0.344s | 0.190s     |
| forbidden palace | 582147B  | 1955   | 1542      | 0.118s | 0.060s     |
| googler          | 323438B  | 1240   | 956       | 0.079s | 0.038s     |
| linux            | 490538B  | 1839   | 1439      | 0.108s | 0.056s     |
| microsoft        | 458318B  | 1709   | 1338      | 0.106s | 0.052s     |
| white house      | 632724B  | 2501   | 1967      | 0.153s | 0.076s     |

("size" is the size of the decompressed HTML response; "#nodes" is the number of HTML nodes in the response, and "#elements" is the number of element nodes among those.)

Times are as measured by `cProfile`'s default config (wall clock) on a middle-of-the-road Core i7-8700B @ 3.20GHZ CPU.

As we can see, the parser takes ~100ms (give or take 50%) total on a typical query on my so-so CPU, with about 50% of the time spent building the DOM then another 50% extracting info, which I deem pretty reasonable. It must be orders of magnitude slower than, say Blink/Webkit/Gecko/html5ever (oh and our parser is not really an HTML5 parser and craps its pants on tag soup), but this is a pure Python parser optimized for modularity and readability (hopefully), so we can't expect too much from it. However there are pathological cases like "covid" which takes ~350ms, here we're inching into the embarrassingly slow territory. (Do note "covid" is a truly pathological case — there's so much more stuff returned, including some unique stuff, as evidenced by the sheer number of nodes/elements, which is at least 2x other queries.)

Overall I think I'm mostly satisfied at the moment. Can we further optimize? Maybe. I'll tag this as a "help wanted". Of course actionable discussions about performance of the HTTP part is also welcome.

---

Here's a bundle of all the source files, scripts, and generated data:

[googler-profiling.zip](https://github.com/jarun/googler/files/5358880/googler-profiling.zip)

Here's a typical call graph (query "linux"):

![googler-linux](https://user-images.githubusercontent.com/4149852/95653460-818e0680-0b2b-11eb-99a3-036705321fbb.png)

Below are the details (including 20 most expensive calls, by cumulative time):

```
query: apple
response size (bytes): 694461
number of HTML nodes: 2058
number of HTML elements: 1617

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     52/1    0.000    0.000    0.178    0.178 {built-in method builtins.exec}
        1    0.000    0.000    0.178    0.178 googler:19(<module>)
        1    0.000    0.000    0.127    0.127 googler:3696(main)
        3    0.000    0.000    0.124    0.041 googler:2695(enforced_method)
        1    0.000    0.000    0.123    0.123 googler:2791(fetch)
        1    0.000    0.000    0.122    0.122 googler:2297(__init__)
        1    0.000    0.000    0.122    0.122 googler:2306(parse)
        1    0.000    0.000    0.062    0.062 googler:763(parse_html)
        1    0.000    0.000    0.062    0.062 parser.py:103(feed)
        2    0.007    0.003    0.062    0.031 parser.py:133(goahead)
      128    0.004    0.000    0.053    0.000 googler:353(_select_all)
    76/15    0.000    0.000    0.049    0.003 <frozen importlib._bootstrap>:1002(_find_and_load)
    76/15    0.000    0.000    0.049    0.003 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
    73/15    0.000    0.000    0.047    0.003 <frozen importlib._bootstrap>:659(_load_unlocked)
    51/12    0.000    0.000    0.045    0.004 <frozen importlib._bootstrap_external>:784(exec_module)
   102/18    0.000    0.000    0.045    0.002 <frozen importlib._bootstrap>:220(_call_with_frames_removed)
       54    0.000    0.000    0.038    0.001 googler:312(select)
     1618    0.011    0.000    0.038    0.000 parser.py:300(parse_starttag)
160802/10310    0.026    0.000    0.026    0.000 googler:464(descendants)
        1    0.000    0.000    0.024    0.024 client.py:1(<module>)

cumtime of parse: 0.122s
cumtime of parse_html: 0.062s
```
---
```
query: covid
response size (bytes): 1459581
number of HTML nodes: 6376
number of HTML elements: 4690

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     52/1    0.000    0.000    0.406    0.406 {built-in method builtins.exec}
        1    0.000    0.000    0.406    0.406 googler:19(<module>)
        1    0.000    0.000    0.354    0.354 googler:3696(main)
        3    0.000    0.000    0.350    0.117 googler:2695(enforced_method)
        1    0.000    0.000    0.348    0.348 googler:2791(fetch)
        1    0.000    0.000    0.344    0.344 googler:2297(__init__)
        1    0.000    0.000    0.344    0.344 googler:2306(parse)
        1    0.000    0.000    0.190    0.190 googler:763(parse_html)
        1    0.000    0.000    0.190    0.190 parser.py:103(feed)
        2    0.018    0.009    0.190    0.095 parser.py:133(goahead)
      157    0.009    0.000    0.147    0.001 googler:353(_select_all)
     4691    0.035    0.000    0.118    0.000 parser.py:300(parse_starttag)
       67    0.000    0.000    0.099    0.001 googler:312(select)
614326/21857    0.093    0.000    0.093    0.000 googler:464(descendants)
       26    0.000    0.000    0.055    0.002 googler:323(select_all)
    76/15    0.000    0.000    0.050    0.003 <frozen importlib._bootstrap>:1002(_find_and_load)
    76/15    0.000    0.000    0.050    0.003 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
    73/15    0.000    0.000    0.048    0.003 <frozen importlib._bootstrap>:659(_load_unlocked)
    51/12    0.000    0.000    0.047    0.004 <frozen importlib._bootstrap_external>:784(exec_module)
   102/18    0.000    0.000    0.046    0.003 <frozen importlib._bootstrap>:220(_call_with_frames_removed)

cumtime of parse: 0.344s
cumtime of parse_html: 0.190s
```
---
```
query: forbidden palace
response size (bytes): 582147
number of HTML nodes: 1955
number of HTML elements: 1542

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     52/1    0.000    0.000    0.175    0.175 {built-in method builtins.exec}
        1    0.000    0.000    0.175    0.175 googler:19(<module>)
        1    0.000    0.000    0.123    0.123 googler:3696(main)
        3    0.000    0.000    0.120    0.040 googler:2695(enforced_method)
        1    0.000    0.000    0.119    0.119 googler:2791(fetch)
        1    0.000    0.000    0.118    0.118 googler:2297(__init__)
        1    0.000    0.000    0.118    0.118 googler:2306(parse)
        1    0.000    0.000    0.060    0.060 googler:763(parse_html)
        1    0.000    0.000    0.060    0.060 parser.py:103(feed)
        2    0.006    0.003    0.060    0.030 parser.py:133(goahead)
      153    0.004    0.000    0.051    0.000 googler:353(_select_all)
    76/15    0.000    0.000    0.050    0.003 <frozen importlib._bootstrap>:1002(_find_and_load)
    76/15    0.000    0.000    0.050    0.003 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
    73/15    0.000    0.000    0.048    0.003 <frozen importlib._bootstrap>:659(_load_unlocked)
    51/12    0.000    0.000    0.046    0.004 <frozen importlib._bootstrap_external>:784(exec_module)
   102/18    0.000    0.000    0.046    0.003 <frozen importlib._bootstrap>:220(_call_with_frames_removed)
     1543    0.010    0.000    0.036    0.000 parser.py:300(parse_starttag)
       64    0.000    0.000    0.036    0.001 googler:312(select)
        1    0.000    0.000    0.024    0.024 client.py:1(<module>)
145993/10061    0.024    0.000    0.024    0.000 googler:464(descendants)

cumtime of parse: 0.118s
cumtime of parse_html: 0.060s
```
---
```
query: googler
response size (bytes): 323438
number of HTML nodes: 1240
number of HTML elements: 956

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     52/1    0.000    0.000    0.135    0.135 {built-in method builtins.exec}
        1    0.000    0.000    0.135    0.135 googler:19(<module>)
        1    0.000    0.000    0.085    0.085 googler:3696(main)
        3    0.000    0.000    0.082    0.027 googler:2695(enforced_method)
        1    0.000    0.000    0.080    0.080 googler:2791(fetch)
        1    0.000    0.000    0.079    0.079 googler:2297(__init__)
        1    0.000    0.000    0.079    0.079 googler:2306(parse)
    76/15    0.000    0.000    0.049    0.003 <frozen importlib._bootstrap>:1002(_find_and_load)
    76/15    0.000    0.000    0.049    0.003 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
    73/15    0.000    0.000    0.047    0.003 <frozen importlib._bootstrap>:659(_load_unlocked)
    51/12    0.000    0.000    0.045    0.004 <frozen importlib._bootstrap_external>:784(exec_module)
   102/18    0.000    0.000    0.045    0.002 <frozen importlib._bootstrap>:220(_call_with_frames_removed)
        1    0.000    0.000    0.038    0.038 googler:763(parse_html)
        1    0.000    0.000    0.038    0.038 parser.py:103(feed)
        2    0.004    0.002    0.038    0.019 parser.py:133(goahead)
      173    0.003    0.000    0.034    0.000 googler:353(_select_all)
       71    0.000    0.000    0.025    0.000 googler:312(select)
        1    0.000    0.000    0.024    0.024 client.py:1(<module>)
      957    0.006    0.000    0.023    0.000 parser.py:300(parse_starttag)
     7020    0.004    0.000    0.017    0.000 googler:866(matches)

cumtime of parse: 0.079s
cumtime of parse_html: 0.038s
```
---
```
query: linux
response size (bytes): 490538
number of HTML nodes: 1839
number of HTML elements: 1439

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     52/1    0.000    0.000    0.164    0.164 {built-in method builtins.exec}
        1    0.000    0.000    0.164    0.164 googler:19(<module>)
        1    0.000    0.000    0.113    0.113 googler:3696(main)
        3    0.000    0.000    0.111    0.037 googler:2695(enforced_method)
        1    0.000    0.000    0.109    0.109 googler:2791(fetch)
        1    0.000    0.000    0.108    0.108 googler:2297(__init__)
        1    0.000    0.000    0.108    0.108 googler:2306(parse)
        1    0.000    0.000    0.056    0.056 googler:763(parse_html)
        1    0.000    0.000    0.056    0.056 parser.py:103(feed)
        2    0.006    0.003    0.056    0.028 parser.py:133(goahead)
    76/15    0.000    0.000    0.049    0.003 <frozen importlib._bootstrap>:1002(_find_and_load)
    76/15    0.000    0.000    0.048    0.003 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
    73/15    0.000    0.000    0.046    0.003 <frozen importlib._bootstrap>:659(_load_unlocked)
      155    0.004    0.000    0.046    0.000 googler:353(_select_all)
    51/12    0.000    0.000    0.045    0.004 <frozen importlib._bootstrap_external>:784(exec_module)
   102/18    0.000    0.000    0.044    0.002 <frozen importlib._bootstrap>:220(_call_with_frames_removed)
     1440    0.010    0.000    0.034    0.000 parser.py:300(parse_starttag)
       61    0.000    0.000    0.033    0.001 googler:312(select)
        1    0.000    0.000    0.024    0.024 client.py:1(<module>)
134823/9270    0.021    0.000    0.021    0.000 googler:464(descendants)

cumtime of parse: 0.108s
cumtime of parse_html: 0.056s
```
---
```
query: microsoft
response size (bytes): 458318
number of HTML nodes: 1709
number of HTML elements: 1338

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     52/1    0.000    0.000    0.162    0.162 {built-in method builtins.exec}
        1    0.000    0.000    0.162    0.162 googler:19(<module>)
        1    0.000    0.000    0.111    0.111 googler:3696(main)
        3    0.000    0.000    0.108    0.036 googler:2695(enforced_method)
        1    0.000    0.000    0.107    0.107 googler:2791(fetch)
        1    0.000    0.000    0.106    0.106 googler:2297(__init__)
        1    0.000    0.000    0.106    0.106 googler:2306(parse)
        1    0.000    0.000    0.052    0.052 googler:763(parse_html)
        1    0.000    0.000    0.052    0.052 parser.py:103(feed)
        2    0.005    0.003    0.052    0.026 parser.py:133(goahead)
    76/15    0.000    0.000    0.049    0.003 <frozen importlib._bootstrap>:1002(_find_and_load)
    76/15    0.000    0.000    0.049    0.003 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
      126    0.004    0.000    0.048    0.000 googler:353(_select_all)
    73/15    0.000    0.000    0.047    0.003 <frozen importlib._bootstrap>:659(_load_unlocked)
    51/12    0.000    0.000    0.045    0.004 <frozen importlib._bootstrap_external>:784(exec_module)
   102/18    0.000    0.000    0.045    0.002 <frozen importlib._bootstrap>:220(_call_with_frames_removed)
       53    0.000    0.000    0.034    0.001 googler:312(select)
     1339    0.009    0.000    0.032    0.000 parser.py:300(parse_starttag)
        1    0.000    0.000    0.024    0.024 client.py:1(<module>)
139828/9324    0.022    0.000    0.022    0.000 googler:464(descendants)

cumtime of parse: 0.106s
cumtime of parse_html: 0.052s
```
---
```
query: white house
response size (bytes): 632724
number of HTML nodes: 2501
number of HTML elements: 1967

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     52/1    0.000    0.000    0.209    0.209 {built-in method builtins.exec}
        1    0.000    0.000    0.209    0.209 googler:19(<module>)
        1    0.000    0.000    0.158    0.158 googler:3696(main)
        3    0.000    0.000    0.155    0.052 googler:2695(enforced_method)
        1    0.000    0.000    0.154    0.154 googler:2791(fetch)
        1    0.000    0.000    0.153    0.153 googler:2297(__init__)
        1    0.000    0.000    0.153    0.153 googler:2306(parse)
        1    0.000    0.000    0.076    0.076 googler:763(parse_html)
        1    0.000    0.000    0.076    0.076 parser.py:103(feed)
        2    0.008    0.004    0.076    0.038 parser.py:133(goahead)
      137    0.005    0.000    0.071    0.001 googler:353(_select_all)
    76/15    0.000    0.000    0.049    0.003 <frozen importlib._bootstrap>:1002(_find_and_load)
    76/15    0.000    0.000    0.049    0.003 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
       54    0.000    0.000    0.048    0.001 googler:312(select)
    73/15    0.000    0.000    0.047    0.003 <frozen importlib._bootstrap>:659(_load_unlocked)
     1968    0.013    0.000    0.046    0.000 parser.py:300(parse_starttag)
    51/12    0.000    0.000    0.045    0.004 <frozen importlib._bootstrap_external>:784(exec_module)
   102/18    0.000    0.000    0.045    0.002 <frozen importlib._bootstrap>:220(_call_with_frames_removed)
203313/13682    0.033    0.000    0.033    0.000 googler:464(descendants)
    13588    0.007    0.000    0.032    0.000 googler:866(matches)

cumtime of parse: 0.153s
cumtime of parse_html: 0.076s
```

, https://git.io/JTJJF in jarun/googler
@jarun Updated info.json following protocol. Pushed directly since the PR process wasn't gonna validate anything here.

, https://git.io/JTf95 in jarun/googler
Strange. You might to use cProfile to see what part is the bottleneck. I haven’t noticed anything.

, https://git.io/JTfXA in jarun/googler
Yeah, it's time.

Fixed conflict. This should be the last thing that has to be in a release.

, https://git.io/JTfXx in jarun/googler
Fix metadata parsing
====================

Metadata parsing is so damn unreliable now, that the fields detection already
broke after a few days. I've no choice but to use a magic class now, which
could break any second.

Also, turns out this flaky metadata node detector also has the tendency to pick
up non-metadata crap like sublinks and in certain cases, entire accordians of
stuff (like announcements under certain covid results). Slap some bandages on
top to exclude those.

, https://git.io/JTfXp in jarun/googler
Replace em dash in metadata
===========================

Google is now using U+2014 EM DASH instead of U+002D HYPHEN MINUS as separator, at least sometimes.

, https://git.io/JTfKG in jarun/googler
Sorry about the delay. This is fixed in #379. With the fix applied, your attached HTML is parsed correctly:

```console
$ ./googler --debug --parse /tmp/googler-response-_jgskry4.html
[DEBUG] googler version 4.2
[DEBUG] Python version 3.9.0
[DEBUG] Platform: macOS-10.15.7-x86_64-i386-64bit

 1.  The Varsity: What'll ya Have!
     https://www.thevarsity.com/
     EffectiveTuesday June 16th , in accordance with Governor Kemp's latest executive order, The Varsity Atlanta will be reopening our downstairs dining
     rooms.

 2.  The Varsity Atlanta - The Varsity
     https://www.thevarsity.com/locations/detail/1/The_Varsity_Atlanta
     The Varsity in downtown Atlanta is our original, world famous location. This enormous restaurant sits on 2 city blocks and can accommodate 800 diners
     inside.

 3.  Our Food - The Varsity
     https://www.thevarsity.com/food
     ... with Governor Kemp's latest executive order, The Varsity Atlanta will be reopening ... And can you really say you went to The Varsity if you didn't
     get a Frosted ... One burger with chili and cheese, one hot dog with mustard, chili, and cheese.

 4.  The Varsity - Wikipedia
     https://en.wikipedia.org/wiki/The_Varsity
     The Varsity is a restaurant chain, iconic in the modern culture of Atlanta, Georgia. The main ... Mad artist Jack Davis has done advertising for The
     Varsity. The Varsity was featured in the PBS documentary A Hot Dog Program by Rick Sebak.

 5.  The Varsity - Takeout & Delivery - 1454 Photos & 2070 ... - Yelp
     https://www.yelp.com/biz/the-varsity-atlanta-2
     Rating: 3, 2,070 reviews, Price range: Under $10
     $Inexpensive• Burgers, Hot Dogs, Fast Food. Open • 10:30 am - 11:00 ... Chili Cheese Dog, Chili Slaw Dog, Chili Burger, Fried Apple Pie, Chili Cheese
     Slaw Dog, Chili Cheese Burger, Naked Dog ... The Varsity is simply one of Atlanta's icons.

 6.  The Varsity - Takeout & Delivery - 140 Photos & 157 Reviews ...
     https://www.yelp.com/biz/the-varsity-atlanta-5
     Rating: 2.5, 157 reviews, Price range: Under $10
     ... of The Varsity "Tasty and less expensive than I'd expect for airport fast food. The food is similar to another hot dog place in Atlanta, Zesto.
     Friendly employees.

 7.  The Varsity- Atlanta's Favorite Hotdogs and Hamburgers
     https://www.atlanta.net/partner/the-varsity/296/
     Try the chili dogs, onion rings, Frosted Orange milkshake and homemade fried pies. The Varsity has been serving Atlanta's favorite hotdogs and
     hamburgers ...

 8.  91 years of chili dogs: How Atlanta's The Varsity lasts and lasts
     https://thetakeout.com/atlanta-the-varsity-chili-dogs-1835785500
     Jul 7, 2019
     Alongside their famous chili dogs, The Varsity's menu is a model of classic American drive-in fare. It's all hot dogs, hamburgers, fries, and onion ...

 9.  THE VARSITY, Atlanta - 61 North Ave NW, Downtown - Menu ...
     https://www.tripadvisor.com/Restaurant_Review-g60898-d492279-Reviews-The_Varsity-Atlanta_Georgia.html
     Rating: 4, 5,457 reviews, Price range: $
     The Varsity, Atlanta: See 5457 unbiased reviews of The Varsity, rated 4 of 5 on Tripadvisor and ranked ... Yes 100% all beef hot dogs, and yes they do
     have chili!

 10. The Varsity - Home | Facebook
     https://www.facebook.com/thevarsity/
     Rating: 4.3, 26,269 votes
     Come be a part of an Atlanta tradition! The Varsity is a 92-year-old family-owned and operated company. We treat our team members like family. We are
     looking to ...

```

, https://git.io/JTfVZ in jarun/googler
DOM builder: fix parsing of foreign elements
============================================

Update dim. Most importantly:

[3d5533bb](https://github.com/zmwangx/dim/commit/3d5533bbf551d16e73b38512e3a1db4f19631911) Fix parsing of foreign (e.g. svg namespace) elements

This fixes #366.

, https://git.io/JTfVn in jarun/googler
Oops, there's a bug fix I need to get into this release... Maybe a quick 4.3.1 should do.

, https://git.io/JTfVc in python/python-docs-theme
Fix codebgcolor and codetextcolor for Sphinx 3.1.0+
===================================================

Sphinx 3.1.0+ dropped the custom `codebgcolor` and `codetextcolor` from the classic theme (see commit https://github.com/sphinx-doc/sphinx/commit/b345acc2840a792162845e3a1a3456c347fac08e), leaving pre elements uncolored.

Without the fix:

<img width="815" alt="Screen Shot 2020-10-10 at 11 14 05 AM" src="https://user-images.githubusercontent.com/4149852/95644512-b8dbc380-0ae9-11eb-8dcd-05ccacf91914.png">

Should be:

<img width="811" alt="Screen Shot 2020-10-10 at 11 15 19 AM" src="https://user-images.githubusercontent.com/4149852/95644537-e45eae00-0ae9-11eb-874f-feeb15332eb5.png">