No problem, this is not at all obvious for people who are only familiar with the Latin/Cyrillic/etc. writing systems. Consider this simple Chinese phrase, "中文书面语", which means "Chinese written language"; it can be broken down into two or three words: "中文" — Chinese, "书面语" — written language (which could be further broken into "书面" — written, "语" — language), but there's no word boundary character inside it:
```py
>>> re.findall(r'\b.', '中文书面语')
['中']
```
(As a result tokenization of Chinese/Japanese text is nontrivial and requires specialized tokenizers, a simple example being ICU's BreakIterator.)
As for implementation, iterating over children of a `div_g.select('.st')` and manually build up the text content (instead of using the `text` property) while recording positions as needed should do the trick. The HTML parser's API is documented at https://docs.tcl.sh/py/dim/ in case you need it.
In case I wasn't clear enough, what I have in mind is something like:
```json
{
"url": "...",
"title": "...",
"abstract": "...",
"matches": [
{"phrase": "...", "offset": ...},
{"phrase": "...", "offset": ...},
...
]
}
```
Actually, I was already convinced matches could/should be added to structured output when I had [this realization](https://github.com/jarun/googler/pull/283#issuecomment-490321282). But as I said, match positions also have to be included, otherwise we can only support a subset of languages. Since schema changes should be done with extreme care, it has to be all or nothing.
Just came up with an important use case(s) where the re matching strategy fails: any written language that doesn't have ~visible~ syntactic word boundaries. e.g., Chinese or Japanese. Try e.g. `googler 中文`; nothing is highlighted. So, while regex matching is smart, it has crippling limitations after all. My initial response is fitting:
> position of a match is almost as important as the text of the match
And I guess the information has to be recorded in the parser, not afterwards with clever tricks.
(Not trying to diminish this feature in any way.)
An update to my last comment: sorry, I didn't read the code in detail and misunderstood what's being matched. You're matching against matched phrases highlighted in Google's response, not phrases supplied by the user (what I thought). In that case, the parser addition is definitely warranted. Still not that convinced about addition to structured output, though.
Use `None` for the default value. Don't use a mutable default value for a keyword argument. See for instance https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments.
Interactive part: LGTM. Don't bother with using `COLORMAP`, it's there for mapping user-defined color themes to actual sequences. Fixed sequences can be hard-coded, no problem.
Parser and structured output changes: not so sure. Does anyone actually need that? If one actually does need that, just the matching part seems be incomplete info, too — position of a match is almost as important as the text of the match (think of information captured by a `re.Match`). Lastly, if someone needs this, they can simply do a regex query themselves like you've shown and get all the information.
I'd say let's just not bother with the addition to structured output.
@EvanDotPro You misunderstood my proposal. Each screenful of result is piped into less, not everything, not any already interactive part. You read the results, manually exit less (or it automatically exits when you reach the end), then do whatever with googler like you’re used to.
Now, reversing the printing order of results is one solution (with the benefit of being trivial to implement, even). I definitely support its inclusion, but I would personally find this more pretty weird.