GHC 2019-02-05

5 comments.

, https://git.io/fh9W3 in coleifer/peewee
> In py27, `json.dumps` on a `str` gives you a `str`, and `json.dumps` on a `unicode` gives you a `unicode`.

Actually that's also false... It seems you only get a `unicode` when `ensure_ascii=False`. Anyway, very confusing. Glad I went all in on Python 3 years ago...

, https://git.io/fh9Ws in coleifer/peewee
Cool, I'll give it a spin shortly. Regarding this:

> Here is the behavior with Python 2. The unicode version results in the unicode code-point escapes _regardless_ of `ensure_ascii`. It is only by serializing the UTF8-encoded `bytes` value with `ensure_ascii=False` that we get the equivalent of the "raw" characters:

Actually no, you seem to be tricked by py27's confusing return types and the repr representations that look similar...

> ```py
> In [9]: json.dumps(ustr, ensure_ascii=False)
> Out[9]: u'"\u4e2d\u6587"'
> ```

These are the unescaped Unicode code points U+4E2D and U+6587 (equivalent to the Python 3 `ensure_ascii=False` version), escaped for `repr`. In py27, `json.dumps` on a `str` gives you a `str`, and `json.dumps` on a `unicode` gives you a `unicode`.

> Anyways...the end-result is potentially 2 or 3 extra bytes for these code-points. Is this the crux of the issue? The extra bytes?

That's one problem. The other problem is when you look at JSON values directly without `json_extract` (e.g. in a command line shell, or in a GUI like sqlitebrowser), escaped codepoints are not readable.

, https://git.io/fh9C8 in coleifer/peewee
By the way, while SQLite's `json` compresses whitespaces, it does preserve Unicode escape sequences, so `ensure_ascii=True` is still a problem.

, https://git.io/fh9C4 in coleifer/peewee
Sorry, either I wasn't clear or I'm doing something wrong. Adapting your example a little bit:

```py
#!/usr/bin/env python3

import logging

from playhouse.sqlite_ext import *

logger = logging.getLogger("peewee")
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.DEBUG)

db = SqliteDatabase("/tmp/json1.db")


class KeyData(Model):
    key = TextField()
    data = JSONField()

    class Meta:
        database = db


KeyData.create_table()
unicode_str = "中文"
data = {"foo": unicode_str}
KeyData.create(key="k1", data=data)
```

Execution output:

```sql
('CREATE TABLE IF NOT EXISTS "keydata" ("id" INTEGER NOT NULL PRIMARY KEY, "key" TEXT NOT NULL, "data" JSON NOT NULL)', [])
('INSERT INTO "keydata" ("key", "data") VALUES (?, ?)', ['k1', '{"foo": "\\u4e2d\\u6587"}'])
```

Note that the value of `data` does NOT go through the `json` function. 

Now, dumping the database:

```sql
$ sqlite3 /tmp/json1.db .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE IF NOT EXISTS "keydata" ("id" INTEGER NOT NULL PRIMARY KEY, "key" TEXT NOT NULL, "data" JSON NOT NULL);
INSERT INTO keydata VALUES(1,'k1','{"foo": "\u4e2d\u6587"}');
COMMIT;
```

Your raw SQL example is exactly what I have in mind (as what should be done)

```sql
insert into foo (data) values (json('{ "foo": "bar" }'));
```

emphasis on `json(...)`.

> the library is doing the right thing:

Definitely, there's no bug here, it's just that JSON is not efficiently stored.

, https://git.io/fh9IC in coleifer/peewee
Expose serialization options in sqlite_ext.JSONField (or use better defaults)
=============================================================================

Currently, `sqlite_ext.py` seems to use `json.dumps` without any options, which means the defaults `separators=(', ', ': ')` and `ensure_ascii=True` are assumed. These are a waste of space, and `ensure_ascii` also reduces human readability of non-ASCII text without real world benefit (since SQLite is Unicode-safe).

Compare the result of SQLite JSON1 extension's builtin `json` function:

```sql
INSERT INTO entry(doc) VALUES(json('{ "lang":  "中文" }'));
```

=> 

```sql
INSERT INTO entry VALUES(1,'{"lang":"中文"}');
```

and peewee:

```py
Entry.insert(doc=dict(lang='中文'))
```
=>

```sql
INSERT INTO entry VALUES(1,'{"lang": "\u4e2d\u6587"}');
```

It would be nice if serialization options are exposed, or just use `separators=(',', ':')` and `ensure_ascii=False`.