GHC 2020-10-13

1 comment.

, https://git.io/JTtAz in indygreg/PyOxidizer
Mangled sys.argv when command line arguments are non-ASCII
==========================================================

CPython on *ix systems decode command line arguments

> with filesystem encoding and “surrogateescape” error handler

according to [docs](https://docs.python.org/3/library/sys.html#sys.argv).

However, pyoxidize seems to attempt to set `sys.argv` to the raw bytes while `sys.argv` is eventually a list of `str`s, not `byte`s, so anything non-ASCII is mangled.

Consider this basically default pyoxidizer config, generating a Python REPL: 

<details>
<summary><code>pyoxidizer.bzl</code></summary>

```bzl
def make_dist():
    return default_python_distribution()


def make_exe(dist):
    policy = dist.make_python_packaging_policy()

    python_config = dist.make_python_interpreter_config()
    python_config.run_mode = "repl"

    exe = dist.to_python_executable(
        name = "python",
        packaging_policy = policy,
        config = python_config,
    )

    return exe

def make_embedded_resources(exe):
    return exe.to_embedded_resources()

def make_install(exe):
    files = FileManifest()
    files.add_python_resource(".", exe)
    return files

register_target("dist", make_dist)
register_target("exe", make_exe, depends = ["dist"])
register_target("resources", make_embedded_resources, depends = ["exe"], default_build_script = True)
register_target("install", make_install, depends = ["exe"], default = True)

resolve_targets()

# END OF COMMON USER-ADJUSTED SETTINGS.


PYOXIDIZER_VERSION = "0.8.0"
PYOXIDIZER_COMMIT = "UNKNOWN"
```
</details>

Now let's pass a Unicode argument. macOS:

```console
$ ./build/x86_64-apple-darwin/release/install/python 中文  # 中文 is e4b8 ade6 9687 in UTF-8 encoding
Python 3.8.6 (default, Oct  3 2020, 13:58:55)
[Clang 10.0.1 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getfilesystemencoding(), sys.getdefaultencoding()
('utf-8', 'utf-8')
>>> sys.argv
['./build/x86_64-apple-darwin/release/install/python', '\xe4\xb8\xad\xe6\x96\x87']
```

On Linux it's even worse, I get an extra `0xdc` inserted before each byte:

```console
$ build/x86_64-unknown-linux-gnu/release/install/googler 中文
Python 3.8.6 (default, Oct  3 2020, 20:48:20)
[Clang 10.0.1 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getfilesystemencoding(), sys.getdefaultencoding()
('ascii', 'utf-8')
>>> sys.argv
['build/x86_64-unknown-linux-gnu/release/install/googler', '\udce4\udcb8\udcad\udce6\udc96\udc87']
```

(I can turn on <s>`configure_locale`</s> or `utf8_mode` to coerce `sys.getfilesystemencoding()` to `utf-8`, but `sys.argv` ends up being the same. **Edit:** `configure_locale` actually does work.)

Ultimately though whether an extra byte is inserted is irrelevant; as I said `sys.argv` is a `str`-based API, so I could hardly even code around this limitation, making pyoxidizer rather useless for anything that might involve non-ASCII command line arguments.

Reading #10 and https://github.com/indygreg/PyOxidizer/blob/7a222ac6fe12bd667869e2d47e75606f4717ebbc/pyembed/src/interpreter.rs#L429-L436 (but I haven't read the actual implementation) seems to suggest to me that valid Unicode arguments are supposed to be supported. Am I missing something obvious?