Mangled sys.argv when command line arguments are non-ASCII
==========================================================
CPython on *ix systems decode command line arguments
> with filesystem encoding and “surrogateescape” error handler
according to [docs](https://docs.python.org/3/library/sys.html#sys.argv).
However, pyoxidize seems to attempt to set `sys.argv` to the raw bytes while `sys.argv` is eventually a list of `str`s, not `byte`s, so anything non-ASCII is mangled.
Consider this basically default pyoxidizer config, generating a Python REPL:
<details>
<summary><code>pyoxidizer.bzl</code></summary>
```bzl
def make_dist():
return default_python_distribution()
def make_exe(dist):
policy = dist.make_python_packaging_policy()
python_config = dist.make_python_interpreter_config()
python_config.run_mode = "repl"
exe = dist.to_python_executable(
name = "python",
packaging_policy = policy,
config = python_config,
)
return exe
def make_embedded_resources(exe):
return exe.to_embedded_resources()
def make_install(exe):
files = FileManifest()
files.add_python_resource(".", exe)
return files
register_target("dist", make_dist)
register_target("exe", make_exe, depends = ["dist"])
register_target("resources", make_embedded_resources, depends = ["exe"], default_build_script = True)
register_target("install", make_install, depends = ["exe"], default = True)
resolve_targets()
# END OF COMMON USER-ADJUSTED SETTINGS.
PYOXIDIZER_VERSION = "0.8.0"
PYOXIDIZER_COMMIT = "UNKNOWN"
```
</details>
Now let's pass a Unicode argument. macOS:
```console
$ ./build/x86_64-apple-darwin/release/install/python 中文 # 中文 is e4b8 ade6 9687 in UTF-8 encoding
Python 3.8.6 (default, Oct 3 2020, 13:58:55)
[Clang 10.0.1 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getfilesystemencoding(), sys.getdefaultencoding()
('utf-8', 'utf-8')
>>> sys.argv
['./build/x86_64-apple-darwin/release/install/python', '\xe4\xb8\xad\xe6\x96\x87']
```
On Linux it's even worse, I get an extra `0xdc` inserted before each byte:
```console
$ build/x86_64-unknown-linux-gnu/release/install/googler 中文
Python 3.8.6 (default, Oct 3 2020, 20:48:20)
[Clang 10.0.1 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getfilesystemencoding(), sys.getdefaultencoding()
('ascii', 'utf-8')
>>> sys.argv
['build/x86_64-unknown-linux-gnu/release/install/googler', '\udce4\udcb8\udcad\udce6\udc96\udc87']
```
(I can turn on <s>`configure_locale`</s> or `utf8_mode` to coerce `sys.getfilesystemencoding()` to `utf-8`, but `sys.argv` ends up being the same. **Edit:** `configure_locale` actually does work.)
Ultimately though whether an extra byte is inserted is irrelevant; as I said `sys.argv` is a `str`-based API, so I could hardly even code around this limitation, making pyoxidizer rather useless for anything that might involve non-ASCII command line arguments.
Reading #10 and https://github.com/indygreg/PyOxidizer/blob/7a222ac6fe12bd667869e2d47e75606f4717ebbc/pyembed/src/interpreter.rs#L429-L436 (but I haven't read the actual implementation) seems to suggest to me that valid Unicode arguments are supposed to be supported. Am I missing something obvious?