Comment 2 for bug 1868232

Revision history for this message
Dan Watkins (oddbloke) wrote :

I've just done some research (by stepping through with a debugger), and socket.getaddrinfo _does_ perform the encoding of non-ASCII characters:

  In [7]: socket.getaddrinfo('www.\u2603.com', None)[0][4][0]
  Out[7]: '185.53.178.7'

It does so using the 'idna' encoding:

  In [2]: "www.☃.com".encode('idna')
  Out[2]: b'www.xn--n3h.com'

which (unsurprisingly, given this bug) doesn't do anything to underscores:

  In [4]: "www_foo.☃.com".encode('idna')
  Out[4]: b'www_foo.xn--n3h.com'

So I believe the correct implementation of (a) would be to encode the URL ourselves, and then drop any invalid characters out. (We should check if there is any stdlib/requests functionality that already does this.)