requests
là thư viện HTTP client phổ biến nhất của Python, nó phổ biến vì dễ dùng, chỉ cần import requests; requests.get
là xong.
Trông dễ như vậy bởi requests
đã cấu hình mặc định hết rất nhiều thứ như: connection, headers, adapter, session, encoding, ... mà mặc định thì nhiều khi đúng, đôi khi sai.
Encoding
Vì Python mặc định encoding trên Linux là utf-8
, người dùng dễ mặc định là requests
cũng vậy, nhưng thực tế thì:
Tài liệu viết:
When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property:
r.encoding 'utf-8'
r.encoding = 'ISO-8859-1'
If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text. You might want to do this in any situation where you can apply special logic to work out what the encoding of the content will be. For example, HTML and XML have the ability to specify their encoding in their body. In situations like this, you should use r.content to find the encoding, and then set r.encoding. This will let you use r.text with the correct encoding.
requests
thực hiện "đoán một cách có học" (educated guesses) encoding của response dựa trên HTTP headers https://github.com/psf/requests/blob/v2.32.5/src/requests/adapters.py#L355.
response.encoding = get_encoding_from_headers(response.headers)
Xem code thấy nó chỉ dựa trên header content-type
:
def get_encoding_from_headers(headers):
"""Returns encodings from given HTTP Header Dict.
:param headers: dictionary to extract encoding from.
:rtype: str
"""
content_type = headers.get("content-type")
if not content_type:
return None
content_type, params = _parse_content_type_header(content_type)
if "charset" in params:
return params["charset"].strip("'\"")
if "text" in content_type:
return "ISO-8859-1"
if "application/json" in content_type:
# Assume UTF-8 based on RFC 4627: https://www.ietf.org/rfc/rfc4627.txt since the charset was unset
return "utf-8"
# https://github.com/psf/requests/blob/v2.32.5/src/requests/utils.py#L529-L551
nếu có charset
trong content-type
requests
sẽ dùng giá trị của charset. Ví dụ:
Content-Type: text/html; charset=utf-8
Khi không có charset
, nếu content-type
chứa text
dùng ISO-8859-1
, còn nếu là json
dùng utf-8
.
Vậy khi dùng với các JSON API, requests
sẽ dùng utf-8
nên kết quả luôn như mong đợi, nhưng nếu lấy trang HTML, khi header không set, encoding có thể bị sai.
resp.content vs resp.text
content
và text
là 2 property của response object, content
chứa byte và người dùng có thể tự decode
với encoding tùy ý.
text
đọc .encoding
rồi decode:
In [48]: url = 'https://podcasts.apple.com/jp/podcast/%E3%81%AA%E3%81%8C%E3%82%89%E6%97%A5%E7%B5%8C/id1627014612'
In [49]: resp = requests.get(url)
In [50]: print(resp.encoding)
ISO-8859-1
In [51]: print(resp.headers.get('Content-Type'))
text/html
do header Content-Type
không có charset
, lại chứa text
, nên requests
set encoding = 'ISO-8859-1'
In [54]: resp.text.split("<title>")[1].split("</title>")[0]
Out[54]: 'ã\x81ªã\x81\x8cã\x82\x89æ\x97¥çµ\x8c - ã\x83\x9dã\x83\x83ã\x83\x89ã\x82\xadã\x83£ã\x82¹ã\x83\x88 - Apple Podcast'
In [55]: resp.content.split(b"<title>")[1].split(b"</title>")[0].decode("utf-8")
Out[55]: 'ながら日経 - ポッドキャスト - Apple Podcast'
tự set encoding = 'utf-8'
để text hiển thị đúng:
In [56]: resp.encoding = 'utf-8'
In [57]: resp.text.split("<title>")[1].split("</title>")[0]
Out[57]: 'ながら日経 - ポッドキャスト - Apple Podcast'
In [62]: resp.connection
Out[62]: <requests.adapters.HTTPAdapter at 0x772dba31ccd0>
PS: curl cũng encode đúng với utf-8:
$ curl -s https://podcasts.apple.com/jp/podcast/%E3%81%AA%E3%81%8C%E3%82%89%E6%97%A5%E7%B5%8C/id1627014612 | grep -o '<title>.*</title>'
<title>ながら日経 - ポッドキャスト - Apple Podcast</title>
Xem code method text
:
@property
def text(self):
"""Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using
``charset_normalizer`` or ``chardet``.
The encoding of the response content is determined based solely on HTTP
headers, following RFC 2616 to the letter. If you can take advantage of
non-HTTP knowledge to make a better guess at the encoding, you should
set ``r.encoding`` appropriately before accessing this property.
"""
# Try charset from content-type
content = None
encoding = self.encoding
if not self.content:
return ""
# Fallback to auto-detected encoding.
if self.encoding is None:
encoding = self.apparent_encoding
# Decode unicode from given encoding.
try:
content = str(self.content, encoding, errors="replace")
except (LookupError, TypeError):
# A LookupError is raised if the encoding was not found which could
# indicate a misspelling or similar mistake.
#
# A TypeError can be raised if encoding is None
#
# So we try blindly encoding.
content = str(self.content, errors="replace")
return content
https://github.com/psf/requests/blob/v2.32.5/src/requests/models.py#L909
Kết luận
requests
dễ nhưng không đơn giản.
Hết.
HVN at https://pymi.vn and https://www.familug.org.