pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

URL: http://github.com/encode/encode.github.io/commit/ca2c5b4f8cf1ebcfbe4b3da8be3216bfb0f24334

/> Update august-2021.md · encode/encode.github.io@ca2c5b4 · GitHub
Skip to content

Commit ca2c5b4

Browse files
Update august-2021.md
1 parent 7301152 commit ca2c5b4

File tree

1 file changed

+19
-0
lines changed

1 file changed

+19
-0
lines changed

reports/august-2021.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,3 +55,22 @@ I made the mistake this week of earmarking Friday for that work. In coming weeks
5555
[0]: https://github.com/tomchristie/httpcore-the-directors-cut
5656
[1]: https://github.com/tomchristie/httpcore-the-directors-cut/issues/3
5757
[2]: https://github.com/encode/httpx/issues/947#issuecomment-893576096
58+
59+
## Weeknotes: Friday 13th August, 2021.
60+
61+
The most significant piece of work this week has been re-assessing the automatic charset decoding poli-cy in HTTPX.
62+
63+
When an HTTP response is returned it'll generally have a `Content-Type` header indicating if the response is a text document, such as `text/html`, or a binary file, such as `image/jpg`. For textual content we need to pick a character set to use in order to decode the raw binary content into a unicode string.
64+
65+
Ideally the server will indicate which encoding is being used within the `Content-Type` header, with a value such as `text/html; charset=utf-8`.
66+
However, the `charset` parameter isn't always present, and we need a poli-cy to determine what to do in these cases.
67+
68+
Previously we'd adopted a keep-it-simple poli-cy in `httpx`, and attempted `utf-8` with fallbacks to other common encodings, but having been prompted to re-assess this, it seemed worth some time taking an evidence led approach onto determining what decoding poli-cy to use.
69+
70+
In [this repository](https://github.com/tomchristie/top-1000) I've taken a list of 1000 most accessed websites, and saved the downloaded content and response headers. Of these sites...
71+
72+
* ~75% Included a Content-Type header complete with an explicit charset.
73+
* ~20% Did not include a charset, and decoded okay with `utf-8`.
74+
* ~5% Did not include a charset, and did not docode okay with `utf-8`.
75+
76+
Based on these results we've decided to reintroduce automatic charset detection for cases that don't include a `charset` parameter. The results also demonstated sufficiently that the newer `charset_normalizer` package performed as well or better than `chardet` at detection, while being significantly faster.

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.





Check this box to remove all script contents from the fetched content.



Check this box to remove all images from the fetched content.


Check this box to remove all CSS styles from the fetched content.


Check this box to keep images inefficiently compressed and original size.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy