Releases: jawah/charset_normalizer
Version 3.4.0
๐ charset-normalizer is raising awareness around HTTP/2, and HTTP/3!
Did you know that Internet Explorer 11 shipped with an optional HTTP/2 support back in 2013? also libcurl did ship it in 2014[...]
All of this while our community is still struggling to make a firm advancement in HTTP clients. Now, many of you use Requests
as the defacto http client, now, and for many years now, Requests has been frozen. Being left in a vegetative state and not evolving,
this blocked millions of developers from using more advanced features.
We promptly invite Python developers to look at the drop-in replacement for Requests, namely Niquests.
It leverage charset-normalizer in a better way! Check it out, you will be positively surprised! Don't wait another decade.
We are thankful to @microsoft and involved parties for funding our work through the Microsoft FOSS Fund program.
3.4.0 (2024-10-08)
Added
- Argument
--no-preemptive
in the CLI to prevent the detector to search for hints. - Support for Python 3.13 (#512)
Fixed
- Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
- Improved the general reliability of the detector based on user feedbacks. (#520) (#509) (#498) (#407) (#537)
- Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. (#381)
Version 3.3.2
3.3.2 (2023-10-31)
Fixed
- Unintentional memory usage regression when using large payloads that match several encodings (#376)
- Regression on some detection cases showcased in the documentation (#371)
Added
- Noise (md) probe that identifies malformed Arabic representation due to the presence of letters in isolated form (credit to my wife, thanks!)
Version 3.3.1
3.3.1 (2023-10-22)
Changed
- Optional mypyc compilation upgraded to version 1.6.1 for Python >= 3.8
- Improved the general detection reliability based on reports from the community
Release 3.3.0
3.3.0 (2023-09-30)
Added
- Allow to execute the CLI (e.g. normalizer) through
python -m charset_normalizer.cli
orpython -m charset_normalizer
- Support for 9 forgotten encodings that are supported by Python but unlisted in
encoding.aliases
as they have no alias (#323)
Removed
- (internal) Redundant utils.is_ascii function and unused function is_private_use_only
- (internal) charset_normalizer.assets is moved inside charset_normalizer.constant
Changed
- (internal) Unicode code blocks in constants are updated using the latest v15.0.0 definition to improve detection
- Optional mypyc compilation upgraded to version 1.5.1 for Python >= 3.8
Fixed
- Unable to properly sort CharsetMatch when both chaos/noise and coherence were close due to an unreachable condition in __lt__ (#350)
Version 3.2.0
3.2.0 (2023-06-07)
Changed
- Typehint for function
from_path
no longer enforcePathLike
as its first argument - Minor improvement over the global detection reliability
Added
- Introduce function
is_binary
that relies on main capabilities, and is optimized to detect binaries - Propagate
enable_fallback
argument throughoutfrom_bytes
,from_path
, andfrom_fp
that allow a deeper control over the detection (default True) - Explicit support for Python 3.12
Fixed
- Edge case detection failure where a file would contain 'very-long' camel-cased word (Issue #289)
Version 3.1.0
Version 3.0.1
Version 3.0.0
3.0.0 (2022-10-20)
Added
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
- Add parameter
language_threshold
infrom_bytes
,from_path
andfrom_fp
to adjust the minimum expected coherence ratio normalizer --version
now specify if the current version provides extra speedup (meaning mypyc compilation whl)
Changed
- Build with static metadata (not pyproject.toml yet)
- Make language detection stricter
- Optional: Module
md.py
can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
Fixed
- CLI with opt --normalize fail when using full path for files
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha characters have been fed to it
- Sphinx warnings when generating the documentation
Removed
- Coherence detector no longer returns 'Simple English' instead returns 'English'
- Coherence detector no longer returns 'Classical Chinese' instead returns 'Chinese'
- Breaking: Method
first()
andbest()
from CharsetMatch - UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflicts with ASCII)
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
- Breaking: Top-level function
normalize
- Breaking: Properties
chaos_secondary_pass
,coherence_non_latin
andw_counter
from CharsetMatch - Support for the backport
unicodedata2
This is the last version (3.0.x) to support Python 3.6 We plan to drop it for 3.1.x
Version 3.0.0rc1
This is the last pre-release. If everything goes well, I will publish the stable tag.
3.0.0rc1 (2022-10-18)
Added
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
- Add parameter
language_threshold
infrom_bytes
,from_path
andfrom_fp
to adjust the minimum expected coherence ratio
Changed
- Build with static metadata using 'build' frontend
- Make language detection stricter
Fixed
- CLI with opt --normalize fail when using full path for files
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha characters have been fed to it
Removed
- Coherence detector no longer returns 'Simple English' instead returns 'English'
- Coherence detector no longer returns 'Classical Chinese' instead returns 'Chinese'
Version 3.0.0b2
3.0.0b2 (2022-08-21)
Added
normalizer --version
now specify if current version provide extra speedup (meaning mypyc compilation whl)
Removed
- Breaking: Method
first()
andbest()
from CharsetMatch - UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)
Fixed
- Sphinx warnings when generating the documentation