- fix release ci attempting to upload a pyodide wheel
- add risc64 wheels
- add support for taskflow 4.0.0
- upgrade to
Cython==3.2.4.
- fix type hints for extractOne when no score_cutoff is provided
- add missing pypy and freethreaded linux wheels
- drop s390x and ppc64le wheels since they are virtually unused and require extremly long to build under emulation
- upgrade to
Cython==3.1.6 - enable free threading
- Fully disable line tracing in release builds
- upgrade to
Cython==3.1.3. This enables compilation with free threaded python. - upgrade to
rapidfuzz-cpp==3.3.3
- add support for freethreaded Python
- add python 3.14 wheels
- dropped support for Python3.9
- drop 32 bit linux wheels
- remove unused hook-dirs from pyinstaller config to fix a warning
- fixed WRatio for a length ratio of exactly 8.0
- add support for arrays of type 'w'
- add support for any DTypeLike as dtype in
cdistandcpdist
- upgrade to
rapidfuzz-cpp==3.3.2
- added wheels for pypy 3.11
- upgrade to
Cython==3.0.12
- fix version number
- generate code for fallback imports to be better parseable for tools bundling Python applications into a single binary (examples are cx-freeze and pyinstaller)
- added support for taskflow 3.9.0
- improve calculation of min score inside partial_ratio so it can skip more alignments
- added build support for emscripten
- fix compilation on clang-19
- fix incorrect results in simd optimized implementation of Levenshtein and OSA on 32bit targets
- added support for taskflow 3.8.0
- drop support for Python 3.8
- switch build system to
scikit-build-core
- fix crash in
cdistdue to Visual Studio upgrade
- upgrade to
Cython==3.0.11 - add python 3.13 wheels
- include simd binaries in pyinstaller builds
- fix builds with setuptools 72 by upgrading
scikit-build
- fix bug in
Levenshtein.editopsandLevenshtein.opcodeswhich could lead to incorrect results and crashes for some inputs
- fix None handling for queries in
process.cdistfor scorers not supporting SIMD
- fix supported versions of taskflow in cmake to be in the range v3.3 - v3.7
- disable AVX2 on MacOS since it did lead to illegal instructions being generated
- significantly improve type hints for the library
- fix cmake version parsing
- use the correct version of
rapidfuzz-cppwhen building against a system installed version
- added
process.cpdistwhich allows pairwise comparison of two collection of inputs
- fix some minor errors in the type hints
- fix potentially incorrect results of JaroWinkler when using high prefix weights
- reduce importtime
- upgrade to
Cython==3.0.9
- upgrade
rapidfuzz-cppwhich includes a fix for build issues on some compilers - fix some issues with the sphinx config
- fix overflow error on systems with
sizeof(size_t) < 8
- fix pure python fallback implementation of
fuzz.token_set_ratio - properly link with
-latomicifstd::atomic<uint64_t>is not natively supported
- add banded implementation of LCS / Indel. This improves the runtime from
O((|s1|/64) * |s2|)toO((score_cutoff/64) * |s2|)
- upgrade to
Cython==3.0.7 - cdist for many metrics now returns a matrix of
uint32instead ofint32by default
- use _mm_malloc/_mm_free on macOS if aligned_alloc is unsupported
- fix compilation failure on macOS
- skip pandas
pd.NAsimilar toNone - add
score_multiplierargument toprocess.cdistwhich allows multiplying the end result scores with a constant factor. - drop support for Python 3.7
- improve performance of simd implementation for
LCS/Indel/Jaro/JaroWinkler - improve performance of Jaro and Jaro Winkler for long sequences
- implement
process.extractwithlimit=1usingprocess.extractOnewhich can be faster
- the preprocessing function was always called through Python due to a broken C-API version check
- fix wraparound issue in simd implementation of Jaro and Jaro Winkler
- upgrade to
Cython==3.0.3 - add simd implementation for Jaro and Jaro Winkler
- add missing tag for python 3.12 support
- upgrade to
Cython==3.0.2 - implement the remaining missing features from the C++ implementation in the pure Python implementation
- added support for Python 3.12
- build x86 with sse2/avx2 runtime detection
- upgrade to
Cython==3.0.0
- upgrade to
taskflow==3.6
- replace usage of
isnanwithstd::isnanwhich fixes the build on NetBSD
- added keyword argument
padto Hamming distance. This controls whether sequences of different length should be padded or lead to aValueError - improve consistency of exception messages between the C++ and pure Python implementation
- upgrade required Cython version to
Cython==3.0.0b3
- fix missing GIL restore when an exception is thrown inside
process.cdist - fix incorrect type hints for the
processmodule
- allow the usage of
Hammingfor different string lengths. Length differences are handled as insertions / deletions - remove support for boolean preprocessor functions in
rapidfuzz.fuzzandrapidfuzz.process. The processor argument is now always a callable orNone. - update defaults of the processor argument to be
Noneeverywhere. For affected functions this can change results, since strings are no longer preprocessed. To get back the old behaviour passprocessor=utils.default_processto these functions. The following functions are affected by this:process.extract,process.extract_iter,process.extractOnefuzz.token_sort_ratio,fuzz.token_set_ratio,fuzz.token_ratio,fuzz.partial_token_sort_ratio,fuzz.partial_token_set_ratio,fuzz.partial_token_ratio,fuzz.WRatio,fuzz.QRatio
rapidfuzz.processno longer calls scorers withprocessor=None. For this reason user provided scorers no longer require this argument.- remove option to pass keyword arguments to scorer via
**kwargsinrapidfuzz.process. They can be passed via ascorer_kwargsargument now. This ensures this does not break when extending function parameters and prevents naming clashes. - remove
rapidfuzz.string_metricmodule. Replacements for all functions are available inrapidfuzz.distance
- added support for arbitrary hashable sequence in the pure Python fallback implementation of all functions in
rapidfuzz.distance - added support for
Noneandfloat("nan")inprocess.cdistas long as the underlying scorer supports it. This is the case for all scorers returning normalized results.
- fix division by zero in simd implementation of normalized metrics leading to incorrect results
- fix incorrect tag dispatching implementation leading to AVX2 instructions in the SSE2 code path
- add wheels for windows arm64
- allow the usage of finite generators as choices in
process.extract
- upgrade required Cython version to
Cython==3.0.0b2
- fix handling of non symmetric scorers in pure python version of
process.cdist - fix default dtype handling when using
process.cdistwith pure python scorers
- fix function signature of
get_requires_for_build_wheel
- reformat changelog as restructured text to get rig of
m2r2dependency
- added docs to sdist
- fix two cases of undefined behavior in
process.cdist
- handle
float("nan")similar toNonefor query / choice, since this is common for non-existent data in tools like numpy
- fix handling on
None/float("nan")inprocess.distance - use absolute imports inside tests
- improve handling of functions wrapped using
functools.wraps - fix broken fallback to Python implementation when the a
ImportErroroccurs on import. This can e.g. occur when the binary has a dependency on libatomic, but it is unavailable on the system - define
CMAKE_C_COMPILER_AR/CMAKE_CXX_COMPILER_AR/CMAKE_C_COMPILER_RANLIB/CMAKE_CXX_COMPILER_RANLIBif they are not defined yet
- fix incorrect results in
Hamming.normalized_similarity - fix incorrect score_cutoff handling in pure python implementation of
Postfix.normalized_distanceandPrefix.normalized_distance - fix
Levenshtein.normalized_similarityandLevenshtein.normalized_distancewhen used in combination with the process module fuzz.partial_ratiowas not always symmetric whenlen(s1) == len(s2)
- fix bug in
normalized_similarityof most scorers, leading to incorrect results when used in combination with the process module - fix sse2 support
- fix bug in
JaroWinklerandJarowhen used in the pure python process module - forward kwargs in pure Python implementation of
process.extract
- fix bug in
Levenshtein.editopsleading to crashes when used withscore_hint
- moved capi from
rapidfuzz_capiintorapidfuzz, since it will always succeed the installation now that there is a pure Python mode - add
score_hintargument to process module - add
score_hintargument to Levenshtein module
- drop support for Python 3.6
- added
Prefix/Suffixsimilarity
- fixed packaging with pyinstaller
- Fix segmentation fault in
process.cdistwhen used with an empty query sequence
- move jarowinkler dependency into rapidfuzz to simplify maintenance
- add SIMD implementation for
fuzz.ratio/fuzz.QRatio/Levenshtein/Indel/LCSseq/OSAto improve performance for short strings in cdist
- use
scikit-build=0.14.1on Linux, sincescikit-build=0.15.0fails to find the Python Interpreter - workaround gcc in bug in template type deduction
- fix support for cmake versions below 3.17
- modernize cmake build to fix most conda-forge builds
- add editops to hamming distance
- strip common affix in osa distance
- ignore missing pandas in Python 3.11 tests
- add optimal string alignment (OSA)
fuzz.partial_ratiodid not find the optimal alignment in some edge cases (#219)
- improve performance of
fuzz.partial_ratio
- increased minimum C++ version to C++17 (see #255)
- improve performance of
Levenshtein.distance/Levenshtein.editopsfor long sequences.
- add
score_hintparameter toLevenshtein.editopswhich allows the use of a faster implementation
- all functions in the
string_metricmodule do now raise a deprecation warning. They are now only wrappers for their replacement functions, which makes them slower when used with the process module
- fix incorrect results of partial_ratio for long needles (#257)
- fix hashing for custom classes
- add support for slicing in
Editops.__getitem__/Editops.__delitem__ - add
DamerauLevenshteinmodule
- added support for KeyboardInterrupt in processor module
It might still take a bit until the KeyboardInterrupt is registered, but
no longer runs all text comparisons after pressing
Ctrl + C
- fix default scorer used by cdist to use C++ implementation if possible
- Added support for Python 3.11
- fix value range of
jaro_similarity/jaro_winkler_similarityin the pure Python mode for the string_metric module - fix missing atomix symbol on arm 32 bit
- add missing symbol to pure Python which made the usage impossible
- fix version number
- fix banded Levenshtein implementation
- improve performance and memory usage of
Levenshtein.editops- memory usage is reduced from O(NM) to O(N)
- performance is improved for long sequences
- add
as_matching_blockstoEditops/Opcodes - add support for deletions from
Editops - add
Editops.apply/Opcodes.apply - add
Editops.remove_subsequence
- merge adjacent similar blocks in
Opcodes
- fix usage of
eval(repr(Editop)),eval(repr(Editops)),eval(repr(Opcode))andeval(repr(Opcodes)) - fix opcode conversion for empty source sequence
- fix validation for empty Opcode list passed into
Opcodes.__init__
- added in-tree build backend to install cmake and ninja only when it is not installed yet and only when wheels are available
- changed internal implementation of cdist to remove build dependency to numpy
- added wheels for musllinux and manylinux ppc64le, s390x
- fix missing type stubs
- change src layout to make package import from root directory possible
- allow installation without the C++ extension if it fails to compile
- allow selection of implementation via the environment variable
RAPIDFUZZ_IMPLEMENTATIONwhich can be set to "cpp" or "python"
- added pure python fallback for all implementations with the following exceptions:
- no support for sequences of hashables. Only strings supported so far
\*.editops/\*.opcodesfunctions not implemented yet- process.cdist does not support multithreading
- fuzz.partial_ratio_alignment ignored the score_cutoff
- fix implementation of Hamming.normalized_similarity
- fix default score_cutoff of Hamming.similarity
- fix implementation of LCSseq.distance when used in the process module
- treat hash for -1 and -2 as different
- fix integer wraparound in partial_ratio/partial_ratio_alignment
- fix unlimited recursion in LCSseq when used in combination with the process module
- add fallback implementations of
taskflow,rapidfuzz-cppandjarowinkler-cppback to wheel, since some package building systems like piwheels can't clone sources
- use system version of cmake on arm platforms, since the cmake package fails to compile
- add tests to sdist
- remove cython dependency for sdist
- relax version requirements of dependencies to simplify packaging
- Do not include installations of jaro_winkler in wheels (regression from 2.0.7)
- Allow installation from system installed versions of
rapidfuzz-cpp,jarowinkler-cppandtaskflow
- Added PyPy3.9 wheels on Linux
- Add missing Cython code in sdist
- consider float imprecision in score_cutoff (see #210)
- fix incorrect score_cutoff handling in token_set_ratio and token_ratio
- add longest common subsequence
- Do not include installations of jaro_winkler and taskflow in wheels
- fix incorrect population of sys.modules which lead to submodules overshadowing other imports
- moved JaroWinkler and Jaro into a separate package
- fix signed integer overflow inside hashmap implementation
- fix binary size increase due to debug symbols
- fix segmentation fault in
Levenshtein.editops
- Added fuzz.partial_ratio_alignment, which returns the result of fuzz.partial_ratio combined with the alignment this result stems from
- Fix Indel distance returning incorrect result when using score_cutoff=1, when the strings are not equal. This affected other scorers like fuzz.WRatio, which use the Indel distance as well.
- fix type hints
- Add back transpiled cython files to the sdist to simplify builds in package builders like FreeBSD port build or conda-forge
- fix type hints
- Indel.normalized_similarity mistakenly used the implementation of Indel.normalized_distance
- added C-Api which can be used to extend RapidFuzz from different Python modules using any programming language which allows the usage of C-Apis (C/C++/Rust)
- added new scorers in
rapidfuzz.distance.*- port existing distances to this new api
- add Indel distance along with the corresponding editops function
- when the result of
string_metric.levenshteinorstring_metric.hammingis below max they do now returnmax + 1instead of -1 - Build system moved from setuptools to scikit-build
- Stop including all modules in __init__.py, since they significantly slowed down import time
- remove the
rapidfuzz.levenshteinmodule which was deprecated in v1.0.0 and scheduled for removal in v2.0.0 - dropped support for Python2.7 and Python3.5
- deprecate support to specify processor in form of a boolean (will be removed in v3.0.0)
- new functions will not get support for this in the first place
- deprecate
rapidfuzz.string_metric(will be removed in v3.0.0). Similar scorers are available inrapidfuzz.distance.*
- process.cdist did raise an exception when used with a pure python scorer
- improve performance and memory usage of
rapidfuzz.string_metric.levenshtein_editops- memory usage is reduced by 33%
- performance is improved by around 10%-20%
- significantly improve performance of
rapidfuzz.string_metric.levenshteinformax <= 31using a banded implementation
- fix bug in new editops implementation, causing it to SegFault on some inputs (see qurator-spk/dinglehopper#64)
- Fix some issues in the type annotations (see #163)
- improve performance and memory usage of
rapidfuzz.string_metric.levenshtein_editops- memory usage is reduced by 10x
- performance is improved from
O(N * M)toO([N / 64] * M)
- Added missing wheels for Python3.6 on MacOs and Windows (see #159)
- Add wheels for Python 3.10 on MacOs
- Fix incorrect editops results (See #148)
- Add Wheels for Python3.10 on all platforms except MacOs (see #141)
- Improve performance of
string_metric.jaro_similarityandstring_metric.jaro_winkler_similarityfor strings with a length <= 64
- fixed incorrect results of fuzz.partial_ratio for long needles (see #138)
- Added typing for process.cdist
- Added multithreading support to cdist using the argument
process.cdist - Add dtype argument to
process.cdistto set the dtype of the result numpy array (see #132) - Use a better hash collision strategy in the internal hashmap, which improves the worst case performance
- improved performance of fuzz.ratio
- only import process.cdist when numpy is available
- Add back wheels for Python2.7
- fuzz.partial_ratio uses a new implementation for short needles (<= 64). This implementation is
- more accurate than the current implementation (it is guaranteed to find the optimal alignment)
- it is significantly faster
- Add process.cdist to compare all elements of two lists (see #51)
- Fix out of bounds access in levenshtein_editops
all scorers do now support similarity/distance calculations between any sequence of hashables. So it is possible to calculate e.g. the WER as: .. code-block:
>>> string_metric.levenshtein(["word1", "word2"], ["word1", "word3"]) 1
- Added type stub files for all functions
- added jaro similarity in
string_metric.jaro_similarity - added jaro winkler similarity in
string_metric.jaro_winkler_similarity - added Levenshtein editops in
string_metric.levenshtein_editops
- Fixed support for set objects in
process.extract - Fixed inconsistent handling of empty strings
- improved performance of result creation in process.extract
- Cython ABI stability issue (#95)
- fix missing decref in case of exceptions in process.extract
- added processor support to
levenshteinandhamming - added distance support to extract/extractOne/extract_iter
- incorrect results of
normalized_hammingandnormalized_levenshteinwhen used withutils.default_processas processor
- Fix a bug in the mbleven implementation of the uniform Levenshtein distance and cover it with fuzz tests
- some of the newly activated warnings caused build failures in the conda-forge build
- Fixed issue in LCS calculation for partial_ratio (see #90)
- Fixed incorrect results for normalized_hamming and normalized_levenshtein when the processor
utils.default_processis used - Fix many compiler warnings
- add wheels for a lot of new platforms
- drop support for Python 2.7
- use
isinstead of==to compare functions directly by address
- Fix another ref counting issue
- Fix some issues in the Levenshtein distance algorithm (see #92)
- further improve bitparallel implementation of uniform Levenshtein distance for strings with a length > 64 (in many cases more than 50% faster)
- add more benchmarks to documentation
- add bitparallel implementation to InDel Distance (Levenshtein with the weights 1,1,2) for strings with a length > 64
- improve bitparallel implementation of uniform Levenshtein distance for strings with a length > 64
- use the InDel Distance and uniform Levenshtein distance in more cases instead of the generic implementation
- Directly use the Levenshtein implementation in C++ instead of using it through Python in process.*
- Fix reference counting in process.extract (see #81)
- Fix result conversion in process.extract (see #79)
- string_metric.normalized_levenshtein supports now all weights
- when different weights are used for Insertion and Deletion the strings are not swapped inside the Levenshtein implementation anymore. So different weights for Insertion and Deletion are now supported.
- replace C++ implementation with a Cython implementation. This has the following advantages:
- The implementation is less error prone, since a lot of the complex things are done by Cython
- slightly faster than the current implementation (up to 10% for some parts)
- about 33% smaller binary size
- reduced compile time
- Added **kwargs argument to process.extract/extractOne/extract_iter that is passed to the scorer
- Add max argument to hamming distance
- Add support for whole Unicode range to utils.default_process
- replaced Wagner Fischer usage in the normal Levenshtein distance with a bitparallel implementation
- The bitparallel LCS algorithm in fuzz.partial_ratio did not find the longest common substring properly in some cases. The old algorithm is used again until this bug is fixed.
- string_metric.normalized_levenshtein supports now the weights (1, 1, N) with N >= 1
- The Levenshtein distance with the weights (1, 1, >2) do now use the same implementation as the weight (1, 1, 2), since
Substitution > Insertion + Deletionhas no effect
- fix uninitialized variable in bitparallel Levenshtein distance with the weight (1, 1, 1)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bit-parallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bit-parallel implementation.
- Improved performance of
fuzz.partial_ratio-> Sincefuzz.ratioandfuzz.partial_ratioare used in most scorers, this improves the overall performance. - Improved performance of
process.extractandprocess.extractOne
- the
rapidfuzz.levenshteinmodule is now deprecated and will be removed in v2.0.0 These functions are now placed inrapidfuzz.string_metric.distance,normalized_distance,weighted_distanceandweighted_normalized_distanceare combined intolevenshteinandnormalized_levenshtein.
- added normalized version of the hamming distance in
string_metric.normalized_hamming - process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, that's not from RapidFuzz
- fixed bug in
token_ratio - fixed bug in result normalization causing zero division
- utf8 usage in the copyright header caused problems with python2.7 on some platforms (see #70)
- when a custom processor like
lambda s: swas used with any of the methods inside fuzz.* it always returned a score of 100. This release fixes this and adds a better test coverage to prevent this bug in the future.
- added hamming distance metric in the levenshtein module
- improved performance of default_process by using lookup table
- Add missing virtual destructor that caused a segmentation fault on Mac Os
- C++11 Support
- manylinux wheels
- Levenshtein was not imported from __init__
- The reference count of a Python Object inside process.extractOne was decremented to early
- process.extractOne exits early when a score of 100 is found. This way the other strings do not have to be preprocessed anymore.
- string objects passed to scorers had to be strings even before preprocessing them. This was changed, so they only have to be strings after preprocessing similar to process.extract/process.extractOne
- process.extractOne is now implemented in C++ making it a lot faster
- When token_sort_ratio or partial_token_sort ratio is used inprocess.extractOne the words in the query are only sorted once to improve the runtime
- process.extractOne/process.extract do now return the index of the match, when the choices are a list.
- process.extractIndices got removed, since the indices are now already returned by process.extractOne/process.extract
- fix documentation of process.extractOne (see #48)
- Added wheels for
- CPython 2.7 on windows 64 bit
- CPython 2.7 on windows 32 bit
- PyPy 2.7 on windows 32 bit
- fix bug in partial_ratio (see #43)
- fix inconsistency with fuzzywuzzy in partial_ratio when using strings of equal length
- MSVC has a bug and therefore crashed on some of the templates used. This Release simplifies the templates so compiling on msvc works again
- partial_ratio is using the Levenshtein distance now, which is a lot faster. Since many of the other algorithms use partial_ratio, this helps to improve the overall performance
- fix partial_token_set_ratio returning 100 all the time
- added rapidfuzz.__author__, rapidfuzz.__license__ and rapidfuzz.__version__
- do not use auto junk when searching the optimal alignment for partial_ratio
- support for python 2.7 added #40
- add wheels for python2.7 (both pypy and cpython) on MacOS and Linux
- added wheels for Python3.9
- tuple scores in process.extractOne are now supported #39