okascore.c misuses Python C API, leading to odd failures #18

tseaver · 2024-05-15T16:44:19Z

E.g., see the following test:

    def test__search_wids_non_empty_wids_multiple_docs(self):

        NUM_DOCS = 23   # get past the DICT_CUTOFF size
        TEXT = 'one two three'
        family = self._getBTreesFamily()
        index = self._makeOne()

        for i in range(NUM_DOCS):
            texts = [TEXT] * ((i % 5) + 1)
            index.index_doc(i, " ".join(texts) )

        d2f = index._wordinfo[1]
        assert isinstance(d2f, family.IF.BTree)

        wids = [index._lexicon._wids[x] for x in TEXT.split()]

        relevances = index._search_wids(wids)

        self.assertEqual(len(relevances), len(wids))
        for relevance in relevances:
            self.assertTrue(isinstance(relevance[0], index.family.IF.Bucket))
            self.assertEqual(len(relevance[0]), NUM_DOCS)
            self.assertTrue(isinstance(relevance[0][1], float))
            self.assertTrue(isinstance(relevance[1], int))

When run with PURE_PYTHON set in the environment, this test passes, but without it, I see the following errors:

____ OkapiIndexTest32Impure.test__search_wids_non_empty_wids_multiple_docs _____

self = <hypatia.text.tests.test_okapiindex.OkapiIndexTest32Impure testMethod=test__search_wids_non_empty_wids_multiple_docs>

    def test__search_wids_non_empty_wids_multiple_docs(self):
    
        NUM_DOCS = 23   # get past the DICT_CUTOFF size
        TEXT = 'one two three'
        family = self._getBTreesFamily()
        index = self._makeOne()
    
        for i in range(NUM_DOCS):
            texts = [TEXT] * ((i % 5) + 1)
            index.index_doc(i, " ".join(texts) )
    
        d2f = index._wordinfo[1]
        assert isinstance(d2f, family.IF.BTree)
    
        wids = [index._lexicon._wids[x] for x in TEXT.split()]
    
>       relevances = index._search_wids(wids)

hypatia/text/tests/test_okapiindex.py:142: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <hypatia.text.okapiindex.OkapiIndex object at 0x7fe3cc5301a0>
wids = [1, 2, 3]

    def _search_wids(self, wids):
        if not wids:
            return []
        N = float(self.indexed_count())  # total # of docs
        try:
            doclen = self._totaldoclen()
        except TypeError:
            # _totaldoclen has not yet been upgraded
            doclen = self._totaldoclen
        meandoclen = doclen / N
        #K1 = self.K1
        #B = self.B
        #K1_plus1 = K1 + 1.0
        #B_from1 = 1.0 - B
    
        #                           f(D, t) * (k1 + 1)
        #   TF(D, t) =  -------------------------------------------
        #               f(D, t) + k1 * ((1-b) + b*len(D)/E(len(D)))
    
        L = []
        docid2len = self._docweight
        for t in wids:
            d2f = self._wordinfo[t] # map {docid -> f(docid, t)}
            idf = inverse_doc_frequency(len(d2f), N)  # an unscaled float
            result = self.family.IF.Bucket()
>           score(result, _items(d2f), docid2len, idf, meandoclen)
E           KeyError: 0

hypatia/text/okapiindex.py:341: KeyError
____ OkapiIndexTest64Impure.test__search_wids_non_empty_wids_multiple_docs _____
TypeError: 'float' object cannot be interpreted as an integer

The above exception was the direct cause of the following exception:

self = <hypatia.text.tests.test_okapiindex.OkapiIndexTest64Impure testMethod=test__search_wids_non_empty_wids_multiple_docs>

    def test__search_wids_non_empty_wids_multiple_docs(self):
    
        NUM_DOCS = 23   # get past the DICT_CUTOFF size
        TEXT = 'one two three'
        family = self._getBTreesFamily()
        index = self._makeOne()
    
        for i in range(NUM_DOCS):
            texts = [TEXT] * ((i % 5) + 1)
            index.index_doc(i, " ".join(texts) )
    
        d2f = index._wordinfo[1]
        assert isinstance(d2f, family.IF.BTree)
    
        wids = [index._lexicon._wids[x] for x in TEXT.split()]
    
>       relevances = index._search_wids(wids)

hypatia/text/tests/test_okapiindex.py:142: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <hypatia.text.okapiindex.OkapiIndex object at 0x7fe3cc54e430>
wids = [1, 2, 3]

    def _search_wids(self, wids):
        if not wids:
            return []
        N = float(self.indexed_count())  # total # of docs
        try:
            doclen = self._totaldoclen()
        except TypeError:
            # _totaldoclen has not yet been upgraded
            doclen = self._totaldoclen
        meandoclen = doclen / N
        #K1 = self.K1
        #B = self.B
        #K1_plus1 = K1 + 1.0
        #B_from1 = 1.0 - B
    
        #                           f(D, t) * (k1 + 1)
        #   TF(D, t) =  -------------------------------------------
        #               f(D, t) + k1 * ((1-b) + b*len(D)/E(len(D)))
    
        L = []
        docid2len = self._docweight
        for t in wids:
            d2f = self._wordinfo[t] # map {docid -> f(docid, t)}
            idf = inverse_doc_frequency(len(d2f), N)  # an unscaled float
            result = self.family.IF.Bucket()
>           score(result, _items(d2f), docid2len, idf, meandoclen)
E           SystemError: <built-in function score> returned a result with an exception set

hypatia/text/okapiindex.py:341: SystemError

The text was updated successfully, but these errors were encountered:

tseaver · 2024-05-15T17:07:28Z

Note that the new test passes on Python3 < 3.10.

Fix 'setup.py' to prevent building 'okascore' module if 'PURE_PYTHON' is defined. Add 'py310-pure' environment to build / run tests with that env var set. Note that the new test fails under 'py310', but passes under 'py310-pure'

* tests: demonstrate #18 Fix 'setup.py' to prevent building 'okascore' module if 'PURE_PYTHON' is defined. Add 'py310-pure' environment to build / run tests with that env var set. Note that the new test fails under 'py310', but passes under 'py310-pure' * fix: use 'PyFloat_AsDouble' for upcasts - We know that the second element of each 'd2fitems' *should* be a float already (we store it, after all, in an IF BTree). - The upcast in the second change is safe, and more regular. * ci: bump checkout, setup-python action versions - Use released Python 3.12 * ci: work around GHA macos-latest messes: - actions/setup-python#850 - actions/setup-python#860 --------- Co-authored-by: Peter Wilkinson <pw@thirdfloor.com.au>

tseaver mentioned this issue May 15, 2024

fix: okascore C api usage #19

Merged

tseaver closed this as completed in #19 May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

okascore.c misuses Python C API, leading to odd failures #18

okascore.c misuses Python C API, leading to odd failures #18

tseaver commented May 15, 2024

tseaver commented May 15, 2024

okascore.c misuses Python C API, leading to odd failures #18

okascore.c misuses Python C API, leading to odd failures #18

Comments

tseaver commented May 15, 2024

tseaver commented May 15, 2024