Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

okascore.c misuses Python C API, leading to odd failures #18

Closed
tseaver opened this issue May 15, 2024 · 1 comment · Fixed by #19
Closed

okascore.c misuses Python C API, leading to odd failures #18

tseaver opened this issue May 15, 2024 · 1 comment · Fixed by #19

Comments

@tseaver
Copy link
Member

tseaver commented May 15, 2024

E.g., see the following test:

    def test__search_wids_non_empty_wids_multiple_docs(self):

        NUM_DOCS = 23   # get past the DICT_CUTOFF size
        TEXT = 'one two three'
        family = self._getBTreesFamily()
        index = self._makeOne()

        for i in range(NUM_DOCS):
            texts = [TEXT] * ((i % 5) + 1)
            index.index_doc(i, " ".join(texts) )

        d2f = index._wordinfo[1]
        assert isinstance(d2f, family.IF.BTree)

        wids = [index._lexicon._wids[x] for x in TEXT.split()]

        relevances = index._search_wids(wids)

        self.assertEqual(len(relevances), len(wids))
        for relevance in relevances:
            self.assertTrue(isinstance(relevance[0], index.family.IF.Bucket))
            self.assertEqual(len(relevance[0]), NUM_DOCS)
            self.assertTrue(isinstance(relevance[0][1], float))
            self.assertTrue(isinstance(relevance[1], int))

When run with PURE_PYTHON set in the environment, this test passes, but without it, I see the following errors:

____ OkapiIndexTest32Impure.test__search_wids_non_empty_wids_multiple_docs _____

self = <hypatia.text.tests.test_okapiindex.OkapiIndexTest32Impure testMethod=test__search_wids_non_empty_wids_multiple_docs>

    def test__search_wids_non_empty_wids_multiple_docs(self):
    
        NUM_DOCS = 23   # get past the DICT_CUTOFF size
        TEXT = 'one two three'
        family = self._getBTreesFamily()
        index = self._makeOne()
    
        for i in range(NUM_DOCS):
            texts = [TEXT] * ((i % 5) + 1)
            index.index_doc(i, " ".join(texts) )
    
        d2f = index._wordinfo[1]
        assert isinstance(d2f, family.IF.BTree)
    
        wids = [index._lexicon._wids[x] for x in TEXT.split()]
    
>       relevances = index._search_wids(wids)

hypatia/text/tests/test_okapiindex.py:142: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <hypatia.text.okapiindex.OkapiIndex object at 0x7fe3cc5301a0>
wids = [1, 2, 3]

    def _search_wids(self, wids):
        if not wids:
            return []
        N = float(self.indexed_count())  # total # of docs
        try:
            doclen = self._totaldoclen()
        except TypeError:
            # _totaldoclen has not yet been upgraded
            doclen = self._totaldoclen
        meandoclen = doclen / N
        #K1 = self.K1
        #B = self.B
        #K1_plus1 = K1 + 1.0
        #B_from1 = 1.0 - B
    
        #                           f(D, t) * (k1 + 1)
        #   TF(D, t) =  -------------------------------------------
        #               f(D, t) + k1 * ((1-b) + b*len(D)/E(len(D)))
    
        L = []
        docid2len = self._docweight
        for t in wids:
            d2f = self._wordinfo[t] # map {docid -> f(docid, t)}
            idf = inverse_doc_frequency(len(d2f), N)  # an unscaled float
            result = self.family.IF.Bucket()
>           score(result, _items(d2f), docid2len, idf, meandoclen)
E           KeyError: 0

hypatia/text/okapiindex.py:341: KeyError
____ OkapiIndexTest64Impure.test__search_wids_non_empty_wids_multiple_docs _____
TypeError: 'float' object cannot be interpreted as an integer

The above exception was the direct cause of the following exception:

self = <hypatia.text.tests.test_okapiindex.OkapiIndexTest64Impure testMethod=test__search_wids_non_empty_wids_multiple_docs>

    def test__search_wids_non_empty_wids_multiple_docs(self):
    
        NUM_DOCS = 23   # get past the DICT_CUTOFF size
        TEXT = 'one two three'
        family = self._getBTreesFamily()
        index = self._makeOne()
    
        for i in range(NUM_DOCS):
            texts = [TEXT] * ((i % 5) + 1)
            index.index_doc(i, " ".join(texts) )
    
        d2f = index._wordinfo[1]
        assert isinstance(d2f, family.IF.BTree)
    
        wids = [index._lexicon._wids[x] for x in TEXT.split()]
    
>       relevances = index._search_wids(wids)

hypatia/text/tests/test_okapiindex.py:142: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <hypatia.text.okapiindex.OkapiIndex object at 0x7fe3cc54e430>
wids = [1, 2, 3]

    def _search_wids(self, wids):
        if not wids:
            return []
        N = float(self.indexed_count())  # total # of docs
        try:
            doclen = self._totaldoclen()
        except TypeError:
            # _totaldoclen has not yet been upgraded
            doclen = self._totaldoclen
        meandoclen = doclen / N
        #K1 = self.K1
        #B = self.B
        #K1_plus1 = K1 + 1.0
        #B_from1 = 1.0 - B
    
        #                           f(D, t) * (k1 + 1)
        #   TF(D, t) =  -------------------------------------------
        #               f(D, t) + k1 * ((1-b) + b*len(D)/E(len(D)))
    
        L = []
        docid2len = self._docweight
        for t in wids:
            d2f = self._wordinfo[t] # map {docid -> f(docid, t)}
            idf = inverse_doc_frequency(len(d2f), N)  # an unscaled float
            result = self.family.IF.Bucket()
>           score(result, _items(d2f), docid2len, idf, meandoclen)
E           SystemError: <built-in function score> returned a result with an exception set

hypatia/text/okapiindex.py:341: SystemError
@tseaver
Copy link
Member Author

tseaver commented May 15, 2024

Note that the new test passes on Python3 < 3.10.

tseaver added a commit that referenced this issue May 15, 2024
Fix 'setup.py' to prevent building 'okascore' module if 'PURE_PYTHON'
is defined.

Add 'py310-pure' environment to build / run tests with that env var set.

Note that the new test fails under 'py310', but passes under 'py310-pure'
tseaver added a commit that referenced this issue May 16, 2024
Fix 'setup.py' to prevent building 'okascore' module if 'PURE_PYTHON'
is defined.

Add 'py310-pure' environment to build / run tests with that env var set.

Note that the new test fails under 'py310', but passes under 'py310-pure'
tseaver added a commit that referenced this issue May 16, 2024
* tests: demonstrate #18

Fix 'setup.py' to prevent building 'okascore' module if 'PURE_PYTHON'
is defined.

Add 'py310-pure' environment to build / run tests with that env var set.

Note that the new test fails under 'py310', but passes under 'py310-pure'

* fix: use 'PyFloat_AsDouble' for upcasts

- We know that the second element of each 'd2fitems' *should* be a float
  already (we store it, after all, in an IF BTree).

- The upcast in the second change is safe, and more regular.

* ci: bump checkout, setup-python action versions

- Use released Python 3.12

* ci: work around GHA macos-latest messes:

- actions/setup-python#850
- actions/setup-python#860

---------

Co-authored-by: Peter Wilkinson <pw@thirdfloor.com.au>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant