You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a potential data corruption in Pandas UDFs using BatchQueue to combine the input and python output batches.
Currently the BatchProducer and BatchQueue use different locks to protect the batch pulling from the input iterator and the batch appending to the queue separately.
So in a two-threaded Python ruuner, there is a race when the reader thread and the writer thread append batches to the batch queue.
One possible case is:
the writer thread gets a batch A, but next it pauses.
then the reader thread gets the next Batch B, and appends it to the queue.
the writer thread resumes and appends batch A to the queue.
Therefore, batch A and B have the reversed order in the queue now, leading to data
corruption when doing the combination.
The text was updated successfully, but these errors were encountered:
Thanks to @GaryShen2008 finding this.
There is a potential data corruption in Pandas UDFs using
BatchQueue
to combine the input and python output batches.Currently the
BatchProducer
andBatchQueue
use different locks to protect the batch pulling from the input iterator and the batch appending to the queue separately.So in a two-threaded Python ruuner, there is a race when the reader thread and the writer thread append batches to the batch queue.
One possible case is:
Therefore, batch A and B have the reversed order in the queue now, leading to data
corruption when doing the combination.
The text was updated successfully, but these errors were encountered: