Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

BUG: read_csv with date_parser lock file open on failure #15302

Closed
rsheftel opened this issue Feb 3, 2017 · 18 comments
Closed

BUG: read_csv with date_parser lock file open on failure #15302

rsheftel opened this issue Feb 3, 2017 · 18 comments
Labels
good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions

Comments

@rsheftel
Copy link

rsheftel commented Feb 3, 2017

Problem description

When using the date_parser functionality of read_csv() if the file read fails then the file is left locked open. My use case is that I am trying to enforce a strict datetime format that must include the time zone offset. Another thread stated that the way to accomplish this is with date_parser. My issue is that I would like to move a file that fails being loaded to another directory, but I cannot because the failure in the date_parser keeps the file open until the python session is terminated.

Code Sample, a copy-pastable example if possible

import pandas as pd
import datetime

def strict_parser(dates):
    datetimes = [pd.Timestamp(datetime.datetime.strptime(date, '%Y-%m-%d %H:%M:%S%z'), tz='UTC') for date in dates]
    return pd.DatetimeIndex(datetimes)

filename = 'c:/temp/data.csv'
data = pd.read_csv(filename, parse_dates=['datetime'], index_col=['datetime'], date_parser=strict_parser)

The file data.csv is:

datetime,data
2010-05-05 09:30:00-0500,10
2010-05-05 09:35:00-0500,20
2010-05-05 09:40:00,30

Output of pd.show_versions()

pandas: 0.19.1

@jreback
Copy link
Contributor

jreback commented Feb 3, 2017

We had a very similar issue, can't seem to find this now. But on windows, this is fixed I think in 0.19.2, but maybe 0.20.0. (still in dev)

can you try?

@rsheftel
Copy link
Author

rsheftel commented Feb 5, 2017

I tried with 0.19.2 and the problem remains. When 0.20.0 becomes available on one of my systems I will give that a try.

@jreback
Copy link
Contributor

jreback commented Feb 6, 2017

I thought this was solved, but was able to repro on windows.

So it seems that we are not closing on calling a date_parser function (we raise, but the top-level needs a try/finally around this to close the handle). We do this for other errors, this must have been missed.

@rsheftel would you like to do a pull-request to fix?

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Novice IO CSV read_csv, to_csv labels Feb 6, 2017
@jreback jreback added this to the 0.20.0 milestone Feb 6, 2017
@jreback jreback changed the title read_csv with date_parser lock file open on failure BUG: read_csv with date_parser lock file open on failure Feb 6, 2017
@rsheftel
Copy link
Author

rsheftel commented Feb 6, 2017

I don't think I am enough of an expert in the pandas code to submit a change that fixes this. Do you want me to just create a blank pull-request? (Sorry I'm new to the GitHub / collaborative world and how exactly it works)

@jreback
Copy link
Contributor

jreback commented Feb 6, 2017

no this would be a regular pull-request that has tests and makes the change.

http://pandas.pydata.org/pandas-docs/stable/contributing.html

@rsheftel
Copy link
Author

If I gain expertise in the code base and feel I can confidently contribute I will. Thanks.

@cpcloud
Copy link
Member

cpcloud commented Feb 27, 2017

I looked into this with the help of @faizanv and finally got to the bottom of what's happening in pandas, but I'm still trying to figure out what's happening in the C API.

The proximate cause of this issue is that when we're reading a standard CSV file from a user provided path, we open a file handle in C (using fopen) and do not close it if there's an exception.

If there's no exception (or we catch the exception) the file handle is closed by a call to parser_free() (which calls del_file_source -- this function contains a call to fclose) in TextReader.__dealloc__. __dealloc__ is a special Cython method that more or less sits in the tp_dealloc slot of PyTypeObjects.

It's not clear to me why it gets called when we catch the exception versus when we don't.

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

hmm, I think we wrapped most/all of the calls for the low-level .read in try/excepts to do exactly this (in fact you get ResourceWarnings in py3 if you don't close things).

All that said @cpcloud something fishy is going on. maybe @gfyoung has an idea.

@cpcloud
Copy link
Member

cpcloud commented Feb 27, 2017

@jreback The issue is that Python knows nothing about any calls to fopen or fclose.

@cpcloud
Copy link
Member

cpcloud commented Feb 27, 2017

We do something like this:

try:
    rows = parser.read()
finally:
    parser.close()

but we don't track the file handle of files opened by the parser_t* related code so we only close it when __dealloc__ is called. Since __dealloc__ isn't called when there's an exception, the handle never closes.

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

you could try explicitly closing IN the actual parser when the .close method is called (ios do cleanup), but would prob have to set a flag so that you don't do it again in dealloc

@cpcloud
Copy link
Member

cpcloud commented Feb 27, 2017

:) Funny you mention that. That is my current working solution

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@mroeschke mroeschke added Bug and removed Compat pandas objects compatability with Numpy or Python functions labels Apr 10, 2020
@twoertwein
Copy link
Member

I'm quite sure this should work with 1.3.2, but I'm not sure whether the following is a sufficient test for this:

PYTHONWARNINGS="default" python test.py

test.py:

from io import StringIO

import pandas as pd


def strict_parser(dates):
    assert False


with StringIO("a,b,c,datetime") as file:
    pd.read_csv(
        file,
        parse_dates=["datetime"],
        index_col=["datetime"],
        date_parser=strict_parser,
    )

This should print a ResourceWarning (even on unix) if we do not close the file handle.

@twoertwein twoertwein added the Needs Tests Unit test(s) needed to prevent regressions label Sep 12, 2021
@roberthdevries
Copy link
Contributor

Cannot reproduce anymore with v1.5.0.dev0 on Linux.
I modified the test script on line 6 slightly to adapt to the current pandas version and added a manual test to determine open file descriptors.
Attached the test script.

import pandas as pd
import datetime
import psutil

def strict_parser(dates):
    datetimes = [pd.to_datetime(datetime.datetime.strptime(date, '%Y-%m-%d %H:%M:%S%z'), utc=True) for date in dates]
    return pd.DatetimeIndex(datetimes)

filename = 'bug-15302.csv'

proc = psutil.Process()
print("open files", proc.open_files())

try:
    data = pd.read_csv(filename, parse_dates=['datetime'], index_col=['datetime'], date_parser=strict_parser)
    print(data)
except Exception as e:
    print("------------------------------------------------------------------")
    print(e)
    print("------------------------------------------------------------------")

print("==============================================")
print("open files", proc.open_files())

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@LeilaShahmoradi
Copy link

take

@jasonmokk
Copy link
Contributor

take

@aabhinavg
Copy link

Issue Description:

Here what I observed is that when we are using the date_parser functionality of read_csv(), if the file read fails due to timezone which has a missing offset in the datetime strings, the file remains locked open, preventing it from being moved to another directory. This occurs because the failure in the date_parser keeps the file open until the Python session is terminated.

Proposed Solution:

To handle this issue gracefully, we can modify the strict_parser function to attempt parsing with timezone offset and, if that fails, try parsing without timezone offset. This approach ensures that files with datetime strings lacking timezone offset are parsed correctly.

def strict_parser(dates):
    try:
        # Attempt to parse with timezone offset
        datetimes = [pd.Timestamp(datetime.datetime.strptime(date, '%Y-%m-%d %H:%M:%S%z'), tz='UTC') for date in dates]
        return pd.DatetimeIndex(datetimes)
    except ValueError:
        # If parsing with timezone offset fails, try parsing without timezone offset
        datetimes = [pd.Timestamp(datetime.datetime.strptime(date, '%Y-%m-%d %H:%M:%S'), tz='UTC') for date in dates]
        return pd.DatetimeIndex(datetimes)

@jasonmokk jasonmokk removed their assignment May 5, 2024
@mroeschke
Copy link
Member

date_parser will be removed in pandas 3.0 so closing as a won't fix #51019

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

No branches or pull requests