Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Extremely slow line iteration on Windows/ProactorEventLoop #47

Open
lych77 opened this issue Aug 31, 2018 · 6 comments
Open

Extremely slow line iteration on Windows/ProactorEventLoop #47

lych77 opened this issue Aug 31, 2018 · 6 comments

Comments

@lych77
Copy link

lych77 commented Aug 31, 2018

Windows 7 64-bit, Python 3.7.0 64-bit, aiofiles 0.4.0

Test program:

import asyncio, aiofiles, timeit

with open('sample.txt', 'w') as f:
	for i in range(10000):
		print('A'*100, file=f)

loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)

def traditional_read():
	with open('sample.txt', encoding='utf-8') as f:
		for ln in f:
			pass

async def aiofiles_read():
	async with aiofiles.open('sample.txt', encoding='utf=8') as f:
		async for ln in f:
			pass

print('traditional:', timeit.timeit('traditional_read()', number=1, globals=globals()))
print('aiofiles:', timeit.timeit('loop.run_until_complete(aiofiles_read())', number=1, globals=globals()))

Output:

traditional: 0.005477579999999982
aiofiles: 1.563328274

The sample file is not a very big one (less than 1MB), but it contains a bit "too many" lines. So is the result caused by excess thread context switches? Can this be avoided by some approach?

@Jitsusama
Copy link

I'm noticing the same thing. I have a 240 meg file that I'm trying to read in; and I've tried it with 3 different approaches. The first, inserting asyncio.sleep calls after each line is read in with the builtin open function, the second, utilizing aiofiles, the third without async using the builtin open function. Here's the code:

import asyncio
import logging
import aiofiles

logging.basicConfig(format='%(asctime)s %(message)s')
logging.getLogger().setLevel(level=logging.INFO)

async def async_read_with_aiofiles(file_name: str):
    line_count = 0
    logging.info('starting async aiofiles')
    async with aiofiles.open(file=file_name, mode='r', newline='\r\n') as file:
        async for _ in file:
            line_count += 1
    logging.info(f'read {line_count} lines with async aiofiles')

async def async_read_with_stdio(file_name: str):
    line_count = 0
    logging.info('starting async stdio')
    with open(file=file_name, mode='r', newline='\r\n') as file:
        for _ in file:
            line_count += 1
            await asyncio.sleep(0)
    logging.info(f'read {line_count} lines with async stdio')

def sync_read_with_stdio(file_name: str):
    line_count = 0
    logging.info('starting sync stdio')
    with open(file=file_name, mode='r', newline='\r\n') as file:
        for _ in file:
            line_count += 1
    logging.info(f'read {line_count} lines with sync stdio')

loop = asyncio.get_event_loop()

Here's the results of running it on a ~240MB file through each of the 3 options:

>>> sync_read_with_stdio(file_name='large_file')
2018-11-28 11:04:21,546 starting sync stdio
2018-11-28 11:04:22,935 read 2769652 lines with sync stdio
>>> loop.run_until_complete(async_read_with_stdio(file_name='large_file'))
2018-11-28 11:06:05,393 starting async stdio
2018-11-28 11:06:57,863 read 2769652 lines with async stdio
>>> loop.run_until_complete(async_read_with_aiofiles(file_name='large_file'))
2018-11-28 11:07:14,362 starting async aiofiles
2018-11-28 11:14:26,550 read 2769652 lines with async aiofiles

BTW; I'm on the exact same platform and versions as @lych77.

@ccosby
Copy link

ccosby commented Apr 30, 2019

Both SelectorEventLoop and ProactorEventLoop display the same behavior. It's just plain slow. Used the code from @lych77

MacOS / Python-3.7.3 / aiofiles-0.4.0 / SelectorEventLoop

traditional: 0.001899950000000039
aiofiles: 1.288835948

RHEL-7.6 / Python-3.7.3 / aiofiles-0.4.0 / SelectorEventLoop

traditional: 0.0033165980130434036
aiofiles: 2.984602615237236

@espdev
Copy link

espdev commented Jul 14, 2019

I have the same problem with aiofiles.

My simple benchmark and results on linux/uvloop:

import time
import asyncio
import aiofiles
import uvloop

uvloop.install()


async def _async_file_reader(file_name):
    async with aiofiles.open(file_name, 'r') as fp:
        async for line in fp:
            yield line

async def _async_write_file(file_name, lines):
    async with aiofiles.open(file_name, 'w') as fp:
        for line in lines:
            await fp.write(f"{line}\n")
        await fp.flush()

def _file_reader(file_name):
    with open(file_name, 'r') as fp:
        for line in fp:
            yield line

def _write_file(file_name, lines):
    with open(file_name, 'w') as fp:
        for line in lines:
            fp.write(f"{line}\n")
        fp.flush()

async def async_main(file_name):
    ts = time.time()
    lines = []

    async for line in _async_file_reader(file_name):
        lines.append(line)

    te = time.time() - ts
    print(f'ASYNC FILE READING: {te:.2f} sec')

    ts = time.time()
    await _async_write_file('/tmp/bbb.txt', lines)
    te = time.time() - ts
    print(f'ASYNC FILE WRITING: {te:.2f} sec')


def main(file_name):
    ts = time.time()
    lines = []

    for line in _file_reader(file_name):
        lines.append(line)

    te = time.time() - ts
    print(f'FILE READING: {te:.2f} sec')

    ts = time.time()
    _write_file('/tmp/aaa.txt', lines)
    te = time.time() - ts
    print(f'FILE WRITING: {te:.2f} sec')


if __name__ == '__main__':
    file_name = 'lines.txt'

    main(file_name)

    loop = asyncio.get_event_loop()
    loop.run_until_complete(async_main(file_name))
    loop.close()
# file with 10K lines
FILE READING: 0.01 sec
FILE WRITING: 0.06 sec
ASYNC FILE READING: 2.90 sec
ASYNC FILE WRITING: 2.68 sec

In my real code aiofiles ~14000x slower than not async IO when I use aiofiles and async for in reader->chunker->concurrent_workers pipeline

Maybe a problem in this code that called for each line if we use async for:

def _make_delegate_method(attr_name):
@asyncio.coroutine
def method(self, *args, **kwargs):
cb = functools.partial(getattr(self._file, attr_name),
*args, **kwargs)
return (yield from self._loop.run_in_executor(self._executor, cb))
return method

@asyncio.coroutine
def __anext__(self):
"""Simulate normal file iteration."""
line = yield from self.readline()
if line:
return line
else:
raise StopAsyncIteration

Therefore, I think async for is not applicable. read() should be used instead.

@Bobronium
Copy link

Bobronium commented Jul 14, 2019

The reason for this issue is that for each line aiofiles will start new task in executor, which is a huge overhead just for one line reading:

loop.run_in_executor(executor, file.readline)

Code above is just for demonstration, original code is:

@delegate_to_executor('close', 'flush', 'isatty', 'read', 'readable',
'readline', 'readlines', 'seek', 'seekable', 'tell',

def delegate_to_executor(*attrs):
def cls_builder(cls):
for attr_name in attrs:
setattr(cls, attr_name, _make_delegate_method(attr_name))
return cls
return cls_builder

def _make_delegate_method(attr_name):
@asyncio.coroutine
def method(self, *args, **kwargs):
cb = functools.partial(getattr(self._file, attr_name),
*args, **kwargs)
return (yield from self._loop.run_in_executor(self._executor, cb))
return method

@espdev
Copy link

espdev commented Jul 14, 2019

Thanks @MrMrRobat for investigating.

tomchristie pushed a commit to encode/starlette that referenced this issue May 6, 2020
* Tinche/aiofiles#47
Thread pool is no longer used for files in memory

* fix tests

* fix import sorted

* little change
@denravonska
Copy link

Reading binary files also seems to be super slow. I'm streaming files from a VM to a server on the host OS and I get ~1-1.6MBit/s.

hidraco added a commit to hidraco/starlette that referenced this issue Apr 21, 2022
* Tinche/aiofiles#47
Thread pool is no longer used for files in memory

* fix tests

* fix import sorted

* little change
dudleyhunt86 added a commit to dudleyhunt86/starlette-develop-with-python that referenced this issue Oct 7, 2022
* Tinche/aiofiles#47
Thread pool is no longer used for files in memory

* fix tests

* fix import sorted

* little change
azurelotus0926 added a commit to azurelotus0926/starlette that referenced this issue Jun 27, 2024
* Tinche/aiofiles#47
Thread pool is no longer used for files in memory

* fix tests

* fix import sorted

* little change
github-actions bot pushed a commit to Kludex/jik that referenced this issue Aug 16, 2024
* Tinche/aiofiles#47
Thread pool is no longer used for files in memory

* fix tests

* fix import sorted

* little change
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants