Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Remove SerializebleLock in to_snowflake #39

Merged
merged 11 commits into from
Jul 17, 2023
Merged

Remove SerializebleLock in to_snowflake #39

merged 11 commits into from
Jul 17, 2023

Conversation

phobson
Copy link
Contributor

@phobson phobson commented Nov 23, 2022

Fixes #29

Just testing things out for now

@phobson
Copy link
Contributor Author

phobson commented Nov 23, 2022

With the larger dataframe, the partition sizes when we read back from snowflake are coming back a bit erratically sized. This is without any changes to core.py

Locally, I get:

(Pdb) from dask.utils import format_bytes
(Pdb) partition_sizes.map(format_bytes).to_frame("result").assign(expected="2 MiB")
        result expected
0     1.60 MiB    2 MiB
1     1.71 MiB    2 MiB
2     2.18 MiB    2 MiB
3     3.51 MiB    2 MiB
4     1.60 MiB    2 MiB
5     1.71 MiB    2 MiB
6     2.18 MiB    2 MiB
7     4.36 MiB    2 MiB
8     1.39 MiB    2 MiB
9   875.77 kiB    2 MiB
10    1.71 MiB    2 MiB
11    2.18 MiB    2 MiB
12    3.72 MiB    2 MiB
13    1.60 MiB    2 MiB
14    1.70 MiB    2 MiB
15    2.18 MiB    2 MiB
16    1.69 MiB    2 MiB
17    1.28 MiB    2 MiB
18    1.70 MiB    2 MiB
19    2.18 MiB    2 MiB
20    4.37 MiB    2 MiB
21    1.82 MiB    2 MiB
22    1.70 MiB    2 MiB
23    2.18 MiB    2 MiB
24    3.30 MiB    2 MiB
25    1.60 MiB    2 MiB
26    1.71 MiB    2 MiB
27    2.18 MiB    2 MiB
28    4.37 MiB    2 MiB
29    2.79 MiB    2 MiB
30    1.60 MiB    2 MiB
31    1.71 MiB    2 MiB
32    2.18 MiB    2 MiB
33    4.37 MiB    2 MiB
34    1.29 MiB    2 MiB

@phobson
Copy link
Contributor Author

phobson commented Nov 23, 2022

Push a commit to remove the SerializableLock. The difference in test durations is substantial:

Without 399da14 (status quo):

============================================= slowest 10 durations =============================================
708.74s call     dask_snowflake/tests/test_core.py::test_result_batching_nparitions_and_equality[twelve months]
664.69s call     dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[twelve months]
76.58s call     dask_snowflake/tests/test_core.py::test_result_batching_nparitions_and_equality[one month]
63.95s call     dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[one month]
9.74s call     dask_snowflake/tests/test_core.py::test_application_id_config
8.94s call     dask_snowflake/tests/test_core.py::test_application_id_default
8.27s call     dask_snowflake/tests/test_core.py::test_application_id_explicit
7.54s call     dask_snowflake/tests/test_core.py::test_write_read_roundtrip
7.01s call     dask_snowflake/tests/test_core.py::test_execute_params
6.94s call     dask_snowflake/tests/test_core.py::test_read_empty_result
=========================================== short test summary info ============================================
XFAIL dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[twelve months] - inconsistent partition sizing
=========================== 11 passed, 1 xfailed, 46 warnings in 1601.74s (0:26:41) ============================

With 399da14 (this PR):

============================================= slowest 10 durations =============================================
175.20s call     dask_snowflake/tests/test_core.py::test_result_batching_nparitions_and_equality[twelve months]
115.40s call     dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[twelve months]
27.99s call     dask_snowflake/tests/test_core.py::test_result_batching_nparitions_and_equality[one month]
26.80s call     dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[one month]
7.11s call     dask_snowflake/tests/test_core.py::test_application_id_explicit
7.11s call     dask_snowflake/tests/test_core.py::test_write_read_roundtrip
7.11s call     dask_snowflake/tests/test_core.py::test_application_id_config_on_cluster
6.97s call     dask_snowflake/tests/test_core.py::test_execute_params
6.93s call     dask_snowflake/tests/test_core.py::test_arrow_options
6.71s call     dask_snowflake/tests/test_core.py::test_application_id_default
=========================================== short test summary info ============================================
XFAIL dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[twelve months] - inconsistent partition sizing
============================ 11 passed, 1 xfailed, 46 warnings in 425.89s (0:07:05) ============================

@phobson phobson marked this pull request as ready for review November 23, 2022 22:30
@phobson
Copy link
Contributor Author

phobson commented Nov 23, 2022

@dchudz
Copy link

dchudz commented May 15, 2023

@jrbourbeau is this something we should revisit?

@fjetter
Copy link
Member

fjetter commented May 16, 2023

If the threading problem is resolved, I'd recommend removing the parallel=1 kwarg as well. Since everything is IO bound we should be able to insert with many more threads than we have CPUs.

@jrbourbeau jrbourbeau mentioned this pull request May 17, 2023
@jrbourbeau jrbourbeau changed the title Seeing if we can remove the SerializebleLock to speed up writes Remove SerializebleLock in to_snowflake May 26, 2023
@fjetter
Copy link
Member

fjetter commented Jul 10, 2023

@jrbourbeau are you still on this?

@phobson
Copy link
Contributor Author

phobson commented Jul 27, 2023

Heyyy look that it. Finally merged! 🎉

@mrocklin
Copy link
Member

You just gotta plant a lotta seeds 🙂

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Questioning lock necessity in write_snowflake
5 participants