Pre-calculate total geo area and time max limit for project groups #780

thenav56 · 2023-05-12T10:51:44Z

Changes

Add two new columns in the groups table to use in aggregated data calculation. We are currently calculating this in fly for each calculation run
- total_area float
- time_spent_max_allowed float

Why

Caching the geo areas decreases stat calculation time significantly.
For eg: For the whole 2022 year data (by calculating 1 month at a time), It took more than 5 days. After the current implementation, it took around 1-2 hours and the majority of time was spent on initial cache calculation.

Deployment steps

Run SQL migration script

postgres/scripts/v2_to_v3/06_alter_groups_data_add_total_area_and_time_limit_columns.sql

Run aggregated data collection (Optional)

docker-compose run --rm django ./manage.py update_aggregated_data

Hagellach37 · 2023-06-13T09:06:38Z

mapswipe_workers/mapswipe_workers/project_types/base/project.py

+              required_count,
+              progress,
+              project_type_specifics
+            )


I would have expected that we add the total group area max time here as well? Or is this done in another step?

We are using Django for calculating this part. I have also added a comment mentioning that as well.
This can take some time to calculate and also doesn't work with existing ones as well.
So for now doing this on Django made more sense as it is calculated and maintained by Aggregated logic.

mhh okay, not sure if I prefer this way to split it up. Could you investigate how long this will take for a single project during project creation.

I think it's "just" adjusting the sql statement slightly and summing up the task area per group with postgis. @ElJocho might be able to advise here as well.

ideally the group information could be set directly when setting up the project, then we do not need to rely on django to perform additional steps at some point in time.

for backtrack calculation of existing projects your approach is good. I'm just concerned about new projects.

It is already difficult to understand which information is set for which project, e.g. considering the various project types. @ElJocho has worked towards refactoring this part of the code so that it will be easier to learn which attributes are expected.

That makes sense, I wanted to add this to the group as well. Considering the refactor and complex structure right now I wanted to make minimal changes to the mapswipe_workers core part as well. Currently, the calculation/test parts also fit perfectly in the aggregated module as it is tightly coupled with it and only used there.

Maybe @ElJocho can move this aggregated logic to the core part when during the refactor. The SQL query is already defined, will just need to add it to the core logic and maybe define further tests as well. Which seems to be perfect on the refactoring part.

python-mapswipe-workers/django/apps/aggregated/management/commands/update_aggregated_data.py

Lines 22 to 74 in 5f8c32f

UPDATE_PROJECT_GROUP_DATA = f"""

WITH to_calculate_groups AS (

SELECT

project_id,

group_id

FROM groups

WHERE

(project_id, group_id) in (

SELECT

MS.project_id,

MS.group_id

FROM mapping_sessions MS

WHERE

MS.start_time >= %(from_date)s

AND MS.start_time < %(until_date)s

GROUP BY MS.project_id, MS.group_id

) AND

(

total_area is NULL OR time_spent_max_allowed is NULL

)

),

groups_data AS (

SELECT

T.project_id,

T.group_id,

SUM( -- sqkm

ST_Area(T.geom::geography(GEOMETRY,4326)) / 1000000

) as total_task_group_area,

(

CASE

-- Using 95_percent value of existing data for each project_type

WHEN P.project_type = {Project.Type.BUILD_AREA.value} THEN 1.4

WHEN P.project_type = {Project.Type.COMPLETENESS.value} THEN 1.4

WHEN P.project_type = {Project.Type.CHANGE_DETECTION.value} THEN 11.2

-- FOOTPRINT: Not calculated right now

WHEN P.project_type = {Project.Type.FOOTPRINT.value} THEN 6.1

ELSE 1

END

) * COUNT(*) as time_spent_max_allowed

FROM tasks T

INNER JOIN to_calculate_groups G USING (project_id, group_id)

INNER JOIN projects P USING (project_id)

GROUP BY project_id, P.project_type, group_id

)

UPDATE groups G

SET

total_area = GD.total_task_group_area,

time_spent_max_allowed = GD.time_spent_max_allowed

FROM groups_data GD

WHERE

G.project_id = GD.project_id AND

G.group_id = GD.group_id;

"""

For backtrack calculation of existing projects your approach is good. I'm just concerned about new projects.

The aggregated module will calculate for old and new projects. Basically, if there are any projects with swipe data then the aggregated module will take care of it.

perfect. sounds good to me then.

Okay. I have merged this PR and created a new ticket (assigned @ElJocho there)

Move group aggregated columns calculation logic #852

thenav56 force-pushed the feature/speed-up-aggregate-calculation branch 5 times, most recently from 4e135bc to b4f335a Compare May 12, 2023 12:20

thenav56 marked this pull request as ready for review May 12, 2023 12:47

thenav56 requested review from Hagellach37 and laurentS and removed request for Hagellach37 May 18, 2023 12:14

thenav56 force-pushed the feature/speed-up-aggregate-calculation branch from 9806151 to 5f8c32f Compare May 23, 2023 08:33

Hagellach37 reviewed Jun 13, 2023

View reviewed changes

thenav56 mentioned this pull request Jun 13, 2023

Move group aggregated columns calculation logic #852

Open

thenav56 force-pushed the dev branch from f0d9543 to 118874d Compare June 13, 2023 11:03

thenav56 force-pushed the feature/speed-up-aggregate-calculation branch from 5f8c32f to 1798493 Compare June 13, 2023 11:06

thenav56 added 4 commits June 13, 2023 16:51

Pre-calculate total geo area and time max limit for project groups

245a613

Test fix for new groups columns

b4c9017

Django test migration update

1798493

Upgrade Django from "^4.1.1" to "^4.1.9"

6c0dbd3

This was referenced Jun 13, 2023

Bump django from 4.1.1 to 4.1.9 in /django #779

Closed

Bump sqlparse from 0.4.2 to 0.4.4 in /django #677

Closed

Bump sentry-sdk from 1.9.9 to 1.14.0 in /django #676

Closed

thenav56 merged commit ec64f11 into dev Jun 13, 2023

thenav56 deleted the feature/speed-up-aggregate-calculation branch June 13, 2023 11:38

ElJocho requested review from ElJocho and removed request for ElJocho June 15, 2023 19:31

thenav56 mentioned this pull request Jun 23, 2023

Prod Deployment - Community Dashboard data calculation update #871

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-calculate total geo area and time max limit for project groups #780

Pre-calculate total geo area and time max limit for project groups #780

thenav56 commented May 12, 2023 •

edited

Loading

Hagellach37 Jun 13, 2023

thenav56 Jun 13, 2023

Hagellach37 Jun 13, 2023

Hagellach37 Jun 13, 2023

thenav56 Jun 13, 2023

Hagellach37 Jun 13, 2023

thenav56 Jun 13, 2023

	UPDATE_PROJECT_GROUP_DATA = f"""
	WITH to_calculate_groups AS (
	SELECT
	project_id,
	group_id
	FROM groups
	WHERE
	(project_id, group_id) in (
	SELECT
	MS.project_id,
	MS.group_id
	FROM mapping_sessions MS
	WHERE
	MS.start_time >= %(from_date)s
	AND MS.start_time < %(until_date)s
	GROUP BY MS.project_id, MS.group_id
	) AND
	(
	total_area is NULL OR time_spent_max_allowed is NULL
	)
	),
	groups_data AS (
	SELECT
	T.project_id,
	T.group_id,
	SUM( -- sqkm
	ST_Area(T.geom::geography(GEOMETRY,4326)) / 1000000
	) as total_task_group_area,
	(
	CASE
	-- Using 95_percent value of existing data for each project_type
	WHEN P.project_type = {Project.Type.BUILD_AREA.value} THEN 1.4
	WHEN P.project_type = {Project.Type.COMPLETENESS.value} THEN 1.4
	WHEN P.project_type = {Project.Type.CHANGE_DETECTION.value} THEN 11.2
	-- FOOTPRINT: Not calculated right now
	WHEN P.project_type = {Project.Type.FOOTPRINT.value} THEN 6.1
	ELSE 1
	END
	) * COUNT(*) as time_spent_max_allowed
	FROM tasks T
	INNER JOIN to_calculate_groups G USING (project_id, group_id)
	INNER JOIN projects P USING (project_id)
	GROUP BY project_id, P.project_type, group_id
	)
	UPDATE groups G
	SET
	total_area = GD.total_task_group_area,
	time_spent_max_allowed = GD.time_spent_max_allowed
	FROM groups_data GD
	WHERE
	G.project_id = GD.project_id AND
	G.group_id = GD.group_id;
	"""

Pre-calculate total geo area and time max limit for project groups #780

Pre-calculate total geo area and time max limit for project groups #780

Conversation

thenav56 commented May 12, 2023 • edited Loading

Changes

Why

Deployment steps

Hagellach37 Jun 13, 2023

Choose a reason for hiding this comment

thenav56 Jun 13, 2023

Choose a reason for hiding this comment

Hagellach37 Jun 13, 2023

Choose a reason for hiding this comment

Hagellach37 Jun 13, 2023

Choose a reason for hiding this comment

thenav56 Jun 13, 2023

Choose a reason for hiding this comment

Hagellach37 Jun 13, 2023

Choose a reason for hiding this comment

thenav56 Jun 13, 2023

Choose a reason for hiding this comment

thenav56 commented May 12, 2023 •

edited

Loading