-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Pre-calculate total geo area and time max limit for project groups #780
Conversation
4e135bc
to
b4f335a
Compare
9806151
to
5f8c32f
Compare
required_count, | ||
progress, | ||
project_type_specifics | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected that we add the total group area max time here as well? Or is this done in another step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using Django for calculating this part. I have also added a comment mentioning that as well.
This can take some time to calculate and also doesn't work with existing ones as well.
So for now doing this on Django made more sense as it is calculated and maintained by Aggregated logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mhh okay, not sure if I prefer this way to split it up. Could you investigate how long this will take for a single project during project creation.
I think it's "just" adjusting the sql statement slightly and summing up the task area per group with postgis. @ElJocho might be able to advise here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally the group information could be set directly when setting up the project, then we do not need to rely on django to perform additional steps at some point in time.
for backtrack calculation of existing projects your approach is good. I'm just concerned about new projects.
It is already difficult to understand which information is set for which project, e.g. considering the various project types. @ElJocho has worked towards refactoring this part of the code so that it will be easier to learn which attributes are expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I wanted to add this to the group as well. Considering the refactor and complex structure right now I wanted to make minimal changes to the mapswipe_workers core part as well. Currently, the calculation/test parts also fit perfectly in the aggregated module as it is tightly coupled with it and only used there.
Maybe @ElJocho can move this aggregated logic to the core part when during the refactor. The SQL query is already defined, will just need to add it to the core logic and maybe define further tests as well. Which seems to be perfect on the refactoring part.
python-mapswipe-workers/django/apps/aggregated/management/commands/update_aggregated_data.py
Lines 22 to 74 in 5f8c32f
UPDATE_PROJECT_GROUP_DATA = f""" | |
WITH to_calculate_groups AS ( | |
SELECT | |
project_id, | |
group_id | |
FROM groups | |
WHERE | |
(project_id, group_id) in ( | |
SELECT | |
MS.project_id, | |
MS.group_id | |
FROM mapping_sessions MS | |
WHERE | |
MS.start_time >= %(from_date)s | |
AND MS.start_time < %(until_date)s | |
GROUP BY MS.project_id, MS.group_id | |
) AND | |
( | |
total_area is NULL OR time_spent_max_allowed is NULL | |
) | |
), | |
groups_data AS ( | |
SELECT | |
T.project_id, | |
T.group_id, | |
SUM( -- sqkm | |
ST_Area(T.geom::geography(GEOMETRY,4326)) / 1000000 | |
) as total_task_group_area, | |
( | |
CASE | |
-- Using 95_percent value of existing data for each project_type | |
WHEN P.project_type = {Project.Type.BUILD_AREA.value} THEN 1.4 | |
WHEN P.project_type = {Project.Type.COMPLETENESS.value} THEN 1.4 | |
WHEN P.project_type = {Project.Type.CHANGE_DETECTION.value} THEN 11.2 | |
-- FOOTPRINT: Not calculated right now | |
WHEN P.project_type = {Project.Type.FOOTPRINT.value} THEN 6.1 | |
ELSE 1 | |
END | |
) * COUNT(*) as time_spent_max_allowed | |
FROM tasks T | |
INNER JOIN to_calculate_groups G USING (project_id, group_id) | |
INNER JOIN projects P USING (project_id) | |
GROUP BY project_id, P.project_type, group_id | |
) | |
UPDATE groups G | |
SET | |
total_area = GD.total_task_group_area, | |
time_spent_max_allowed = GD.time_spent_max_allowed | |
FROM groups_data GD | |
WHERE | |
G.project_id = GD.project_id AND | |
G.group_id = GD.group_id; | |
""" |
For backtrack calculation of existing projects your approach is good. I'm just concerned about new projects.
The aggregated module will calculate for old and new projects. Basically, if there are any projects with swipe data then the aggregated module will take care of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect. sounds good to me then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I have merged this PR and created a new ticket (assigned @ElJocho there)
5f8c32f
to
1798493
Compare
Changes
groups
table to use in aggregated data calculation. We are currently calculating this in fly for each calculation runWhy
Caching the geo areas decreases stat calculation time significantly.
For eg: For the whole 2022 year data (by calculating 1 month at a time), It took more than 5 days. After the current implementation, it took around 1-2 hours and the majority of time was spent on initial cache calculation.
Deployment steps