Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

New plan for trace storage work #10

Open
41 of 60 tasks
StachuDotNet opened this issue Mar 1, 2023 · 2 comments
Open
41 of 60 tasks

New plan for trace storage work #10

StachuDotNet opened this issue Mar 1, 2023 · 2 comments

Comments

@StachuDotNet
Copy link
Member

StachuDotNet commented Mar 1, 2023

DB Clone

  • prototype the DB clone (go through the steps, record "down times")
    • events table (not v2)
      • verify that nothing is querying events (not v2) table
      • look at the events table and see if it's using any foreign keys
        • if any, remove the foreign keys
      • delete the events table from dark-west
    • do the DB clone
    • update the DB clone
      • set zoning to single-zone (applies to both servers and storage)
      • update postgres version to 14
        conclusion: probably brings unnecessary risk
      • turn down the CPUs by ~1/3

Drop Events table

  • drop events table in the codebase
    • review usages of "events" in codebase - see if we're missing anything
    • investigate connection to worker_stats_v1
    • write migration script (drop if still exists) to drop events
    • update tests if they were somehow referencing events
    • update clear-canvas script to not reference events table
      (note: apparently we weren't clearing events_v2!)
    • do we need to merge any changes before we drop the events table in prod?
      • Yes.
  • drop events table in production
    • set lock_timeout = '1s'
    • set statement_timeout = '1s'
    • alter table events drop constraint events_canvas_id_fkey
    • alter table events drop constraint events_account_id_fkey
    • drop index concurrently if exists idx_events_for_dequeue
    • drop index concurrently if exists idx_events_for_dequeue2
    • truncate events table
    • drop events table
  • merge the migration
  • copy the above all from stable-dark to dark

Get Google to shrink a clone

Goal: determine the amount of downtime

  • make a clone of our DB
  • set lock_timeout = '1s'
  • set statement_timeout = '1s'
  • drop FK on account_id
    alter table events drop constraint events_canvas_id_fkey
  • drop FK on canvas_id
    alter table events drop constraint events_account_id_fkey
  • drop index idx_events_for_dequeue
    drop index concurrently if exists idx_events_for_dequeue
  • drop index idx_events_for_dequeue2
    drop index concurrently if exists idx_events_for_dequeue2
  • drop index index_events_for_stats
  • truncate events table
  • drop events table
  • ask google to shrink it
    (they'll do this in real-time synchronously during a workday/call)
  • record the downtime for reference: [downtime]
  • lower availability to single-zone
  • lower CPU from 16 vCPUs to 12 vCPUs

Make a plan for doing this against the prod DB

  • plan how to alert customers
    • of expected downtime, etc
  • ...

another day: (pull into another issue)

Cloud storage

  • delete trace-related tests
  • check that 404s continue to work
  • ensure we overwrite cloud storage traces for execute_handler button
  • check if execute_function traces are appropriately merged with a cloud-storage -based trace
  • garbage collection - set object lifecycle for bucket or for traces
  • ensure pusher is supported
  • do walkthrough and check it all works

monitoring

  • schedule weekly call/meeting where we review usage, for 4 weeks. at the end of such, consider what to do
    • check table sizes
    • check costs

migrate existing canvases

  • upload to both simultaneously
  • fetch and upload existing trace data for existing canvases/handlers
  • possibly automatically switch LD flag once this is done
  • switch all users to only use uploaded storage data

Maybe later?

  • turn on private IPs (requires DB downtime)
@StachuDotNet
Copy link
Member Author

  • toplevel_oplists has PK
  • stored_events_v2 no PK
  • function_results_v3 no PK
  • function_arguments no PK
  • traces_v0 has PK
  • user_data has PK
  • system_migrations has PK
  • static_asset_deploys has PK
  • secrets has PK
  • scheduling_rules has PK
  • packages_v0 has PK
  • op_ctrs no PK (tho does have unique constraint)
  • events_v0 has PK
  • events has PK
  • custom_domains has PK
  • cron_records has PK
  • canvases has PK
  • accounts has PK
  • access no PK

from an earlier call - notes on which tables have/don't have PKs

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Mar 3, 2023

Edit: the contents of this comment have been moved to #11, and have been executed

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant