Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

Latest commit

 

History

History
244 lines (189 loc) · 23.6 KB

CHANGELOG.md

File metadata and controls

244 lines (189 loc) · 23.6 KB

Changes

21.09.9 (2022-03-29)

Features

  • Add POST_AUTHORIZE notify hook to support plugins that needs to do some operations just after the authorization step. (#552)

Fixes

  • Fix a long-standing critical bug that leaked kernel creation coroutine tasks and made some sessions stuck at PREPARING (#558)

21.09.8 (2022-03-07)

Features

  • Make an explicit error message upon IntegrityError due to missing scaling groups when handling agent heartbeats. (#443)
  • Check the allowed session types per scaling group as part of scheduler predicate checks and structurize the scheduler_opts column by introducing StructuredJSONBColumn which applies trafarets on raw/Python value conversion (#523)
  • Make shielded async functions to spawn inside aiotools.PersistentTaskGroup to ensure proper cancellation on shutdown (#533)

Fixes

  • Fix get_wsproxy_version API raising GenericNotFound when user's current domain isn't associated with target scaling group (#517)
  • Allow admins to take actions on behalf of "inactivated" keypairs, such as terminating an RUNNING compute session created by the inactive keypair. (#530)
  • Upgrade Callosum to resolve installation error on Ubuntu-20.04/aarch64 (#534)
  • Fix get_wsproxy_version() returning 404 when target scaling group is allowed to individual keypair/user group rather than user's associated domain (#538)
  • Prevent potential blocking of mutating database transactions when vfolder clone operations take too long time, by making clone operation async (background task) with transactions (#539)
  • Fix a wrong registry name in the sample etcd config (#540)
  • Handle multi-architecture image manifest properly. (#543)
  • Correctly skip the legacy kernel images in the Docker Hub registry, prefixed with kernel-, while updating the image metadata. (#545)
  • Fix a bug that retried transactions even for non-serialization failures, causing excessive database overheads (#547)

21.09.7 (2022-02-14)

Features

  • Allow vfolder mounting to arbitrary path, excluding pre-existing folders of '/' like '/bin'. (#516)

Fixes

  • Prevent redis pool depletion in proxying streaming app requests by executing the entire connection tracker script in a non-bursty manner. (#526)
  • Apply "read-only" attribute to a broader range of database transactions to improve overall performance (#529)
  • Force resource usage recalculation when session creation is failed to prevent failed session's resource slot not returning. (#531)

21.09.6 (2022-01-26)

Fixes

  • Reduce possibility for sessions stuck at PREPARING by reordering spwaning of postprocessing tasks and RPC calls (#518)
  • Prevent AttributeError in proxying app requests by declaring down_task in TCPProxy class explicitly. (#519)
  • Reduce possibility of new sessions to get stuck in the PREPARING status by improving synchronization of session/kernel creation trackers (#522)
  • Migrate aiodataloader to aiodataloader-ng, which is managed by us (#525)

Miscellaneous

  • Change the default values of max concurrent sessions (30 -> 5) and idle timeout (600 -> 3600) in keypair resource policy fixture to conform to the preferable defaults. (#512)

21.09.5 (2022-01-13)

Fixes

  • Improve kernel creation stability by applying improved transaction retries to more database queries including predicate checks (#514)

21.09.4 (2022-01-11)

Fixes

  • Use a fixed value as the node ID in EventDispatcher instances, either auto-generated from the hostname or manually configured manager.id value of manager.toml.
    • IMPORTANT: An explicit admin/developer action is required to fix up the corrupted Redis database and configuration. Check out the description of lablup/backend.ai-manager#513 for details. (#513)

21.09.3 (2022-01-10)

Features

  • Add session.start_service API to support wsproxy v2 (#479)
  • Move 'max_containers_per_session' policy check from predicates to registry.enqueue_session (#504)
  • Add support for session renaming (#505)

Fixes

  • Fix "too many sessions matched" error when the given session name has an exact match with additional prefix matches (#506)
  • Update mypy to 0.930 and fix newly discovered type errors (#508)
  • Update type annotations and correct typing errors additionally found by pyright and latest mypy (#509)
  • Fix a potential reason of hang-up while shutting down the manager service, by explicitly handling cancellations in global timers better (#510)

21.09.2 (2021-12-15)

Features

  • Update CRUD of session template and correct typo of example-session-templates.json (#480)
  • Add mgr clear-history cli command to delete old records from the kernels table and clear up the actual disk space. (#498)
  • Add a new GQL mutation to modify the schedulable attribute of agents (#500)

Fixes

  • Remove premature optimization that caches Callosum RPC Peer connections to reduce ZeroMQ handshake latencies because long-running idle connections may get silently expired by network middleboxes and unexpected hang-ups (#497)
  • Fix a regression of the usage stats aggregation API due to difference of aiopg and asyncpg behavior on rowcount of SELECT query results, by replacing .rowcount to len() (#502)
  • Revert introduction of busy-wait polling loop for advisory locks by #483 and rollback to blocking advisory locks in #482, while preserving the refactoring work in #483. (#503)

21.09.1 (2021-11-11)

Fixes

  • Upgrade aiohttp from 3.7 to 3.8 series (#496)

21.09.0 (2021-11-08)

Features

  • Add optional manually-assigned agent list to session creation API (#469)
  • Upgrade to aioredis v2 (#478)
  • Allow configuration of the TCP keepalive timeout for the manager-to-agent RPC layer via etcd (#485)
  • Limit the maximum configurable value of per-vfolder quota when creating a vfolder to the size quota specified in the keypair resource policy (max_vfolder_size) (#488)
  • Properly implement vfolder's max_size property with storage proxy (21.03.1+ required) and change the unit of the field from KBytes to MBytes. (#489)
  • Return manager version information in status API. This enables clients to display the current version of the manager. (#491)
  • Allow (non-superadmin) users to query/update per-vfolder quotas on their own. To help a client determine availability of per-vfolder quota option, now the response of the vfolder host list API includes volume information such as capabilities from storage proxy. (#492)
  • update_quota API returns the quota value actually set for client's reference. (#493)
  • Add a new get_usage API for superadmins to query the usage of an arbitrary vfolder, while users can query their vfolder usage with get_info API (#494)

Fixes

  • More realistic resource preset fixture. (#481)
  • Replace aioredlock with pg_advisory_lock because aioredlock is no longer actively maintained and causes lots of synchronization issues (#482)
  • A follow-up fix for #482 to silence bogus DB API error upon service shutdown (#483)
  • Apply TCP keepalive options to ZeroMQ sockets for RPC channels (#484)
  • Filter out images with malformed tags from the response of the image list API (#486)
  • Remove deferrable=True option from the DB transaction to read session usage statistics. Since the manager now keeps repeatedly creating implicitly started DB transactions to acquire advisory locks (#482) and deferrable transactions barely can be started. (#487)
  • Fix an error in creating a virtual folder when quota is not delivered. (#490)
  • Improve stability of session/kernel event notification APIs (#495)

Miscellaneous

  • Update missing licenses and add project links in DEPENDENCIES.md (#471)

21.09.0a2 (2021-09-28)

Features

  • Add the get/set APIs for size-based quota of vfolder hosts via storage proxy (#474)
  • Add an Etcd option to set MTU in creating an overlay network for a cluster session to support improved performance for multi-node cluster training. (#475)

Fixes

  • Always set the scaling group when creating sessions to prevent use of non-allowed scaling groups (#472)
  • Rearrange the order of checking vfolder mount aliases to fix emptiness and null checks to come at the right order (#473)
  • Fix a regression of Agent.batch_load() GraphQL resolver due to internal argument name changes (#476)

21.09.0a1 (2021-08-25)

Breaking Changes

  • Removed never-used order_key and order_asc arguments in GraphQL pagination queries in favor of the new generic order argument (#449)

Features

  • Rewrite the session scheduler to avoid HoL blocking (#415)
    • Skip over sessions in the queue if they fail to satisfy predicates for multiple retries -> 1st case of HoL blocking: a rogue pending session blocks everything in the same scaling group
    • You may configure the maximum number of retries in the config/plugins/scheduler/fifo/num_retries_to_skip etcd key.
    • Split the scheduler into two async loops for scheduling decision and session spawning by inserting "SCHEDULED" status between "PENDING" and "PREPARING" statuses -> 2nd case of HoL blocking: failure isolation with each task
  • Add an API endpoint to share/unshare a group virtual folder directly to specific users. This is to allow specified users (usually teachers with user account) can upload data/materials to a virtual folder while it is shared as read-only for other group users. (#419)
  • Add PRE_AUTH_MIDDLEWARE hook for cookie-based SSO plugins (#420)
  • Add update_full_name API to rename user's full_name regardless of role. (#424)
  • Modify the loading process so that the scheduler can be loaded reflecting scheduler_opts. (#428)
  • A new idle timeout checker to support utilization-based garbage collection of sessions. (#432)
  • Add a common session environment variable BACKENDAI_SESSION_NAME for improved prompts and user acquaintance of which container they use (#433)
  • Add an internal warning logs for excessive number of concurrent DB transactions (#435)
  • Add a environment variable BACKENDAI_ACCESS_KEY for identifying the session owner inside the session containers (#437)
  • Make an explicit error message upon IntegrityError due to missing scaling groups when handling agent heartbeats. (#443)
  • Now all paginated list GraphQL queries have optional filter and order arguments where the client may specify the filtering/ordering conditions using a simple mini-language expression (#449)
  • Add aiomonitor module for manager (#450)
  • Add groups_by_name GraphQL query to directly get group(s) from the given name (#452)
  • Add lock-related DB connection settings for better DB stability when the Manager does not release a lock and/or idle for a long time after acquiring a lock. (#454)
  • Make ilike operator (equivalent to SQL's ILIKE) available in the queryfilter to allow case-insensitive string matching (#458)
  • Add queryfilter/queryorder support for keypairs' (full_name, num_queries), users' (uuid), and kernels' (id, agent(s)) column. (#464)

Fixes

  • Fix a missing reference fix for renaming of gateway to manager.api (#409)
  • Refactor the manager CLI initialization steps and promote generate-keypair as a regular mgr subcommand (#411)
  • Fix an internal API mismatch for our SQLAlchemy custom enum types (#412)
  • Fix a regression in session cancellation and kernel status updates after SQLAlchemy v1.4 upgrade (#413)
  • Fix a regression of spawning multi-node cluster sessions due to DB API changes related to setting transaction isolation levels (#416)
  • Adjust the firing rate of DoPrepareEvent to follow and alternate with the scheduler execution (#418)
  • Change the KeyPair.num_queries GQL field to use Redis instead of the keypairs.num_queries DB column to avoid excessive DB writes (#421)
  • Improve stability and synchronization of container-databse states (#425)
    • Now all DB transactions use the "SERIALIZABLE" isolation level with explicit retries.
    • Now DB transactions that includes only SELECT queries are marked as "read-only" so that the PostgreSQL engine could optimize concurrent access with the new isolation level. All future codes should use beegin_readonly() method from our own subclassed SQLAlchemy engine instance replacing all existing db context variables.
    • Remove excessive database updates due to keypair API query counts and kernel API query counts. The keypair API query count is re-written to use Redis with one month retention. (#421) Now just calling an API does not trigger updates in the PostgreSQL database.
    • Fix unnecessary database updates for agent heartbeats.
    • Split many update-only DB transactions into smaller units, such as resource recalculation.
    • Use PostgreSQL advisory locks to make the scheduling decision process as a critical section.
    • Fix some of variable binding issues with nested functions inside loops.
    • Apply event message coalescing to prevent event bursts (e.g., DoScheduleEvent fired after enqueueing new session requests) which hurts the database performance and potentially break the transaction isolation guarantees.
  • Further refine the stability update with improved database transaction retries and the latest SQLAlchemy 1.4.x updates within the last month (#429)
  • Fix a regression that destroying a cluster session generates duplicate session termination events (#430)
  • Optimize read-only GraphQL queries to use read-only transaction isolation level, which greatly reduces the database loads when using GUI (#431)
  • Fix owner_access_key related issues in creating and terminating the session (#434)
    • Remove automatic removal of owner_access_key in check_api_params() since all API handlers supporting it has explicit trafaret definition of it
    • Add owner_access_key in checking API parameter during session termination
  • Rewrite internal database connection and transaction management for GraphQL query and mutation processing, which improves overall stability and performance (#436)
  • Handle missing root context gracefully with explicit warning during initialization of the intrinsic error monitor plugin (#439)
  • Do not collect the data for utilization idle checker when the current time does not exceed the sum of the last collect time and the interval of the checker. (#441)
  • Handle failure of acquiring postgres advisory locks in the scheduler gracefully, by translating them as logged cancellations (#444)
  • Handle missing kernel log gracefully by adding a message about unavailability instead of panicking (#445)
  • Fix the regression of batch-type sessions by moving startup_command invocation to agents (#447)
  • Add a new GQL endpoint /admin/gql (in addition to existing /admin/graphql) which uses the standard-compliant response format (#448)
  • Apply missing batching of database queries for the Group.scaling_groups GraphQL field resolver. (#451)
  • Apply batching to user group resolution in GraphQL queries (#452)
  • Fix missing order_key, order_asc -> order changes in paginated list GQL queries (#453)
  • Fix a regression of session concurrency limit checks due to transaction retry refactoring in #429 (#455)
  • Partially revert and fix #454 which introduced default connection settings of deadlock/lock/transaction-idle timeouts for PostgreSQL, to make it working with AWS RDS (#456)
  • Fix a critical bug due to a missing column from the select targets of join SQL queries in the new batched GQL object resolvers for Group.by_user and ScalingGroup.by_group, which only happens with non-admin user accounts (#457)
  • Fix handling of value transforms with array values in queryfilter binary expressions (#459)
  • Let the idle timeout checkr skip batch-type sessions as they do not have any interaction whose absence are translated to idleness (#460)
  • Remove duplicate codes of mount_map check and add alias name check for mount_map. (#461)
  • Un-allocated resources were not excluded from the criteria of the utilization-based idle checker. (#463)
  • Fix missing timestamp updates when terminating sessions (#465)
  • Include the exact list of missing/invalid vfolders when returning VFolderNotFound error (#466)
  • Return the correct ID of the manager for the request to get the manager status. (#468)
  • Add the missing extra requirements tag of SQLAlchemy to install greenlet and asyncpg correctly (#470)

Miscellaneous

  • Fix the examples for the storage proxy URL configurations in the manager.config module (#410)
  • Update sample configurations for etcd (#414)
  • Temporarily pin pytest-asyncio to 0.14.0 due to regression of handling event loops for fixtures (#423)
  • Update package dependencies (#462)

Older changelogs