diff --git a/doc/ADDB_Monitoring.md b/doc/ADDB_Monitoring.md index 417a81dcb26..8f014da00e1 100644 --- a/doc/ADDB_Monitoring.md +++ b/doc/ADDB_Monitoring.md @@ -3,27 +3,27 @@ ADDB records are posted by Motr software to record the occurrence of specific si ADDB monitors serve two purposes: -+ To support online statistics reporting, like similar to df, vmstat, top. This is needed both globally (summarized over all nodes) and locally. -+ Inform external world (HA & cluster management tools) about exceptional conditions like failures, overload, etc. ++ To support online statistics reporting, like similar to df, vmstat, top. This is needed both globally (summarized over all nodes) and locally. ++ Inform external world (HA & cluster management tools) about exceptional conditions like failures, overload, etc. ## Definitions -+ **ADDB** Analysis and Diagnotic Data Base. See [0] for aditional details. -+ **ADDB base record type** The architecture defines “event”, “data point” and “counter” type ADDB record categories. See [0] for aditional details. In this HLD we will use the word “exception” instead of “event”. We’ll use the term “event” to describe any derivation of the base record types. -+ **Summary ADDB records** These are summary records that are generated by ADDB monitor for a particular system metric. -+ **ADDB monitors** These are objects present on every node in the cluster (client or server), they generate the summary records. ++ **ADDB** Analysis and Diagnotic Data Base. See [0] for aditional details. ++ **ADDB base record type** The architecture defines “event”, “data point” and “counter” type ADDB record categories. See [0] for aditional details. In this HLD we will use the word “exception” instead of “event”. We’ll use the term “event” to describe any derivation of the base record types. ++ **Summary ADDB records** These are summary records that are generated by ADDB monitor for a particular system metric. ++ **ADDB monitors** These are objects present on every node in the cluster (client or server), they generate the summary records. ## Requirements -+ [r.addb.monitor.add.runtime] Addition of ADDB monitors. -+ [r.addb.monitor.remove.runtime] Deletion of ADDB monitors. -+ [r.addb.monitor.summary.generate-addb-records] ADDB monitors are able to generate summary addb records. -+ [r.addb.monitor.summary.deliver-addb-record-to-stats-service] ADDB monitors send summary addb records to stats service. -+ [r.addb.monitor.summary.deliver-addb-exception-records-to-local-HA] ADDB monitors send exception addb records to local HA (High Availability) component as a fop. -+ [r.addb.monitor.summary.deliver-addb-record-to-addb-service] ADDB monitors send summary addb records to addb service similar to normal addb records. -+ [r.addb.monitor.nesting] There can be ADDB monitors for ADDB records that are generated by some other monitors. -+ [r.addb.monitor.stats-service] Stats service to maintain statistics information received from all the nodes. -+ [r.addb.monitor.stats-service.state] ADDB stats service maintains a state, this state is built from all the addb summary records that this service receives through all the nodes present in the system (cluster). This state basically comprises of all the statistics summary information sent from nodes. -+ [r.addb.monitor.stats-service.query] Client (for eg. m0stats, m0t1fs, etc.) can query to this stats service for getting state information. -+ [r.addb.monitor.stats-service.single-instance] There would be only one instance of the stats service in the cluster/system. ++ [r.addb.monitor.add.runtime] Addition of ADDB monitors. ++ [r.addb.monitor.remove.runtime] Deletion of ADDB monitors. ++ [r.addb.monitor.summary.generate-addb-records] ADDB monitors are able to generate summary addb records. ++ [r.addb.monitor.summary.deliver-addb-record-to-stats-service] ADDB monitors send summary addb records to stats service. ++ [r.addb.monitor.summary.deliver-addb-exception-records-to-local-HA] ADDB monitors send exception addb records to local HA (High Availability) component as a fop. ++ [r.addb.monitor.summary.deliver-addb-record-to-addb-service] ADDB monitors send summary addb records to addb service similar to normal addb records. ++ [r.addb.monitor.nesting] There can be ADDB monitors for ADDB records that are generated by some other monitors. ++ [r.addb.monitor.stats-service] Stats service to maintain statistics information received from all the nodes. ++ [r.addb.monitor.stats-service.state] ADDB stats service maintains a state, this state is built from all the addb summary records that this service receives through all the nodes present in the system (cluster). This state basically comprises of all the statistics summary information sent from nodes. ++ [r.addb.monitor.stats-service.query] Client (for eg. m0stats, m0t1fs, etc.) can query to this stats service for getting state information. ++ [r.addb.monitor.stats-service.single-instance] There would be only one instance of the stats service in the cluster/system. ## Functional Requirements There are two APIs that are globally visible, they are to add/delete monitors. @@ -31,16 +31,16 @@ There are two APIs that are globally visible, they are to add/delete monitors. ``void m0_addb_monitor_del(struct m0_addb_monitor *mon);`` ADDB monitors do two main work / things: -+ Report statistics online -+ Report exceptional conditions like failure, etc. ++ Report statistics online ++ Report exceptional conditions like failure, etc. ### Statistics Reporting Reporting of statistics is required, which is similar to df, vmstat, top, etc. These statistics are generated by monitors that generate summaries of various system metric periodically. Statistics belong to two categories: -1. Stats which are readily available, eg. balloc will generate addb records about free space in a container periodically. -1. Stats which are not readily available. +1. Stats which are readily available, eg. balloc will generate addb records about free space in a container periodically. +1. Stats which are not readily available. These stats summary ADDB records can be produced on any node, this could be client or server. If produced on client they are sent to endpoint where addb service is running (using the current mechanism) and also to the endpoint where stats service is running, while if produced on server they are written to addb stob and also sent to this endpoint where stats service is running. @@ -68,9 +68,9 @@ ADDB monitors are represented as follows: ``` Structure field descriptions: -+ am_watch(), a monitor specific function.Actual monitoring logic is to be written in this function. It does the processing of all the addb records of its interests and can post the summary statistics obtained directly or computed as addb records that gets delivered to endpoint where addb service is running and to the endpoint where stats service is running as addb records. Also, it can post the exceptional conditions to a special service & a local HA component. -+ am_datum, provides for some private information that be kept per monitor. -+ am_linkage, links monitor to the global monitor list. ++ am_watch(), a monitor specific function.Actual monitoring logic is to be written in this function. It does the processing of all the addb records of its interests and can post the summary statistics obtained directly or computed as addb records that gets delivered to endpoint where addb service is running and to the endpoint where stats service is running as addb records. Also, it can post the exceptional conditions to a special service & a local HA component. ++ am_datum, provides for some private information that be kept per monitor. ++ am_linkage, links monitor to the global monitor list. There is a global list of all the monitors, add() would just add the monitor to this global list while del () would just remove this particular monitor from this global list. Monitors are added during addb sub-system initialization and deleted during the addb sub-system finalization. @@ -90,73 +90,74 @@ There is a periodic posting of these addb summary records and this is done by th The bottom half i.e. AST part would be run by a dedicated thread & would be synchronized among the various others threads that would run monitors with a sm (state machine) group lock. ## Conformance -+ [i.addb.monitor.add] An API is made available for this. -+ [i.addb.monitor.remove] An API is made available for this. -+ [i.addb.monitor.generate-summary-addb-records] Monitor’s am_watch() function will do this. -+ [r.addb.monitor.deliver-addb-record-to-stats-service] Addition to current ADDB mechanism is to be done to differentiate between summary stats records generated by monitors and other addb records & send these summary records to stats service. -+ [r.addb.monitor.deliver-addb-exception-records-to-local-HA] Monitor’s am_watch() function will do this. -+ [r.addb.monitor.deliver-addb-record-to-addb-service] This makes use of current implementation. -+ [r.addb.monitor.nesting] Monitors generate addb records which themselves can be monitored. -+ [r.addb.stats-service.state] Implementation of stats service handles this. -+ [r.addb.stats-service.query] Implementation of stats service handles this. -+ [r.addb.stats-service.single-instance] Implementation of stats service handles this. ++ [i.addb.monitor.add] An API is made available for this. ++ [i.addb.monitor.remove] An API is made available for this. ++ [i.addb.monitor.generate-summary-addb-records] Monitor’s am_watch() function will do this. ++ [r.addb.monitor.deliver-addb-record-to-stats-service] Addition to current ADDB mechanism is to be done to differentiate between summary stats records generated by monitors and other addb records & send these summary records to stats service. + ++ [r.addb.monitor.deliver-addb-exception-records-to-local-HA] Monitor’s am_watch() function will do this. ++ [r.addb.monitor.deliver-addb-record-to-addb-service] This makes use of current implementation. ++ [r.addb.monitor.nesting] Monitors generate addb records which themselves can be monitored. ++ [r.addb.stats-service.state] Implementation of stats service handles this. ++ [r.addb.stats-service.query] Implementation of stats service handles this. ++ [r.addb.stats-service.single-instance] Implementation of stats service handles this. ## Dependencies -+ [r.addb.retention] ADDB monitor generates addb records. -+ [r.addb.retention.storage] ADDB monitor generates addb records. -+ [r.addb.timings] ADDB monitor may need to calculate processing rate statistics. -+ [r.addb.filtering] ADDB monitor needs information from addb records. -+ [r.addb.record.type.datapoint] ADDB monitor can generate datapoint addb records. -+ [r.addb.record.type.counter] ADDB monitor can generate counter addb records. -+ [r.addb.record.type.event] ADDB monitor can generate event addb record. -+ [r.addb.record.type.counter.statistics] ADDB monitor needs to do statistics reporting. -+ [r.addb.record.definition] ADDB monitor can define new addb record. -+ [r.addb.record.definition.extensible]. -+ [r.addb.post] ADDB monitor can post addb records. -+ [r.addb.post.non-blocking] Decrease performance impact of ADDB monitoring. ++ [r.addb.retention] ADDB monitor generates addb records. ++ [r.addb.retention.storage] ADDB monitor generates addb records. ++ [r.addb.timings] ADDB monitor may need to calculate processing rate statistics. ++ [r.addb.filtering] ADDB monitor needs information from addb records. ++ [r.addb.record.type.datapoint] ADDB monitor can generate datapoint addb records. ++ [r.addb.record.type.counter] ADDB monitor can generate counter addb records. ++ [r.addb.record.type.event] ADDB monitor can generate event addb record. ++ [r.addb.record.type.counter.statistics] ADDB monitor needs to do statistics reporting. ++ [r.addb.record.definition] ADDB monitor can define new addb record. ++ [r.addb.record.definition.extensible]. ++ [r.addb.post] ADDB monitor can post addb records. ++ [r.addb.post.non-blocking] Decrease performance impact of ADDB monitoring. ## Use Cases **Statistical monitoring of addb records that already have statistical information in them.** Following steps show how an addb monitor collects statistical information on a particular node (client/server) from addb records and send it to stats service as addb records: -1. Create ADDB monitor, add it to the global list of monitors. -2. Define the type of addb record that it will generate. -3. Get the statistics information from these addb records periodically. -4. Send this statistical information to the endpoint where stats service is running as addb records & to the endpoint where addb service is running if the node is a client or to the addb stob if the node is server periodically. +1. Create ADDB monitor, add it to the global list of monitors. +2. Define the type of addb record that it will generate. +3. Get the statistics information from these addb records periodically. +4. Send this statistical information to the endpoint where stats service is running as addb records & to the endpoint where addb service is running if the node is a client or to the addb stob if the node is server periodically. **Statistical monitoring of addb records that do not contain statistical information in them** Following steps show how an addb monitor collects statistical information on a particular node(client/server) from addb records and send it to stats service as addb records: -1. Create ADDB monitor, add it to the global list of monitors. -2. Define the type of addb record that it will generate. -1. Continuously compute statistics from the monitored addb records. -1. Send this statistical information to the endpoint where stats service is running as addb records & to the endpoint where addb service is running if the node is a client or to the addb stob if the node is server periodically. +1. Create ADDB monitor, add it to the global list of monitors. +2. Define the type of addb record that it will generate. +1. Continuously compute statistics from the monitored addb records. +1. Send this statistical information to the endpoint where stats service is running as addb records & to the endpoint where addb service is running if the node is a client or to the addb stob if the node is server periodically. **Exceptional conditions monitoring** Exceptional conditions such as failures, overflows, etc. could be generated inside monitoring(exceptions occurred as a result of interpreting the statistical information generated after monitoring addb records) or outside monitoring (other sub-system failures). Following steps are to be taken: -1. Generate the exception description fop. -2. Post this fop to a local HA component. +1. Generate the exception description fop. +2. Post this fop to a local HA component. **Building a cluster wide global & local state in memory on a node where stats service is running** -1. Create in-memory state structure of the cluster on this node. -1. Receive statistical summary addb records from all the node. -1. Update the state with the information in these latest addb records. +1. Create in-memory state structure of the cluster on this node. +1. Receive statistical summary addb records from all the node. +1. Update the state with the information in these latest addb records. **Query for some state information to the stats service** -1. Construct & send a request fop for specific or complete state information to the stats service & wait for reply. -2. Stats service checks for requesting information, gathers it in reply fop & sends it back to the node from where request was initiated. +1. Construct & send a request fop for specific or complete state information to the stats service & wait for reply. +2. Stats service checks for requesting information, gathers it in reply fop & sends it back to the node from where request was initiated. ## Failures Following failure cases are listed along with their handling mechanism: -+ A failure to construct new state on the node where the stats service runs would return the previous state to the node that requested this state information during this duration. -+ Exceptional conditions are reported to local HA component using a fop, a failure of receiving a fop by local HA component can happen, this would mean that some exceptional conditions can go unnoticed by local HA component. This type of failure is ignored. ++ A failure to construct new state on the node where the stats service runs would return the previous state to the node that requested this state information during this duration. ++ Exceptional conditions are reported to local HA component using a fop, a failure of receiving a fop by local HA component can happen, this would mean that some exceptional conditions can go unnoticed by local HA component. This type of failure is ignored. ### Rationale The existing ADDB implementation and the newly developed tracing subsystem contributed greatly to the requirement to use C macro interfaces with compile time validation. @@ -174,4 +175,4 @@ ADDB repositories are stored in Motr storage objects. ADDB summary records are s The ADDB monitoring component can be added/deleted by modified the configuration related to it. ## References -[0] HLD of ADDB collection mechanism. +[0] HLD of ADDB collection mechanism. diff --git a/doc/CORTX-MOTR-ARCHITECTURE.md b/doc/CORTX-MOTR-ARCHITECTURE.md index 7342b58e520..6410ba30085 100644 --- a/doc/CORTX-MOTR-ARCHITECTURE.md +++ b/doc/CORTX-MOTR-ARCHITECTURE.md @@ -46,7 +46,7 @@ + extension interface + flexible transactions + open source -+ Portable: runs in user space on any version of Linux ++ Portable: runs in user space on any version of Linux # Data flow S3 + cortx rpc: uses RDMA when available (requires kernel module) @@ -86,14 +86,14 @@ + Fast scalable repairs of device failure. + There are other layouts: composite. -# Index Layout +# Index Layout + An index is a container of key-value pairs: + GET(key) -> val, PUT(key, val), DEL(key), NEXT(key) -> (key, val) + used to store meta-data: (key: "/etc/passwd:length", value: 8192) + Uses network raid with parity de-clustering (same as objects), but only N = 1, in N + K + S + X-way replication (N = 1, K = X - 1), each key is replicated independently + takes hardware topology into account (for free!) -+ fast scalable repair (for free!) ++ fast scalable repair (for free!) ![image](./Images/7_Index_Layout.png) @@ -129,7 +129,7 @@ + One of the most complex CORTX components + Scalable efficient transactions are hard + fortunately not everything is needed at once -+ staged implementation: DTM0 first ++ staged implementation: DTM0 first ![image](./Images/10_DTM.png) @@ -255,5 +255,4 @@ + combine workloads ![image](./Images/15_ADDB_Advanced_Use_Case.png) -# Questions - +# Questions diff --git a/doc/End-to-end-Data-Integrity.md b/doc/End-to-end-Data-Integrity.md index a6244d23e8e..bfa99706c41 100644 --- a/doc/End-to-end-Data-Integrity.md +++ b/doc/End-to-end-Data-Integrity.md @@ -1,10 +1,10 @@ # Motr end-to-end Data Integrity ## Design Highlights -+ Data of each target is divided into blocks of 4096 bytes. -+ Checksum and tags of 64-bit each for these blocks are computed at m0t1fs and sent over wire. -+ Checksum for data blocks is computed based on checksum algorithm selected from configuration. -+ Data integrity type and operations are initialized in m0_file. -+ Using do_sum(), checksum values are computed for each block of data and using do_chk(), checksum values are verified. ++ Data of each target is divided into blocks of 4096 bytes. ++ Checksum and tags of 64-bit each for these blocks are computed at m0t1fs and sent over wire. ++ Checksum for data blocks is computed based on checksum algorithm selected from configuration. ++ Data integrity type and operations are initialized in m0_file. ++ Using do_sum(), checksum values are computed for each block of data and using do_chk(), checksum values are verified. ![image](./Images/Write.PNG) @@ -12,11 +12,11 @@ ## Current Status ### Completed -+ Di is computed at m0t1fs and sent over wire. -+ After receiving write fop, checksum is recomputed and verified at the IO service. ++ Di is computed at m0t1fs and sent over wire. ++ After receiving write fop, checksum is recomputed and verified at the IO service. ### In progress -+ In be segment block attributes m0_be_emap_seg:ee_battr is added. The m0_be_emap_seg:ee_val and ee_battr (When b_nob > 0) are stored in btree. -+ Emap split for di data. -+ Write di data to extents while storing the data in disks (uses be_emap_split and in place btree insert api’s). -+ Read di data from extents while reading data from disks and verify checksum. -+ In sns while reading data units, verify checksum and while writing, store di data. ++ In be segment block attributes m0_be_emap_seg:ee_battr is added. The m0_be_emap_seg:ee_val and ee_battr (When b_nob > 0) are stored in btree. ++ Emap split for di data. ++ Write di data to extents while storing the data in disks (uses be_emap_split and in place btree insert api’s). ++ Read di data from extents while reading data from disks and verify checksum. ++ In sns while reading data units, verify checksum and while writing, store di data. diff --git a/doc/Seagate-FDMI-Design-Notes.md b/doc/Seagate-FDMI-Design-Notes.md index 5bfb1bb70cf..a4cb9a8fa73 100644 --- a/doc/Seagate-FDMI-Design-Notes.md +++ b/doc/Seagate-FDMI-Design-Notes.md @@ -15,8 +15,8 @@ The processing is to be done on an incoming FOL record against the Filter set. T Filter Store is to be filled in by means of Subscription API. Filter Index is updated internally on adding Filter. Currently 2 options for Plug-in architecture are anticipated: -1. Option 1: FDMI-Plug-in. Each plug-in is linked with FDMI making use of internal FDMI API only (some callback, for instance). See Fig.1 -1. Option 2: FDMI Plug-in transforms to Mero Core Plug-in. Mero core in this case most likely provides limited features for RPC only. Mero RPC is used to collect notifications from all Mero instances. +1. Option 1: FDMI-Plug-in. Each plug-in is linked with FDMI making use of internal FDMI API only (some callback, for instance). See Fig.1 +1. Option 2: FDMI Plug-in transforms to Mero Core Plug-in. Mero core in this case most likely provides limited features for RPC only. Mero RPC is used to collect notifications from all Mero instances. ## Plug-in API @@ -25,8 +25,8 @@ The API is to allow the plug-ins making use of FDMI to register with the latter Besides, plug-in may additionally inform FDMI about its specifics, like FOL type it is to process, optimization/tweak requirements (e.g. no batching), etc. Public Entries: -* Initialize plug-in -* <..> +* Initialize plug-in +* <..> **N.B.** Plug-in API design details depend on selected architecture option 1 or 2. @@ -49,8 +49,8 @@ Failure results in dropping current FOL copy For both options FOLs are fed to Filer Processor the same way: locally. Public Entries: -* <..> -* <..> +* <..> +* <..> ## Subscription API @@ -58,9 +58,9 @@ Public Entries: The API is to provide a way for adding Filter rules to FDMI instances, identifying FOL processing traits as well as associating Filter with Consumer Public Entries: -* Register Filter -* Unregister Filter -* <..> +* Register Filter +* Unregister Filter +* <..> ## Notification API @@ -74,17 +74,17 @@ Option 2, Filter Processor always sends a Notification to Consumer using Mero RP Public Entries: -* <..> +* <..> -* <..> +* <..> ## Assumptions ### Filter Filter identifies: -* Consumer to be notified -* Conditions FOL to meet +* Consumer to be notified +* Conditions FOL to meet The index format: TBD Filter once registered will be eventually spread across the whole number of Cluster nodes running FDMI services. @@ -94,8 +94,8 @@ The number of Filters registered in Mero Cluster is expected to be of no explici ### Filter semantics Filter syntax need to remain flexible to adopt any future improvements/enhancements, but fully covering initial requirements of multi-type support and human readability. Possible options are: -* Native-language-like syntax (SQL-like) -* Symbolic object notation, easily parsed (some standard adherent notation preferred), easily extensible, including object nesting, e.g. JSON (current DSR’s choice) +* Native-language-like syntax (SQL-like) +* Symbolic object notation, easily parsed (some standard adherent notation preferred), easily extensible, including object nesting, e.g. JSON (current DSR’s choice) NB: Filter being parsed on ingestion may be transformed to a combination of elementary rules in case the transformation does not change Filter semantics but potentially improves Processor performance (decomposition is being performed) diff --git a/doc/outdated/Containers.md b/doc/outdated/Containers.md index aa95a691987..4ff86bc004a 100644 --- a/doc/outdated/Containers.md +++ b/doc/outdated/Containers.md @@ -4,20 +4,29 @@ This document summarizes container discussions in the Motr architecture team. Th Container is a low level abstraction that insulates higher Motr layers of the knowledge of storage device addresses (block numbers). The bulk of data and meta-data is stored in containers. # Items -+ A container is a storage for data or meta-data. There are several types of container: data container, meta-data container, possibly others. - -+ A container provides a very simple interface: it has an internal namespace consisting of keys and provides methods to fetch and update records stored at a given key. Nomenclature of keys and constraints on record structure depend on container type. For data container, keys are logical page offsets and records are simply pages full of data without any internal structure. For meta-data container keys are opaque identifiers of meta-data records. -+ A container stores its contents in a backing store---a storage device. A container allocates space from its backing store and returns no longer used space back as necessary. -+ Local RAID is implemented by using a collection of containers to stripe data across. Note that local RAID is not implemented in the containers layer to avoid cyclic dependency between layers: the data structures (specifically layouts) that local RAID implementation is required to share with SNS are interpreted by Motr back-end that itself uses containers to store its data. -+ Snapshots. Containers are used for data and meta-data snap shotting. When a local snapshot is made as a part of object or system level snap shotting, a container involved into the snapshot is COW-ed, so that all updates are re-directed to a new container. -+ After a container is COW-ed no update should ever touch the blocks of the read-only primary container. This is a necessary prerequisite of a scalable fsck implementation, that will achieve reasonable confidence in system consistency by background scan and check of periodically taken global snapshots. - -+ Migration. A container (either read-only or read-write) can migrate to another node. Typical scenarios of such migration are bulk re-integration of updates from a proxy-server to a set of primary servers and moving snapshots. - -+ To make migration efficient, a container must be able to do a fast IO of its contents to another node. Specifically, it should be possible to send container contents over network without iterating through individual container records. This condition also implies that a container allocates space from its backing store in a relatively large contiguous chunks. -Self-identification. A container has an identity that is stored in the container and migrated together with the container. Higher layers address container records by (container-id, key) pairs and such addresses remain valid across container migrations, including "migration" where a hard drive is pulled out of one server and inserted into another. In the latter case the system must be able to determine what containers were moved between nodes and update configuration respectively, also it should able determine whether any storage or layout invariants were violated, e.g., whether multiple units of the same parity group are now located on the same server. -+ Layouts. Higher layers address and locate data and meta-data through container identifiers and container keys. On the other hand, a layout produces a location of a given piece of data or meta-data. Together this means that lowest level layouts produce locations in the form of (container-id, key) pairs. -+ A container can be merged into another container. Inclusion of one container into another can be done administrative reasons, to efficiently migrate a large number of smaller containers or for some other purpose. On inclusion, a container retains its identity (does it? Losing identity would require updating (and, hence, tracking) all references). -+ Fids. A fid (file identifiers) is an immutable globally and temporally unique file identifier. As file meta-data record ("inode") is stored in some container, it's logical to use (container-id, key) address of this record as fid. Note that this is formally quite similar to the structure of Lustre fid, composed of a sequence identifier and offset within the sequence. -+ CLDB. A method is required to resolve container addresses to node identifiers. This method should work while containers migrate between nodes and merge with each other. Container Location Data-Base (CLDB) is a distributed data-base tracking containers in the cluster. This database is updated transactionally on container migrations and merges (and splits? There must be splits for symmetry.). Note that CLDB supersedes Fid Location Data-Base (FLDB), see above on fids. -+ Data integrity. A container is a possible place to deal with data integrity issues. Alternatively this can be relegated to lower levels (self-checking device pairs) or higher levels (DMU-like file system with check-sums in meta-data). ++ A container is a storage for data or meta-data. There are several types of container: data container, meta-data container, possibly others. + ++ A container provides a very simple interface: it has an internal namespace consisting of keys and provides methods to fetch and update records stored at a given key. Nomenclature of keys and constraints on record structure depend on container type. For data container, keys are logical page offsets and records are simply pages full of data without any internal structure. For meta-data container keys are opaque identifiers of meta-data records. + ++ A container stores its contents in a backing store---a storage device. A container allocates space from its backing store and returns no longer used space back as necessary. + ++ Local RAID is implemented by using a collection of containers to stripe data across. Note that local RAID is not implemented in the containers layer to avoid cyclic dependency between layers: the data structures (specifically layouts) that local RAID implementation is required to share with SNS are interpreted by Motr back-end that itself uses containers to store its data. + ++ Snapshots. Containers are used for data and meta-data snap shotting. When a local snapshot is made as a part of object or system level snap shotting, a container involved into the snapshot is COW-ed, so that all updates are re-directed to a new container. + ++ After a container is COW-ed no update should ever touch the blocks of the read-only primary container. This is a necessary prerequisite of a scalable fsck implementation, that will achieve reasonable confidence in system consistency by background scan and check of periodically taken global snapshots. + ++ Migration. A container (either read-only or read-write) can migrate to another node. Typical scenarios of such migration are bulk re-integration of updates from a proxy-server to a set of primary servers and moving snapshots. + ++ To make migration efficient, a container must be able to do a fast IO of its contents to another node. Specifically, it should be possible to send container contents over network without iterating through individual container records. This condition also implies that a container allocates space from its backing store in a relatively large contiguous chunks. +Self-identification. A container has an identity that is stored in the container and migrated together with the container. Higher layers address container records by (container-id, key) pairs and such addresses remain valid across container migrations, including "migration" where a hard drive is pulled out of one server and inserted into another. In the latter case the system must be able to determine what containers were moved between nodes and update configuration respectively, also it should able determine whether any storage or layout invariants were violated, e.g., whether multiple units of the same parity group are now located on the same server. + ++ Layouts. Higher layers address and locate data and meta-data through container identifiers and container keys. On the other hand, a layout produces a location of a given piece of data or meta-data. Together this means that lowest level layouts produce locations in the form of (container-id, key) pairs. + ++ A container can be merged into another container. Inclusion of one container into another can be done administrative reasons, to efficiently migrate a large number of smaller containers or for some other purpose. On inclusion, a container retains its identity (does it? Losing identity would require updating (and, hence, tracking) all references). + ++ Fids. A fid (file identifiers) is an immutable globally and temporally unique file identifier. As file meta-data record ("inode") is stored in some container, it's logical to use (container-id, key) address of this record as fid. Note that this is formally quite similar to the structure of Lustre fid, composed of a sequence identifier and offset within the sequence. + ++ CLDB. A method is required to resolve container addresses to node identifiers. This method should work while containers migrate between nodes and merge with each other. Container Location Data-Base (CLDB) is a distributed data-base tracking containers in the cluster. This database is updated transactionally on container migrations and merges (and splits? There must be splits for symmetry.). Note that CLDB supersedes Fid Location Data-Base (FLDB), see above on fids. + ++ Data integrity. A container is a possible place to deal with data integrity issues. Alternatively this can be relegated to lower levels (self-checking device pairs) or higher levels (DMU-like file system with check-sums in meta-data). diff --git a/doc/trace.md b/doc/trace.md index 35b0cd4665e..48692b08f13 100644 --- a/doc/trace.md +++ b/doc/trace.md @@ -134,12 +134,12 @@ User-space Subsystem filtering is controlled in two ways: - 1. environment variable: + 1. environment variable: $ export M0_TRACE_IMMEDIATE_MASK='!rpc' $ ./utils/ut.sh - 2. CLI options for utils/ut: + 2. CLI options for utils/ut: -m string: trace mask, either numeric (HEX/DEC) or comma-separated list of subsystem names, use ! at the beginning to invert @@ -153,12 +153,12 @@ Subsystem filtering is controlled in two ways: Trace levels: - 1. environment variable: + 1. environment variable: export M0_TRACE_LEVEL=debug ./utils/ut.sh - 2. CLI options for utils/ut: + 2. CLI options for utils/ut: -e string: trace level: level[+][,level[+]] where level is one of call|debug|info|warn|error|fatal @@ -170,12 +170,12 @@ Trace levels: Trace print context: - 1. environment variable: + 1. environment variable: export M0_TRACE_PRINT_CONTEXT=none ./utils/ut.sh - 2. CLI options for utils/ut: + 2. CLI options for utils/ut: -p string: trace print context, values: none, func, short, full diff --git a/doc/workarounds.md b/doc/workarounds.md index 646564b5dcd..17d3cde7e00 100644 --- a/doc/workarounds.md +++ b/doc/workarounds.md @@ -1,7 +1,7 @@ List of workarounds for third-party libraries and external dependencies ======================================================================= -* `sem_timedwait(3)` from _glibc_ on _Centos_ >= 7.2 +* `sem_timedwait(3)` from _glibc_ on _Centos_ >= 7.2 **Problem**: `sem_timedwait(3)` returns `-ETIMEDOUT` immediately if `tv_sec` is greater than `gettimeofday(2) + INT_MAX`, that makes `m0_semaphore_timeddown(M0_TIME_NEVER)` @@ -17,10 +17,10 @@ List of workarounds for third-party libraries and external dependencies **Source**: `lib/user_space/semaphore.c: m0_semaphore_timeddown()` **References**: - - [CASTOR-1990: Different sem_timedwait() behaviour on real cluster node and EC2 node](https://jts.seagate.com/browse/CASTOR-1990) - - [Bug 1412082 - futex_abstimed_wait() always converts abstime to relative time](https://bugzilla.redhat.com/show_bug.cgi?id=1412082) + - [CASTOR-1990: Different sem_timedwait() behaviour on real cluster node and EC2 node](https://jts.seagate.com/browse/CASTOR-1990) + - [Bug 1412082 - futex_abstimed_wait() always converts abstime to relative time](https://bugzilla.redhat.com/show_bug.cgi?id=1412082) -* `sched_getcpu(3)` on KVM guest +* `sched_getcpu(3)` on KVM guest **Problem**: `sched_getcpu(3)` can return 0 on a KVM guest system regardless of cpu number. @@ -31,4 +31,4 @@ List of workarounds for third-party libraries and external dependencies **Source**: `lib/user_space/processor.c processor_getcpu_init()` **References**: - - [MOTR-2500: Motr panic: (locality == m0_locality_here()) at m0_locality_chores_run()](https://jts.seagate.com/browse/MOTR-2500) + - [MOTR-2500: Motr panic: (locality == m0_locality_here()) at m0_locality_chores_run()](https://jts.seagate.com/browse/MOTR-2500) diff --git a/fdmi/plugins/motr-fdmi-app.md b/fdmi/plugins/motr-fdmi-app.md index 1c15efade9b..542f89498a0 100644 --- a/fdmi/plugins/motr-fdmi-app.md +++ b/fdmi/plugins/motr-fdmi-app.md @@ -105,8 +105,8 @@ The executable binary file will be compiled as part of the initial Motr compilat The `fdmi_sample_plugin` application can be tested in two forms: -- Running the `fdmi_app` python script -- Running the `fdmi_plugin_st` shell script +- Running the `fdmi_app` python script +- Running the `fdmi_plugin_st` shell script For the first case, `fdmi_sample_plugin` communicates with the [fdmi_app]( https://github.com/Seagate/cortx-motr/blob/main/fdmi/plugins/fdmi_app) python script by printing to standard output all the FDMI records. @@ -129,12 +129,12 @@ In order to do that, we need to run the `fdmi_app` script typing in the console The basic arguments needed are the cluster info which will be picked by default from the `etc/motr/confd.xc` config file if not specified at the time of running. This way the FDMI plugin knows where to connect. Examples of the flags you can provide to the python script are: -- `-pp`: `plugin path` -- `-le`: `Local endpoint` -- `-fi`: `Filter id` -- `-ha`: `HA endpoint` -- `-pf`: `Profile fid` -- `-sf`: `Process fid` +- `-pp`: `plugin path` +- `-le`: `Local endpoint` +- `-fi`: `Filter id` +- `-ha`: `HA endpoint` +- `-pf`: `Profile fid` +- `-sf`: `Process fid` All the flags can be known by running the help:`-h` option. @@ -210,4 +210,4 @@ More details about the FDMI design and settings can be found in this link: ## Tested by -- Dec 7, 2021: Liana Valdes Rodriguez (liana.valdes@seagate.com / lvald108@fiu.edu) tested using CentOS Linus release 7.8.2003 x86_64 +- Dec 7, 2021: Liana Valdes Rodriguez (liana.valdes@seagate.com / lvald108@fiu.edu) tested using CentOS Linus release 7.8.2003 x86_64 diff --git a/hsm/README.md b/hsm/README.md index 0f93287d521..d33e17e5077 100644 --- a/hsm/README.md +++ b/hsm/README.md @@ -5,14 +5,14 @@ HSM stands for Hierarchical Storage Management. The concept and design are discu For more information, see: -- [D3.1 HSM for SAGE: Concept and Architecture Report](https://github.com/Seagate/cortx-motr/blob/main/doc/PDF/SAGE_WP3_HSM_for_SAGE_Concept_and_Architecture_v1_Submitted_PUBLIC.pdf) -- [D3.5 HSM for SAGE: Validation Readiness Report](https://github.com/Seagate/cortx-motr/blob/main/doc/PDF/SAGE_D35_HSM_validation_readiness_PUBLIC.pdf) -- [D3.9 HSM for SAGE: Final Validation Report](https://github.com/Seagate/cortx-motr/blob/main/doc/PDF/SAGE_D3.9_HSM_final_v1.1_PUBLIC.pdf) +- [D3.1 HSM for SAGE: Concept and Architecture Report](https://github.com/Seagate/cortx-motr/blob/main/doc/PDF/SAGE_WP3_HSM_for_SAGE_Concept_and_Architecture_v1_Submitted_PUBLIC.pdf) +- [D3.5 HSM for SAGE: Validation Readiness Report](https://github.com/Seagate/cortx-motr/blob/main/doc/PDF/SAGE_D35_HSM_validation_readiness_PUBLIC.pdf) +- [D3.9 HSM for SAGE: Final Validation Report](https://github.com/Seagate/cortx-motr/blob/main/doc/PDF/SAGE_D3.9_HSM_final_v1.1_PUBLIC.pdf) The m0hsm tool available in this directory allows to create composite objects in Motr, write/read to/from them and move them between the tiers (pools). Here is how to use the tool: -1. Set the following environment variables: +1. Set the following environment variables: ```bash export CLIENT_PROFILE="<0x7000000000000001:0x480>" # profile id @@ -24,9 +24,9 @@ The m0hsm tool available in this directory allows to create composite objects in Profile id of the cluster and ha-agent address on your client node can be checked with `hctl status` command. As well as all addresses and processes ids configured in the cluster. Consult with the cluster system - administrator about which of them you can use. + administrator about which of them you can use. -2. Initialize the composite layout index: +2. Initialize the composite layout index: ```Text $ m0composite "$CLIENT_LADDR" "$CLIENT_HA_ADDR" "$CLIENT_PROFILE" "$CLIENT_PROC_FID" @@ -34,7 +34,7 @@ The m0hsm tool available in this directory allows to create composite objects in Note: this should be done one time only after the cluster bootstrap. -3. Configure pools ids of the tiers in ~/.hsm/config file: +3. Configure pools ids of the tiers in ~/.hsm/config file: ```Text M0_POOL_TIER1 = <0x6f00000000000001:0xc74> # NVME diff --git a/scripts/provisioning/README.md b/scripts/provisioning/README.md index af15254c2b2..6c342e4c8e7 100644 --- a/scripts/provisioning/README.md +++ b/scripts/provisioning/README.md @@ -1,24 +1,24 @@ Build and Test Environment for Motr =================================== -* [Quick Start (MacOS)](#quick-start-macos) -* [Quick Start (Windows)](#quick-start-windows) -* [Overview](#overview) -* [Requirements](#requirements) -* [DevVM provisioning](#devvm-provisioning) -* [Building and running Motr](#building-and-running-motr) -* [Try single-node Motr cluster](#try-single-node-motr-cluster) -* [Vagrant basics](#vagrant-basics) -* [Streamlining VMs creation and provisioning with snapshots](#streamlining-vms-creation-and-provisioning-with-snapshots) -* [Managing multiple VM sets with workspaces](#managing-multiple-vm-sets-with-workspaces) -* [Executing Ansible commands manually](#executing-ansible-commands-manually) -* [VirtualBox / VMware / Libvirt specifics](#virtualbox--vmware--libvirt-specifics) +* [Quick Start (MacOS)](#quick-start-macos) +* [Quick Start (Windows)](#quick-start-windows) +* [Overview](#overview) +* [Requirements](#requirements) +* [DevVM provisioning](#devvm-provisioning) +* [Building and running Motr](#building-and-running-motr) +* [Try single-node Motr cluster](#try-single-node-motr-cluster) +* [Vagrant basics](#vagrant-basics) +* [Streamlining VMs creation and provisioning with snapshots](#streamlining-vms-creation-and-provisioning-with-snapshots) +* [Managing multiple VM sets with workspaces](#managing-multiple-vm-sets-with-workspaces) +* [Executing Ansible commands manually](#executing-ansible-commands-manually) +* [VirtualBox / VMware / Libvirt specifics](#virtualbox--vmware--libvirt-specifics) Quick Start (MacOS) ------------------- -* Install - - [Homebrew](https://brew.sh/) +* Install + - [Homebrew](https://brew.sh/) ```bash /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" @@ -28,27 +28,27 @@ Quick Start (MacOS) brew install bash - - GNU `readlink` + - GNU `readlink` brew install coreutils - - [VMware Fusion](https://www.vmware.com/go/downloadfusion) or + - [VMware Fusion](https://www.vmware.com/go/downloadfusion) or [VirtualBox](https://www.virtualbox.org/wiki/Downloads) (VMware is recommended for better experience) - - [Vagrant](https://www.vagrantup.com/downloads.html) (and + - [Vagrant](https://www.vagrantup.com/downloads.html) (and [Vagrant VMware Utility](https://www.vagrantup.com/vmware/downloads) in case of VMware) - - Vagrant plugins (for VMware the license needs to be [purchased](https://www.vagrantup.com/vmware)) + - Vagrant plugins (for VMware the license needs to be [purchased](https://www.vagrantup.com/vmware)) vagrant plugin install vagrant-{env,hostmanager,scp} vagrant plugin install vagrant-vmware-desktop # for VMware or vagrant plugin install vagrant-vbguest # for VirtualBox - - Ansible + - Ansible brew install ansible # on Linux or macOS hosts -* Configure +* Configure - `m0vg` script (make sure you have `$HOME/bin` in the `$PATH`) ```bash @@ -56,7 +56,7 @@ Quick Start (MacOS) ln -s $MOTR_SRC/scripts/m0vg $HOME/bin/ ``` - - VMs + - VMs ```bash # open virtual cluster configuration file in default editor @@ -106,36 +106,36 @@ Quick Start (MacOS) see `m0vg params` output for the full list of supported configuration parameters -* Run - - check VMs state +* Run + - check VMs state m0vg status - - create _cmu_ VM (this can take ~30 minutes depending on the internet + - create _cmu_ VM (this can take ~30 minutes depending on the internet connection, CPU and system disk speed) m0vg up cmu - - restart _cmu_ VM in order to activate shared folder + - restart _cmu_ VM in order to activate shared folder m0vg reload cmu - - logon on _cmu_ and check contents of `/data` dir + - logon on _cmu_ and check contents of `/data` dir m0vg tmux ls /data - - create _ssu_ and _client_ VMs (can take about ~40 minutes depending on the + - create _ssu_ and _client_ VMs (can take about ~40 minutes depending on the number of configured _ssu_ and _client_ nodes) m0vg up /ssu/ /client/ m0vg reload /ssu/ /client/ - - stop all nodes when they're not needed to be running + - stop all nodes when they're not needed to be running m0vg halt - - if a node hangs (e.g. Motr crash in kernel or deadlock) it can be forced + - if a node hangs (e.g. Motr crash in kernel or deadlock) it can be forced to shutdown using `-f` option for `halt` command, for example: m0vg halt -f client1 @@ -143,35 +143,35 @@ Quick Start (MacOS) Quick Start (Windows) --------------------- -* Install - - [VMware Workstation](https://www.vmware.com/go/downloadworkstation) or +* Install + - [VMware Workstation](https://www.vmware.com/go/downloadworkstation) or [VirtualBox](https://www.virtualbox.org/wiki/Downloads) (VMware is recommended for better experience) - - [Vagrant](https://www.vagrantup.com/downloads.html) (and + - [Vagrant](https://www.vagrantup.com/downloads.html) (and [Vagrant VMware Utility](https://www.vagrantup.com/vmware/downloads) in case of VMware) - - Vagrant plugins (for VMware the license needs to be [purchased](https://www.vagrantup.com/vmware)) + - Vagrant plugins (for VMware the license needs to be [purchased](https://www.vagrantup.com/vmware)) vagrant plugin install vagrant-{env,hostmanager,scp} vagrant plugin install vagrant-vmware-desktop # for VMware or vagrant plugin install vagrant-vbguest # for VirtualBox - - [Git for Windows](https://git-scm.com/download/win) + - [Git for Windows](https://git-scm.com/download/win) During installation, when asked, choose the following options (keep other options to their default setting): - - _Use Git and optional Unix tools from the Command Prompt_ - - _Checkout as-is, commit Unix-style line ending_ - - _Enable symbolic links_ + - _Use Git and optional Unix tools from the Command Prompt_ + - _Checkout as-is, commit Unix-style line ending_ + - _Enable symbolic links_ -* Configure +* Configure - - Open _Git Bash_ terminal, add CRLF configuration option to make sure that Motr/Hare scripts can work on VM + - Open _Git Bash_ terminal, add CRLF configuration option to make sure that Motr/Hare scripts can work on VM ```bash git config --global core.autocrlf input ``` - - Clone Motr repository somewhere, just as an example let's say it's in `$HOME/src/motr`: + - Clone Motr repository somewhere, just as an example let's say it's in `$HOME/src/motr`: ```bash mkdir -p src @@ -179,7 +179,7 @@ Quick Start (Windows) git clone --recursive git@github.com:Seagate/cortx-motr.git motr ``` - - Create a persistent alias for `m0vg` script: + - Create a persistent alias for `m0vg` script: ```bash cat <> $HOME/.bash_profile @@ -191,9 +191,9 @@ Quick Start (Windows) Exit and re-launch _Git Bash_ terminal. At this point the setup should be complete. -* Run +* Run - - Follow the steps from _Run_ section under _Quick Start (MacOS)_ above. + - Follow the steps from _Run_ section under _Quick Start (MacOS)_ above. > *NOTE*: during `m0vg up ` command execution you may be asked to enter > your Windows username and password, and then grant permissions for @@ -240,22 +240,22 @@ In order to run these scripts, additional tools have to be installed first. It's assumed that either _macOS_, _Windows_ or _Linux_ is used as a host operating system. -* Minimum Host OS - - 8GB of RAM - - 10GB of free disk space - - 2 CPU cores +* Minimum Host OS + - 8GB of RAM + - 10GB of free disk space + - 2 CPU cores -* Additional Software/Tools: - - [VMware Fusion](https://www.vmware.com/products/fusion.html) (for _macOS_) or +* Additional Software/Tools: + - [VMware Fusion](https://www.vmware.com/products/fusion.html) (for _macOS_) or [VMware Workstation](https://www.vmware.com/products/workstation-pro.html) (for _Windows_) _OR_ [VirtualBox](https://www.virtualbox.org/wiki/Downloads) (VMware is recommended for better experience in terms of memory utilisation) - `libvirt + qemu-kvm` (_Linux_ only) - - [Vagrant](https://www.vagrantup.com/downloads.html) - - [Vagrant VMware plugin](https://www.vagrantup.com/vmware) + - [Vagrant VMware Utility](https://www.vagrantup.com/vmware/downloads) (in case of VMware) - - [Ansible](https://github.com/ansible/ansible) (_macOS_ and _Linux_ only) - - [Git for Windows](https://git-scm.com/download/win) (_Windows_ only) + - [Vagrant](https://www.vagrantup.com/downloads.html) + - [Vagrant VMware plugin](https://www.vagrantup.com/vmware) + + [Vagrant VMware Utility](https://www.vagrantup.com/vmware/downloads) (in case of VMware) + - [Ansible](https://github.com/ansible/ansible) (_macOS_ and _Linux_ only) + - [Git for Windows](https://git-scm.com/download/win) (_Windows_ only) On _Ubuntu Linux_ all of the above prerequisites can be installed with a single command: