Seagate · rkothiya · Aug 4, 2022 · Jul 22, 2022 · Jul 26, 2022 · Jul 26, 2022
@@ -46,7 +46,7 @@
    +  extension interface
    +  flexible transactions
    +  open source    
-+  Portable: runs in user space on any version of Linux  
++  Portable: runs in user space on any version of Linux 
 
 # Data flow S3  
 +  cortx rpc: uses RDMA when available (requires kernel module)
@@ -86,14 +86,14 @@
 +  Fast scalable repairs of device failure.
 +  There are other layouts: composite.  
 
-# Index Layout  
+# Index Layout 
 +  An index is a container of key-value pairs:
    +  GET(key) -> val, PUT(key, val), DEL(key), NEXT(key) -> (key, val)
    +  used to store meta-data: (key: "/etc/passwd:length", value: 8192)   
 +  Uses network raid with parity de-clustering (same as objects), but only N = 1, in N + K + S   
 +  X-way replication (N = 1, K = X - 1), each key is replicated independently   
 +  takes hardware topology into account (for free!)
-+  fast scalable repair (for free!)  
++  fast scalable repair (for free!) 
 
 ![image](./Images/7_Index_Layout.png)  
 
@@ -129,7 +129,7 @@
 +  One of the most complex CORTX components
 +  Scalable efficient transactions are hard
 +  fortunately not everything is needed at once
-+  staged implementation: DTM0 first  
++  staged implementation: DTM0 first
 
 ![image](./Images/10_DTM.png)  
 
@@ -255,5 +255,4 @@
 +  combine workloads  
 ![image](./Images/15_ADDB_Advanced_Use_Case.png)
 
-# Questions 
-
+# Questions
@@ -1,22 +1,22 @@
 # Motr end-to-end Data Integrity
 ## Design Highlights
-+ Data of each target is divided into blocks of 4096 bytes.
-+ Checksum and tags of 64-bit each for these blocks are computed at m0t1fs and sent over wire.
-+ Checksum for data blocks is computed based on checksum algorithm selected from configuration.
-+ Data integrity type and operations are initialized in m0_file.
-+ Using do_sum(), checksum values are computed for each block of data and using do_chk(), checksum values are verified.  
++  Data of each target is divided into blocks of 4096 bytes.
++  Checksum and tags of 64-bit each for these blocks are computed at m0t1fs and sent over wire.
++  Checksum for data blocks is computed based on checksum algorithm selected from configuration.
++  Data integrity type and operations are initialized in m0_file.
++  Using do_sum(), checksum values are computed for each block of data and using do_chk(), checksum values are verified.  
 
 ![image](./Images/Write.PNG)  
 
 ![image](./Images/Read.PNG)  
 
 ## Current Status
 ### Completed
-+ Di is computed at m0t1fs and sent over wire.
-+ After receiving write fop, checksum is recomputed and verified at the IO service.
++  Di is computed at m0t1fs and sent over wire.
++  After receiving write fop, checksum is recomputed and verified at the IO service.  
 ### In progress
-+ In be segment block attributes m0_be_emap_seg:ee_battr is added. The m0_be_emap_seg:ee_val and ee_battr (When b_nob > 0) are stored in btree.
-+ Emap split for di data.
-+ Write di data to extents while storing the data in disks (uses be_emap_split and in place btree insert api’s).
-+ Read di data from extents while reading data from disks and verify checksum.
-+ In sns while reading data units, verify checksum and while writing, store di data.
++  In be segment block attributes m0_be_emap_seg:ee_battr is added. The m0_be_emap_seg:ee_val and ee_battr (When b_nob > 0) are stored in btree.
++  Emap split for di data.
++  Write di data to extents while storing the data in disks (uses be_emap_split and in place btree insert api’s).
++  Read di data from extents while reading data from disks and verify checksum.
++  In sns while reading data units, verify checksum and while writing, store di data.
@@ -15,8 +15,8 @@ The processing is to be done on an incoming FOL record against the Filter set. T
 Filter Store is to be filled in by means of Subscription API. Filter Index is updated internally on adding Filter.
 
 Currently 2 options for Plug-in architecture are anticipated:  
-1. Option 1: FDMI-Plug-in. Each plug-in is linked with FDMI making use of internal FDMI API only (some callback, for instance). See Fig.1
-1. Option 2: FDMI Plug-in transforms to Mero Core Plug-in. Mero core in this case most likely provides limited features for RPC only. Mero RPC is used to collect notifications from all Mero instances.    
+1.  Option 1: FDMI-Plug-in. Each plug-in is linked with FDMI making use of internal FDMI API only (some callback, for instance). See Fig.1
+1.  Option 2: FDMI Plug-in transforms to Mero Core Plug-in. Mero core in this case most likely provides limited features for RPC only. Mero RPC is used to collect notifications from all Mero instances.    
 
 ## Plug-in API
 
@@ -25,8 +25,8 @@ The API is to allow the plug-ins making use of FDMI to register with the latter
 Besides, plug-in may additionally inform FDMI about its specifics, like FOL type it is to process, optimization/tweak requirements (e.g. no batching), etc.
 
 Public Entries:
-* Initialize plug-in
-* <..>
+*  Initialize plug-in
+*  <..>
 
 **N.B.** Plug-in API design details depend on selected architecture option 1 or 2.   
 
@@ -49,18 +49,18 @@ Failure results in dropping current FOL copy
 For both options FOLs are fed to Filer Processor the same way: locally.
 
 Public Entries:
-* <..>
-* <..>
+*  <..>
+*  <..>
 
 
 ## Subscription API
 
 The API is to provide a way for adding Filter rules to FDMI instances, identifying FOL processing traits as well as associating Filter with Consumer
 
 Public Entries:
-* Register Filter
-* Unregister Filter
-* <..>   
+*  Register Filter
+*  Unregister Filter
+*  <..>   
 
 ## Notification API
 
@@ -74,17 +74,17 @@ Option 2, Filter Processor always sends a Notification to Consumer using Mero RP
 
 Public Entries:
 
-* <..>
+*  <..>
 
-* <..>
+*  <..>
 
 ## Assumptions
 
 ### Filter
 
 Filter identifies:
-* Consumer to be notified
-* Conditions FOL to meet
+*  Consumer to be notified
+*  Conditions FOL to meet
 The index format: TBD
 
 Filter once registered will be eventually spread across the whole number of Cluster nodes running FDMI services.
@@ -94,8 +94,8 @@ The number of Filters registered in Mero Cluster is expected to be of no explici
 ### Filter semantics
 
 Filter syntax need to remain flexible to adopt any future improvements/enhancements, but fully covering initial requirements of multi-type support and human readability. Possible options are:
-* Native-language-like syntax (SQL-like)
-* Symbolic object notation, easily parsed (some standard adherent notation preferred), easily extensible, including object nesting, e.g. JSON (current DSR’s choice)  
+*  Native-language-like syntax (SQL-like)
+*  Symbolic object notation, easily parsed (some standard adherent notation preferred), easily extensible, including object nesting, e.g. JSON (current DSR’s choice)  
 
 NB: Filter being parsed on ingestion may be transformed to a combination of elementary rules in case the transformation does not change Filter semantics but potentially improves Processor performance (decomposition is being performed)
 

@@ -4,20 +4,29 @@ This document summarizes container discussions in the Motr architecture team. Th
 Container is a low level abstraction that insulates higher Motr layers of the knowledge of storage device addresses (block numbers). The bulk of data and meta-data is stored in containers.
 
 # Items
-+ A container is a storage for data or meta-data. There are several types of container: data container, meta-data container, possibly others.  
-
-+ A container provides a very simple interface: it has an internal namespace consisting of keys and provides methods to fetch and update records stored at a given key. Nomenclature of keys and constraints on record structure depend on container type. For data container, keys are logical page offsets and records are simply pages full of data without any internal structure. For meta-data container keys are opaque identifiers of meta-data records.
-+ A container stores its contents in a backing store---a storage device. A container allocates space from its backing store and returns no longer used space back as necessary.
-+ Local RAID is implemented by using a collection of containers to stripe data across. Note that local RAID is not implemented in the containers layer to avoid cyclic dependency between layers: the data structures (specifically layouts) that local RAID implementation is required to share with SNS are interpreted by Motr back-end that itself uses containers to store its data.
-+ Snapshots. Containers are used for data and meta-data snap shotting. When a local snapshot is made as a part of object or system level snap shotting, a container involved into the snapshot is COW-ed, so that all updates are re-directed to a new container.
-+   After a container is COW-ed no update should ever touch the blocks of the read-only primary container. This is a necessary prerequisite of a scalable fsck implementation, that will achieve reasonable confidence in system consistency by background scan and check of periodically taken global snapshots.
-
-+   Migration. A container (either read-only or read-write) can migrate to another node. Typical scenarios of such migration are bulk re-integration of updates from a proxy-server to a set of primary servers and moving snapshots.
-
-+ To make migration efficient, a container must be able to do a fast IO of its contents to another node. Specifically, it should be possible to send container contents over network without iterating through individual container records. This condition also implies that a container allocates space from its backing store in a relatively large contiguous chunks.
-Self-identification. A container has an identity that is stored in the container and migrated together with the container. Higher layers address container records by (container-id, key) pairs and such addresses remain valid across container migrations, including "migration" where a hard drive is pulled out of one server and inserted into another. In the latter case the system must be able to determine what containers were moved between nodes and update configuration respectively, also it should able determine whether any storage or layout invariants were violated, e.g., whether multiple units of the same parity group are now located on the same server.
-+ Layouts. Higher layers address and locate data and meta-data through container identifiers and container keys. On the other hand, a layout produces a location of a given piece of data or meta-data. Together this means that lowest level layouts produce locations in the form of (container-id, key) pairs.
-+ A container can be merged into another container. Inclusion of one container into another can be done administrative reasons, to efficiently migrate a large number of smaller containers or for some other purpose. On inclusion, a container retains its identity (does it? Losing identity would require updating (and, hence, tracking) all references).
-+ Fids. A fid (file identifiers) is an immutable globally and temporally unique file identifier. As file meta-data record ("inode") is stored in some container, it's logical to use (container-id, key) address of this record as fid. Note that this is formally quite similar to the structure of Lustre fid, composed of a sequence identifier and offset within the sequence.
-+ CLDB. A method is required to resolve container addresses to node identifiers. This method should work while containers migrate between nodes and merge with each other. Container Location Data-Base (CLDB) is a distributed data-base tracking containers in the cluster. This database is updated transactionally on container migrations and merges (and splits? There must be splits for symmetry.). Note that CLDB supersedes Fid Location Data-Base (FLDB), see above on fids.
-+ Data integrity. A container is a possible place to deal with data integrity issues. Alternatively this can be relegated to lower levels (self-checking device pairs) or higher levels (DMU-like file system with check-sums in meta-data).
++  A container is a storage for data or meta-data. There are several types of container: data container, meta-data container, possibly others.  
+
++  A container provides a very simple interface: it has an internal namespace consisting of keys and provides methods to fetch and update records stored at a given key. Nomenclature of keys and constraints on record structure depend on container type. For data container, keys are logical page offsets and records are simply pages full of data without any internal structure. For meta-data container keys are opaque identifiers of meta-data records.  
+
++  A container stores its contents in a backing store---a storage device. A container allocates space from its backing store and returns no longer used space back as necessary.
+
++  Local RAID is implemented by using a collection of containers to stripe data across. Note that local RAID is not implemented in the containers layer to avoid cyclic dependency between layers: the data structures (specifically layouts) that local RAID implementation is required to share with SNS are interpreted by Motr back-end that itself uses containers to store its data.
+
++  Snapshots. Containers are used for data and meta-data snap shotting. When a local snapshot is made as a part of object or system level snap shotting, a container involved into the snapshot is COW-ed, so that all updates are re-directed to a new container.  
+
++  After a container is COW-ed no update should ever touch the blocks of the read-only primary container. This is a necessary prerequisite of a scalable fsck implementation, that will achieve reasonable confidence in system consistency by background scan and check of periodically taken global snapshots.    
+
++  Migration. A container (either read-only or read-write) can migrate to another node. Typical scenarios of such migration are bulk re-integration of updates from a proxy-server to a set of primary servers and moving snapshots.   
+
++  To make migration efficient, a container must be able to do a fast IO of its contents to another node. Specifically, it should be possible to send container contents over network without iterating through individual container records. This condition also implies that a container allocates space from its backing store in a relatively large contiguous chunks.
+Self-identification. A container has an identity that is stored in the container and migrated together with the container. Higher layers address container records by (container-id, key) pairs and such addresses remain valid across container migrations, including "migration" where a hard drive is pulled out of one server and inserted into another. In the latter case the system must be able to determine what containers were moved between nodes and update configuration respectively, also it should able determine whether any storage or layout invariants were violated, e.g., whether multiple units of the same parity group are now located on the same server.  
+
++  Layouts. Higher layers address and locate data and meta-data through container identifiers and container keys. On the other hand, a layout produces a location of a given piece of data or meta-data. Together this means that lowest level layouts produce locations in the form of (container-id, key) pairs.
+
++  A container can be merged into another container. Inclusion of one container into another can be done administrative reasons, to efficiently migrate a large number of smaller containers or for some other purpose. On inclusion, a container retains its identity (does it? Losing identity would require updating (and, hence, tracking) all references).
+
++  Fids. A fid (file identifiers) is an immutable globally and temporally unique file identifier. As file meta-data record ("inode") is stored in some container, it's logical to use (container-id, key) address of this record as fid. Note that this is formally quite similar to the structure of Lustre fid, composed of a sequence identifier and offset within the sequence.
+
++  CLDB. A method is required to resolve container addresses to node identifiers. This method should work while containers migrate between nodes and merge with each other. Container Location Data-Base (CLDB) is a distributed data-base tracking containers in the cluster. This database is updated transactionally on container migrations and merges (and splits? There must be splits for symmetry.). Note that CLDB supersedes Fid Location Data-Base (FLDB), see above on fids.
+
++  Data integrity. A container is a possible place to deal with data integrity issues. Alternatively this can be relegated to lower levels (self-checking device pairs) or higher levels (DMU-like file system with check-sums in meta-data).
@@ -134,12 +134,12 @@ User-space
 
 Subsystem filtering is controlled in two ways:
 
-  1. environment variable:
+  1.  environment variable:
 
         $ export M0_TRACE_IMMEDIATE_MASK='!rpc'
         $ ./utils/ut.sh
 
-  2. CLI options for utils/ut:
+  2.  CLI options for utils/ut:
 
         -m     string: trace mask, either numeric (HEX/DEC) or comma-separated
                        list of subsystem names, use ! at the beginning to invert
@@ -153,12 +153,12 @@ Subsystem filtering is controlled in two ways:
 
 Trace levels:
 
-  1. environment variable:
+  1.  environment variable:
 
         export M0_TRACE_LEVEL=debug
         ./utils/ut.sh
 
-  2. CLI options for utils/ut:
+  2.  CLI options for utils/ut:
 
         -e     string: trace level: level[+][,level[+]] where level is one of
 	               call|debug|info|warn|error|fatal
@@ -170,12 +170,12 @@ Trace levels:
 
 Trace print context:
 
-  1. environment variable:
+  1.  environment variable:
 
         export M0_TRACE_PRINT_CONTEXT=none
         ./utils/ut.sh
 
-  2. CLI options for utils/ut:
+  2.  CLI options for utils/ut:
 
         -p     string: trace print context, values: none, func, short, full
 

@@ -1,7 +1,7 @@
 List of workarounds for third-party libraries and external dependencies
 =======================================================================
 
-* `sem_timedwait(3)` from _glibc_ on _Centos_ >= 7.2
+*  `sem_timedwait(3)` from _glibc_ on _Centos_ >= 7.2
 
   **Problem**: `sem_timedwait(3)` returns `-ETIMEDOUT` immediately if `tv_sec` is
   greater than `gettimeofday(2) + INT_MAX`, that makes `m0_semaphore_timeddown(M0_TIME_NEVER)`
@@ -17,10 +17,10 @@ List of workarounds for third-party libraries and external dependencies
   **Source**: `lib/user_space/semaphore.c: m0_semaphore_timeddown()`
 
   **References**:
-    - [CASTOR-1990: Different sem_timedwait() behaviour on real cluster node and EC2 node](https://jts.seagate.com/browse/CASTOR-1990)
-    - [Bug 1412082 - futex_abstimed_wait() always converts abstime to relative time](https://bugzilla.redhat.com/show_bug.cgi?id=1412082)
+    -  [CASTOR-1990: Different sem_timedwait() behaviour on real cluster node and EC2 node](https://jts.seagate.com/browse/CASTOR-1990)
+    -  [Bug 1412082 - futex_abstimed_wait() always converts abstime to relative time](https://bugzilla.redhat.com/show_bug.cgi?id=1412082)  
 
-* `sched_getcpu(3)` on KVM guest
+*  `sched_getcpu(3)` on KVM guest
 
   **Problem**: `sched_getcpu(3)` can return 0 on a KVM guest system regardless of cpu number.
 
@@ -31,4 +31,4 @@ List of workarounds for third-party libraries and external dependencies
   **Source**: `lib/user_space/processor.c processor_getcpu_init()`
 
   **References**:
-    - [MOTR-2500: Motr panic: (locality == m0_locality_here()) at m0_locality_chores_run()](https://jts.seagate.com/browse/MOTR-2500)
+    -  [MOTR-2500: Motr panic: (locality == m0_locality_here()) at m0_locality_chores_run()](https://jts.seagate.com/browse/MOTR-2500)