- RFC Name: Datastructure Support
- RFC ID: 0008
- Start Data: 2016-07-15
- Owner: Mark Nunberg
- Current Status: ACCEPTED
This RFC provides higher level "datastructure" support based upon the subdocument API. This includes things like language-based support for integration between Couchbase and "lists", "collections", "maps", and other complex data structures for languages which support them.
Data structures might provide a nicer developer experience over the CRUD APIs, and perhaps bypass document construction per-se entirely for some use cases. The trajectory for this project may eventually lead to:
- Lightweight ODMs based on Couchbase/Subdoc
- New features in the subdoc core itself
- Provide better parity with Redis where it currently has more features than Subdoc
To provide proper datastructure support, several key issues must first be identified with respect to high level functionality. This refers to things like ordering, queuing and more.
Functionality may be divided into two categories:
-
Object-level functionality. This applies for languages which can implement various interfaces such as a generic
List
interface or a genericMap
interface. Examples of such languages are Java and C# -
Discreet functionality. This is for languages which do not have generic interfaces to implement. In such cases, a
List
is a concept which can be manipulated using various discreet functions such aslist_append
,list_prepend
and so on. Examples of such languages are Go and PHP.
The following sections include functionality applicable to one or both of the above categories. See below (in API reference) for required functionality.
Operations when using a list are fairly straightforward. The base representation of a list is a JSON array ([]
). Items are added to the beginning or end (or at a specific index) of the array.
Here are the operations required:
Fetch the entire list. When removing an item (while iterating over it), the CAS should also be used, in case the CAS of the item has changed:
iteration():
Items = mc_get("key");
For item in items: yield item
If a 'List' is not a first class object (i.e. datastructures are implemented using discreet functions), then it may suffice to simply document how to access the underlying JSON array.
Subdoc natively supports element access:
__getitem__( index):
Return subdoc_get("key", "[" + index + "]");
Spock will contain support for GET_COUNT access:
length():
Return subdoc_get_count("key", "");
Until then, SDKs should emulate this on the client side by fetching the list and counting the elements:
length():
Return len(get("key"))
Native subdoc ARRAY_APPEND
append(item Object):
subdoc_push_last("key", item)
Native subdoc ARRAY_PREPEND
prepend(item Object):
subdoc_array_push_first("key", item)
NOTE: This overlaps with the functionality of Sets
Spock will contain support for ARRAY_CONTAINS
, which will return the index of the item
contains(item String):
Rv = subdoc_contains("key", "", item);
If rv < 0: Return false
Return true
NOTE: This overlaps with the functionality of Sets
Use ARRAY_CONTAINS with CAS and then employ the returned index:
remove(value):
While true:
Rv = subdoc_contains("key", "", item)
If rv.index < 0: break # missing:
Cas = rv.cas
Try:
subdoc_remove(key, "[" + rv.index + "]")
return
Except CasMismatch:
continue
The JSON representation of a map is a JSON object ({}
). Items are added or removed to the object based on their keys.
Similar to List iteration
Native subdoc support
Check that a given key exists within the dictionary. This can be done ussing
the EXISTS
subdoc opcode.
Get only the keys of the map. This Should be done by fetching the entire map. Future server versions may offer native support for this
Get only the values of the map. This should be done by fetching the entire map.
Sets are represented as a JSON list ([]
). In essence they are derivatives of arrays with specialized add/remove operations.
Because of current implementation limitations, sets are limited to primitives only because uniqueness at the server level is done via a string comparison. Sets will then be used as an array.
Because subdoc does not have sorting functionality, sets will not be sorted.
Note: SDKs implementing a "Set" API may use value-types that are not primitives with the caveat that they document the fact that any sort of non-primitive comparison operation will take place on the client. This should not be permissible for "discreet"-style APIs, however.
See List::Iteration
Can be done using ARRAY_ADD_UNIQUE
Fetch the entire set (i.e. array)
See List::Contains
See List::Size
Queues are almost identical to lists, but provide a special pop()
method which safely returns and removes the item.
In JSON, queues are represented as JSON arrays.
See List::Iteration
See List::Prepend
See List::Contains
See List::Size
Retrieve array[-1]
and remove array[-1]
using CAS:
pop() -> item:
While true:
Rv = subdoc_get("key", "[-1]")
Value = rv.value
Try:
subdoc_remove("key", "[-1]", cas=rv.cas)
Return value
Except CasMismatch:
continue
The pop operation should not participate in the traditional mutate_in
chain, since it is a compound operation and its execution is not well defined within the context of the execution chain.
Put simply, we can't implement pop with the same atomicity constraints normally expected.
Datstructures supports only a single level of addressable nesting. In a sense, this makes it less powerful than subdocument, since the latter supports accessing up to 32 levels of nesting.
Of course, subdocument and datastructures can be used seamlessly in the same document.
Errors should be manifested and wrapped in the native language's errors for various conditions when possible. For example, an ENOENT in Memcached should result in a KeyError (Python) or OutOfBoundsException (Java). If the error is not a typical data error which cannot be translated natively, then the real Couchbase error should be returned, resulting in an exceptional circumstance. These data structures are ones of convenience, not of altering reality.
If these errors occur in the face of a compound operation they may be relayed to the user as normal. Because all the compound operations do not begin as destructive operations, it is always safe to break the execution chain in situ.
Data errors refer to errors which take place as a result of data not existing or not being within constraints. Within data errors there are fulldoc and subdoc errors.
Handle as normal (i.e. propagate). This would likely result from user intervention (i.e. purposefully deleting the document). This may also be thrown if the document does not exist, and on-demand creation is configurable.
- If the document is modified outside of datastructures (e.g. used to be a list and is now a dictionary),
- If a wrong input type was supplied
This will happen if the document is modified outside of datastructures, or if there is a bug within the datastructure implementation, or if the wrong type is supplied (e.g. non-primitive for set addition).
If a document becomes too big the current behavior should be to propagate the exception. In the future it may be possible to add BigList or BigMap implementations which would span multiple documents, however such implementations may not be entirely "free".
For SDKs which do not support the magic wrapping of native data types, data structures can still be implemented at a conceptual level by renaming sub-document API methods. The methods will follow the naming convention of ObjectVerb. This will let users intuitively know which methods are available on which data types.
The return types for all these methods should be of the normal 'Document' return type, or something that functions similarly to it. It would be nice if we had a specific return type called 'Document Fragment'. However that name is already reserved for a collection of 'document fragments'. The different name would just be for clarification, anyway.
Note that in this situation some primitive methods may need to be duplicated multiple times, one for each datatype. For example, Remove may have a ListRemove and DictRemove variant.
Mutator methods may also accept CAS and durability requirements in situations where doing so will not add to much programmatic difficulty. Because the goal of this feature is convenience, it should not sacrifice itself for flexibility.
<td><code>M(key).add_unique(key, value)</code>. See <code>MapAdd</code> for
how to handle the <code>create</code> argument.</td>
Datastructure API | Subdocument API |
---|---|
MapGet(String key, String mapkey) |
L(key).get(mapkey) |
MapRemove(String key, String mapkey) |
M(key).remove(mapkey) |
MapSize(String key) -> Int |
L(key).get_count . Pre-spock can do len(get(key)) |
MapAdd(String key, String mapkey, Object value, Bool createDoc) |
M(key).dict_upsert(mapkey, value, create_doc=createMap) .
Note that pre-spock does not have support for FLAG_MKDOC , and therefore the implementation
may look something like this:
|
ListGet(String key, Int index) |
L(key).get("", '[' + index + ']') |
ListAppend(String key, Object value, Bool createList) |
M(key).array_push_last("", value) . See notes on MapAdd for how to handle
the create option |
ListPrepend(String key, Object value, Bool createList) |
M(key).array_push_first("", value) . See notes on MapAdd for how to handle
the create option |
ListRemove(String key, Int index) |
M(key).remove('[' + index + ']') |
ListSet(String key, Int index, Object value) |
M(key).replace('[' + index + ']', value) |
ListSize(String key) -> Int |
See MapSize |
SetAdd(String key, String value, boolean createSet |
|
SetContains(String key, String value) -> Bool |
Currently not implemented on the server side. This will need to be
emulated on the client like so:
|
SetSize(String key) -> Int |
See MapSize |
SetRemove(String key, String value) |
|
QueuePush(String key, Object value, Bool createQueue) |
See ListPrepend |
QueuePop(String key) -> Object |
See Queue::Pop implementation above. Note that an empty queue should raise an exception of the type appropriate when native language-level queues are empty |
QueueSize(String key) -> Int |
ListSize(key) |
Note that operations whose server-side commands have not yet been implemented (NYI) are expected to implement this client-side.
Clients may also implement backwards-compatible functionality for features which are dependent on a specific server version.
The datastructure implementations will conform to the JDK's Collection API. Each collection will at least need the document's id and a reference to the Bucket object through which to access the subdoc API and in which to store the underlying document.Note the types are for now restricted to JSON simple types (String, boolean, Number, JsonArray, JsonObject...).
Below are quick examples that showcase this:
//we provide initial values for the list, so any pre-existing doc is overridden
List<String> names = new CouchbaseArrayList<String>("key", bucket, "foo", "bar");
if (names.contains("bar")) {
names.add(0, "baz");
}
for (String name : names) { //the collection has an iterator
System.out.println(name);
}
//prints baz, foo, bar
//this will rebuild the collection from existing document
List<String> namesFromDoc = new CouchbaseArrayList<String>("key", bucket);
for (String name : names) {
System.out.println(name);
}
//prints baz, foo, bar as well
Map<String,String> map = new CouchbaseMap<String>("key", bucket);
map.put("subkey", "subvalue");
Python does have a collections API. Two options are offered below, one uses the 'collections' API and the other uses an explicit function-based API. Unlike Java, python uses duck-typing APIs, so there is no predetermined set of operations which the SDK must
support. Generally however, index/key addressing and iteration support is desired.
dd = CouchbaseMap('a_map', cb)
dd['foo'] = 'bar'
print(cb.get('a_map'))
len(dd)
aa = CouchbaseList('a_list', cb)
aa.append('Hello')
aa.append('World')
aa.extend(['foo', 'bar', 'baz'])
len(aa)
ss = CouchbaseSet('a_set', cb)
ss.add('hello')
ss.add('hello')
len(ss)
ss.remove('hello')
qq = CouchbaseQueue('a_queue', cb)
qq.push(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
print(qq.pop())
print(qq.pop())
cb.map_set('docid', 'email', 'm@n.com', create=True)
cb.list_append('docid', 1, 2, 3, 4, 5, 6, create=True)
cb.queue_push('docid', 'job1234')
job = cb.queue_pop('docid').value
cb.set_remove('docid', value)
NOTE: Currently, only the functional API has been implemented
bucket.MapAdd("a_map", "mapkey", "stuff", false)
bucket.MapGet("a_map", "mapkey", &valOut)
bucket.ListGet("a_list", 0, &valOut)
bucket.SetAdd("a_set", "set_member", true)
bucket.SetRemove("a_list", "set_member");
var dictionary = new CouchbaseDictionary<string, Poco>(_bucket, key);
dictionary.Add("somekey1", new Poco { Name = "poco1" });
dictionary.Add("somekey2", new Poco { Name = "poco2" });
var result = dictionary.Remove("somekey2");
var collection = new CouchbaseList<Poco>(_bucket, key);
collection.Add(new Poco { Key = "poco1", Name = "Poco-pica" });
var item = collection[0];
var item = queue.Peek();
No, the ordering of sets is undefined. We only make a guarantee about membership.
The server-side subdoc implementation can only handle primitive containing arrays
for set operations (currently, only ADD_UNIQUE
). So technically speaking, you can
also add support for sets containing integers, booleans, natural numbers, and null
values.
If you are implementing a generic API (such that you cannot make a distinction at the API level between different set value types), you may implement handling for non-primitive set types, so long as you document that any operations on this set will need to go through a full-document fetch.
This is a situation which shouldn't happen in normal use, but in such a case the standard
E2BIG
error should just be bubbled up.
In the future, we may add more data types which are specifically suited to grow beyond 20MB
These ideas were floated while discussing datastructures, but were deemed out of scope for this feature. They are given a mention here:
-
Bitwise operations The ability to use integers rather than explicit JSON for boolean values. this may save storage and bandwidth. Requires server support.
-
Large Queues Many people require large scalable job queues. The queue currently implemented in datastructures/subdoc is simply an array with pop/push functionality and is not designed to contain more than several items, but rather on the order of tens or hundreds.
-
Large Data Structures As a superset of large queues, sometimes a large map would be nice.
-
Higher level set operations. Currently we only support things like adding and membership checking. A "proper" set API usually contains things like intersection, union, and so on.