Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Serialization format #18

Closed
Geal opened this issue Mar 4, 2019 · 10 comments
Closed

Serialization format #18

Geal opened this issue Mar 4, 2019 · 10 comments

Comments

@Geal
Copy link
Contributor

Geal commented Mar 4, 2019

it will soon be time to define the serialization format for biscuit tokens. To sum up the current ideas:

  • using a binary format to make the token more compact
  • using symbol tables to avoid repeating symbols in the token
  • TAI64 for dates looks like a good idea
  • prefix varints to represent integers (the number of leading zeros indicates the number of bytes encoding the number) since most numbers will be very small
  • UTF8 for strings
  • add a version number to the format
  • the authority block should probably be separated from the other blocks
  • blocks carry an index that should be checked (because their order is important for the symbol table)
  • we might need two versions of the signature part: one that works with asymmetric crypto, one that works with symmetric crypto
  • beware formats that have no guarantee on field order in serialization, since we hash those fields. A workaround for this issue when appending new blocks: start from an already serialized token instead of reserializing everything
  • it might be better to have a format that can support nested messages. We have a wrapping structure that contains the blocks and signature, and if that one checks out, we deserialize the blocks
@tarcieri
Copy link
Collaborator

tarcieri commented Mar 4, 2019

Obligatory mention that Veriform is most of that (namely a binary format based on prefix varints which uses TAI64N for dates):

https://github.com/voucher/veriform
https://github.com/voucher/veriform/blob/master/spec/draft-veriform-spec.md

@Geal
Copy link
Contributor Author

Geal commented Mar 4, 2019

how hard to would it be to plug it into the current PoC?

@tarcieri
Copy link
Collaborator

tarcieri commented Mar 4, 2019

I don't know, you tell me. It's only partially complete and implements something close to a push parser

@Geal
Copy link
Contributor Author

Geal commented Mar 4, 2019

the current code only uses serde: https://github.com/CleverCloud/biscuit/blob/824db991d4b6919ebf0fce6d5c520c9ee7deaf4c/code/biscuit-poc/src/ser.rs#L16-L29
There are currently two separate deserializing stages:

  • deserialize the authority and blocks to raw byte vecs, and deserialize the signature
  • if the signature is correct (using hashes of authority and blocks), deserialize authority and blocks

@tarcieri
Copy link
Collaborator

tarcieri commented Mar 4, 2019

In that case CBOR sounds fine

@Geal
Copy link
Contributor Author

Geal commented Mar 5, 2019

as I've seen with the current tests, CBOR generates weird byte arrays:

00000000        a4 00 98 4c 18 a4 00 00 01 18 a1 00 18 84 18 65         �.�L.�....�..�.e
00000010        18 66 18 69 18 6c 18 65 18 31 18 64 18 72 18 65         .f.i.l.e.1.d.r.e
00000020        18 61 18 64 18 65 18 66 18 69 18 6c 18 65 18 32         .a.d.e.f.i.l.e.2
00000030        18 65 18 77 18 72 18 69 18 74 18 65 02 18 83 18         .e.w.r.i.t.e..�.
00000040        a2 00 04 01 18 83 18 82 00 00 18 82 00 07 18 82         �....�.�...�...�

But the packed representation is small enough anyway. I need at least 5 blocks to reach 1kB (with keys and signature)

@Geal Geal mentioned this issue Mar 18, 2019
@Geal
Copy link
Contributor Author

Geal commented Mar 26, 2019

CBOR generates weird looking data for serialized blocks, so I'm shopping around for other formats. I'd like a format that's known enough to have good quality implementations in various languages, can generate code or parse from a schema (one thing that would be easier to put in the spec), has guaranteed order on field serialization (I can work around this but it's annoying for hashes), and good support for nested messages.

Currently evaluating:

  • CBOR:
    • nesting looks weird
    • packed format does not look well specified
    • no guaranteed order
  • Protobuf:
    • good support for nested types
    • code generation
    • no guaranteed order
  • Avro:
    • provide a schema file and the lib knows how to parse it
    • no guaranteed order
    • nested messages?
  • Thrift:
    • generation from IDL files
    • no guaranteed order
    • nested structs?
  • MessagePack:
    • will carry key names in maps, so not really compact (although we could replace them with numbers)
    • no guaranteed order
  • capnproto:
    • field order corresponds to field declaration order in the schema (it even has a canonical form)
    • uses schemas to generate code
    • has a packed representation with less padding
    • no support for variable size integers

@Geal
Copy link
Contributor Author

Geal commented Mar 26, 2019

so, in my tests, I had for CBOR:

  • 1 block (only authority): 314 bytes
  • 2 blocks: 519 bytes
  • 3 blocks: 708 bytes
  • 10 blocks: 2031 bytes

With capnproto (normal serialization):

  • 1 block: 688 bytes
  • 2 blocks: 1184 bytes
  • 3 blocks: 1680 bytes
  • 10 blocks: 5152 bytes

With capnproto (packed serialization):

  • 1 block: 293 bytes
  • 2 blocks: 457 bytes
  • 3 blocks: 618 bytes
  • 10 blocks: 1765 bytes

so the lack of varints is not an issue with the size. It should be possible to access each block as a byte slice to hash them (I have not tested it yet), and we could make sure it's in canonical form.

Here's the example capnp schema I've built:

@0xa9970031d0caaf6d;

struct Biscuit {
  authority @0 :Block;
  blocks @1 :List(Block);
  # each element must be 32 bytes
  keys @2 :List(Data);
  signature @3 :Signature;
}

struct Block {
  index @0 :UInt32;
  symbols @1 :List(Text);
  facts @2 :List(Fact);
  caveats @3 :List(Rule);
}

struct Fact {
  predicate @0 :Predicate;
}

struct Rule {
  head @0 :Predicate;
  body @1 :List(Predicate);
  constraints @2 :List(Constraint);
}

struct Predicate {
  # symbol value
  name @0 :UInt32;
  ids @1 :List(ID);
}

struct Constraint {
  id @0 :UInt32;
  kind :union {
    int @1 :IntConstraint;
    str @2 :StringConstraint;
    date @3 :DateConstraint;
    symbol @4 :SymbolConstraint;
  }
}

struct ID {
  union {
    symbol @0 :UInt32;
    variable @1 :UInt32;
    integer @2 :Int32;
    str @3 :Text;
    date @4 :UInt64;
  }
}

struct IntConstraint {
  union {
    lower @0 :UInt64;
    larger @1 :UInt64;
    lowerOrEqual @2 :UInt64;
    largeOrEqual @3 :UInt64;
    equal @4 :UInt64;
    # must be a set (no duplicates)
    in @5 :List(UInt64);
    # must be a set (no duplicates)
    notIn @6 :List(UInt64);
  }
}

struct StringConstraint {
  union {
    prefix @0 :Text;
    suffix @1 :Text;
    equal @2 :Text;
    # must be a set (no duplicates)
    in @3 :List(Text);
    # must be a set (no duplicates)
    notIn @4 :List(Text);
  }
}

struct DateConstraint {
  union {
    before @0 :UInt64;
    after @1 :UInt64;
  }
}

struct SymbolConstraint {
  union {
    # must be a set (no duplicates)
    in @0 :List(UInt64);
    # must be a set (no duplicates)
    notIn @1 :List(UInt64);
  }
}

struct Signature {
  # each element must be 32 bytes
  gamma @0 :List(Data);
  # each element must be 32 bytes
  c @1 :List(Data);
  #must be 32 bytes
  w @2 :Data;
  #must be 32 bytes
  s @3 :Data;
}

@Geal
Copy link
Contributor Author

Geal commented Mar 28, 2019

Apparently, it will be difficult to access the raw bytes corresponding to a struct's field to hash it, so the main Biscuit struct will need to hold raw bytes buffers for authority and blocks, that will be parsed once the signature is verified. This puts the serialized size at:

  • 1 block: 291 bytes
  • 2 blocks: 463 bytes
  • 3 blocks: 635 bytes
  • 10 blocks: 1847 bytes

With protobuf, I have the following results:

  • 1 block (only authority): 276 bytes
  • 2 blocks: 457 bytes
  • 3 blocks: 638 bytes
  • 10 blocks: 1905 bytes

so it's roughly equivalent. The code generated by the prost crate is quite nice to use.

syntax = "proto2";

package biscuit.schema;

message Biscuit {
  required bytes authority = 1;
  repeated bytes blocks = 2;
  repeated bytes keys = 3;
  required Signature signature = 4;
}

message Signature {
  repeated bytes gamma = 1;
  repeated bytes c = 2;
  required bytes w = 3;
  required bytes s = 4;
}

message Block {
  required uint32 index = 1;
  repeated string symbols = 2;
  repeated Fact   facts = 3;
  repeated Rule   caveats = 4;
}

message Fact {
  required Predicate predicate = 1;
}

message Rule {
  required Predicate head = 1;
  repeated Predicate body = 2;
  repeated Constraint constraints = 3;
}

message Predicate {
  required uint32 name = 1;
  repeated ID ids = 2;
}


message ID {
  enum Kind {
    SYMBOL = 0;
    VARIABLE = 1;
    INTEGER = 2;
    STR = 3;
    DATE = 4;
  }

  required Kind kind = 1;
  optional uint32 symbol = 2;
  optional uint32 variable = 3;
  optional int32 integer = 4;
  optional string str = 5;
  optional uint64 date = 6;
}

message Constraint {
  required uint32 id = 1;

  enum Kind {
    INT = 0;
    STRING = 1;
    DATE = 2;
    SYMBOL = 3;
  }

  required Kind kind = 2;

  optional IntConstraint int = 3;
  optional StringConstraint str = 4;
  optional DateConstraint date = 5;
  optional SymbolConstraint symbol = 6;
}

message IntConstraint {
  enum Kind {
    LOWER = 0;
    LARGER = 1;
    LOWER_OR_EQUAL = 2;
    LARGER_OR_EQUAL = 3;
    EQUAL = 4;
    IN = 5;
    NOT_IN = 6;
  }

  required Kind kind = 1;

  optional int64 lower = 2;
  optional int64 larger = 3;
  optional int64 lower_or_equal = 4;
  optional int64 larger_or_equal = 5;
  optional int64 equal = 6;
  repeated int64 in_set = 7 [packed=true];
  repeated int64 not_in_set = 8 [packed=true];
}

message StringConstraint {
  enum Kind {
    PREFIX = 0;
    SUFFIX = 1;
    EQUAL = 2;
    IN = 3;
    NOT_IN = 4;
  }

  required Kind kind = 1;

  optional string prefix = 2;
  optional string suffix = 3;
  optional string equal = 4;
  repeated string in_set = 5;
  repeated string not_in_set = 6;
}

message DateConstraint {
  enum Kind {
    BEFORE = 0;
    AFTER = 1;
  }

  required Kind kind = 1;

  optional uint64 before = 2;
  optional uint64 after = 3;
}

message SymbolConstraint {
  enum Kind {
    IN = 0;
    NOT_IN = 1;
  }

  required Kind kind = 1;

  repeated uint32 in_set = 2;
  repeated uint32 not_in_set = 3;
}

@Geal
Copy link
Contributor Author

Geal commented Sep 13, 2019

ended up going with protobuf

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants