You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Collection of potential requirements and use cases:
Outer join with sparse data structures
How will we implement A + B where A, B are sparse matrices or vectors:
This is an outer join scenario, since we need to emit output for Ai, Bi, even if one of them is undefined. Right now we only reliably support inner joins (both Ai and Bi ≠ undefined) and perhaps some forms of left outer joins (Bi may be undefined but not Ai) via explicit lookups. How will we model this?
It seems like with some basic type propagation and an input abstraction such as loadCSV(filename, schema), we should get close. The key is to avoid representing “rows” of data in materialized form, all values should be scalarized to map to local variables. Can we do this — if not what’s missing?
Early abort of aggregations
How should we support early abort of aggregations?
For example ‘first’, ‘last’, ‘all’, ‘exists’ do not need to iterate the full range but can abort once the value is known. How will we do this reliably? A complication is that loops may be fused with other ops that need to run the full traversal.
Strategies for handling missing fields & error propagation with try-catch
Do we want to support a more explicit form of error propagation?
Right now we handle expressions like [data.*.optionalField] as a filter, where ‘optionalField’ may or may not be present and we return a list of all the items which had the field present. But sometimes we want to raise an error or do something different for the entire list if any item had the field missing.
We can encode this as forall(data..optionalField) & [data..optionalField], but it would perhaps be nicer if we could express this using a mechanism similar to try/catch, e.g. catch [data.*.optionalField || break] where ‘break’ would exit to the nearest enclosing ‘catch’. How would this work?
Type checking with optional data schema
We want to use user-provided schema of the input data to perform type inference to better aid code generation. Especially for the C backend. Knowing the concrete type and structure of data could reduce the overhead of having tagged values (types known at runtime) and have better representation for objects.
Internal vs external observable behavior: to pick the best internal datastructures we can rely to a large degree on program analysis (including type propagation), but user annotations will be needed when interacting with the outside world. Example: is the output of a query expected to be a structure of plain JS objects, or are we OK returning Red-Black-trees? What if we explicitly want a CSR matrix or a dense tensor in a specific format? What should such an interface look like?
Type/schema information could be conveyed in multiple ways. Ultimately we would like to have a general and expressive hints mechanism, but as a first step, we can use introduction forms like loadCSV(...) or loadCSRMatrix(...) to propagate explicit schema information from data loading instructions downwards to elimination forms.
Dealing with undefined values: one challenge is dealing with potentially missing results/undefined values. A simple strategy to start would be modeling types like MaybeNumber, Maybe[Number], or Number? to indicate that the value could be a number or undefined. One pitfall is to avoid types like Maybe[Maybe[Number]]
Substructural typing: as much as possible we would like to include substructural properties like "this can be modified in-place" in the typing information. As noted above under "internal vs external behavior", at the query boundary we need the programmer to tell us
Gradual typing behavior: the schema should be optional and the query should still be able to be compiled (with unknown schema) and executed as usual. If we have distinguished types such as Number and Number?, we can track conversions from one to the other and insert tagging/untagging or boxing/unboxing operations when generating code.
When to run typing: one question is whether to run type checking before or after extractAssignments — we can start either way but may need to be prepared to change later. Some pros and cons: before means we're dealing with tree-shaped code, after it's graph shaped. In tree form, an expression may occur many times and we may assign different types to each occurrence. When we do assign different types, this means the expression will end up being computed multiple times. In a graph, sharing is made explicit. We have to assign a single type, and we have precise information about all uses of the expression. This would allow us to make decisions such as "if this is used only once, use in-place updates" perhaps more easily.
The text was updated successfully, but these errors were encountered:
Implement loadCSV operation so we can do loadCSV(file, [{A: “int32”, B: "string"...}])
The schema expression is evaluated in compile time instead of being evaluated like normal Rhyme expressions.
Infer the fusion of statements.
Support for strings (extracting string values and string comparison for filter expressions)
Collection of potential requirements and use cases:
Outer join with sparse data structures
How will we implement A + B where A, B are sparse matrices or vectors:
This is an outer join scenario, since we need to emit output for Ai, Bi, even if one of them is undefined. Right now we only reliably support inner joins (both Ai and Bi ≠ undefined) and perhaps some forms of left outer joins (Bi may be undefined but not Ai) via explicit lookups. How will we model this?
C codegen for SQL-like queries
How should we go about generating fast code for the SQL-like subset of Rhyme, like the low-level output of our LMS SQL engine (http://scala-lms.github.io/tutorials/query.html, simple output example here https://github.com/scala-lms/tutorials/blob/master/src/out/query_t1gram2.check.c)
It seems like with some basic type propagation and an input abstraction such as loadCSV(filename, schema), we should get close. The key is to avoid representing “rows” of data in materialized form, all values should be scalarized to map to local variables. Can we do this — if not what’s missing?
Early abort of aggregations
How should we support early abort of aggregations?
For example ‘first’, ‘last’, ‘all’, ‘exists’ do not need to iterate the full range but can abort once the value is known. How will we do this reliably? A complication is that loops may be fused with other ops that need to run the full traversal.
Strategies for handling missing fields & error propagation with try-catch
Do we want to support a more explicit form of error propagation?
Right now we handle expressions like [data.*.optionalField] as a filter, where ‘optionalField’ may or may not be present and we return a list of all the items which had the field present. But sometimes we want to raise an error or do something different for the entire list if any item had the field missing.
We can encode this as forall(data..optionalField) & [data..optionalField], but it would perhaps be nicer if we could express this using a mechanism similar to try/catch, e.g. catch [data.*.optionalField || break] where ‘break’ would exit to the nearest enclosing ‘catch’. How would this work?
Type checking with optional data schema
We want to use user-provided schema of the input data to perform type inference to better aid code generation. Especially for the C backend. Knowing the concrete type and structure of data could reduce the overhead of having tagged values (types known at runtime) and have better representation for objects.
Internal vs external observable behavior: to pick the best internal datastructures we can rely to a large degree on program analysis (including type propagation), but user annotations will be needed when interacting with the outside world. Example: is the output of a query expected to be a structure of plain JS objects, or are we OK returning Red-Black-trees? What if we explicitly want a CSR matrix or a dense tensor in a specific format? What should such an interface look like?
Type/schema information could be conveyed in multiple ways. Ultimately we would like to have a general and expressive hints mechanism, but as a first step, we can use introduction forms like loadCSV(...) or loadCSRMatrix(...) to propagate explicit schema information from data loading instructions downwards to elimination forms.
Dealing with undefined values: one challenge is dealing with potentially missing results/undefined values. A simple strategy to start would be modeling types like MaybeNumber, Maybe[Number], or Number? to indicate that the value could be a number or undefined. One pitfall is to avoid types like Maybe[Maybe[Number]]
Substructural typing: as much as possible we would like to include substructural properties like "this can be modified in-place" in the typing information. As noted above under "internal vs external behavior", at the query boundary we need the programmer to tell us
Gradual typing behavior: the schema should be optional and the query should still be able to be compiled (with unknown schema) and executed as usual. If we have distinguished types such as Number and Number?, we can track conversions from one to the other and insert tagging/untagging or boxing/unboxing operations when generating code.
When to run typing: one question is whether to run type checking before or after extractAssignments — we can start either way but may need to be prepared to change later. Some pros and cons: before means we're dealing with tree-shaped code, after it's graph shaped. In tree form, an expression may occur many times and we may assign different types to each occurrence. When we do assign different types, this means the expression will end up being computed multiple times. In a graph, sharing is made explicit. We have to assign a single type, and we have precise information about all uses of the expression. This would allow us to make decisions such as "if this is used only once, use in-place updates" perhaps more easily.
The text was updated successfully, but these errors were encountered: