Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Potential Requirements / Use Cases #43

Open
guo543 opened this issue Oct 2, 2024 · 1 comment
Open

Potential Requirements / Use Cases #43

guo543 opened this issue Oct 2, 2024 · 1 comment
Labels
design Language design discussion planning

Comments

@guo543
Copy link
Contributor

guo543 commented Oct 2, 2024

Collection of potential requirements and use cases:

Outer join with sparse data structures

  • How will we implement A + B where A, B are sparse matrices or vectors:

  • This is an outer join scenario, since we need to emit output for Ai, Bi, even if one of them is undefined. Right now we only reliably support inner joins (both Ai and Bi ≠ undefined) and perhaps some forms of left outer joins (Bi may be undefined but not Ai) via explicit lookups. How will we model this?

C codegen for SQL-like queries

Early abort of aggregations

  • How should we support early abort of aggregations?

  • For example ‘first’, ‘last’, ‘all’, ‘exists’ do not need to iterate the full range but can abort once the value is known. How will we do this reliably? A complication is that loops may be fused with other ops that need to run the full traversal.

Strategies for handling missing fields & error propagation with try-catch

  • Do we want to support a more explicit form of error propagation?

  • Right now we handle expressions like [data.*.optionalField] as a filter, where ‘optionalField’ may or may not be present and we return a list of all the items which had the field present. But sometimes we want to raise an error or do something different for the entire list if any item had the field missing.

  • We can encode this as forall(data..optionalField) & [data..optionalField], but it would perhaps be nicer if we could express this using a mechanism similar to try/catch, e.g. catch [data.*.optionalField || break] where ‘break’ would exit to the nearest enclosing ‘catch’. How would this work?

Type checking with optional data schema

  • We want to use user-provided schema of the input data to perform type inference to better aid code generation. Especially for the C backend. Knowing the concrete type and structure of data could reduce the overhead of having tagged values (types known at runtime) and have better representation for objects.

  • Internal vs external observable behavior: to pick the best internal datastructures we can rely to a large degree on program analysis (including type propagation), but user annotations will be needed when interacting with the outside world. Example: is the output of a query expected to be a structure of plain JS objects, or are we OK returning Red-Black-trees? What if we explicitly want a CSR matrix or a dense tensor in a specific format? What should such an interface look like?

  • Type/schema information could be conveyed in multiple ways. Ultimately we would like to have a general and expressive hints mechanism, but as a first step, we can use introduction forms like loadCSV(...) or loadCSRMatrix(...) to propagate explicit schema information from data loading instructions downwards to elimination forms.

  • Dealing with undefined values: one challenge is dealing with potentially missing results/undefined values. A simple strategy to start would be modeling types like MaybeNumber, Maybe[Number], or Number? to indicate that the value could be a number or undefined. One pitfall is to avoid types like Maybe[Maybe[Number]]

  • Substructural typing: as much as possible we would like to include substructural properties like "this can be modified in-place" in the typing information. As noted above under "internal vs external behavior", at the query boundary we need the programmer to tell us

  • Gradual typing behavior: the schema should be optional and the query should still be able to be compiled (with unknown schema) and executed as usual. If we have distinguished types such as Number and Number?, we can track conversions from one to the other and insert tagging/untagging or boxing/unboxing operations when generating code.

  • When to run typing: one question is whether to run type checking before or after extractAssignments — we can start either way but may need to be prepared to change later. Some pros and cons: before means we're dealing with tree-shaped code, after it's graph shaped. In tree form, an expression may occur many times and we may assign different types to each occurrence. When we do assign different types, this means the expression will end up being computed multiple times. In a graph, sharing is made explicit. We have to assign a single type, and we have precise information about all uses of the expression. This would allow us to make decisions such as "if this is used only once, use in-place updates" perhaps more easily.

@TiarkRompf TiarkRompf added planning design Language design discussion labels Oct 3, 2024
@guo543
Copy link
Contributor Author

guo543 commented Oct 9, 2024

Next steps for C codegen of SQL-like queries

  1. Implement loadCSV operation so we can do loadCSV(file, [{A: “int32”, B: "string"...}])
    The schema expression is evaluated in compile time instead of being evaluated like normal Rhyme expressions.
  2. Infer the fusion of statements.
  3. Support for strings (extracting string values and string comparison for filter expressions)
  4. Stateful csv output (as lines of text)
  5. Hash maps, group-by, join

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
design Language design discussion planning
Projects
None yet
Development

No branches or pull requests

2 participants