Type System RFC#55
Conversation
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
danking
left a comment
There was a problem hiding this comment.
If you're just trying to get something in, I'm happy to approve, but I do think the formalism could use some work.
|
|
||
| ### Logical vs. Physical Types | ||
|
|
||
| A **logical type** (`DType`) describes what the data means, independent of how it is stored (e.g., |
There was a problem hiding this comment.
what data means
This is true, but I think it is not very elucidating.
IMO, the important distinction between logical and physical types is that logical types define a set of operations and a set of laws. E.g. u32 supports arithmetic with the usual bounded arithmetic laws, e.g.
if a + b would not overflow,
a + b - b = a
The operations empower the user to manipulate the values. The laws empower the engine to simply execution.
There was a problem hiding this comment.
It defines the mathematical domain of inhibitable values and mathematical functions over these domain match exactly with vortex function over these dtypes.
Also currently the DType defines a canonical canonical array, so I guess this also defines the interpretation of the values in the array in that domain?
| this separation, implementing `M` operations across `N` encodings requires `N * M` implementations. | ||
| With it, each encoding only needs to decompress itself and each operation only needs to target | ||
| decompressed forms, reducing the cost to `N + M`. See this | ||
| [blog post](https://spiraldb.com/post/logical-vs-physical-data-types) for more information. |
There was a problem hiding this comment.
I don't think this argument is compelling to users/consumers. It appeals to engineers implementing operators or encodings.
IMO, the value of the logical-physical distinction is: Vortex consumers may abstractly specify work in terms of logical operations on logical types. Their code doesn't need to change when someone ships a new physical type (e.g. update Vortex and you take advantage of NewFSST without changing any code).
| The critical property of a quotient type is that operations on it must be **well-defined**: they | ||
| must produce the same result regardless of which member of the class you operate on. Formally, if | ||
| `f : A → B` respects the equivalence relation (`a ~ a'` implies `f(a) = f(a')`), then `f` descends | ||
| to a well-defined function on the quotient `f' : A/~ → B`. |
There was a problem hiding this comment.
I think this section needs to say something about the operations which are defined on each logical type. AFAICT, it's important that the operations respect the equivalence relation. In particular, the sum of two arrays shouldn't be affected by the physical encoding.
I think the theorem we want is something like this. Elements from the same equivalence class must have equivalent results under every operation.
a, a': A/~
f: A/~ -> A/~
if a ~ a' then f(a) ~ f(a')
a, b, c, d: A/~
g: (A/~, A/~) -> A/~
if
a ~ c
b ~ d
then
g(a, b) ~ g(c, d)
| all physical encodings into equivalence classes, where each class corresponds to a single logical | ||
| column of data. | ||
|
|
||
| A Vortex `DType` like `Primitive(I32, NonNullable)` **names** one of these equivalence classes. It |
There was a problem hiding this comment.
This feels to me like an unusual formulation of the logical-physical distinction.
If i32 corresponds to exactly one equivalence class, then the elements of the equivalence class are encodings like "primitive i32" and "run end i32". That means the "well-defined" functions you describe above are functions on logical or physical types themselves: DType -> DType or Primitive -> RunEnd. I don't think that's what you intended because below you discuss functions like filter and take which are Array -> Array.
I think this whole thing might be more naturally formalized in category theory.
A category contains zero or more "objects" and zero or more "arrows" (also called morphisms) between objects. You often see diagrams like this:
a --> b
|
V
c
That's a category with three objects: a, b, and c and five morphisms: the one from a to b, the one from a to c, and an (undrawn) identity morphism for each object.
The point is that a category has not just elements (the "objects") but also functions (the "arrows"). That means I can make statements not just about equivalence classes of elements but also how functions are affected by those equivalence classes.
We have two categories Logical and Physical. Logical has three kinds of objects:
- The logical types (
u32,u32?,i32, ...). - Array objects:
Array<u32>,Array<i32>, ... - N-tuples thereof.
and has arrows for all our usual operations:
- scalar_at (e.g. from
Array<u32>tou32). - take (e.g. from (
Array<utf8>,Array<u32>) toArray<utf8>). - logical operations (e.g. pointwise-sum (
Array<u32>,Array<u32>) ->Array<u32>). - etc.
Physical also has three kinds of objects:
- The physical types (
primitive<u32>,dict<u32>,dict<i32?>,runend<u32>, ...). - Objects representing arrays of each logical type.
- N-tuples thereof.
It has the above operations (though defined on physical types) as well as: compress/decompress which
are arrows between physical types (e.g. run end encode is an arrow from primitive<u32> to
runend<u32>).
Now we can state your guarantee about physical and logical types precisely: there exists a functor
from Physical to Logical which sends each physical type to its logical type and lifts physical
functions to their logical equivalents. For this relationship to be a functor it must be the case
that arrow composition is preserved, so, in particular, this composition of arrows:
functor(encode ∘ scalar_at_0) = functor(encode) ∘ functor(scalar_at_0)
Encode doesn't change the logical type, so it has to map to the identity arrow. Simplifying, we get the statement we want
functor(encode ∘ scalar_at_0) = functor(scalar_at_0)
Which is maybe easier to read as a diagram. The paths on this diagram have to commute because primitive<u32>, dict<u32>, and runend<u32> map to the same object: logical_array<u32>.
Physical:
dict<u32> -----scalar_at_0----+
^ |
|
dict_encode |
| V
primitive<u32> --scalar_at_0-- u32
| ^
runend_encode |
|
V |
runend<u32> ----scalar_at_0----+
Logical:
logical_array<u32> --scalar_at_0--> u32
| Bool(BoolArray), | ||
| Primitive(PrimitiveArray), | ||
| Decimal(DecimalArray), | ||
| VarBinView(VarBinViewArray), // Note that both `Utf8` and `Binary` map to `VarBinView`. |
There was a problem hiding this comment.
we are thinking about changing this maybe.
There was a problem hiding this comment.
It won't change the fact that we can have multiple dtypes map to the same physical encoding
…ction Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
|
ok I'm just going to rewrite a bunch of this since part of it is outdated now and I might as well go deeper with the formalism |
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
I'd like to revive this now as I want to create issues in the main Vortex repository that reference this.