Skip to content

Type System RFC#55

Open
connortsui20 wants to merge 3 commits intodevelopfrom
ct/types
Open

Type System RFC#55
connortsui20 wants to merge 3 commits intodevelopfrom
ct/types

Conversation

@connortsui20
Copy link
Copy Markdown
Contributor

I'd like to revive this now as I want to create issues in the main Vortex repository that reference this.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Copy link
Copy Markdown
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're just trying to get something in, I'm happy to approve, but I do think the formalism could use some work.

Comment thread rfcs/0055-type-system.md Outdated

### Logical vs. Physical Types

A **logical type** (`DType`) describes what the data means, independent of how it is stored (e.g.,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what data means

This is true, but I think it is not very elucidating.

IMO, the important distinction between logical and physical types is that logical types define a set of operations and a set of laws. E.g. u32 supports arithmetic with the usual bounded arithmetic laws, e.g.

if a + b would not overflow,

a + b - b = a

The operations empower the user to manipulate the values. The laws empower the engine to simply execution.

Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It defines the mathematical domain of inhibitable values and mathematical functions over these domain match exactly with vortex function over these dtypes.

Also currently the DType defines a canonical canonical array, so I guess this also defines the interpretation of the values in the array in that domain?

Comment thread rfcs/0055-type-system.md
this separation, implementing `M` operations across `N` encodings requires `N * M` implementations.
With it, each encoding only needs to decompress itself and each operation only needs to target
decompressed forms, reducing the cost to `N + M`. See this
[blog post](https://spiraldb.com/post/logical-vs-physical-data-types) for more information.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this argument is compelling to users/consumers. It appeals to engineers implementing operators or encodings.

IMO, the value of the logical-physical distinction is: Vortex consumers may abstractly specify work in terms of logical operations on logical types. Their code doesn't need to change when someone ships a new physical type (e.g. update Vortex and you take advantage of NewFSST without changing any code).

Comment thread rfcs/0055-type-system.md
The critical property of a quotient type is that operations on it must be **well-defined**: they
must produce the same result regardless of which member of the class you operate on. Formally, if
`f : A → B` respects the equivalence relation (`a ~ a'` implies `f(a) = f(a')`), then `f` descends
to a well-defined function on the quotient `f' : A/~ → B`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section needs to say something about the operations which are defined on each logical type. AFAICT, it's important that the operations respect the equivalence relation. In particular, the sum of two arrays shouldn't be affected by the physical encoding.

I think the theorem we want is something like this. Elements from the same equivalence class must have equivalent results under every operation.

a, a': A/~
f: A/~ -> A/~

if a ~ a' then f(a) ~ f(a')
a, b, c, d: A/~
g: (A/~, A/~) -> A/~
if
    a ~ c
    b ~ d
then
    g(a, b) ~ g(c, d)

Comment thread rfcs/0055-type-system.md Outdated
all physical encodings into equivalence classes, where each class corresponds to a single logical
column of data.

A Vortex `DType` like `Primitive(I32, NonNullable)` **names** one of these equivalence classes. It
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels to me like an unusual formulation of the logical-physical distinction.

If i32 corresponds to exactly one equivalence class, then the elements of the equivalence class are encodings like "primitive i32" and "run end i32". That means the "well-defined" functions you describe above are functions on logical or physical types themselves: DType -> DType or Primitive -> RunEnd. I don't think that's what you intended because below you discuss functions like filter and take which are Array -> Array.


I think this whole thing might be more naturally formalized in category theory.

A category contains zero or more "objects" and zero or more "arrows" (also called morphisms) between objects. You often see diagrams like this:

a --> b
|
V
c

That's a category with three objects: a, b, and c and five morphisms: the one from a to b, the one from a to c, and an (undrawn) identity morphism for each object.


The point is that a category has not just elements (the "objects") but also functions (the "arrows"). That means I can make statements not just about equivalence classes of elements but also how functions are affected by those equivalence classes.

We have two categories Logical and Physical. Logical has three kinds of objects:

  1. The logical types (u32, u32?, i32, ...).
  2. Array objects: Array<u32>, Array<i32>, ...
  3. N-tuples thereof.

and has arrows for all our usual operations:

  1. scalar_at (e.g. from Array<u32> to u32).
  2. take (e.g. from (Array<utf8>, Array<u32>) to Array<utf8>).
  3. logical operations (e.g. pointwise-sum (Array<u32>, Array<u32>) -> Array<u32>).
  4. etc.

Physical also has three kinds of objects:

  1. The physical types (primitive<u32>, dict<u32>, dict<i32?>, runend<u32>, ...).
  2. Objects representing arrays of each logical type.
  3. N-tuples thereof.

It has the above operations (though defined on physical types) as well as: compress/decompress which
are arrows between physical types (e.g. run end encode is an arrow from primitive<u32> to
runend<u32>).

Now we can state your guarantee about physical and logical types precisely: there exists a functor
from Physical to Logical which sends each physical type to its logical type and lifts physical
functions to their logical equivalents. For this relationship to be a functor it must be the case
that arrow composition is preserved, so, in particular, this composition of arrows:

functor(encode ∘ scalar_at_0) = functor(encode) ∘ functor(scalar_at_0)

Encode doesn't change the logical type, so it has to map to the identity arrow. Simplifying, we get the statement we want

functor(encode ∘ scalar_at_0) = functor(scalar_at_0)

Which is maybe easier to read as a diagram. The paths on this diagram have to commute because primitive<u32>, dict<u32>, and runend<u32> map to the same object: logical_array<u32>.

Physical:

  dict<u32> -----scalar_at_0----+
    ^                           |
    |
 dict_encode                    |
    |                           V
primitive<u32> --scalar_at_0-- u32
    |                           ^
 runend_encode                  |
    |
    V                           |
 runend<u32> ----scalar_at_0----+

Logical:

logical_array<u32> --scalar_at_0--> u32

Comment thread rfcs/0055-type-system.md
Bool(BoolArray),
Primitive(PrimitiveArray),
Decimal(DecimalArray),
VarBinView(VarBinViewArray), // Note that both `Utf8` and `Binary` map to `VarBinView`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are thinking about changing this maybe.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't change the fact that we can have multiple dtypes map to the same physical encoding

…ction

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@connortsui20
Copy link
Copy Markdown
Contributor Author

ok I'm just going to rewrite a bunch of this since part of it is outdated now and I might as well go deeper with the formalism

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants