Type System RFC by connortsui20 · Pull Request #55 · vortex-data/rfcs

connortsui20 · 2026-04-27T19:41:47Z

I'd like to revive this now as I want to create issues in the main Vortex repository that reference this.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

danking

If you're just trying to get something in, I'm happy to approve, but I do think the formalism could use some work.

danking · 2026-04-27T20:17:13Z

+
+### Logical vs. Physical Types
+
+A **logical type** (`DType`) describes what the data means, independent of how it is stored (e.g.,


what data means

This is true, but I think it is not very elucidating.

IMO, the important distinction between logical and physical types is that logical types define a set of operations and a set of laws. E.g. u32 supports arithmetic with the usual bounded arithmetic laws, e.g.

if a + b would not overflow, a + b - b = a

The operations empower the user to manipulate the values. The laws empower the engine to simply execution.

It defines the mathematical domain of inhibitable values and mathematical functions over these domain match exactly with vortex function over these dtypes.

Also currently the DType defines a canonical canonical array, so I guess this also defines the interpretation of the values in the array in that domain?

danking · 2026-04-27T20:23:10Z

+this separation, implementing `M` operations across `N` encodings requires `N * M` implementations.
+With it, each encoding only needs to decompress itself and each operation only needs to target
+decompressed forms, reducing the cost to `N + M`. See this
+[blog post](https://spiraldb.com/post/logical-vs-physical-data-types) for more information.


I don't think this argument is compelling to users/consumers. It appeals to engineers implementing operators or encodings.

IMO, the value of the logical-physical distinction is: Vortex consumers may abstractly specify work in terms of logical operations on logical types. Their code doesn't need to change when someone ships a new physical type (e.g. update Vortex and you take advantage of NewFSST without changing any code).

danking · 2026-04-27T20:38:08Z

+The critical property of a quotient type is that operations on it must be **well-defined**: they
+must produce the same result regardless of which member of the class you operate on. Formally, if
+`f : A → B` respects the equivalence relation (`a ~ a'` implies `f(a) = f(a')`), then `f` descends
+to a well-defined function on the quotient `f' : A/~ → B`.


I think this section needs to say something about the operations which are defined on each logical type. AFAICT, it's important that the operations respect the equivalence relation. In particular, the sum of two arrays shouldn't be affected by the physical encoding.

I think the theorem we want is something like this. Elements from the same equivalence class must have equivalent results under every operation.

a, a': A/~ f: A/~ -> A/~ if a ~ a' then f(a) ~ f(a')

a, b, c, d: A/~ g: (A/~, A/~) -> A/~ if a ~ c b ~ d then g(a, b) ~ g(c, d)

danking · 2026-04-27T21:48:50Z

+all physical encodings into equivalence classes, where each class corresponds to a single logical
+column of data.
+
+A Vortex `DType` like `Primitive(I32, NonNullable)` **names** one of these equivalence classes. It


This feels to me like an unusual formulation of the logical-physical distinction.

If i32 corresponds to exactly one equivalence class, then the elements of the equivalence class are encodings like "primitive i32" and "run end i32". That means the "well-defined" functions you describe above are functions on logical or physical types themselves: DType -> DType or Primitive -> RunEnd. I don't think that's what you intended because below you discuss functions like filter and take which are Array -> Array.

I think this whole thing might be more naturally formalized in category theory.

A category contains zero or more "objects" and zero or more "arrows" (also called morphisms) between objects. You often see diagrams like this:

a --> b | V c

That's a category with three objects: a, b, and c and five morphisms: the one from a to b, the one from a to c, and an (undrawn) identity morphism for each object.

The point is that a category has not just elements (the "objects") but also functions (the "arrows"). That means I can make statements not just about equivalence classes of elements but also how functions are affected by those equivalence classes.

We have two categories Logical and Physical. Logical has three kinds of objects:

The logical types (u32, u32?, i32, ...).

Array objects: Array<u32>, Array<i32>, ...

N-tuples thereof.

and has arrows for all our usual operations:

scalar_at (e.g. from Array<u32> to u32).

take (e.g. from (Array<utf8>, Array<u32>) to Array<utf8>).

logical operations (e.g. pointwise-sum (Array<u32>, Array<u32>) -> Array<u32>).

etc.

Physical also has three kinds of objects:

The physical types (primitive<u32>, dict<u32>, dict<i32?>, runend<u32>, ...).

Objects representing arrays of each logical type.

N-tuples thereof.

It has the above operations (though defined on physical types) as well as: compress/decompress which
are arrows between physical types (e.g. run end encode is an arrow from primitive<u32> to
runend<u32>).

Now we can state your guarantee about physical and logical types precisely: there exists a functor
from Physical to Logical which sends each physical type to its logical type and lifts physical
functions to their logical equivalents. For this relationship to be a functor it must be the case
that arrow composition is preserved, so, in particular, this composition of arrows:

functor(encode ∘ scalar_at_0) = functor(encode) ∘ functor(scalar_at_0)

Encode doesn't change the logical type, so it has to map to the identity arrow. Simplifying, we get the statement we want

functor(encode ∘ scalar_at_0) = functor(scalar_at_0)

Which is maybe easier to read as a diagram. The paths on this diagram have to commute because primitive<u32>, dict<u32>, and runend<u32> map to the same object: logical_array<u32>.

Physical:

dict<u32> -----scalar_at_0----+ ^ | | dict_encode | | V primitive<u32> --scalar_at_0-- u32 | ^ runend_encode | | V | runend<u32> ----scalar_at_0----+

Logical:

logical_array<u32> --scalar_at_0--> u32

joseph-isaacs · 2026-04-28T09:09:20Z

+    Bool(BoolArray),
+    Primitive(PrimitiveArray),
+    Decimal(DecimalArray),
+    VarBinView(VarBinViewArray), // Note that both `Utf8` and `Binary` map to `VarBinView`.


we are thinking about changing this maybe.

It won't change the fact that we can have multiple dtypes map to the same physical encoding

…ction Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 · 2026-04-28T15:21:47Z

ok I'm just going to rewrite a bunch of this since part of it is outdated now and I might as well go deeper with the formalism

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

revive the Type System RFC

2b79836

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 requested review from danking, joseph-isaacs and mprammer April 27, 2026 19:41

connortsui20 deployed to rfc-preview April 27, 2026 19:41 — with GitHub Actions View deployment

connortsui20 deployed to github-pages April 27, 2026 19:42 — with GitHub Actions View deployment

connortsui20 requested a review from gatesn April 27, 2026 21:28

danking requested changes Apr 27, 2026

View reviewed changes

joseph-isaacs reviewed Apr 28, 2026

View reviewed changes

better explanation of equivalence classes and logical-physical distin…

3a2006e

…ction Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 deployed to rfc-preview April 28, 2026 14:54 — with GitHub Actions View deployment

update canonicalization and equivalence explanation

385f750

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 deployed to rfc-preview April 28, 2026 15:28 — with GitHub Actions View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type System RFC#55

Type System RFC#55
connortsui20 wants to merge 3 commits intodevelopfrom
ct/types

connortsui20 commented Apr 27, 2026

Uh oh!

danking left a comment

Uh oh!

danking Apr 27, 2026

Uh oh!

joseph-isaacs Apr 28, 2026 •

edited

Loading

Uh oh!

danking Apr 27, 2026

Uh oh!

danking Apr 27, 2026

Uh oh!

danking Apr 27, 2026

Uh oh!

joseph-isaacs Apr 28, 2026

Uh oh!

connortsui20 Apr 28, 2026

Uh oh!

connortsui20 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		### Logical vs. Physical Types

		A logical type (`DType`) describes what the data means, independent of how it is stored (e.g.,

Conversation

connortsui20 commented Apr 27, 2026

Uh oh!

danking left a comment

Choose a reason for hiding this comment

Uh oh!

danking Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

joseph-isaacs Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danking Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

danking Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

danking Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

joseph-isaacs Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

connortsui20 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

connortsui20 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joseph-isaacs Apr 28, 2026 •

edited

Loading