Jul 19, 2022
Welcome to the tutorial on Gogi (short for egglogish), a made-up language that attempts to generalize both Datalog and egg. This blog stems from a trick I learned from Nicholas Matsakis at PLMW 2022: To write a tutorial for a non-existing language. By doing this, I can get a sense of what I want from this new language as well as early feedbacks from others.
Why Gogi? The motivation behind Gogi is to find a good model for relational e-graphs that can take full advantage of (1) performance of relational e-matching and (2) expressiveness of Datalog, while (3) being compatiable with egg as well as (4) efficient. This is the first approach I described at the beginning of the previous post. I’m actually more excited about this approach, because I believe this is the right way in long term.
Gogi is Datalog, so it supports various reasoning expressible in
Datalog. A rule has the form
head1, ..., headn :- body1, ..., bodyn
. For example, below
is a valid Gogi program:
, string) from "./link.csv".
rel link(string, string).
rel tc(string
, b) :- link(a, b).
tc(a, b) :- link(a, c), tc(c, b). tc(a
TODO: However, Datalog by itself is not that interesting. So for the first part of the post, I will instead focus on the extensions that make Gogi interesting. Next, I’ll give some examples and show why Gogi generalizes egg I will also try to develop the operational and model semantics of Gogi.
In Gogi, every value is either a (semi)lattice value or a sort value.
Lattices in Gogi are algebraic structures with a binary join operator
(\(\lor\)) that is associative,
commutative, and idempotent and a default top \(\top\) where \(\top\lor e=\top\) for all \(e\). For example, standard types like
string
, i64
, and u64
in Gogi are
in fact trivial lattices with \(s_1\lor s_2
=\top\) for all \(s_1\neq s_2\).
In Gogi, \(\top\) means unresolvable
errors. Users can define their own lattices by providing a definition
for lattice join.
Similarly, users can define sorts. Unlike lattices, sorts are uninterpreted. As a result, sort values can only be created implicitly via functional dependency. We will go back to this point later.
Relations can be declared using the rel
keyword.
Moreover, it is possible to specify a functional dependency between
columns in Gogi. For example,
.
sort expr-> expr. rel num(i64)
declares a sort called expr
and a num
relation with two columns (i64, expr)
. In the
num
relation, each i64
uniquely determines the
remaining column (i.e., num(x, e1)
and
num(x, e2)
implies e1 = e2
). The
num
relation can be read as a function from
i64
to values in expr
. Similar declarations
are ubiquitous in Gogi to represent sort constructors.
As another example,
, expr) -> expr. rel add(expr
declares a relation with three columns, and the first two columns
together uniquely determines the third column. This represents a
constructor with two expr
arguments.
Users can introduce new sort values with functional dependencies. Example:
1, c). % equivalently, num(1, _).
num(2, d).
num(, d, e) :- num(1, c), num(2, d). add(c
This program is interesting and its semantics deviates from the one
in standard Datalog. In standard Datalog, this program will not compile
because variable c
in the first rule, d
in the
second rule, and e
in the third rule are not bound.
However, this is a valid program in Gogi. Thanks to functional
dependency, variables in the head do not necessarily have to be bound in
the bodies. Variables can be unbound as long as they can be inferred
from the functional dependency. The above Gogi program is roughly
equivalent to the following Datalog program:
1, c) :- !num(1, _), c = new_expr().
num(2, d) :- !num(2, _), d = new_expr().
num(, d, e) :- num(1, c), num(2, d), !add(c, d, _), e = new_expr() add(c
Negated atoms like !num(1, _)
is necessary here because
otherwise it will inserts more than one atoms matching
num(1, _)
, which violates the functional dependency
associated to the relation.
The above example Gogi program can also be written into one single rule with multiple heads:
, d, e), num(1, c), num(2, d).
add(c% roughly equivalent to
% add(c, d, e), num(1, c), num(2, d) :- !num(1, _), !num(2, _),
% c = new_expr(),
% d = new_expr(),
% !add(c, d, _),
% e = new_expr().
Gogi also supports the bracket syntax, so the last program can be further simplified to:
1], num[2]]. add[num[
The bracket syntax will implicitly fill the omitted column(s) with newly generated variable(s). If the atom is nested within another term, the nested atom will be lifted to the top-level, and the generated variable(s) will take the original position of the atom. Another silly example of the bracket syntax:
:- xor[xor[x]].
ans(x) % expands to
% ans(x) :- xor[y, z], xor(x, y, z)
% which expands to
% ans(x) :- xor(y, z, _), xor(x, y, z)
% this rule can be thought as
% for any expr x, y where `y xor (x xor y)`
% is present in the database, collect x as the result.
Finally, in equational reasoning a la egg, it is common to write
rules like “for every (a + b) + c
, populate
a + (b + c)
on the right and make them equivalent”. This
rule will look like the following:
, add[b, c], id) :- add(add[a, b], c, id). add(a
Gogi further has a syntactic sugar for these equational rules:
head := body if body1 ... bodyn
where both
head
and body
should use the bracket syntax
and omit the same number of columns. The if clause can be omitted. Gogi
will expand this syntactic sugar by unfolding the top-level bracket in
head
and body
with the same variable(s):
.
add[b, a] := add[a, b]% unfolds to add(b, a, id) :- add(a, b, id).
.
add[a, add[b, c]] := add[add[a, b], c]% unfolds to add(a, add[b, c], id) :- add(add[a, b], c, id).
1] := div[a, a] if num(x, a), x != 0.
num[% unfolds to num(1, id) :- div(a, a, id), num(x, a), x != 0.
Note the equational rules may introduce functional dependency
violation; for instance, last rule may cause multiple tuples to match
num(1, _)
, yet the first column should uniquely determines
the tuple. We will discuss more about how we resolve this kind of
violations in the section on Functional Dependency
Repair. The essential idea is that, if two sort values are present
with the same primary key, then the two sort values must be equivalent,
whereas if two lattice values are present with the same primary key, the
new, unique lattice value should generalize the two values, i.e., it
will be the least-upper bound of those lattice values.
The example relations we see so far mostly center around sort values. However, it is also possible and indeed very useful to define relations with lattices:
-> lmax(-2147483648).
rel hi(expr) -> lmin(2147482647). rel lo(expr)
To define a lattice column, a default value needs to be provided in the relation definition. The default value is not a lattice bottom: the bottom means do not exist. Meanwhile, the lattice top means there are conflicts. It is also possible for default value to refer to the determinant columns:
: i64) -> i64(i + 1). rel add1(i
The column initialization syntax should be reminiscent of C++’s member initializer lists.
In the above example, lo
and hi
together
define a range analysis for the expr
sort. This in facts
generalizes the e-class analyses in egg. Here are some rules for
hi
and lo
(stolen from Zach):
, n.into()) :- num(n, x).
hi(x, n.into()) :- num(n, x).
lo(x, n.negated()) :- hi(x, n), neg(x, nx).
lo(nx, n.negated()) :- lo(x, n), neg(x, nx).
hi(nx, 0) :- abs(x, absx).
lo(absx, absx), lo[x] >= 0.
lo[absx] := lo[x] if abs(x, absx), lo[x] >= 0.
hi[absx] := hi[x] if abs(x, lox + loy) :- lo(x, lox), hi(y, loy), add(x, y, xy).
lo(xy% can be further simplified to
% lo[xy] := lo[x] + lo[y] if add(x, y, xy)
Note here instead of
lo(neg[x], n.negated()) :- hi(x, n).
, we put the
neg
atom to the right-hand side and write
lo(nx, n.negated()) :- hi(x, n), neg(x, nx).
There are some
nuanced differences between the two rules. This rule, besides doing what
the second rule does, always populates a neg
tuple for each
hi
tuple even when it does not exist, so the first rule can
be viewed as an “annotation-only” version of the first rule, which is
usually what we want.
The last example shows e-class analyses in Gogi is composable (i.e., each analysis can freely refer to each other). This is one of the reason why we believe Gogi generalizes e-class analyses. Moreover, they can also interact with other non-lattice relations in a meaningful way: 1
, expr).
rel geq(expr
% ... some arithmetic rules ...
% need to convert to int because they are from different lattices
, b) :- lo[a].to_int() < hi[b].to_int().
geq(a% TODO: geq is quadratic in size.
% Gogi should support inlined relations
% to avoid materialize relations like geq
% ... other user-defined knowledge about geq
% x and abs[x] are equivalent when x > 0
, num[0]) x := abs[x] if geq(x
Diverging a little bit, it is even possible to write the above rules without using analyses / relations with lattices:
.
sort booltrue() -> bool.
rel false() -> bool.
rel , expr) -> bool.
rel geq(expr
% for each abs[x] exists, populate geq[x, 0],
% in the hope that later
% it will be "in the same e-class" as true[].
0] :- abs[x]
geq[x,
0] := true[] if num(x, numx), x > 0.
geq[numx, 0] := true[] if mul(x, x, xx).
geq[xx, % if x > 0 and y > 0 are both equivalent to true,
% then x + y > 0 is also equivalent to true.
0] := true[]
geq[xy, , y, xy),
if add(x, 0, true[]),
geq(x, 0, true[])
geq(y0] := true[] if add(mul[x, x], mul[y, y], xxyy).
geq[xxyy,
% ... other reasoning rules...
% if x>0 is equivalent to true,
% every abs[x] in the database should be equivalent to x
, 0, true[]) x := abs[x] if geq(x
This can be seen as implementing a small theorem prover in Gogi.
Whenever it sees abs[x]
, a query about
x >= 0
will be issued to the database. If later
x >= 0
is proven to be equivalent to true, a
distinguished sort value, abs[x]
will be put in the same
e-class as x
.
All these rewrite will be very hard to express in egg.
FDs can be violated: what if the user introduced two values for the
same set of determinant columns? In this case, we need to repair the
FDs. We have seen such examples many times in previous sections. For
example, rules like R[x1, ..., xk] := ...
will add new
values to R
indexed by x1, ..., xk
, and it is
likely that there are already other tuples with the form
R(x1, ..., xk, _)
. These rules may potentially cause
violation of functional dependencies. In general, there are two kinds of
violations:
Case 1. If the dependent column is a sort value, Gogi will unify the two sort values later in the iteration. We can think of a term of a sort in Gogi as a constant in some theories, which refers to some element in the model. But we don’t know which element it refers to. However, by repairing functional dependencies, we can get some clues about what the structure will look like. Consider the following program
, expr) -> expr.
rel add(expr-> expr.
rel num(i64)
% add the fact 2 + 1, where the last column is auto-generated.
2], num[1]].
add[num[
% add the fact 2+1, but the last column is add[num[1], num[2]]
% (add[num[1], num[2]] is created on the fly because
% it occurs at the left hand side.)
2], num[1],
add(num[1], num[2]]). add[num[
Because now (without repairing) add[num[1], num[2]]
will
contain two rows. The functional dependency is violated. If we think of
rewriting under functional dependency as a process of finding a model
for the sort, then what do we learn from this violation? We learned
that, to respect the functional dependency, the two sort values must be
the same thing! Therefore the expr originally referred by
add[num[2], num[1]]
and by add[num[1], num[2]]
will be treated as the same expr and no longer be distinguishable in
Gogi! As we will show later, when a Gogi program reaches the fixpoint,
it produces a valid, minimal model for the relations and the sorts such
that the rewrite rules and the functional dependencies are both
respected.
TODO: mention no global union-find
Case 2. What if the dependent column is a regular type as a Rust struct or an integer? Well, we also need to unify them, but in a different way. The idea here is to describe these values with a algebraic structure, which in this case is a lattice. A lattice has a bottom (means does not exist) and a top (means conflicts). Similar to Flix, lattice values will grow by taking the least upper bound of all the violating tuples. In that sense, Gogi also generalizes Flix (as is described in the PLDI ’16 paper).
This proposed extension takes inspiration from recent work on Ascent, an expressive Datalog engine that has seamless integration with the Rust ecosystem. One interesting feature of Ascent is that it allows first-class introspection of the column values. Ascent use this feature to support features like first-class environment (this and the next example are both from page 4 of the Ascent paper; comments are mine):
, rho2, a, tick(e, t, k)) <--
sigma(v?e@Ref(x), rho, a, t), // the environment rho is enumerated here
sigma(, ?Value(v, rho2)), // rho[x] is used as an index for store
store(rho[x], ?Kont(k)); store(a
One thing though is that Ascent allows enumerating structs as a
relation with the for
keyword. For example:
, rho2, store, a, tick(v, t,k)) <--
sigma(v?Ref(x), rho, store, a, t),
sigma(// enumerating store[&rho[x]]
for xv in store[&rho[x]].iter(), if let Value(v,rho2) = xv,
// enumerating store[a]
for av in store[a].iter(), if let Kont(k) = av;
This makes Ascent have a more macro-y vibe, which makes sense since
the whole Ascent frontend is based on Rust’s procedural macros. However,
I think for
can be easily achieved inside the pure
relational land, so in Gogi, this may not be necessary.
Seamless interop with Rust is in general very powerful. In fact, we
have already used this feature a lot. For example, lattices in Gogi are
structs defined in Rust that implements certain traits. So rules like
hi(x, n.into()) :- num(n, x).
, will call methods in the
corresponding struct (e.g., n.into()
).
User-defined functions introduce functional dependencies from
functions’ domains to their ranges. For example, rule
hi(x, n.into()) :- num(n, x).
can be viewed as
hi(x, n_into) :- num(n, x), into_rel(n, n_into)
with
functional dependency from n
to n_into
.
Advanced join algorithms like worst-case optimal joins can leverage
these functional dependencies to optimize the query.
The evaluation algorithm of Gogi programs consists of two parts. The core of the evaluation is the invariant-maintaining rebuilding algorithm, which is inspired both by the rebuilding algorithm of egg and by the evaluation algorithm of the chase. The second part involves matching and applying Gogi rules. Applying Gogi rules is efficient compared to the standard chase. Moreover, because Gogi programs are monotonic computations over the relational database in nature, they can benefit from the semi-naive evaluation algorithm of Datalog. We call this semi-naive matching, which can be seen as a further improvement over relational e-matching.
The rebuilding algorithm:
= mk_union_find()
todo = mk_set()
domain
def union_sort(s1, s2):
todo.union(s1, s2)
domain.add_all([s1, s2])
def refresh_todo():
= mk_union_find()
todo = mk_set()
domain
def on_insert(R, tup):
# find the tuple by its determinant columns
= R.find_by_determinant(tup.det)
orig_tup if orig_tup is None:
R.insert(tup)else:
# enumerate each dependent column
for c1 in tup.dep:
= c1.col
col = orig_tup[col]
c2 if col.is_sort():
= todo.get_or_create(c1)
s1 = todo.get_or_create(c2)
s2
union_sort(s1, s2)else:
orig_tup.set_col(col, c1.lat_max(c2))
def normalize(tuple, union_find):
return tuple.map(lambda val:
= val))
union_find.get_or_default(val, default
def rebuild(DB):
while not todo.is_empty():
# take todo into the local scope
= todo
union_find
refresh_todo()
= mk_set()
to_remove = mk_set()
to_insert
for val in domain:
for R in DB:
for col in R.cols:
for tup in R.index_by(col = col, val = val):
= normalize(tup, union_find)
new_tup if new_tup != tup:
to_remove.add((R, tup))
to_insert.add((R, new_tup))
DB.remove_all(to_remove)# may trigger on_insert
DB.insert_all(to_insert)
def batch_rewrite(pats, DB):
= mk_set()
to_insert for (lhs, rhs) in pats:
for subst in match(DB, lhs):
= chase(DB, subst, lhs, rhs)
subst for (R, atom) in rhs:
apply(subst)))
to_insert.add((R, atom.
DB.insert_all(to_insert)return to_insert.is_empty()
def chase(DB, subst, lhs, rhs):
= True
shouldContinue while shouldContinue:
= False
shouldContinue
for atom in rhs:
= atom.get_det_vars()
det_vars if det_vars.is_subset_of(subst.get_domain()):
= True
shouldContinue
= DB.get_rel(atom.rel)
R = det_vars.apply(subst)
det = R.find_by_determinant(det)
tup for var in atom.get_dep_vars():
= var.col
col if var.is_sort():
if tup is None: continue
= tup.get_by_col(col)
value
sort_update(subst, var, value)else:
= tup is None ? col.lat_init(det)
value
: tup.get_by_col(col)
lat_update(subst, var, value)
for var in rhs.get_all_vars():
if !subst.contains(var):
assert var.is_sort():
= new_sort_value(var.sort)
subst[var]
def lat_update(subst, var, value):
if subst.contains(var):
= subst[var].lat_max(value)
subst[var] else:
= value
subst[var]
def sort_update(subst, var, value):
if subst.contains(var):
union_sort(subst[var], value)else:
= value subst[var]
def run(pats, DB, max_iter):
for iter in range(max_iter):
if !batch_rewrite(pats, DB):
return
rebuild(DB)
One of the bottleneck in evaluating Gogi programs is matching the
left-hand side. Since we are matching over a relational representation
of the e-graphs, we are already doing is already relational e-matching.
However, we can go one step further: Let DB'
be the
database of tuples that are not touched in the current iteration of
rewrite. DB'
by itself will not produce any interesting new
tuples; it has to join with newly generated tuples (i.e., the delta
database). This is exactly the semi-naive evaluation algorithm of
Datalog. We call this similar optimization in Gogi semi-naive matching.
This optimization will be tricky to do over e-graph’s DAG
representation, yet is fairly obvious in Gogi’s full-fledged relational
representation.
The last rule in this example has a
single variable on the left-hand side, but the above mentioned syntactic
expansion for :=
does not apply to this case. The rule is
indeed equivalent to
abs(x, x) :- abs[x] if geq(x, num[0])
.↩︎