Scaladays talk on Datomisca
Dan James (@dwhjames) and Charles Francis (@agentcoops) took the stage at Scaladays to talk about Pellucid’s open source Scala library for Datomic, Datomisca, that was sponsored and developed in partnership with Zenexity.
Dustin Whitney: Hey guys, I’m Dustin Whitney. I, along with Nathan Hanlin and Doug Tangrin, organize the NY Scala meetup. So, shout out to all the NY Scala people. I’m also the CTO of Pellucid Analytics. We also, like Novus, are hiring. We don’t want to shove the recruitment too much down your throat so we thought we would let some of our engineers show some of the cool stuff we’re working on. So, without much further ado, this is Dan James and Cooper Francis.
Dan James: Can everyone hear me ok, am I too tall for this podium? Is that working out ok? Welcome to part two of interrupting your lunch. I guess all of the blood is flowing down to your stomach at the moment, so hopefully there’s still a bit left up here.
We would like to use our time just briefly today to talk about one of the open source projects we’ve been working at at Pellucid. That’s one of the great things about working at Pellucid in my obviously unbiased opinion – that we are passionate about doing this sort of stuff. One of the projects we’ve been working on is Datomisca.
At Pellucid, we use this database called Datomic. This is a new database system developed by Rich Hickey and Relevance, Inc. I guess Rich Hickey got a name drop in Martin’s keynote. If you haven’t come across Datomic before, highly recommend that you go look at some of the videos that are out there - it’s well worth your time having a quick look at that.
So Datomic is developed in Clojure and it has a Java API but we are a Scala company so we saw an opportunity to make the experience with Datomic a bit more idiomatic, a bit more enjoyable for Scala developers. So this is actually a bit of a collaboration with us, Pellucid, and Zenexity – and those are the guys behind the Play framework. Datomisca is the result of that effort.
So I’m not going to explain everything about Datomic. Rich Hickey does a much better job than I could possibly do. However, I, for the purpose of this talk, I want to give you the essence of what Datomic is about and this is an excerpt from the website.
"Datomic is a database that stores a collection of facts. And these facts in the database are immutable. Once they’re stored, they do not change. However, these old facts can be superseded by new facts over time. And the state of your database is a value, it’s a value, defined by the set of facts in effect at a given moment of time."
So, Datomic’s data model enables a different design for databases – something fundamentally different from the other databases you might be used to. It’s distributes reads, writes, querying across different components in a distributed architecture. Each instance of your app is what is known as a peer in Datomic’s nomenclature and it uses Datomic’s peer library to enable that. It really puts the power of the database into your application so you use this peer library to read from the storage service and to communicate your writes to a transactor, which takes care of serializing these transactions to the datastore.
I said it puts the power of your database into your application and this is because all the queries that you do are local to your application. Datomic gives you this value, it’s immutable, you can query on that, nothing’s gonna change out from underneath you so each instance of your application can run those queries locally and that makes it very, very powerful.
I say, I mention about this transactor, this is sort of the writing part of this distributed architecture and it’s saving all this, it’s making your database durable by using one of several supported storage services, such as DynamoDB, Riak, Couchbase.
I’m gonna hand over to Cooper who’s gonna talk a little bit about why Datomic appeals to us.
Charles Francis: So Datomic was really an appealing choice for us. As Dan has already indicated, Datomic has a rather compelling vision, both in its semantics, incorporating time as a first-class value, and the peer architecture that this allows. It really deserves attention for its innovations in database system design.
We also really like that it is so reliably distributed, given that it is able to be backed by pre-existing cloud storage services. Datomic’s support for so many backends – Dynamo, Riak, etc. – give us quite a bit of flexibility to choose based on our needs. Perhaps most importantly in this case, this means we haven’t had to entrust the durability of our data to an unproven technology. This has, of course, been an issue with some more novel databases in recent years. And since we’re already heavy users of AWS, being able to rely on Dynamo has really made our deployment much easier.
Now, of course, as Scala programmers, we’re already obviously fans of the principles of functional programming, in particular, immutability. And we all know and use the Scala collections library which provides such excellent persistent data structures. Really the Datomic database can be thought of as a durable and persistent data structure.
Finally, Datomic uses Datalog as its query language, which is an old and well-engineered as well as very powerful and declarative language. It’s strictly more expressive than SQL. This means that we can easily work with that other staple of functional programming, recursive data structures, in ways that we simply can’t or that would be rather cumbersome in a relational database. We can, for example, easily express as well as reason about graph structures and linked lists.
So Datomisca, as Dan has already suggested, is the open source library that we’ve developed to help us use Datomic’s peer library from within Scala. We’ve tried really hard to make the experience more idiomatic and safe for the statically-inclined. We should note that Datomic is a rapidly evolving system with new features being released all the time. This does mean that Datomisca itself is very much a work-in-progress.
So during our time today, we thought we’d just focus on two key aspects of Datomisca and how we use several more advanced features of Scala in order to make this a nice experience. First, we want to explain our use of implicits and the type class pattern to introduce compile-time type safety when getting data out of Datomic. And the second is our use of Scala’s new macros enable to infer and validate, again at compile-time, Datalog query syntax as well as input and output arity.
Dan: So Datomic requires that you provide schema. Now, this is in Datomic’s terms, you need to say what attributes you’re going to talk about. You need to give the characteristics of those attributes – you need to give it a name, you need to say what type it is, you need to give its cardinality – and there’s some other information that you could attach if you so choose. You don’t need to say what attributes are going to attach to particular entities. That’s completely up to the application to decide. You’re simply giving a set of attributes that you might talk about.
So, I will introduce a very simple schema for the purpose of this talk. I’m going to talk about persons, and I’m going to introduce two attributes: “name” and “follows.” So, this is an example of Datomisca code here. I’m going to say my “name” attribute has this keyword “name” in blue there and that’s a namespaced name, because Datomic is based on Clojure some of this is coming through. It has the type “string” and a cardinality of “one.” I also want to talk about how people, named “people,” can follow other people. So I have this “follows” attribute that’s got a “reference” type, because I’m going to refer to other entities or other people and a cardinality of “many.”
So, what was on the previous slide is just data. And to attach this into the database, to extend our schema, all that we need to do is transact that data and that now becomes part of the schema of the database. Again, this is some Datomisca that is sending that data off to the database and that is now going to become part of our database.
So, let me give an example of getting data into Datomic using these attributes. So it was Alice, Bob and John or Joe in the previous presentation, I’ve gone for Eve here. We’re going to talk about three people and the first three lines are introducing some identifiers for these entities.
Now, in this transaction block, I’ve got a list of facts that I’m going to store. Now the first two are giving an example of what we might call a low-level API – it’s not really providing much on top of what the Datomic peer library is giving us. We’re going to say something about an entity, attribute, and value. So we’ve got an Alice entity and we’re going to talk about the name of Alice and give a value for that. And here we’re using, we’re explicitly giving the identifier as that attribute and giving the value along with it.
Now I’ve written two versions here. The second line is going to fail at runtime because only the database knows what the type is supposed to be and if I try and attach “10” as the value for the name attribute, when it gets to the transactor, the transactor is going to say “uh uh, I can’t do that, you told me earlier that this attribute is supposed to be about strings, not numbers” and this is a problem we’d like to solve at compile time.
So the next three lines, I’m doing the same thing for Bob and Eve here. But instead of giving the keyword, I’m using the definition of the attribute that I already used. When I defined the attribute, I said what identifier I wanted to talk about. I also said the type and the cardinality and I want to reuse that information. Certainly, I don’t want to get caught out by any typos when I keep using this keyword string everywhere. What’s going on here now is that I’m using this definition I’ve used earlier and I’m going to get a compile-time check through implicits that I really am using the right value and the right type of value and the right cardinality. And now I’ll get a compile-time error if I try to use a number as a name. Similarly, follows was about references and cardinality of many. So, it’s checking here that I’m providing a set of things and referring to other people. So Alice is following Bob and Eve.
Ok, so what about getting data out of Datomic? The first line is accessing, so database is my value for the database, I have a consistent view of the database and, using the entity method, I’m going to look up an entity for a particular identifier – person id has been provided for me here. I now, again, I’m going to contrast two ways of getting a value for a particular entity. Again, I can supply a raw keyword here. I want to look up the name of this person. But what I get back is just going to be some Datomic data – whatever the database is going to provide me. I know it’s a string but I need to cast that to a string here and that’s not so great. And I’m having to repeat this string again. What I really want to do is the last line. I want to supply the attribute that I defined. I already said what the identifier was. I already said the type. And I’m going to get inferred for me at compile time the appropriate type that I expect and that I know should be there and Scala and Datomisca is going to help me out there.
So just to drive that home – attributes are parameterized by both the data type that they’re talking about and the cardinality and we’re using this attribute trait and its parameterization, its type parameterization here, to build up a type class, in fact, a couple of type classes. So I’m using implicits here and building up a type class through implicits. When I’m reading something I have the attribute in hand that I want to talk about and that’s going to tell me the type that I’m going to get out. When I’m writing something, I have a value, a Scala value, in hand with a Scala type, and I need to know that it’s convertible into the attribute that I’m pairing it with. The attribute knows the type that it’s expecting and I have some value and need to make sure that these match up.
So our type classes are really kind of computing type-level functions and type-level relations. If I have an attribute, that function is going to determine the type I get out. If I’m writing then I need this relation to tie together the attribute that I’m talking about and the value.
So I’ll leave that there for talking about schemas. I hope that’s a compelling example of how to reuse this information and not repeat ourselves and I’m going to hand back to Cooper who’s going to talk about macros.
Charles: So as we’ve all learned in the past few days, macros are an ideal tool for deep embedding the DSLs. They’re an abstraction that allows us to elegantly handle the compile time expansion or transformation and, in Scala, the validation of abstract syntax trees into other Scala abstract syntax trees. They’ll be validated in later compile phases.
Thus, it’s certainly not surprising that many of the libraries which have been developed using Scala macros serve to handle that awful wart of programming which is impedance mismatch between programming language and database semantics. See for example libraries like Slick and SQL Type, which similarly deploy macros to reduce boilerplate code and improve type safety when interacting with relational databases.
We made the decision to stick to a purely embedded DSL where, from the programmer’s perspective, queries are handled as strings of well-formed Datalog. While this might appear to be a limitation of the library, this does enable us to maintain complete API compatibility and syntax across implementations. Examples from Clojure can more or less be copied and pasted into Scala and this really means there isn’t much of a specific library to learn on top of Datomic. And you don’t have any of the kind of dreadful idiosyncrasies that are typical of more internal DSLs.
So, first note that essentially a Datomic query is a static data structure with a specific number of input and output parameters. Here we want to input a database, which, given Datomic’s time-sensitive nature, queries must be parameterized on a specific time slice of the database because we want to be able to, for example, within a transaction, synchronize multiple queries at the time that this transaction began. We also have here, the name, bound to “name,” of a user and we want to return the id “p” of all those users that follow “name.”
So right away there were two things that it would be just terrific to have checked at compile time. We would like the syntax of our Datalog strings to be validated. We don’t want to have these typos to be found at runtime. And we’d also like to ensure the correct number of input and output parameters. We want to make sure we’re inputting the right number of things and getting the right number of things out.
So we handle this in two phases. First, we want to parse the query into a value of type “pure query” making sure that there are no typos. One slight downside to our approach is that we do have to maintain a complete Datalog parser which isn’t so bad – the core Datalog syntax is relatively straightforward – but we have had some troubles in that Datomic itself is prone to extending the core syntax with new features. But we have found in practice that this is a tradeoff that from an end user’s as well as from a performance perspective is very much worthwhile.
So, one thing to note at this phase is that inputs and outputs are processed into lists of values. So “find” and “in” are collections of parameters whose cardinality is equal to the arity of our query. So we must go a step farther because these values can only, again, provide runtime checks and we have to encode them at the type level.
So, again, we’ve already proven that this query string can be converted into a valid AST and we want to subsequently make the type more specific so that at a later compile phase we will get errors if we’re not properly handling the right number of inputs and outputs.
So, in this case we’re working with a much more complicated query. We want to restrict followers to those within a particular age band and we want to additionally return the user these followers’ name and age. So we have three inputs – the database, the name and the age – and three outputs – the id, name and age.
So one thing you’ll notice on the right hand side of the slide is that we are using TypedQueryAuto3, which means we are providing manual cases for some arbitrary number of arities. And we would like to make this more sophisticated in later releases but really if your query has more than 50 inputs, this is probably something wrong.
The other obvious limitation is that the type at this point is only as specific as Datomic data. This is the superclass of all specific Datomic value types. We do have all of this information at the time of macro expansion but the macro to correctly validate the types on inputs has not been implemented yet. But this is of course something that we would like. We would really like when you’re inputting parameters to a query, of course, that we use the schema information once more and make sure that we’re not passing an integer where it’s expecting a string.
So, this is an open source project so we can kind of make a call out here. If anyone is really itching to get a bit of real world Scala macro experience we would love contributions. You can find all the code on our Github repository. And also if this is particularly exciting, we are hiring so you can work on this every day. Could be nice. Anyway, yeah, thanks for your time.