I’m pretty sure every developer who has ever worked with a modern database-backed application, particularly a web-app, has a love/hate relationship with their ORM, or object-relational mapper.
On the one hand, ORMs are vastly more pleasant to work with than code that constructs raw SQL, even, generally, from a tool that gives you an object model to construct SQL, instead of requiring (Cthulhu help us all) string concatenation or interpolation.
On the other hand, there tends to be an enormous mismatch between the objects the ORM is modelling on one end, and the relational database on the back end. Basically, the object model encourages dealing with one object at a time, whereas the relational database wants you to push your queries, JOINs, and even your entire transactions back into it, so that it can better optimize them for you. And so, you tend to have a choice between writing shockingly inefficient code, and jumping through weird hoops and strange syntax to give your ORM the tools to generate efficient queries.
The most fundamental problem, perhaps, is the question of how much
data the ORM should request from the database. When I ask for a
collection of objects (using some syntax to generate the SQL
clause automatically), the ORM has to decide:
- Which fields in the object should the ORM
SELECTfrom the database, and
- Which associated models (with foreign keys to or from the object in
question) the ORM should
JOINin and pull out at the same time.
(You can think of the second as a more involved version of the first, but involves more complexity, and is interesting for being potentially recursive).
SELECTing too much data in either case makes the database do too
much work, touch too many blocks on disk, read too many indexes, and
so on, and makes your queries slow.
SELECTing not enough data means
your ORM needs to go back and make additional queries when you ask for
fields or models that were not pulled in initially, killing your
performance even more.
I’d like to muse about two possible solutions, neither of which I’m aware of implementations of, although to be honest I haven’t looked too hard. I’d love it if anyone could point me at work on either sort.
The static solution
Given a language with a sufficiently awesome type system (Haskell
comes to mind :)), you could imagine encoding information about which
fields and models are or are not
SELECTed into the type of a record
object, making it statically an error to write a query that
under-selects. Given type inference, we can probably even go on to
statically infer (probably with a bit of programmer help in the form
of type annotations) which fields and models we need to
based precisely on which fields of the result get accessed.
This is the kind of solution that occurs immediately to a certain type of person (of which I know a lot) when they think about this type of problem enough. It’s fun to think about and there are probably some papers in it (if no one’s beaten you to it), but I suspect it’s the kind of thing that’s unlikely to gain much traction among mainstream developers, if only because it would require you write in Haskell or equivalent.
The dynamic solution
Another solution occurred to me recently, which I don’t think I’ve heard people talking about. I like it because I’m pretty sure it could be grafted onto an existing ORM with little or no API change, which means there’s potential for incrementel adoption and improvement.
The key insight here is that, while the decision of what to
is critical for performance, it’s (mostly)1 irrelevant for correctness whether you
have to make additional queries later, or pull in too much data. A
good ORM, if it notices you ask for a field it doesn’t have cached,
will just go ahead and make the
SELECT for you.
This opens the door for heuristic solutions. A good ORM will already
have hooks for the programmer to specify which fields or models to
include or exclude – Django, to pick a popular example, has
defer methods on
So, my proposal is this: Let’s develop tools to profile an application, and automatically generate those annotations in a profile-driven manner. Since all accesses to your data are through the ORM, we can instrument the ORM to determine which fields are selected after a query, and remember that information for next time it sees the same query.
In a webapp environment, the possibilities for optimization get
potentially even more exciting. A webapp framework like Django knows
an awful lot of information at any given moment, like what user is
logged into your app, and what URL is currently being generated. We
could imagine associating that data with the profile results, so that
the ORM could automatically discover relations between the application
structure and the ORM queries. It could “learn”, for example, that
certain users are administrators, and so will access more fields on
certain models than other users. It could learn that even though
/places/details/ pull up the exact same set of
models (maybe even through the same helper function!), the former
needs only their names and IDs, while the latter needs to pull in full
details, as well as rows from the associated
All without a single programmer annotation. Of course, in cases where the behavior is too complex for the tool’s heuristics to pick up, or where the programmer absolutely needs predictable behavior or needs to micro-optimize, you could turn off the heuristics and fall back the old behavior (with the same old annotations available to the programmer).
Is anyone aware of any work in this space? It sounds like a potentially extremely exciting project to me, but I probably don’t have the time to attempt to throw together an implementation.
1.Ignoring, for example, race conditions where you need to select a model and related models atomically, and don’t want or can’t afford a transaction around the entire lifetime of the objects in the app code.