Category Archives: RethinkDB

getNearest(RethinkDB Upgrade);

With last week’s release of version 1.15.0 (appropriately codenamed “Lawrence of Arabia”), RethinkDB has finally (did I just say finally?) added one of the most-requested features: Geospatial support. Sporting a shiny new set of geo types, functions, and indexes, it’s harder than ever to find excuses not to try RethinkDB.

Several of my current skunkworks projects involve geospatial data in one way or another. RethinkDB’s lack of support was sending me back to Postgres, which, while mature and feature-filled, is also a bit of an overcomplicated mess. Attempts to make PostGIS easier to use, like ORM add-ons, instead tend to make things even more complicated. (The ActiveRecord adapter, for example, wants to store geographies — but for many applications, many of the functions you’ll want operate only on geometries, meaning you have to cast every. single. query.)

RethinkDB’s geo support, on the other hand, is a reimagining of what people actually want in geospatial functions in today’s applications. Want to do a KNN search? No sweat. getNearest has you covered. Need to find some intersecting areas? getIntersecting, duh. Circles and points are easily created with circle and point. It’s so humane it will blow your mind.

Postgres with PostGIS is currently a lot faster for nearest neighbor searches—we’re talking at least an order of magnitude—but I still find RethinkDB more pleasant to use. Not only is ReQL’s getNearest much easier to remember, but the way ReQL is evaluated means I don’t get the SQL Injection Heebie Jeebies every time I have to work with location data from the big bad outside world.

Curiously, getIntersecting does not suffer from getNearest‘s performance issues. It’s instead right on par with PostGIS, and — if the very idea doesn’t make you feel dirty — can even be used to get near-PostGIS KNN performance out of RethinkDB in most applications. (Imagine for a moment you’re making a Yelp competitor. We don’t care about the true K nearest neighbors so much as what’s immediately around the user. If you’re scoping to, say, 5 miles anyway, who cares what the query looks like as long as the results are useful? Call getIntersecting with a 5 mile circle originating at the user’s location. Bam!)

It’s an exciting time to be playing with location-based apps! While I haven’t used every data store out there, RethinkDB 1.15 has made things much simpler than anything I’ve used to this point.

Order

In my discussion of RethinkDB, I mentioned that ReQL is inspired by functional languages.

What I neglected to mention is that this will serve as an initial stumbling block for some developers. Being functional, it makes use of chaining, and results from one function are fed into the next. This means that order matters in ReQL.

This does open up some power you don’t have in SQL (and some non-SQL query languages too), but simultaneously introduces a potential for error.

Let’s take a look at a simple use case: We want the results ordered, but we only want 5 of them.

In standard SQL, this is unambiguous. There’s only one allowable way to write the query. If you try to LIMIT before ORDERing, your DB (PostgreSQL, in this case) will drop the hammer:

SELECT avg_max_f FROM heat_index_data WHERE state = 'Arizona' LIMIT 5 ORDER BY date DESC;

ERROR:  syntax error at or near "ORDER"
LINE 1: ...M heat_index_data WHERE state = 'Arizona' LIMIT 5 ORDER BY d...

Write it correctly, with LIMIT at the end, and we’re good:

SELECT avg_max_f FROM heat_index_data WHERE state = 'Arizona' ORDER BY date DESC LIMIT 5;

weather_dev=# SELECT avg_max_f FROM heat_index_data WHERE state = 'Arizona' ORDER BY date DESC LIMIT 5;
 avg_max_f
-----------
     59.16
     60.57
     58.51
     54.11
      51.8
(5 rows)

But ReQL is functional. It doesn’t care where you stick your limits or orderBys and it doesn’t care how many of them you have. You can orderBy(r.asc(date)).orderBy(r.desc(date)) for all it cares. Data comes out of one function and is fed into the next.

So we can limit before orderBy:

r.getAll('Arizona', {index: 'state'}).limit(5).orderBy('date').pluck('avg_max_f')

[
  {
    "avg_max_f": 48.81
  } ,
  {
    "avg_max_f": 44.35
  } ,
  {
    "avg_max_f": 56.4
  } ,
  {
    "avg_max_f": 66.15
  } ,
  {
    "avg_max_f": 57.04
  }
]

And we can orderBy before limit:

r.getAll('Arizona', {index: 'state'}).orderBy('date').limit(5).pluck('avg_max_f')

[
  {
    "avg_max_f": 48.81
  } ,
  {
    "avg_max_f": 41.85
  } ,
  {
    "avg_max_f": 40.26
  } ,
  {
    "avg_max_f": 45.54
  } ,
  {
    "avg_max_f": 48.79
  }
]

…and as is plainly visible here, you get different results because they’re different queries.

Wanting to limit before ordering is certainly an edge case (which is why SQL doesn’t allow for it), but it — along with many other unconventional queries — is perfectly possible with the limited constraints of ReQL. If you run into unexpected results working with RethinkDB, take a step back and ask yourself: “What am I really asking for here? Does it make sense?”

RethinkDB: Your Next Next Database

So some developers walk into Y Combinator with an alternate MySQL storage engine optimized for SSDs, and… No, wait! This gets better than it starts out!

Along the way, MySQL is cut out of the picture and something new and more interesting springs up. RethinkDB the SSD-optimized storage engine becomes RethinkDB the human-optimized NoSQL database. It attracts some brainy talent, conquers some nifty problems, and uses MongoDB’s list of pain points as a checklist of things to do better. What emerges is a newer, smarter source of document store. One that’s agile while valuing data integrity and stability. (Novel idea, right?)

RethinkDB may not be the first NoSQL database out of the gate, but that’s exactly why it’s so good. It’s the much-overlooked advantage of not being the first mover; Apple’s been succeeding on this basis for decades. Apple built an empire around the iPod, which was an extremely refined product in a market analysts had declared “mature” and “crowded.” They did it again with iPad, kickstarting a market Microsoft had been trying and failing to create for two decades. Now RethinkDB has come into a “mature” and “crowded” niche to build on the shortcomings of the prevalent NoSQL solutions.

But as a developer, what’s the draw of yet another document store? Don’t we have enough? It’s been discussed to death that PostgreSQL is frequently better than MongoDB at being MongoDB. Postgres features a pedigree and modern performance numbers that are the stuff of dreams. It’s taken years to get there. So why in the hell bother with a database that’s a tender 3 years from its 1.0 release?

Because it’s awesome. Not perfect for all applications, and perhaps not perfect yet for any application, but awesome.

Admin Interface

The first feature of RethinkDB that most users will encounter is its beautiful, modern administrative web interface. At a glance, you can see the status of your server or cluster. With a few clicks, you can perform nearly any administrative task you like. There’s even a live query console.

RethinkDB Admin UI
RethinkDB’s included administrative console

I’m not going to replicate effort here, so if you want to see the admin interface in action without installing RethinkDB, give their screencast a spin. Sharding, replication, and multi-datacenter management are truly only a few clicks away. Even many grotesquely expensive commercial products don’t offer cluster management this easy.

For the small team that’s dev and ops rolled into one, this level of control and accessibility is breathtaking. While I have a copy of Panic’s Prompt SSH client on every mobile device I own, there’s power in being able to administer your database from literally any browser.

Query Language

RethinkDB dispenses with SQL-style query languages in favor of its own language, ReQL. While distinct from SQL, it has most of your familiar features — including joins — and can be picked up enough for basic use in all of 5 minutes.

Assuming you have some grounding in set theory and databases, it’s simple and well-documented enough that you can get started even without an SQL background.

ReQL is functional in its construction, building queries through chaining rather than monolithic command strings. This, among other design decisions, also helps minimize (but not eliminate!) injection and similar attack vectors; unless you’re doing something very foolish, like passing user input to RethinkDB’s JavaScript engine, it would be quite difficult for a typical attacker to achieve the kinds of malfeasance easily accomplished with SQL.

By now there has to be a thought lurking in the back of every reader’s mind: What’s the catch? And, in fact, I purposely helped plant that seed with the title.

There are a couple catches and it’s important to be aware of them. For some users they’ll be unimportant; for others they may be a deal-breaker. But by being aware of them, you can make an educated decision and focus your testing when evaluating RethinkDB. You want to choose a data store on its merits and weaknesses, not whether or not it’s webscale.

Performance

While performance is a priority for the RethinkDB team, you may find it’s not quite where you need it right now. Thus the “next next database.” I’ll touch on this in some detail in a later post, but performance ranges from as fast as or faster than Postgres (on very simple queries) to dreadfully slower than Postgres (on more complex queries).

Tolerance for RethinkDB’s performance will vary with your expectations; based on my own testing and experience, many MongoDB or MySQL users will likely find it acceptable or even an improvement. Pragmatic Postgres users with the right usage patterns may even find response times perfectly acceptable (more so if you can take advantage of caching). It’s frequently slower than Postgres to be sure, but there’s a sizable gulf between “slower” and “slow”. And Postgres itself has seen some massive improvements in performance over the past several releases, so certainly it’s not fair to act as if RethinkDB will never see a speed increase.

Query Optimization

RethinkDB doesn’t currently have a query optimizer. If you want to use an index, you need to explicitly request the use of that index in your ReQL. If you’re doing a series of expensive operations, it’s up to you to determine the best order (and identify any redundancies or stupid code). This is a significant difference from modern SQL servers, most of which have extremely intelligent query optimizers, allowing excellent performance from even half-baked queries.

A developer today can adopt a popular SQL ORM — or even raw SQL — with no or very little knowledge of SQL and still get great performance.

A developer today using RethinkDB has to put some thought into his queries and develop an understanding of functions’ performance through experience.

Indexing

There are also some gotchas (for SQL users, anyway) surrounding secondary index use; as RethinkDB functions can only utilize an index when operating on a table (which you generally cease to have after the first function in the query chain), a good understanding of your application’s usage pattern is required for both index creation and querying. Whereas SQL will generally gleefully assemble an optimized query across multiple conditions and multiple single-column indexes, RethinkDB’s functions can only use an index when acting directly on a table. This means that, in many cases, you get to use one index and it has to be in your first function. It thus becomes important to both create appropriate compound indexes and determine what portion of your query will actually benefit from using your one available index-enabled call.

To be fair, ReQL and RethinkDB are well-designed and this isn’t nearly as painful as it might first sound. Coming from SQL, though, it does initially feel a bit clumsy and dated.

Circling back to “a good understanding of your application,” index creation times are — at least currently — quite long. While index creation is not a blocking operation, the expense does mean you want to get it right the first time.

Try it out!

Go on, give it a whirl on a side project. Work through a tutorial. Check out the excellent docs.

The RethinkDB team is also super friendly and super accessible, so if you have any questions or suggestions, link up with the community.

Maybe it’ll work for you, maybe it won’t, but in either case it may start to change your mind about what your database needs are and how a NoSQL solution should work.

orm_adapter + NoBrainer = <3

After trying and falling in love with RethinkDB, I began using the NoBrainer ORM with some of my Rails side projects. And while the combination is fantastic and productive, I’ve hit many of the roadblocks that frequently come with straying from ActiveRecord.

One of those roadblocks has been the lack of orm_adapter support for NoBrainer. Used by a variety of gems (chief among them the popular Devise authentication framework) to provide ORM abstraction, orm_adapter support is a necessity for easy growth and development.

And today, it’s here with orm_adapter-nobrainer. Written to scratch my own itch, it’s available on GitHub and pushed out to RubyGems for wider distribution and use. While not entirely complete — it doesn’t yet address the writable associations NoBrainer doesn’t implement — all other tests are passing and it should be a drop-in implementation for most real-world uses.

Similar work still needs to be done for Devise itself, but this is the foundation for that work (and a whole bunch more).

Hopefully it makes someone’s life a little easier. Pull requests and issues are always welcome.