Searching on string properties in RavenDB

I just got an email from Robert Muehsig asking me for some advice on how to do searching on text fields in RavenDB. This seems to be a question that pops up quite frequently, so here are my two cents.

Given this scenario:

class User
{
    public string Id { get; set; }
    public string Name { get; set; }
}

session.Store(new User { Name = "Daniel Lang" });
session.Store(new User { Name = "Daniel Smith"});
session.SaveChanges();

Let’s see what we get when we fire some basic queries on them:

session.Query<User>().Where(x => x.Name == "Daniel Lang").ToList(); // returns just me
session.Query<User>().Where(x => x.Name.StartsWith("Daniel")).ToList(); // returns "Daniel Lang" and "Daniel Smith"
session.Query<User>().Where(x => x.Name.Contains("Daniel")).ToList(); // ATTENTION - 0 results
session.Query<User>().Where(x => x.Name.Contains("Lang")).ToList(); // also 0 results!

While the first two queries work as expected, the third and the fourth both return 0 results although they should return 2 and 1 users respectively. What has happened here?

The reason for this has to do with the way RavenDB uses Lucene to build the index here.

Some background about indexes

If you send a query to RavenDB, this is what happens on the server: RavenDB will not go and scan all your documents to check whether a document meets your query-condition or not! This would be very slow in fact. Instead, it executes the query on an index that has been created asynchronously in the background. RavenDB uses Lucene.NET to create, maintain and query those indexes. Because lucene is incredibly fast for queries and the whole indexing happens asynchronously in a background-thread on the server, both queries and writes are lightning fast with RavenDB. If you send a query for which no static index has been created before, RavenDB will create a temporary one for you automatically. This temporary index extracts (say maps) all those properties out of your documents that are needed to answer the question. Alright, I guess that’s the story you already know if you’re using RavenDB, but what does that mean for us?

The essential thing is – every query will be translated into a lucene-query in the end.

So when you write this line of code…

session.Query<User>().Where(x => x.StartsWith("Daniel")).ToList();

it will directly translate into this lucene query Name: Daniel*
The syntax is actually quite easy to understand. As you might guess, the asterisk is a placeholder/wildcard for “whatever follows” and thus represents the obvious implementation of “starts with”.

Lucenes query syntax is very powerful and allows you to do a lot of interesting things, take a look at the documentation here: Lucene query syntax.

So the obvious (and perfectly working) choice to implement string.Contains would be to use *search-term*, right?
Ok, but…

Why (the hell) does string.Contains() not work for text-search in RavenDB?

The short answer: because of ravens “safe-by-design” paradigm.

The long answer is that a leading wildcard forces lucene to do a full scan on the index and thus can drastically slow down query-performance. Lucene internally stores its indexes (actually the terms of string-fields) sorted alphabetically and “reads” from left to right. That’s the reason why it is fast to do a search for a trailing wildcard and slow for a leading one.

Because of that, string.Contains is implemented the same way as string.Equals in RavenDBs linq provider, for good reasons as we’ve seen.

How to do it right?

While you certainly can ignore RavenDBs built-in protection using this line of code

session.Advanced.LuceneQuery<User>().Where("Name: *Daniel*").ToList();
// naive implementation of string.Contains

… I don’t recommend to do so, as there is a much better alternative available.

Instead of just showing you the code, I want you to understand the concept behind it. In the code samples above, we’ve seen that we get the correct result if the whole string matches the name of a user (not case-sensitive). This has to do with the way lucene stores the strings that come in from the documents. Remember, an index in RavenDB (no matter if it was created manually or automatically) just extracts fields (properties) out of documents inside the database and passes them on to lucene.net, so that lucene stores them for later querying. When a string/text field comes in into lucene, it will either be stored as it is, or, will be analyzed before. Analyzing means that the string might be split up (say tokenized) into multiple parts (say terms) before they get stored into the index. This is very important to understand!

Think about a lucene index as a flat table with two columns, where the first column contains all the terms (whole string-properties out of documents or tokenized parts) and the second column the primary keys (the ids) of the corresponding original documents, which can later be used to load them out of RavenDBs underlying storage engine. This model is actually far away from how lucene.net really works, but I think it works quite good to explain the concept…

Ok, so you already guess that it’s the analyzing step in the indexing process, that makes the difference. You were right, by default RavenDBs uses a custom analyzer (one that doesn’t ship with lucene.net) that’s called LowerCaseKeywordAnalyzer. That means, if you don’t explicitly say you want to have another one on each of your indexed properties, this guy will be used for all of them. What it does is, it lowercases all strings before they come into lucene (thus queries are not case-sensitive). That’s fine in most cases, but if we want to do some text-searching, we will want to use an Analyzer that tokenizes every word. In our sample above, we don’t want the name “Daniel Lang” to be stored as a whole inside lucenes “table”, but instead want two rows “Daniel” and “Lang” in the first column.

The good news is, it’s quite simple to do that. Lucene ships with a handful of Analyzers and they are all available through RavenDB as well, take a look here: How indexes work

The one you want in case of our Name property is the StandardAnalyzer – it breaks words at whitespace characters and ignores common English stop-words.

Finally, here is the code:

documentStore.DatabaseCommands.PutIndex("UsersByName", new IndexDefinitionBuilder<User>
{
    Map = users => from user in users
                   select new { user.Name },
    Indexes =
        {
            { x => x.Name, FieldIndexing.Analyzed}
        }
});

Because we have set property the Name to be Analyzed, RavenDB will automatically choose the StandardAnalyzer for it. If you want to explicitly set another one, it looks quite similar (but I suggest to stick with the StandardAnalyzer in this case):

documentStore.DatabaseCommands.PutIndex("UsersByName", new IndexDefinitionBuilder<User>
{
    Map = users => from user in users
                   select new { user.Name },
    Analyzers =
        {
            { x => x.Name, "SimpleAnalyzer"}
        }
});

 

As soon as we have the index set-up correctly, we can now do the following query:

session.Query<User>("UsersByName").Where(x => x.Name.StartsWith("Lang")).ToList();
// will return the user "Daniel Lang", even though this string doesn't start with "Lang"

That’s all about it. As soon as you’ve got the indexing thing clear in your mind, you will love the amazing power lucene and it’s query syntax give you. Good luck!

Edit: Please note ..
…as Steve pointed out in the comment below, this doesn’t answer the question how to implement string.Contains. I’m sorry that I haven’t been very clear on that. Here’s the deal: you don’t need it! Think about it – in almost every case you actually want to have string.StartsWith on every word, because the results will be more relevant. I we had a full string.Contains, a search for “La” would not only return “Daniel Lang”, but also a user “Angela Merkel”.

However, if you really have a case where you want to do string.Contains (I cannot think of an example scenario for that right now) you can certainly do that using the leading wildcard.

One more thing: An imlementation of string.EndsWith can be done very easy and effective. All you need to do is to select a lucene analyzer that reverses the string (because analyzers are also applied for the search-term that is just as easy as that).

Subscribe

Subscribe to my e-mail newsletter to receive updates whenever there is a new post.

,

9 Responses to Searching on string properties in RavenDB

  1. steve January 5, 2012 at 9:54 pm #

    I’m sorry if I missed something but wasn’t the point of the post to talk about how to do the string.contains properly? Maybe your last code paste should be Contains instead of startswith otherwise i don’t think you answered the question.

    • Daniel Lang January 5, 2012 at 10:39 pm #

      Steve, you are right. Thanks for reminding me of that. I’ve extended the post above to address this. Please let me know if that answers your questions.

      • steve January 5, 2012 at 10:44 pm #

        awesome, wasn’t sure your point would be driven home if you didn’t make it super clear.

  2. Peter Seewald February 9, 2012 at 11:23 pm #

    Thanks for a great article on the basics of indexes and searching for fulltext.

  3. Leniel Macaferi February 10, 2012 at 1:51 pm #

    Daniel,

    You have good writing skillsssssss… =] We get focused on the read by the way you write!

    Thanks for posting this. One of the things that gets in the way when we start using RavenDB is the Indexing thingy. The more posts we have about it the better it is for us.

    Keep posting man…

    Leniel

  4. Terry May 15, 2012 at 5:14 pm #

    RavenDB newbie says that the LuceneQuery works fine, but when I add the index and use Contains, I always get NotSupportedException: “Contains is not supported, doing a substring match over a text field is a very slow operation, and is not allowed using the Linq API. The recommended method is to use full text search (mark the field as Analyzed and use the Search() method to query it.”

    Based on your post, I think that I’ve marked the field as Analyzed. Now I’m trying to figure out how to use the Search() method instead of Contains().

    By the way, is there a way to check whether the index is already there and not put it again?

    • Daniel Lang May 16, 2012 at 12:41 pm #

      Terry, yes – there was a breaking change that made Contains throw a NotSupportedException. This post is good evidence that people had lots of problems with it, so that’s the reason for that change. The newer method Search() does the same as the StartsWith() in my blog post above, so this is the recommended way to go now. You still need to mark the field as analyzed as you would have before. Just use Search() instead of StartsWith() and everything is working fine (actually you can still continue to use StartsWith() if you want – only the weird Contains() was abandones)

  5. josh reeter August 2, 2012 at 2:28 pm #

    Thank you for your thorough explanation Daniel!

Trackbacks/Pingbacks

  1. Was sind RavenDB Indexes und wie kann ich diese Unit-testen? | Code-Inside Blog - January 18, 2012

    [...] das ganze überhaupt macht, ist hier gut erklärt. Wenn ich den Code anwenden [...]

Leave a Reply