Measure your productivity!

Today I had an interesting discussion with one of our team members. He was estimating an upcoming development task and came up with a 124 hour realistic estimation. Assuming that the estimation itself is true, how many days of work will it take to get the job done?businessman  on the beach

That’s a tough question because it requires us to know how many hours of work a full day actually has. While a typical work day lasts 8 or 9 hours, which quota of that are we actually doing productive work? 5 hours, 6 hours, or even more?

Let us be honest – we don’t know. Hopefully it is something between 60 and 90 percent, but that really makes a lot of difference! How can we even calculate an hourly rate, agree on project deadlines and plan our vacations without knowing that number?

There is this awesome book called Peopleware (a must read for every software manager in my opinion) which tells us what to do in order to improve our own and our teams development productivity. Turns out that beside things like office space layout, noise level and temperature, the major reason for poor productivity is the amount of distractions. All those tiny little things that keep us away from actually getting things done. It could be a phone-call in the middle of a difficult programming task, the boss stopping by to chat about his bike tour on the weekend or just the coffee machine that ran out of water, whatever.

How can we measure all those tiny distractions throughout a work day, so that we can

  • do meaningful estimation on project deadlines?
  • calculate a reasonable hourly rate?
  • improve on that number?

In our case, we actually have a time tracking application which we already use for assigning blocks of time to projects and tasks. However, it is cumbersome to work with it, mainly because you need to click a lot and you need to select a task or project on which the time should go. That’s just too much friction to be actually useful for logging all those 0.5 – 5 minute distractions we’re worrying about here. We need something else…

FocusMeter – a tool for tracking distractions and measuring productivity

After hacking around for a few hours I came up with a small tool that is so easy and simple that I actually want to use it! Sitting in the system tray and being controlled by the keyboard it really gets out of my way and still provides enough functionality to do the job. Let’s take a look at it!

Here is how it looks like before you start working (after booting your PC, coming to the office, etc.):

gray

Notice that little gray dot? Yep, that’s the application. It is gray because we’re not working right now.

This is how it looks like when we’re doing productive work.

green

The icon has changed to green color in order to indicate that everything is good.

Here is what happens when we’re distracted (e.g. the phone rings):

red

Red is indicating that we’re not doing productive work right now.

The only keyboard shortcut we need to learn is CTRL + ENTER. Every time we type that combination we’re toggling between being productive and being distracted.

Alternatively, we could have also used the context menu which pops up when we right click the icon:

contextmenu

We also need to use this context menu for starting and stopping the overall work time. If we’re going for lunch, this is not a distraction (at least not for me :) ), thus we need to stop the timer during that time.

Did you notice the menu item Show RavenDB database? It will bring up RavenDB management studio and let us take a view on the underlying database. Right now, we are not concerned about showing reports on our data in a nice UI or exporting the tabular data to Excel or SQL server.

We wanted a tool which lets us begin collecting the data immediately. It was important that it would work disconnected and be very easy to deploy. We couldn’t spend days or even weeks for development, because we didn’t have that time nor did we believe that it would have been worth the effort. RavenDB was just perfect for that scenario. It took less than 6 hours in development!

The tool is open source and you are free to do whatever you want to do with it. I would be happy if someone could craft out some code to export the data to Excel, so that we can generate some nice graphs and get some insight after a few weeks of use.

You can easily compile the project from the source on github: https://github.com/dlang/FocusMeter
Alternative, you can also download the binary from the latest version as of writing this post (12th May 2012) here.

I intend to add an export feature to submit the data to a centralized SQL server, so that we can do some advanced company wide reports. However, that won’t happen in the next days. Right now, I’m just happy that we have started collecting the data.

Update 2012-05-12:
As per request, I just added an options dialog that lets you choose your own keyboard shortcut. It looks like this:

 

Comments { 4 }

RavenDB vs MSSQL

I just stumbled upon this question on stackoverlow. I began typing my answer and when I was finished I realized that it was pretty long and it might be a good idea to do this in a blog post. So here is what I answer when I am asked whether to choose RavenDB or MSSQL as the storage technology for your greenfield application.

Disclaimer: before accusing me of being biased towards RavenDB, please note that we’ve just started two projects where we had chosen SQL server as our database. Obviously I love RavenDB, but it’s definitely not a one-for-all thing.

Performance

RavenDB is said to be very fast and you often hear that it’s faster than sql server. That’s not true. It is not RavenDB itself that is faster as a database, instead it’s the application built on top of RavenDB that is faster most of the times. There are a few reasons for that:

  • Reduced amount of database requests
    RavenDB enables you to store whole object graphs and collections inside a single document. That’s why you generally need much fewer requests when using RavenDB compared to a relational database.
  • Less joins when querying
    Although RavenDB has some concepts to support joining between documents, a document database embraces the use of denormalization. So instead of picking all the necessary information for one query together from different documents, you probably already have everything inside one single document. RavenDB is very fast in loading single documents so that’s the reason for the performance boost here.
  • Indexes: precomputed aggregations and calculations
    Map/reduce indexes can be used to do intensive calculation in the background (and also distributed) and store the calculated results on the disk, so that querying those is cheap because all the calculations have already been done. The same goes for normal map indexes as well. RavenDB can do calculations and store the results inside indexes for fast querying. You should note however, that indexes have BASE consistency, that means, they can be stale and need to catch up latest write operation on the database. While that means that in busy systems you don’t have a guarantee that you get most up-to-date results from an index, this is also a big selling point on the other hand, because it means, that write operations are not affected by the number of indexes. They simple complete and RavenDB updates the indexes in the background without slowing down insert/update perfomance.

Simplicity

Although RavenDB is putting a lot of efforts into making your experience as easy as possible, there’s still a steep learning curve related to the mindshift in data modeling. You will need to think about your data in a different way than you probably had been doing before when designing relational databases. This is something you definitely want to think about before choosing RavenDB as the storage technology for a new application with a tight development deadline.

However, learning document modeling and working with RavenDB is a rewarding experience because you get immediate feedback and see how much easier and faster you can be when you’re finished.

Although we have EF4.1 code-first, FNH with automappings and even micro-orms like Massive or Petapoco that do a very good job in simplifiying our development efforts when working with sql server, it’s still far easier to work with RavenDB (please note, this is my personal opinion here). You just don’t have to think about mappings and normalization. Just put your objects in and you’re done.

Tooling and ecosystem

This is something you will definitely miss if you’re coming from sql server and are used to tons of third-party software that helps you work with and manage your sql servers. RavenDB has made a few big improvements in this regard recently and I expect a massive improvement with the 1.2 release, however it’s still far beyond what you get when using sql server. If you don’t feel comfortable with configuration files and are more the wizard-magic guy then you probably don’t want to use RavenDB (right now).

Another pain point in working with RavenDB (and NoSQL databases in general) is that it’s nearly impossible to get good ad-hoc reporting support. RavenDB has a bundle that let’s you replicate data out to sql server for reporting purposes but obviously that’s much more friction than just querying on the datasets in the first place.

Which to choose?

The question which database to choose obviously depend on your concrete scenario, the skills of your team, your enviroment (existing licenses), etc. but here is what I think could help you:

We choose RavenDB when…

  • we can think of our data in terms of aggregates with mostly independent chunks of data (e.g. customer, order, product, etc.)
  • we need to have good performance on aggregation and calculation queries
  • we need to have complex searching (full-text, facets, etc.)
  • we need to be able to scale
  • we need high availability at low costs

We choose SQL Server when…

  • when we need to support user generated reports and highly dynamical data analysis
  • we have to deal with mostly relation data (e.g. accounting, statistics)
  • we want to use Windows Azure
  • our customer definitely wants us to choose sql server without knowing better

I hope these arguments help you choose one or the other storage technology for your application…

But wait! You don’t have to choose one or the other database. Actually you can mix them and pick the advantages of both databases. So here is the deal: if you feel your application is complex enough or it would benefit from one or the other database, just use them both.

Comments { 7 }

The danger of properties that do too much

I was always against putting too much code into the getters and setters of C# properties. I just came across a bug the led me to this piece of code:

image

Now it’s really hard to see what actually happens here, because someone decided to choose a property over a method:

SNAGHTML5737d08

In my opinion, a property description should never begin with “Creates a..” as this is a clear indicator that you should go for a method instead.

I refactored it to this:

SNAGHTML57a50f9

You may say this is nitpicking. Fine, but I say it’s exactly these small things that make the difference between a robust codebase and one that prone to a lot of bugs.

Comments { 2 }

Deploy Raven.Studio.xap using a powershell script

When you’re running RavenDB in embedded mode you probably want to make sure that the file ‘Raven.Studio.xap’ is inside your binaries folder, so that you can run RavenDB Management Studio on your embedded instance.

Well, copying the file by hand each time after cloning an applications repository or updating ravens NuGet package is a pain in the ass. Adding the file to your Visual Studio project and setting it to ‘Copy always’ is also a bad solution as you will still need to update that file after each package update. Here is how you can leverage the power of the shell (wohoo!) to do that automatically.

First, create a powershell script deploy-raven-studio.ps1 and copy it into your projects folders.

Update May 2012: RavenDB has changed its offficial NuGet packages. Please note that the script below needs to be changed slightly in order to work with the new structure. Here is the download-link: deploy-raven-studio.

SNAGHTMLca5475e

You can also download the file here (please use the updated download-link above the picture).

Next, you need to add a post-build event to your Visual Studio project.

SNAGHTMLcb3143d

Please make notice of the whitespace character at the end of the parameters. If you don’t include them, Visual Studio will only pass the first parameter (this is a bug of course).

Now that you have the PowerShell script inside your project folder and set-up the post-build event, you’re done. The script will automatically copy the Raven.Studio.xap file from the NuGet package into your bin folder when compiling.

In case you encounter an error when running the script, please make sure that you have set the correct execution policy in PowerShell. To do that, please run either the x86 or x64 version of PowerShell (depending on your Visual Studio version!) as administrator and fire this command set-executionpolicy remotesigned.

Comments { 0 }

5 ReSharper settings for C#4 coding

Sometimes, it is the small things that make you feel good or bad when programming. I personally find it disgusting to see code like this:

image

Oh my god. I was afraid I’d need two monitors just for Visual Studio to see the full source code when something like this comes along:

image

… but fortunately ReSharper came to the rescue and automatically wrapped the long lines, phew!

Long story short, here are my top 5 ReSharper settings:

SNAGHTML39f8c98

Change them both from “At next line indented 2 (GNU style)” to “At next line (BSD style)”.

Then, we have these two:

image

Check them off and you’ll be better.

And here is the last one:

SNAGHTML3a59e82

This guy force you to hit backspace twice after each completion in order to remove the two unnecessary parentheses. I’m not 100% sure if I like this setting turned off since it effects many other situations as well, where I’d actually want to have them inserted.

Do you have any other must-have R# settings that everyone should know?

Update: As Carsten suggested in the comment below, here is the resulting picture. Much cleaner, isn’t it?

image

Comments { 3 }

Document level encryption in RavenDB

You may run into situation where you want to have the RavenDB documents saved encrypted on the disk. A typical scenario for that would be storing credit card information with PCI compliance. Or maybe you have an application that runs RavenDB embedded and stores the data locally on the users computer. In that case you may want to have the documents encrypted, so that no one else can take ravens data folder and open it with another RavenDB instance.

RavenDB has no out-of-the-box implementation of this kind of encryption, but through its extensibility model, it is very easy to implement it on your own. Here we go…

Step 1 – Create a RavenDB plugin

Open up Visual Studio and create a new class library project. Then use NuGet package manager to download the RavenDB package.

Step 2 – Implement AbstractDocumentCodec

RavenDB has a nice extension point for this. It will use the implementations of AbstractDocumentCodec at a very low level, just above loading and saving in its storage engines (either ESENT oder Munin). The base class you need to implement is very simple:

public abstract class AbstractDocumentCodec
{
    public abstract Stream Encode(string key, RavenJObject data, RavenJObject metadata, Stream dataStream);

    public abstract Stream Decode(string key, RavenJObject metadata, Stream dataStream);
}

Basically you can use these two methods for whatever encryption you want. This is a very simple example using a strong TripleDES encryption:

public class DocumentCodec : AbstractDocumentCodec
{
    private const string YourPassword = "super-secret-password";

    public override Stream Encode(string key, RavenJObject data, RavenJObject metadata, Stream dataStream)
    {
        return new CryptoStream(dataStream, GetCryptoProvider(key).CreateEncryptor(), CryptoStreamMode.Write);
    }

    public override Stream Decode(string key, RavenJObject metadata, Stream dataStream)
    {
        return new CryptoStream(dataStream, GetCryptoProvider(key).CreateDecryptor(), CryptoStreamMode.Read);
    }

    private static SymmetricAlgorithm GetCryptoProvider(string key)
    {
        var passwordBytes = new Rfc2898DeriveBytes(YourPassword, GetSaltFromDocumentKey(key));
        return new TripleDESCryptoServiceProvider
        {
            Key = passwordBytes.GetBytes(24),
            IV = passwordBytes.GetBytes(8)
        };
    }

    private static byte[] GetSaltFromDocumentKey(string key)
    {
        return MD5.Create().ComputeHash(Encoding.ASCII.GetBytes(key));
    }
}

Step 3 – Compile and deploy

Now you just need to compile and put the dll into ravens plugin-folder. That’s it.

For the lazy one, check out this tiny RavenCrypt solution on my github.

Comments { 12 }

Using an index as a materialized view in RavenDB

This post shows how you can use a RavenDB map index as a persistent view of your documents.

There was an interesting question on StackOverflow and also another one on the mailing-list (btw, did I tell you that mailing-lists suck? go and use SO instead! :) ) regarding how to query on a simple map index and use the index itself as the result of the query.

Ok, first we need to answer this question:

What is the difference between a map and a map/reduce index?

The obvious difference is that a map index has only one or more map functions to extract index fields out of your documents, while a map/reduce also has a reduce function that runs over the results of the map functions in a second step. Ayende has written a good explanation of this concept. This happens at indexing time, but there is also a big difference at query time:

When you query a map/reduce index, the index itself is the result of your query, whereas querying a map index will give you the original documents!

Internally, it works like this: At indexing time, RavenDB extracts information out of your input documents and uses this information to create fields on lucene documents, that make up your index (leaving tokenization and analysis aside). You need those fields, so that you can query upon them. Along with the fields that you define as the result of your map function, RavenDB will also store a field __document_id that is the key of your original document. So when you query, RavenDB uses this field to fetch the original documents and answer your query. This is the way map indexes work – it is very important to understand that concept if you want to do more advanced stuff.

Remember, we are only talking about map indexes here, because for map/reduce indexes we already get what we want. So the real question is:

How can we tell RavenDB not to load the original documents and return them as the result of our query, but instead give us the fields stored in the index as a result?

Here is the answer:

var results = documentSession.Query<Post, AggregationIndex>()
                        .AsProjection<AggregationIndex.ReduceResult>()
                        .ToList();

It’s as simple as that. Don’t be confused by the name of the inner class – there is no reduce function, this naming is just because I call every result of an index a ReduceResult. (I don’t know who came up with this convention anyway, but I’m sure I don’t like it because it produces some much confusion. I would rather call it just Result, but I’d like to be consistent with other peoples code…)

Here is the index itself:

public class AggregationIndex : AbstractIndexCreationTask<Post, AggregationIndex.ReduceResult>
{
    public class ReduceResult
    {
        public string Id { get; set; }
        public string Title { get; set; }
        public string Text { get; set; }
        public int CommentsCount { get; set; }
    }

    public AggregationIndex()
    {
        Map = posts => from post in posts
                       select new
                       {
                           post.Title,
                           post.Text,
                           CommentsCount = post.Comments.Count
                       };

        Store(x => x.Title, FieldStorage.Yes);
        Store(x => x.Text, FieldStorage.Yes);
        Store(x => x.CommentsCount, FieldStorage.Yes);
    }
}

Nothing special here, right?
Stop. Make notice of the calls to Store(). Now re-read the bold question above, where it says…

…give us the fields stored in the index…

By default, RavenDB creates the fields in the lucene documents in a way that they can be used in queries, but cannot be retrieved from the query result. So, if we want to use the index itself as a query result, we need to enable field storage.

The only field that we don’t need to set field storage explicitly on, is the document key (the Id property), because this is already done by RavenDB.

In our example above, we chose to have all our indexed fields to be stored, which is generally a not so good idea if you have large fields (like the text in a blog post) because lucene isn’t intended to be a database on its own. So instead of storing all fields it is better only to store the fields we really need and have a projection class for that. Let’s modify our code to take that into account.

public class AggregationIndex : AbstractIndexCreationTask<Post, AggregationIndex.ReduceResult>
{
    public class ReduceResult
    {
        public string Id { get; set; }
        public string Title { get; set; }
        public string Text { get; set; }
        public int CommentsCount { get; set; }
    }

    public class ResultProjection
    {
        public string Id { get; set; }
        public string Title { get; set; }
        public string CommentsCount { get; set; }
    }

    public AggregationIndex()
    {
        Map = posts => from post in posts
                       select new
                       {
                           post.Title,
                           post.Text,
                           CommentsCount = post.Comments.Count
                       };

        Store(x => x.Title, FieldStorage.Yes);
        Store(x => x.CommentsCount, FieldStorage.Yes);
    }
}

And the query…

var results = documentSession.Query<Post, AggregationIndex>()
                        .AsProjection<AggregationIndex.ResultProjection>()
                        .ToList();

It is better, because it keeps the lucene index smaller and thus makes queries even faster. Note that all the computation happens at indexing time and we only get the results at query time.

I don’t know how if there is an equivalent concept in SqlServer, but in Oracle there are materialized views that give you similar query characteristics.

As an extra point, I’d like to mention that if you are using this approach, you can count yourself to the cool CQRS kidz. I’ll leave it up to you to find out why.

Comments { 14 }

Using multi maps indexes to search over different document types

In this post I’ll show you how we built a nice search using RavenDBs multi maps indexes and facets feature. But first, here is the context:

In one of our recent projects using RavenDB, we created a software system for the federal government of Upper Austria, that is intended to be a platform for all kinds of musical works and artists, both classical and contemporary. It consists of different parts, one of them is a public web application that will be used to publish news and upcoming events, but also provides a nice search feature for authors and opuses, that are stored in the database.

It is now running in test environment, but if you’re not afraid of German language, you can already try it out here: www.diemusiksammlung.at/search

This is a screenshot:

image

Given a single search term, the results will contain both authors and opuses, sorted by their relevance. In addition, you can drill-down the search results using facets. By clicking on the gray arrow, you can expand each facet block to see more options available. In any case, you will only see facets that return at least one document, either an opus or an author.

Ok, so what is interesting here?

Besides this facet thing (that could be the topic of another post), we have used a multi maps index to search over authors and opuses at the same time.

Here is a simplified model:

public class Opus
{
    public string Id { get; set; }
    public string Name { get; set; }
    public string SubName { get; set; }
    public DateTime? PublishingDate { get; set; }
    public List<string> Genres { get; set; }
    public string Description { get; set; }
    public List<AuthorReference> Authors { get; set; }
}

public class Author
{
    public string Id { get; set; }
    public string FirstName { get; set; }
    public string LastName { get; set; }
    public string Email { get; set; }
    public List<string> SpheresOfAction { get; set; }
    public string Biography { get; set; }
    public DateTime? DateOfBirth { get; set; }
}

public class AuthorReference
{
    public string Id { get; set; }
    public string FullName { get; set; }
}

 

Now, let’s imagine we have thousands of opuses and authors in our database. We want to search on those documents by a search term and expect results sorted by their relevance with full paging capabilities, no matter if they are opuses or authors.

To do that, we need an index that covers both document types. This is what multi map indexes are for. The only difference to a simple map index is, that there are at least two map functions that extract data out of your documents to populate the lucene index.

In our case, the most trivial index would probably be this one:

public class AuthorsAndOpuses : AbstractMultiMapIndexCreationTask<AuthorsAndOpuses.ReduceResult>
{
    public class ReduceResult
    {
        public string Name { get; set; }
    }

    public AuthorsAndOpuses()
    {
        AddMap<Opus>(opuses => from opus in opuses
                              select new { Name = opus.Name });

        AddMap<Author>(authors => from author in authors
                                  select new { Name = author.FirstName + " " + author.LastName });

        Index(x => x.Name, FieldIndexing.Analyzed);
    }
}

 

Using this index, we can query on the name property of both document types:

var results = session.Query<AuthorsAndOpuses.ReduceResult, AuthorsAndOpuses>()
                        .Where(x => x.Name.StartsWith("Dan"))
                        .As<dynamic>()
                        .ToList();

 

Each result item is either of type Author or Opus and you can cast them as needed. However, we cannot query on any properties that are specific to authors or opuses. If we want to do that, you need to add those fields to the index:

public class AuthorsAndOpuses : AbstractMultiMapIndexCreationTask<AuthorsAndOpuses.ReduceResult>
{
    public class ReduceResult
    {
        public string Name { get; set; }
        public string AllText { get; set; }

        public DateTime? Opus_PublishingDate { get; set; }
        public List<string> Opus_Genres { get; set; }

        public string Author_Biography { get; set; }
        public DateTime? Author_DateOfBirth { get; set; }
    }

    public AuthorsAndOpuses()
    {
        AddMap<Opus>(opuses => from opus in opuses
                               select new
                               {
                                   Name = opus.Name.Boost(3),
                                   AllText = new[]
                                                 {
                                                     string.Join(" ", opus.Authors.Select(author => author.FullName)),
                                                     opus.SubName,
                                                     opus.Description
                                                 },

                                   Opus_PublishingDate = opus.PublishingDate,
                                   Opus_Genres = opus.Genres,

                                   Author_Biography = (string)null,
                                   Author_DateOfBirth = (object)null
                               });

        AddMap<Author>(authors => from author in authors
                                  select new
                                  {
                                      Name = (author.FirstName + " " + author.LastName).Boost(3),
                                      AllText = new[]
                                                    {
                                                        author.Email,
                                                        author.Biography
                                                    },

                                      Opus_PublishingDate = (object)null,
                                      Opus_Genres = (object)null,

                                      Author_Biography = author.Biography,
                                      Author_DateOfBirth = author.DateOfBirth
                                  });

        Index(x => x.Name, FieldIndexing.Analyzed);
        Index(x => x.AllText, FieldIndexing.Analyzed);
    }
}

 

This index enables us to find all authors and opuses that contain a word beginning with “Dan”, no matter if the word is inside the authors biography/email or if it is part of the opus’ description.

string searchTerm = "Dan";
var results = session.Query<AuthorsAndOpuses.ReduceResult, AuthorsAndOpuses>()
    .Where(x => x.Name.StartsWith(searchTerm) || x.AllText.StartsWith(searchTerm))
    .As<dynamic>()
    .ToList();

 

Let’s assume we have an author whose name is “Daniel” and that author has 3 opuses. Using the query above, we will get all 4 documents. A very nice thing is, that because we have boosted the Name field in the index, the author will be the first item in the search result, followed by the 3 opuses which only contain the search term in their field AllText, which was not boosted at all.

Also note that we have included opus and author specific properties in the index above and we had to null them out in the map-functions where we didn’t have the information available. Now we can use the same index to drill down the search result using conditions that are more specific to Opus or Author. In our own application we have done that using a SearchQueryBuilder that dynamically builds the query depending on the input viewmodel that gets posted to the controller.

Using this approach it is very easy to integrate rich search capabilities into your applications. Again, because RavenDB uses Lucene under the hood and has the indexes pre-built, you can expect to get results very fast. Just try it out on our site and let me know what you think.

Comments { 9 }

Random sorting in RavenDB

Think of a shopping cart where you want to display random products in a sidebar. I needed to do something quite similar. Because RavenDB did not have any OOTB feature to randomly sort query results, I asked for some advice on the mailing-list, presenting my very first ideas on how it could be done.

Then it happened – my question was Ayendefied.

His answer was (uncut):

Next version, you have:

var list1 = s.Query<Customer>()
     .Customize(x=>x.RandomOrdering())
     .ToList();

Awesome, isn’t it?

He just implemented a very nice and fast way to get a random order for query results. Its usage is as easy as shown above. It has one overload that allows you to specify a random-seed for consecutive queries.

It’s already in the latest build, so everyone can start using it immediately. That saved me a lot of time!

Comments { 1 }

Searching on string properties in RavenDB

I just got an email from Robert Muehsig asking me for some advice on how to do searching on text fields in RavenDB. This seems to be a question that pops up quite frequently, so here are my two cents.

Given this scenario:

class User
{
    public string Id { get; set; }
    public string Name { get; set; }
}

session.Store(new User { Name = "Daniel Lang" });
session.Store(new User { Name = "Daniel Smith"});
session.SaveChanges();

Let’s see what we get when we fire some basic queries on them:

session.Query<User>().Where(x => x.Name == "Daniel Lang").ToList(); // returns just me
session.Query<User>().Where(x => x.Name.StartsWith("Daniel")).ToList(); // returns "Daniel Lang" and "Daniel Smith"
session.Query<User>().Where(x => x.Name.Contains("Daniel")).ToList(); // ATTENTION - 0 results
session.Query<User>().Where(x => x.Name.Contains("Lang")).ToList(); // also 0 results!

While the first two queries work as expected, the third and the fourth both return 0 results although they should return 2 and 1 users respectively. What has happened here?

The reason for this has to do with the way RavenDB uses Lucene to build the index here.

Some background about indexes

If you send a query to RavenDB, this is what happens on the server: RavenDB will not go and scan all your documents to check whether a document meets your query-condition or not! This would be very slow in fact. Instead, it executes the query on an index that has been created asynchronously in the background. RavenDB uses Lucene.NET to create, maintain and query those indexes. Because lucene is incredibly fast for queries and the whole indexing happens asynchronously in a background-thread on the server, both queries and writes are lightning fast with RavenDB. If you send a query for which no static index has been created before, RavenDB will create a temporary one for you automatically. This temporary index extracts (say maps) all those properties out of your documents that are needed to answer the question. Alright, I guess that’s the story you already know if you’re using RavenDB, but what does that mean for us?

The essential thing is – every query will be translated into a lucene-query in the end.

So when you write this line of code…

session.Query<User>().Where(x => x.StartsWith("Daniel")).ToList();

it will directly translate into this lucene query Name: Daniel*
The syntax is actually quite easy to understand. As you might guess, the asterisk is a placeholder/wildcard for “whatever follows” and thus represents the obvious implementation of “starts with”.

Lucenes query syntax is very powerful and allows you to do a lot of interesting things, take a look at the documentation here: Lucene query syntax.

So the obvious (and perfectly working) choice to implement string.Contains would be to use *search-term*, right?
Ok, but…

Why (the hell) does string.Contains() not work for text-search in RavenDB?

The short answer: because of ravens “safe-by-design” paradigm.

The long answer is that a leading wildcard forces lucene to do a full scan on the index and thus can drastically slow down query-performance. Lucene internally stores its indexes (actually the terms of string-fields) sorted alphabetically and “reads” from left to right. That’s the reason why it is fast to do a search for a trailing wildcard and slow for a leading one.

Because of that, string.Contains is implemented the same way as string.Equals in RavenDBs linq provider, for good reasons as we’ve seen.

How to do it right?

While you certainly can ignore RavenDBs built-in protection using this line of code

session.Advanced.LuceneQuery<User>().Where("Name: *Daniel*").ToList();
// naive implementation of string.Contains

… I don’t recommend to do so, as there is a much better alternative available.

Instead of just showing you the code, I want you to understand the concept behind it. In the code samples above, we’ve seen that we get the correct result if the whole string matches the name of a user (not case-sensitive). This has to do with the way lucene stores the strings that come in from the documents. Remember, an index in RavenDB (no matter if it was created manually or automatically) just extracts fields (properties) out of documents inside the database and passes them on to lucene.net, so that lucene stores them for later querying. When a string/text field comes in into lucene, it will either be stored as it is, or, will be analyzed before. Analyzing means that the string might be split up (say tokenized) into multiple parts (say terms) before they get stored into the index. This is very important to understand!

Think about a lucene index as a flat table with two columns, where the first column contains all the terms (whole string-properties out of documents or tokenized parts) and the second column the primary keys (the ids) of the corresponding original documents, which can later be used to load them out of RavenDBs underlying storage engine. This model is actually far away from how lucene.net really works, but I think it works quite good to explain the concept…

Ok, so you already guess that it’s the analyzing step in the indexing process, that makes the difference. You were right, by default RavenDBs uses a custom analyzer (one that doesn’t ship with lucene.net) that’s called LowerCaseKeywordAnalyzer. That means, if you don’t explicitly say you want to have another one on each of your indexed properties, this guy will be used for all of them. What it does is, it lowercases all strings before they come into lucene (thus queries are not case-sensitive). That’s fine in most cases, but if we want to do some text-searching, we will want to use an Analyzer that tokenizes every word. In our sample above, we don’t want the name “Daniel Lang” to be stored as a whole inside lucenes “table”, but instead want two rows “Daniel” and “Lang” in the first column.

The good news is, it’s quite simple to do that. Lucene ships with a handful of Analyzers and they are all available through RavenDB as well, take a look here: How indexes work

The one you want in case of our Name property is the StandardAnalyzer – it breaks words at whitespace characters and ignores common English stop-words.

Finally, here is the code:

documentStore.DatabaseCommands.PutIndex("UsersByName", new IndexDefinitionBuilder<User>
{
    Map = users => from user in users
                   select new { user.Name },
    Indexes =
        {
            { x => x.Name, FieldIndexing.Analyzed}
        }
});

Because we have set property the Name to be Analyzed, RavenDB will automatically choose the StandardAnalyzer for it. If you want to explicitly set another one, it looks quite similar (but I suggest to stick with the StandardAnalyzer in this case):

documentStore.DatabaseCommands.PutIndex("UsersByName", new IndexDefinitionBuilder<User>
{
    Map = users => from user in users
                   select new { user.Name },
    Analyzers =
        {
            { x => x.Name, "SimpleAnalyzer"}
        }
});

 

As soon as we have the index set-up correctly, we can now do the following query:

session.Query<User>("UsersByName").Where(x => x.Name.StartsWith("Lang")).ToList();
// will return the user "Daniel Lang", even though this string doesn't start with "Lang"

That’s all about it. As soon as you’ve got the indexing thing clear in your mind, you will love the amazing power lucene and it’s query syntax give you. Good luck!

Edit: Please note ..
…as Steve pointed out in the comment below, this doesn’t answer the question how to implement string.Contains. I’m sorry that I haven’t been very clear on that. Here’s the deal: you don’t need it! Think about it – in almost every case you actually want to have string.StartsWith on every word, because the results will be more relevant. I we had a full string.Contains, a search for “La” would not only return “Daniel Lang”, but also a user “Angela Merkel”.

However, if you really have a case where you want to do string.Contains (I cannot think of an example scenario for that right now) you can certainly do that using the leading wildcard.

One more thing: An imlementation of string.EndsWith can be done very easy and effective. All you need to do is to select a lucene analyzer that reverses the string (because analyzers are also applied for the search-term that is just as easy as that).

Comments { 8 }