改进RavenDB的MapReduce性能

jieforest · 发表于 2012-8-24 14:21

So far, this looks nice, but there are several practical problems that we still need to solve.

To start with, when does this end? We have users who write map/reduce queries like this:

01.//map #1
02.from customer in docs.Customers
03.select new
04.{
05.CustomerId = customer.Id,
06.CustomerName = customer.Name,
07.OrderId = (string)null,
08.}
09.
10.// map #2
11.from order in docs.Orders
12.select new
13.{
14.order.CustomerId,
15.CustomerName = (string)null,
16.OrderId = order.Id
17.}
18.
19.//reduce
20.from result in results
21.group result by result.CustomerId into g
22.let name = g.FirstOrDefault(x=>x.CustomerName!=null).CustomerName
23.from item in g
24.select new
25.{
26.CustomerId = g.Key,
27.CustomerName = name,
28.item.OrderId
29.}

复制代码

jieforest · 发表于 2012-8-24 14:22

This is a frowned upon (but working) way of allow you to query and sort by the customer name while searching for indexes. The problem with this method is that if we have 15,000 orders per customer, we are going to have the same number come out of the reduce phase as well.

Now, the reason this is frowned upon? Because while this is using map/reduce, it isn’t actually… you know.. reducing the data. In order to resolve this issue, we are going to make sure that all of the items generated from a single reduce step will always go into the same bucket. This will mean that we keep pretty much the same behavior as we have now, it is going to be inefficient, but that was always going to be the case.

We are also going to limit the number of levels to three, which still gives us the ability to handle over a billion results before a reduce phase would need to see more than 1,024 items at once.

Take the California example, we would have 37,691,912 people, each of them reduce to 37,691,912 map results at bucket –1. Then we have 36,809 buckets for the second level. And finally 35 levels at the third level. All of which are computed for the final result.

The next step from here is to actually handle updates, which means that we have to keep track of the bucket ids going forward, so we start with deleting a person, which means that we need to delete their map result. Which means that we need to re-reduce the bucket they belong to at the appropriate level, and then upward, etc. In total, we would have to compute 1,024 + 1,024 + 35 items, instead of 37,691,912.

Okay, enough talking, let us see if I straighten things out enough for me to actually be able to implement this.

jieforest · 发表于 2012-8-24 14:22

over.