HBase空间索引设计

jieforest · 发表于 2012-11-12 23:11

Consider these three locations: LaGuardia Airport (40.77° N, 73.87° W), JFK International Airport (40.64° N, 73.78° W), and Central Park (40.78° N, 73.97° W). Their coordinates geohash to the values dr5rzjcw2nze , dr5x1n711mhd , and dr5ruzb8wnfr, respectively.

You can look at those points on the map in figure 4 and see that Central Park is closer to LaGuardia than JFK. In absolute terms, Central Park to LaGuardia is about 5 miles, whereas Central Park to JFK is about 14 miles.

Figure 4 Relative distances. When viewed on a map, it's easy to see that the distance between Central Park and JFK is much farther than the distance between Central Park and LaGuardia. This is precisely the relationship you want to reproduce with your hashing algorithm.

jieforest · 发表于 2012-11-12 23:12

Because they're closer to each other spatially, you expect Central Park and LaGuardia to share more common prefix characters than Central Park and JFK. Sure enough:

$ sort <(echo "dr5rzjcw2nze"; echo "dr5x1n711mhd"; echo "dr5ruzb8wnfr")
dr5ruzb8wnfr
[1]
dr5rzjcw2nze
[2]
dr5x1n711mhd
[3]

复制代码

Now that you understand how a geohash can work for you, we'll show you how to calculate one. Don't worry; you won't be hashing all these points by hand. With HBase, it's useful to understand how it works in order to use it effectively. Likewise with the geohash, understanding how it's constructed will help you understand its edge cases.

jieforest · 发表于 2012-11-13 13:14

Understand the geohash

The geohashes you've seen are all represented as character strings in the Base32 encoding alphabet.[3] In reality, the geohash is a sequence of bits representing an increasingly precise subdivision of longitude and latitude.

For instance, 40.78° N is a latitude. It falls in the upper half[4] of the range [-90.0, 90.0], so its first geohash bit is 1. Its second bit is 0 because 40.78 is in the lower half of the range [0.0, 90.0]. The third range is [0.0, 45.0], and 40.78 falls in the upper half, so the third bit is 1.

The contribution provided by each dimension is calculated by halving the value range and determining which half the point resides in. If the point is greater than or equal to the midpoint, it's a 1-bit.

Otherwise, it's a 0-bit. This process is repeated, again cutting the range in half and selecting a 1 or 0 based on where the target point lies. This binary partitioning is performed on both the longitude and latitude values.

Rather than using the bit sequence from each dimension independently, the encoding weaves the bits together to create the hash. The spatial partitioning is why geohashes have the spatial locality property.

The weaving of bits from each dimension is what allows for the prefix-match precision trickery.

jieforest · 发表于 2012-11-13 13:15

Now that you understand how each component is encoded, let's calculate a full value. This process of bisecting the range and selecting a bit is repeated until the desired precision is achieved.

A bit sequence is calculated for both longitude and latitude, and the bits are interwoven, longitude first, then latitude, out to the target precision.

Figure 5 illustrates the process. Once the bit sequence is calculated, it's encoded to produce the final hash value.

Figure 5 Constructing a geohash. The first 3 bits from longitude and latitude are calculated and woven to produce a geohash of 6-bit precision. The example data we discussed previously executed this algorithm out to 7 Base32 characters, or 35-bit precision.

Now that you understand why the geohash is useful to you and how it works, let's plug it in for your rowkey.

jieforest · 发表于 2012-11-13 13:15

Using the geohash as a spatially aware rowkey

The geohash makes a great choice for the rowkey because it's inexpensive to calculate and the prefix gets you a long way toward finding nearest neighbors. Let's apply it to the sample data, sort by geohash, and see how you're doing on prefixes.

We've calculated the geohash for each point using a library[5] and added it to the data. All of the data in the sample is relatively close together, so you expect a good deal of prefix overlap across the points:

GEOHASH X Y ID NAME
1 dr5rugb9rwjj -73.96993203 40.75815170 442 Fedex Kinko's
2 dr5rugbge05m -73.96978387 40.75850573 388 Barnes & Noble
3 dr5rugbvggqe -73.96974759 40.75890919 441 Fedex Kinko's
4 dr5rugckg406 -73.96910155 40.75873061 564 Public Telephone
5 dr5ruu1x1ct8 -73.96880474 40.76048717 472 Juan Valdez NYC
6 dr5ruu29vytq -73.97000655 40.76098703 593 Starbucks
7 dr5ruu2y5vkb -73.96974993 40.76170883 219 Startegy Atrium and Cafe
8 dr5ruu3d7x0b -73.96873588 40.76107453 463 Smilers 707
9 dr5ruu693jhm -73.96746533 40.76089302 525 McDonalds

复制代码

jieforest · 发表于 2012-11-13 13:16

Sure enough, you get five characters of common prefix. That's not bad at all! This means you're a long way toward the distance query and goal #1 with a simple range scan. For context, figure 6 puts this data on a map.

Figure 6 Seeing prefix matches in action. If the target search is in this area, a simple rowkey scan will get the data you need. Not only that, but the order of results makes a lot more sense than the order in figure 2.

jieforest · 发表于 2012-11-13 13:16

This is much better than the compound rowkey approach, but it's by no means perfect. All these points are close together, within a couple blocks of each other. Why are you only matching on 5 of 12 characters?

We would hope data this spatially close would be stored much closer together. Thinking back to figure 3, the difference in size of the spatial area covered by a prefix scan of five versus six versus seven characters is significant—far more than a couple of blocks. You'd make strides toward goal #2 if you could make two scans of prefix six rather than a single scan of prefix five.

Or better still, how about five or six scans of prefix seven? Let's look at figure 7, this time with more perspective.

The geohash boxes for both the six-character and seven-character geohash prefixes are overlaid.

Figure 7 Prefix matches with geohash overlay. Lots of additional, unnecessary area is introduced into the query result by using the 6-character prefix. An ideal implementation would use only 7-character prefixes to minimize the amount of extra data transmitted over the wire.

jieforest · 发表于 2012-11-14 17:47

Compared to the target query area, the six-character prefix match areas are huge. Worse still, the query spans two of those larger prefixes. Seen in this context, those five characters of common prefix include far more data than you need. Relying on prefix match results in scanning a huge amount of extra area.

Of course, there's a trade-off. If your data isn't dense at this precision level, executing fewer, longer scans isn't such a big deal. The scans don't return too much superfluous data, and you can minimize the remote procedure call (RPC) overhead.

If your data is dense, running more, shorter scans will reduce the number of excess points transported over the wire. Plus, if there's one thing that computers are getting good at these days, it's parallelism. Execute each of those shorter scans on its own CPU core, and the query is still as fast as the slowest scan.

jieforest · 发表于 2012-11-14 17:48

Let's scroll the map over to a different part of Manhattan, not far from the space we've explored thus far. Look at figure 8. Notice that the geohash of the center box has six characters (dr5ruz) of prefix in common with the boxes to its east, southeast, and south.

But there are only five characters (dr5ru) in common with the west and southwest boxes. If five characters of common prefix is bad, then the prefix match with the entire northern row is abysmal, with only two characters (dr) in common!

Figure 8 Visualizing the geohash edge case. The encoding isn't perfect; this is one such case. Imagine a nearest-neighbor search falling on the point under the arrow in this illustration. It's possible you'll find a neighbor in a tile with only two characters of common prefix.

jieforest · 发表于 2012-11-14 17:48

This doesn't happen every time, but it does happen with a surprisingly high frequency. As a counterexample, all eight neighbors of the southeast box (dr5ruz9) share a common six-character prefix.

The geohash is effective, but you can't use a simple naive prefix match either. Based on these figures, it looks like the optimal approach for the data is to scan the center tile and its eight neighbors. This approach will guarantee correct results while minimizing the amount of unnecessary network IO.

Luckily, calculating those neighbors is a simple bit-manipulation away. Explaining the details of that manipulation is beyond the scope of our interest, so we'll trust the geohash library to provide that feature.