Neo4j介绍

jieforest · 发表于 2015-2-17 21:19

At each depth, we ran the query 10 times—this was simply to warm up any caches that could help with performance. The fastest execution time for each depth was recorded. No additional database performance tuning was performed, apart from column indexes defined in the SQL script from listing 1.1. Table 1.1 shows the results of the experiment.

Execution times for multiple join queries using a MySQL database engine on a data set of 1,000 users

jieforest · 发表于 2015-2-17 21:19

Note

All experiments were executed on an Intel i7–powered commodity laptop with 8 GB of RAM, the same computer that was used to write this book.

Note

With depths 3, 4, and 5, a count of 999 is returned. Due to the small data set, any user in the database is connected to all others.

As you can see, MySQL handles queries to depths 2 and 3 quite well. That’s not unexpected—join operations are common in the relational world, so most database engines are designed and tuned with this in mind. The use of database indexes on the relevant columns also helped the relational database to maximize its performance of these join queries.

jieforest · 发表于 2015-2-17 21:20

At depths 4 and 5, however, you see a significant degradation of performance: a query involving 4 joins takes over 10 seconds to execute, while at depth 5, execution takes way too long—over a minute and a half, although the count result doesn’t change. This illustrates the limitation of MySQL when modeling graph data: deep graphs require multiple joins, which relational databases typically don’t handle too well.

Inefficiency of SQL joins

To find all a user’s friends at depth 5, a relational database engine needs to generate the Cartesian product of the t_user_friend table five times. With 50,000 records in the table, the resulting set will have 50,0005 rows (102.4 × 1021), which takes quite a lot of time and computing power to calculate. Then you discard more than 99% to return the just under 1,000 records that you’re interested in!

As you can see, relational databases are not so great for modeling many-to-many relationships, especially in large data sets. Neo4j, on the other hand, excels at many-to-many relationships, so let’s take a look at how it performs with the same data set. Instead of tables, columns, and foreign keys, you’re going to model users as nodes, and friendships as relationships between nodes.

jieforest · 发表于 2015-2-19 20:03

Graph data in Neo4j

Neo4j stores data as vertices and edges, or, in Neo4j terminology, nodes and relationships. Users will be represented as nodes, and friendships will be represented as relationships between user nodes. If you take another look at the social network in figure 1.1, you’ll see that it represents nothing more than a graph, with users as nodes and friendship arrows as relationships.

There’s one key difference between relational and Neo4j databases, which you’ll come across right away: data querying. There are no tables and columns in Neo4j, nor are there any SQL-based select and join commands. So how do you query a graph database?

The answer is not “write a distributed MapReduce function.” Neo4j, like all graph databases, takes a powerful mathematical concept from graph theory and uses it as a powerful and efficient engine for querying data. This concept is graph traversal, and it’s one of the main tools that makes Neo4j so powerful for dealing with large-scale graph data.

jieforest · 发表于 2015-2-19 20:03

Traversing the graph

The traversal is the operation of visiting a set of nodes in the graph by moving between nodes connected with relationships. It’s a fundamental operation for data retrieval in a graph, and as such, it’s unique to the graph model. The key concept of traversals is that they’re localized—querying the data using a traversal only takes into account the data that’s required, without needing to perform expensive grouping operations on the entire data set, like you do with join operations on relational data.

Neo4j provides a rich Traversal API, which you can employ to navigate through the graph. In addition, you can use the REST API or Neo4j query languages to traverse your data. We’ll dedicate much of this book to teaching you the principles of and best practices for traversing data with Neo4j.

jieforest · 发表于 2015-2-19 20:03

To get all the friends of a user’s friends, run the code in the following listing.

Neo4j Traversal API code for finding all friends at depth 2

TraversalDescription traversalDescription =
Traversal.description()
.relationships("IS_FRIEND_OF", Direction.OUTGOING)
.evaluator(Evaluators.atDepth(2))
.uniqueness(Uniqueness.NODE_GLOBAL);
Iterable<Node> nodes = traversalDescription.traverse(nodeById).nodes();

复制代码

Don’t worry if you don’t understand the syntax of the code snippet in listing 1.2—everything will be explained slowly and thoroughly in the next few chapters. Figure 1.3 illustrates the traversal of the social network graph, based on the preceding traversal description.

jieforest · 发表于 2015-2-19 20:04

Traversing the social network graph data

jieforest · 发表于 2015-2-19 20:04

Before the traversal starts, you select the node from which the traversal will start (node X in figure 1.3). Then you follow all the friendship relationships (arrows) and collect the visited nodes as results. The traversal continues its journey from one node to another via the relationships that connect them. The direction of relationships does not affect the traversal—you can go up and down the arrows with the same efficiency. When the rules stop applying, the traversal stops. For example, the rule can be to visit only nodes that are at depth 1 from the starting node, in which case once all nodes at depth 1 are visited, the traversal stops. (The darker arrows in figure 1.3 show the relationships that are followed for this example.)

Table shows the performance metrics for running a traversal against a graph containing the same data that was in the previous MySQL database (where the traversal is functionally the same as the queries executed previously on the database, finding friends of friends up the defined depth). Again, this is for a data set of 1,000 users with an average of 50 friends per user.

solman9 · 发表于 2015-2-19 21:48