查看: 4623|回复: 9

开源行为数据库SkyDB介绍

[复制链接]
认证徽章
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
发表于 2014-4-20 21:26 | 显示全部楼层 |阅读模式
What is Sky?

Sky is an open source database used for flexible, high performance analysis of behavioral data. For certain kinds of data such as clickstream data and log data, it can be several orders of magnitude faster than traditional approaches such as SQL databases or Hadoop.

The performance of Sky comes from optimized data organization, fast query execution, and the embarrassingly parallel nature of behavioral data. On commodity hardware you can expect to query several million events per core per second.


认证徽章
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
发表于 2014-4-20 21:28 | 显示全部楼层
Overview

This guide is meant to give you an understanding of how Sky works and how you can use it. It covers the basics of how data is stored and how you can use Sky's query engine for advanced analytics.

Partially Transient Hashes

Sky is a database for tracking the state of partially transient hashes over time. These hashes are partially transient because the individual properties of the hash can be set to persist values over time or only hold the value for a particular moment. These data structures are especially useful for tracking the behavior of things.

Let's look at an example. Say you want to know what users are doing on your e-commerce web site. First you want to track properties about these users such as gender and place of residence. These are called permanent properties and their value will persist from the time it is set until it is changed.

Next you want to track properties about actions that these users are doing such as visiting your home page, adding a product to a cart and checking out. These properties are called transient properties and their value will only exist for a single moment in time.

Each time you update the value of one or more properties it is called an event which as an associated timestamp and these events are all combined into a timeline for each hash. Hashes are also referred to as objects in Sky. Because these events are all organized into a single timeline for each user, Sky can query over this data extremely fast.

使用道具 举报

回复
认证徽章
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
发表于 2014-4-20 21:28 | 显示全部楼层
Internal Representation & Data Types

Sky is composed of tables which are just collections of objects that adhere to an ad hoc schema that is defined by properties. Properties can be defined as transient or permanent and have a data type associated with them. There are five data types currently supported: string, integer, float, boolean, & factor.

Factor types work the same as strings except that they are represented more efficiently when the property has a limited number of possible values. They work great for categorical data such as gender, state or action names.

The Query System

Because Sky stores non-relational data, SQL doesn't make sense as a query language. Instead Sky has a simple query language composed of a few basic primitives: queries, selections & conditions. Sky uses a RESTful JSON over HTTP API so we'll show queries in their JSON format.

使用道具 举报

回复
认证徽章
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
发表于 2014-4-20 21:29 | 显示全部楼层
Query

The query primitive is the root of all queries and contains a list of steps (either conditions or selections). The query engine works by iterating over each event for each object in chronological order like a cursor in a database. At each event, all the steps in the query are executed.

If you want to break timelines into sessions you can use sessionIdleTime to specify the number of seconds between two contiguous events to delineate a session.

Selections

A selection is simply a sample that is taken whenever the query engine reaches it. The selection can specify an optional name, optional dimensions and one or more fields. The selection fields are named and have an expression using one of the aggregation functions: count(), sum(field), min(field), max(field).

Simple Count Query

Here's an example of a simple query to count every event in a table:

Running this query will output the following result (assuming you have 381,293 events in your table):
  1. {"count":381293}
复制代码

使用道具 举报

回复
认证徽章
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
发表于 2014-4-20 21:29 | 显示全部楼层
Multi-Dimensional Query

A more complicated query involves breaking out these sample by a dimension of each object. For example, let's assume you want to see the breakdown events that occur by gender and country. We'll also name our selection “myStats”:
  1. {
  2.   "steps":[
  3.     {"type":"selection", "name":"myStats", "dimensions":["gender","country"], "fields":[
  4.       {"name":"count", "expression":"count()"}
  5.     ]}
  6.   ]
  7. }
复制代码
This query would return the following results:
  1. {
  2.   "myStats":{
  3.     "gender":{
  4.       "male":{
  5.         "country":{
  6.           "US":{
  7.             "count":19382
  8.           },
  9.           "Mexico":{
  10.             "count":10302
  11.           }
  12.         }
  13.       },
  14.       "female":{
  15.         "country":{
  16.           "US":{
  17.             "count":20183
  18.           },
  19.           "England":{
  20.             "count":3023
  21.           }
  22.         }
  23.       }
  24.     }
  25.   }
  26. }
复制代码

使用道具 举报

回复
认证徽章
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
发表于 2014-4-21 21:21 | 显示全部楼层
Conditions

Basic aggregation is not too interesting so that's why we have conditions. Conditions are like mini-queries that will execute only if a given expression evaluates to true. Conditions are also a way to move the query system's cursor forward by using the within property. Another useful feature of conditions is that they can be nested inside each other.

Simple Conditioned Selection

Here's an example query to count the number of events where the “action” property is equal to “checkout” and sum the total purchase price property at each checkout:
  1. {
  2.   "steps":[
  3.     {"type":"condition", "expression":"action == 'checkout'", "steps":[
  4.       {"type":"selection", "fields":[
  5.         {"name":"count", "expression":"count()"},
  6.         {"name":"total", "expression":"sum(purchaseAmount)"}
  7.       ]}
  8.     ]}
  9.   ]
  10. }
复制代码

使用道具 举报

回复
认证徽章
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
发表于 2014-4-21 21:22 | 显示全部楼层
Result:

  1. {"count":128712, "total":2910232.23}
复制代码
Funnel Analysis

The real power of the Sky query system is when you combine and nest these primitives. In this example, we'll use the within property which is a range that specifies the number of steps that a condition must occur within in order for it to execute. For example, specifying a within of [0,1] means the condition must evaluate to true in the current event or in the next event. A within of [1,1] means that the condition must evaluate true in the next event.

By nesting conditions, each subcondition only executes if its parent executes. We can perform a funnel analysis by combining nested conditions and selections. Let's say we want to track the number of users who visit our home page, go to our pricing page and finally go to our sign up page. We also want to count the number of users who make it to each step.

使用道具 举报

回复
认证徽章
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
发表于 2014-4-21 21:22 | 显示全部楼层
We'll number each step of the way (“0”, “1”, “2”). Finally we want to breakdown what action users performed immediately after the sign up so we'll dimension our final selection.
  1. {
  2.   "sessionIdleTime":7200,
  3.   "steps":[
  4.     {"type":"condition", "expression":"action == '/index.html'", "steps":[
  5.       {"type":"selection","name":"0","fields":[{"name":"count","expression":"count()"}]},
  6.       {"type":"condition", "expression":"action == '/pricing.html'", "within":[1,1], "steps":[
  7.         {"type":"selection","name":"1","fields":[{"name":"count","expression":"count()"}]},
  8.         {"type":"condition", "expression":"action == '/signup.html'", "within":[1,1], "steps":[
  9.           {"type":"selection","name":"2","fields":[{"name":"count","expression":"count()"}]},
  10.           {"type":"condition", "expression":"true", "within":[1,1], "steps":[
  11.             {"type":"selection","name":"3","dimensions":["action"],"fields":[{"name":"count","expression":"count()"}]}
  12.           ]}
  13.         ]}
  14.       ]}
  15.     ]}
  16.   ]
  17. }
复制代码

使用道具 举报

回复
认证徽章
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
发表于 2014-4-21 21:22 | 显示全部楼层
This query will result in a nice set of steps so we can see how our funnel breaks down:
  1. {
  2.   "0":{"count":10292},
  3.   "1":{"count":7382},
  4.   "2":{"count":2731},
  5.   "3":{
  6.     "/welcome.html":{"count":726},
  7.     "/contact_us.html":{"count":529},
  8.     "/index.html":{"count":128}
  9.   }
  10. }
复制代码
We read these results like this:

10,292 users made it to the home page.
7,382 of those users then clicked through to the pricing page.
2,731 of those users then went to the sign up page.
Finally, only 726 of those users actually signed up and made it to our welcome page. 529 users went to our contact page and then 128 users went back to our home page.
We can also deduce that any user who didn't make it to a next step must have left the web site. That means 1,348 users left the site (2731-(726+529+128)).

使用道具 举报

回复
认证徽章
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
发表于 2014-4-21 21:22 | 显示全部楼层
Client Libraries & ToolsClient Libraries
Sky uses a simple RESTful JSON over HTTP API so it's easy to interact with the server from any language. However, there are also several client libraries that are available:
Tools
Below is a list of open source tools built to be used with Sky:
Please let us know if you create any tools for Sky and we'll add them here.

使用道具 举报

回复

您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

DTCC2020中国数据库技术大会 限时8.5折

【架构革新 高效可控】2020年9月21日~23日第十一届中国数据库技术大会将在北京隆重召开。

大会设置2大主会场,20+技术专场,将邀请超百位行业专家,重点围绕数据架构、AI与大数据、传统企业数据库实践和国产开源数据库等内容展开分享和探讨,为广大数据领域从业人士提供一场年度盛会和交流平台。

http://dtcc.it168.com


大会官网>>
TOP技术积分榜 社区积分榜 徽章 团队 统计 知识索引树 积分竞拍 文本模式 帮助
  ITPUB首页 | ITPUB论坛 | 数据库技术 | 企业信息化 | 开发技术 | 微软技术 | 软件工程与项目管理 | IBM技术园地 | 行业纵向讨论 | IT招聘 | IT文档
  ChinaUnix | ChinaUnix博客 | ChinaUnix论坛
CopyRight 1999-2011 itpub.net All Right Reserved. 北京盛拓优讯信息技术有限公司版权所有 联系我们 
京ICP备09055130号-4  北京市公安局海淀分局网监中心备案编号:11010802021510 广播电视节目制作经营许可证:编号(京)字第1149号
  
快速回复 返回顶部 返回列表