12
返回列表 发新帖
楼主: Sky-Tiger

HTML5、CSS3 及相关技术

[复制链接]
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
 楼主| 发表于 2013-11-22 21:46 | 显示全部楼层
Run Length Encoding / Bit-Packing Hybrid (RLE = 3)

The second encoding uses a combination of bit-packing and run length encoding to more efficiently store repeated values.

The grammar for this encoding looks like this, given a fixed bit-width known in advance:

rle-bit-packed-hybrid: <length> <encoded-data>
length := length of the <encoded-data> in bytes stored as 4 bytes little endian
encoded-data := <run>*
run := <bit-packed-run> | <rle-run>  
bit-packed-run := <bit-packed-header> <bit-packed-values>  
bit-packed-header := varint-encode(<bit-pack-count> << 1 | 1)  
// we always bit-pack a multiple of 8 values at a time, so we only store the number of values / 8  
bit-pack-count := (number of values in this run) / 8  
bit-packed-values := *see 1 below*  
rle-run := <rle-header> <repeated-value>  
rle-header := varint-encode( (number of times repeated) << 1)  
repeated-value := value that is repeated, using a fixed-width of round-up-to-next-byte(bit-width)
The bit-packing here is done in a different order than the one in the deprecated encoding above. The values are packed from the least significant bit of each byte to the most significant bit, though the order of the bits in each value remains in the usual order of most significant to least significant. For example, to pack the same values as the example in the deprecated encoding above:

The numbers 1 through 7 using bit width 3:

dec value: 0   1   2   3   4   5   6   7
bit value: 000 001 010 011 100 101 110 111
bit label: ABC DEF GHI JKL MNO PQR STU VWX
would be encoded like this where spaces mark byte boundaries (3 bytes):

bit value: 10001000 11000110 11111010
bit label: HIDEFABC RMNOJKLG VWXSTUPQ
The reason for this packing order is to have fewer word-boundaries on little-endian hardware when deserializing more than one byte at at time. This is because 4 bytes can be read into a 32 bit register (or 8 bytes into a 64 bit register) and values can be unpacked just by shifting and ORing with a mask. (to make this optimization work on a big-endian machine, you would have to use the ordering used in the deprecated bit-packing encoding)

使用道具 举报

回复
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
 楼主| 发表于 2013-11-22 21:47 | 显示全部楼层
Data Pages
For data pages, the 3 pieces of information are encoded back to back, after the page header. We have the

definition levels data,
repetition levels data,
encoded values. The size of specified in the header is for all 3 pieces combined.
The data for the data page is always required. The definition and repetition levels are optional, based on the schema definition. If the column is not nested (i.e. the path to the column has length 1), we do not encode the repetition levels (it would always have the value 1). For data that is required, the definition levels are skipped (if encoded, it will always have the value of the max definition level).

For example, in the case where the column is non-nested and required, the data in the page is only the encoded values.

The following encodings are supported:

Plain encoding (PLAIN = 0)

The plain encoding is used whenever a more efficient encoding can not be used. It stores the data in the following format:

BOOLEAN: Bit Packed (see above), LSB first
INT32: 4 bytes little endian
INT64: 8 bytes little endian
INT96: 12 bytes little endian
FLOAT: 4 bytes IEEE little endian
DOUBLE: 8 bytes IEEE little endian
BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
Dictionary Encoding (PLAIN_DICTIONARY = 2)

The dictionary encoding builds a dictionary of values encountered in a given column. The dictionary will be stored in a dictionary page per column chunk. The values are stored as integers using the RLE/Bit-Packing Hybrid encoding described above. If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is written first, before the data pages of the column chunk.

Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding described above.

Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width).

Column chunks
Column chunks are composed of pages written back to back. The pages share a common header and readers can skip over page they are not interested in. The data for the page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the page metadata.

Checksumming
Data pages can be individually checksummed. This allows disabling of checksums at the HDFS file level, to better support single row lookups.

Error recovery
If the file metadata is corrupt, the file is lost. If the column metdata is corrupt, that column chunk is lost (but column chunks for this column in order row groups are okay). If a page header is corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups.

Potential extension: With smaller row groups, the biggest issue is lowing the file metadata at the end. If this happens in the write path, all the data written will be unreadable. This can be fixed by writing the file metadata every Nth row group.
Each file metadata would be cumulative and include all the row groups written so far. Combining this with the strategy used for rc or avro files using sync markers, a reader could recovery partially written files.

使用道具 举报

回复
论坛徽章:
6
蛋疼蛋
日期:2013-11-15 16:18:07奥迪
日期:2013-08-02 11:31:11保时捷
日期:2013-08-08 17:23:20ITPUB社区12周年站庆徽章
日期:2013-08-08 10:26:54一汽
日期:2013-10-04 16:35:26现代
日期:2013-11-15 16:18:07
发表于 2013-12-17 14:23 | 显示全部楼层
怎么不提供文档下载?

使用道具 举报

回复
招聘 : HTML页面制作
论坛徽章:
0
发表于 2014-9-25 16:23 | 显示全部楼层
学习中,谢谢

使用道具 举报

回复
论坛徽章:
0
发表于 2014-11-3 10:39 | 显示全部楼层
前面挺好的,后边贴的是什么呀,没看懂。 CSS3的学习资源 http://www.gbtags.com/gb/gbliblist/26.htm

使用道具 举报

回复
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
 楼主| 发表于 2014-11-13 23:31 | 显示全部楼层

使用道具 举报

回复

您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

TOP技术积分榜 社区积分榜 徽章 团队 统计 知识索引树 积分竞拍 文本模式 帮助
  ITPUB首页 | ITPUB论坛 | 数据库技术 | 企业信息化 | 开发技术 | 微软技术 | 软件工程与项目管理 | IBM技术园地 | 行业纵向讨论 | IT招聘 | IT文档
  ChinaUnix | ChinaUnix博客 | ChinaUnix论坛
CopyRight 1999-2011 itpub.net All Right Reserved. 北京盛拓优讯信息技术有限公司版权所有 联系我们 
京ICP备09055130号-4  北京市公安局海淀分局网监中心备案编号:11010802021510 广播电视节目制作经营许可证:编号(京)字第1149号
  
快速回复 返回顶部 返回列表