|
Data Pages
For data pages, the 3 pieces of information are encoded back to back, after the page header. We have the
definition levels data,
repetition levels data,
encoded values. The size of specified in the header is for all 3 pieces combined.
The data for the data page is always required. The definition and repetition levels are optional, based on the schema definition. If the column is not nested (i.e. the path to the column has length 1), we do not encode the repetition levels (it would always have the value 1). For data that is required, the definition levels are skipped (if encoded, it will always have the value of the max definition level).
For example, in the case where the column is non-nested and required, the data in the page is only the encoded values.
The following encodings are supported:
Plain encoding (PLAIN = 0)
The plain encoding is used whenever a more efficient encoding can not be used. It stores the data in the following format:
BOOLEAN: Bit Packed (see above), LSB first
INT32: 4 bytes little endian
INT64: 8 bytes little endian
INT96: 12 bytes little endian
FLOAT: 4 bytes IEEE little endian
DOUBLE: 8 bytes IEEE little endian
BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
Dictionary Encoding (PLAIN_DICTIONARY = 2)
The dictionary encoding builds a dictionary of values encountered in a given column. The dictionary will be stored in a dictionary page per column chunk. The values are stored as integers using the RLE/Bit-Packing Hybrid encoding described above. If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is written first, before the data pages of the column chunk.
Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding described above.
Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width).
Column chunks
Column chunks are composed of pages written back to back. The pages share a common header and readers can skip over page they are not interested in. The data for the page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the page metadata.
Checksumming
Data pages can be individually checksummed. This allows disabling of checksums at the HDFS file level, to better support single row lookups.
Error recovery
If the file metadata is corrupt, the file is lost. If the column metdata is corrupt, that column chunk is lost (but column chunks for this column in order row groups are okay). If a page header is corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups.
Potential extension: With smaller row groups, the biggest issue is lowing the file metadata at the end. If this happens in the write path, all the data written will be unreadable. This can be fixed by writing the file metadata every Nth row group.
Each file metadata would be cumulative and include all the row groups written so far. Combining this with the strategy used for rc or avro files using sync markers, a reader could recovery partially written files. |
|