Hypertable HQL指南

jieforest · 发表于 2013-6-28 21:51

HADOOP STREAMING MAPREDUCE

In order to run this example, Hadoop needs to be installed and HDFS and the MapReduce framework needs to be up and running. Hypertable builds against Cloudera's CDH3 distribution of hadoop. See CDH3 Installation for instructions on how to get Hadoop up and running.

In this example, we'll be running a Hadoop Streaming MapReduce job that uses a Bash script as the mapper and a Bash script as the reducer. Like the example in the previous section, the programs operate on a table called wikipedia that has been loaded with a Wikipedia dump.

Setup

First, exit the Hypertable command line interpreter and download the Wikipedia dump, for example:

$ wget http://cdn.hypertable.com/pub/wikipedia.tsv.gz

复制代码

jieforest · 发表于 2013-6-28 21:52

Next, jump back into the Hypertable command line interpreter and create the wikipedia table by executing the HQL commands show below.

CREATE NAMESPACE test;
USE test;
DROP TABLE IF EXISTS wikipedia;
CREATE TABLE wikipedia (
title,
id,
username,
article,
word
);

复制代码

jieforest · 发表于 2013-6-28 21:52

Now load the compressed Wikipedia dump file directly into the wikipedia table by issuing the following HQL commands:

hypertable> LOAD DATA INFILE "wikipedia.tsv.gz" INTO TABLE wikipedia;
Loading 638,058,135 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 78.28 s
Avg value size: 1709.59 bytes
Avg key size: 24.39 bytes
Throughput: 25226728.63 bytes/s (8151017.58 bytes/s)
Total cells: 1138847
Throughput: 14548.46 cells/s
Resends: 8328

复制代码

The mapper script (tokenize-article.sh) and the reducer script (reduce-word-counts.sh) are show below.

jieforest · 发表于 2013-6-28 21:53

Example

The following script, tokenize-article.sh, will be used as the mapper script.

#!/usr/bin/env bash
IFS=" "
read name column article
while [ $? == 0 ] ; do
if [ "$column" == "article" ] ; then
# Strip punctuation
stripped_article=`echo $article | awk 'BEGIN { FS="\t" } { print $NF }' | tr "\!\"#\[ DISCUZ_CODE_3 ]'()*+,-./:;<=>?@[\\\\]^_\{|}~" " " | tr -s " "` ;
# Split article into words
echo $stripped_article | awk -v name="$name" 'BEGIN { article=name; FS=" "; } { for (i=1; i<=NF; i++) printf "%s\tword:%s\t1\n", article, $i; }' ;
fi
# Read another line
read name column article
done
exit 0

复制代码

jieforest · 发表于 2013-6-28 21:53

The following script, reduce-word-counts.sh, will be used as the reducer script.

#!/usr/bin/env bash
last_article=
last_word=
let total=0
IFS=" "
read article word count
while [ $? == 0 ] ; do
if [ "$article" == "$last_article" ] && [ "$word" == "$last_word" ] ; then
let total=$count+total
else
if [ "$last_word" != "" ]; then
echo "$last_article $last_word $total"
fi
let total=$count
last_word=$word
last_article=$article
fi
read article word count
done
if [ $total -gt 0 ] ; then
echo "$last_article $last_word $total"
fi
exit 0

复制代码

jieforest · 发表于 2013-6-29 01:39

To populate the word column of the wikipedia table by tokenizing the article column using the above mapper and reduce script, issue the following command:

hypertable> quit
$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \
-libjars /opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar \
-Dhypertable.mapreduce.namespace=test \
-Dhypertable.mapreduce.input.table=wikipedia \
-Dhypertable.mapreduce.output.table=wikipedia \
-mapper /home/doug/tokenize-article.sh \
-combiner /home/doug/reduce-word-counts.sh \
-reducer /home/doug/reduce-word-counts.sh \
-file /home/doug/tokenize-article.sh \
-file /home/doug/reduce-word-counts.sh \
-inputformat org.hypertable.hadoop.mapred.TextTableInputFormat \
-outputformat org.hypertable.hadoop.mapred.TextTableOutputFormat \
-input wikipedia -output wikipedia

复制代码

jieforest · 发表于 2013-6-29 01:39

Input/Output Configuration Properties

The following table lists the job configuration properties that are used to specify, among other things, the input table, output table, and scan specification. These properties can be supplied to a streaming MapReduce job with -Dproperty=value arguments.

Input/Output Configuration Properties
Property Description Example Value
hypertable.mapreduce.namespace Namespace for both input and output table /test
hypertable.mapreduce.input.namespace Namespace for input table /test/intput
hypertable.mapreduce.input.table Input table name wikipedia
hypertable.mapreduce.input.scan_spec.columns Comma separated list of input columns id,title
hypertable.mapreduce.input.scan_spec.options Input WHERE clause options. These options (i.e. LIMIT, OFFSET) are evaluated for each single job MAX_VERSIONS 1 KEYS_ONLY
hypertable.mapreduce.input.scan_spec.row_interval Input row interval Dog <= ROW < Kitchen
hypertable.mapreduce.input.scan_spec.timestamp_interval Timestamp filter TIMESTAMP >= 2011-11-21
hypertable.mapreduce.input.include_timestamps Emit integer timestamp as the
1st field (nanoseconds since epoch) true
hypertable.mapreduce.output.namespace Namespace containing output table /test/output
hypertable.mapreduce.output.table Output table name wikipedia
hypertable.mapreduce.output.mutator_flags flags parameter passed to mutator constructor (1 = NO_LOG_SYNC) 1
hypertable.mapreduce.thriftbroker.framesize sets the ThriftClient framesize (in bytes); the default is 16 MB 20971520

复制代码

jieforest · 发表于 2013-6-29 01:40

Column Selection

To run a MapReduce job over a subset of columns from the input table, specify a comma separated list of columns in the hypertable.mapreduce.input.scan_spec.columns Hadoop configuration property. For example,

$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \
-libjars /opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar \
-Dhypertable.mapreduce.namespace=test \
-Dhypertable.mapreduce.input.table=wikipedia \
-Dhypertable.mapreduce.input.scan_spec.columns="id,title" \
-mapper /bin/cat -reducer /bin/cat \
-inputformat org.hypertable.hadoop.mapred.TextTableInputFormat \
-input wikipedia -output wikipedia2

复制代码

jieforest · 发表于 2013-6-29 01:40

Timestamps

To filter the input table with a timestamp predicate, specify the timestamp predicate in the hypertable.mapreduce.input.scan_spec.timestamp_interval Hadoop configuration property. The timestamp predicate is specified using the same format as the timestamp predicate in the WHERE clause of the SELECT statement, as illustrated in the following examples:

TIMESTAMP < 2010-08-03 12:30:00
TIMESTAMP >= 2010-08-03 12:30:00
2010-08-01 <= TIMESTAMP <= 2010-08-09

jieforest · 发表于 2013-6-29 01:40

To preserve the timestamps from the input table, set the hypertable.mapreduce.input.include_timestamps Hadoop configuration property to true. This will cause the TextTableInputFormat class to produce an additional field (field 0) that represents the timestamp as nanoseconds since the epoch. The following example illustrates how to pass a timestamp predicate into a Hadoop Streaming MapReduce program.

$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \
-libjars /opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar \
-Dhypertable.mapreduce.namespace=test \
-Dhypertable.mapreduce.input.table=wikipedia \
-Dhypertable.mapreduce.output.table=wikipedia2 \
-Dhypertable.mapreduce.input.scan_spec.columns="id,title" \
-Dhypertable.mapreduce.input.scan_spec.timestamp_interval="2010-08-01 <= TIMESTAMP <= 2010-08-09" \
-Dhypertable.mapreduce.input.include_timestamps=true \
-mapper /bin/cat -reducer /bin/cat \
-inputformat org.hypertable.hadoop.mapred.TextTableInputFormat \
-outputformat org.hypertable.hadoop.mapred.TextTableOutputFormat \
-input wikipedia -output wikipedia2

复制代码

Hypertable HQL指南

浏览过的版块