楼主: Sky-Tiger

MongoDB and Scale Out? No, says MongoHQ

[复制链接]
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
11#
 楼主| 发表于 2014-3-31 23:17 | 只看该作者
dfs.name.dir
One of the most critical parameters, dfs.name.dir specifies a comma separated list
of local directories (with no spaces) in which the namenode should store a copy of
the HDFS filesystem metadata. Given the criticality of the metadata, administrators
are strongly encouraged to specify two internal disks and a low latency, highly
reliable, NFS mount. A complete copy of the metadata is stored in each directory;
in other words, the namenode mirrors the data between directories. For this reason,
the underlying disks need not be part of a RAID group, although some administrators choose to do so and forego specifying multiple directories in dfs.name.dir
(although an NFS mount should still be used, no matter what). The namenode
metadata is not excessively large; usually far below 1TB in size, although running
out of disk space is not something you want to occur.

使用道具 举报

回复
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
12#
 楼主| 发表于 2014-3-31 23:19 | 只看该作者
dfs.data.dir
While  dfs.name.dir specifies the location of the namenode metadata,
dfs.data.dir is used to indicate where datanodes should store HDFS block data.
Also a comma separate list, rather than mirroring data to each directory specified,
the datanode round robins blocks between disks in an attempt to allocate blocks
evenly across all drives. The datanode assumes each directory specifies a separate
physical device in a JBOD group. As described earlier, by JBOD, we mean each
disk individually addressable by the OS, and formatted and mounted as a separate
mount point. Loss of a physical disk is not critical since replicas will exist on other
machines in the cluster.
Example value:  /data/1/dfs/dn,/data/2/dfs/dn,/data/3/dfs/dn,/data/4/dfs/dn. Used
by: DN.
fs.checkpoint.dir
The fs.checkpoint.dir parameter specifies the comma separated list of directories
used by the secondary namenode in which to store filesystem metadata during a
checkpoint operation. If multiple directories are provided, the secondary namenode mirrors the data in each directory the same way the namenode does. It is rare,
however, that multiple directories are given because the checkpoint data is transient and, if lost, is simply copied during the next checkpoint operation. Some administrators treat the contents of this directory as a worst case scenario location
from which they can recover the namenode’s metadata. It is, after all, a valid copy
of the data required to restore a completely failed namenode

使用道具 举报

回复
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
13#
 楼主| 发表于 2014-4-2 21:10 | 只看该作者
Input setup — This is set up as InputFormat, which is responsible for calculation of the
job’s input split and creation of the data reader. In this example, TextInputFormat is used.
This InputFormat leverages its base class (FileInputFormat) to calculate splits (by default,
this will be HDFS blocks) and creates a LineRecordReader as its reader. Several additional
InputFormats supporting HDFS, HBase, and even databases are provided by Hadoop,
covering the majority of scenarios used by MapReduce jobs. Because an InputFormat based
on the HDFS ile is used in this case, it is necessary to specify the location of the input
data. You do this by adding an input path to the TextInputFormat class. It is possible to
add multiple paths to the HDFS-based input format, where every path can specify either a
speciic ile or a directory. In the latter case, all iles in the directory are included as an input
to the job.

使用道具 举报

回复
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
14#
 楼主| 发表于 2014-4-2 21:18 | 只看该作者
Reducer setup — This sets up reducer class that is used by the job. In addition, you can
set up the number of reducers that are used by the job. (There is a certain asymmetry in
Hadoop setup. The number of mappers depends on the size of the input data and split,
whereas the number of reducers is explicitly settable.) If this value is not set up, a job uses
a single reducer. For MapReduce applications that speciically do not want to use reducers,
the number of reducers must be set to 0.
Output setup — This sets up output format, which is responsible for outputting results of
the execution. The main function of this class is to create an OutputWriter. In this case,
TextOutputFormat (which creates a LineRecordWriter for outputting data) is used.
Several additional OutputFormats supporting HDFS, HBase, and even databases are
provided with Hadoop, covering the majority of scenarios used by MapReduce jobs. In
addition to the output format, it is necessary to specify data types used for output of key/
value pairs (Text and IntWritable, in this case), and the output directory (used by the
output writer). Hadoop also deines a special output format — NullOutputFormat — which
should be used in the case where a job does not use an output (for example, it writes its
output to HBase directly from either map or reduce). In this case, you should also use
NullWritable class for output of key/value pair types.

使用道具 举报

回复
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
15#
 楼主| 发表于 2014-4-2 21:27 | 只看该作者
To reformulate the initial problem in terms of MapReduce, it is typically necessary to answer the
following questions:
➤ How do you break up a large problem into smaller tasks? More speciically, how do you
decompose the problem so that the smaller tasks can be executed in parallel?
➤ Which key/value pairs can you use as inputs/outputs of every task?
➤ How do you bring together all the data required for calculation? More speciically, how do
you organize processing the way that all the data necessary for calculation is in memory at
the same time?
It is important to realize that many algorithms cannot be easily expressed as a single MapReduce
job. It is often necessary to decompose complex algorithms into a sequence of jobs, where data
output of one job becomes the input to the next.
This section takes a look at several examples of designing MapReduce applications for different
practical problems (from very simple to more complex). All of the examples are described in the
same format:
➤ A short description of the problem
➤ A description of the MapReduce job(s), including the following:
➤ Mapper description
➤ Reducer description

使用道具 举报

回复
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
16#
 楼主| 发表于 2014-4-2 21:44 | 只看该作者
A generic join problem can be described as follows. Given multiple data sets (S1 through Sn),
sharing the same key (a join key), you want to build records containing a key and all required data
from every record.
Two “standard” implementations exist for joining data in MapReduce: reduce-side join and mapside join.
A most common implementation of a join is a reduce-side join. In this case, all data sets are
processed by the mapper that emits a join key as the intermediate key, and the value as the
intermediate record capable of holding either of the set’s values. Because MapReduce guarantees
that all values with the same key are brought together, all intermediate records will be grouped by
the join key, which is exactly what is necessary to perform the join operation. This works very well
in the case of one-to-one joins, where at most one record from every data set has the same key.
Although theoretically this approach will also work in the case of one-to-many and many-to-many
joins, these cases can have additional complications. When processing each key in the reducer, there
can be an arbitrary number of records with the join key. The obvious solution is to buffer all values
in memory, but this can create a scalability bottleneck, because there might not be enough memory
to hold all the records with the same join key. This situation typically requires a secondary sort,
which can be achieved with the value-to-key conversion design pattern.

使用道具 举报

回复
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
17#
 楼主| 发表于 2014-4-2 21:44 | 只看该作者
A generic join problem can be described as follows. Given multiple data sets (S1 through Sn),
sharing the same key (a join key), you want to build records containing a key and all required data
from every record.
Two “standard” implementations exist for joining data in MapReduce: reduce-side join and mapside join.
A most common implementation of a join is a reduce-side join. In this case, all data sets are
processed by the mapper that emits a join key as the intermediate key, and the value as the
intermediate record capable of holding either of the set’s values. Because MapReduce guarantees
that all values with the same key are brought together, all intermediate records will be grouped by
the join key, which is exactly what is necessary to perform the join operation. This works very well
in the case of one-to-one joins, where at most one record from every data set has the same key.
Although theoretically this approach will also work in the case of one-to-many and many-to-many
joins, these cases can have additional complications. When processing each key in the reducer, there
can be an arbitrary number of records with the join key. The obvious solution is to buffer all values
in memory, but this can create a scalability bottleneck, because there might not be enough memory
to hold all the records with the same join key. This situation typically requires a secondary sort,
which can be achieved with the value-to-key conversion design pattern.

使用道具 举报

回复
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
18#
 楼主| 发表于 2014-4-2 22:10 | 只看该作者
In the case of iterative MapReduce applications, one or more MapReduce jobs are typically
implemented in the loop. This means that such applications can either be implemented using a driver
that internally implements an iteration logic and invokes a required MapReduce job(s) inside such
an iteration loop, or an external script running MapReduce jobs in a loop and checking conversion
criteria. (Another option is using a worklow engine. Chapters 6 through 8 examine Hadoop’s
worklow engine called Apache Oozie.) Using a driver for execution of iterative logic often provides
a more lexible solution, enabling you to leverage both internal variables and the full power of Java
for implementation of both iterations and conversion checks.
A typical example of an iterative algorithm is solving a system of linear equations. Next you look at
how you can use MapReduce for designing such an algorithm.

使用道具 举报

回复
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
19#
 楼主| 发表于 2014-4-2 22:11 | 只看该作者
As discussed, MapReduce is a technique used to solve a relatively simple problem in situations where
there is a lot of data, and it must be processed in parallel (preferably on multiple machines). The
whole idea of the concept is that it makes it possible to do calculations on massive data sets in a
realistic time frame.
Alternatively, MapReduce can be used for parallelizing compute-intensive calculations, where it’s
not necessarily about the amount of data, but rather about the overall calculation time (typically the
case of “embarrassingly parallel” computations).
The following must be true in order for MapReduce to be applicable:
  The calculations that need to be run must be composable. This means that you should be
able run the calculation on a subset of data, and merge partial results.
  The data set size is big enough (or the calculations are long enough) that the infrastructure
overhead of splitting it up for independent computations and merging results will not hurt
overall performance.
  The calculation depends mostly on the data set being processed. Additional small data sets
can be added using HBase, distributed cache, or some other techniques.
MapReduce is not applicable, however, in scenarios when the data set must be accessed randomly
to perform the operation (for example, if a given data set record must be combined with additional
records to perform the operation). However, in this case, it is sometimes possible to run
additional MapReduce jobs to “prepare” data for calculations.

使用道具 举报

回复
论坛徽章:
350
2006年度最佳版主
日期:2007-01-24 12:56:49NBA大富翁
日期:2008-04-21 22:57:29地主之星
日期:2008-11-17 19:37:352008年度最佳版主
日期:2009-03-26 09:33:53股神
日期:2009-04-01 10:05:56NBA季后赛大富翁
日期:2009-06-16 11:48:01NBA季后赛大富翁
日期:2009-06-16 11:48:01ITPUB年度最佳版主
日期:2011-04-08 18:37:09ITPUB年度最佳版主
日期:2011-12-28 15:24:18ITPUB年度最佳技术原创精华奖
日期:2012-03-13 17:12:05
20#
 楼主| 发表于 2014-5-18 19:49 | 只看该作者
rdf:type is a property that provides an elementary typing system in RDF. For example, we can express the relationship between several playwrights using type information, as shown in Table 3.9. The subject of rdf:type in these triples can be any identifier, and the object is understood to be a type. There is no restriction on the usage of rdf:type with types; types can have types ad infinitum, as shown in Table 3.10.
When we read a triple out loud (or just to ourselves), it is understandably tempting to read it (in English, anyway) in subject/predicate/object order so that the first triple in Table 3.9 would read, “Shakespeare type Playwright.” Unfortunately, this is pretty fractured syntax no matter how you inflect it. It would be better to have something like “Shakespeare has type Playwright” or maybe “The type of Shakespeare is Playwright.”
This issue really has to do with the choice of name for the rdf:type resource; if it had been called rdf:isInstanceOf instead, it would have been much easier to read out loud in English. But since we never have control over how other entities (in this case, the W3C) chose their names, we don’t have the luxury of changing these names. When we read out loud, we just have to take some liberties in adding in connecting words. So this triple can be pronounced, “Shakespeare [has] type Playwright,” adding in the “has” (or sometimes, the word “is” works better) to make the sentence into somewhat correct English.

使用道具 举报

回复

您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

TOP技术积分榜 社区积分榜 徽章 团队 统计 知识索引树 积分竞拍 文本模式 帮助
  ITPUB首页 | ITPUB论坛 | 数据库技术 | 企业信息化 | 开发技术 | 微软技术 | 软件工程与项目管理 | IBM技术园地 | 行业纵向讨论 | IT招聘 | IT文档
  ChinaUnix | ChinaUnix博客 | ChinaUnix论坛
CopyRight 1999-2011 itpub.net All Right Reserved. 北京盛拓优讯信息技术有限公司版权所有 联系我们 未成年人举报专区 
京ICP备16024965号-8  北京市公安局海淀分局网监中心备案编号:11010802021510 广播电视节目制作经营许可证:编号(京)字第1149号
  
快速回复 返回顶部 返回列表