Bucketized tables do not support INSERT INTO

今天碰到了一个错误:

FAILED: SemanticException [Error 10122]: Bucketized tables do not support INSERT INTO: Table: ecif_customer_relation_mapping

根据异常信息,我第一反应是不能使用INSERT INTO语句到一个分桶表中,但是我另外一个脚本又是没问题的。。

结果是因为我的SQL是INSERT INTO TABLE ** SELECT * FROM ….

分通表不能直接从子查询中插入。只能先将子查询固化成表后,再从改表中插入到分桶表中。

网卡情况检测

在Centos7中检测网卡情况,比如是万兆网还是千兆网,以及服务器之间的传输速度可以采用如下的方式进行检测。

1、检测网卡的传输速度

先使用ifconfig查看网卡名字。

然后使用/sbin/ethtool 网卡名,查看传输速度。

speed可以看出网卡的传输速度,这里可以看出是万兆网卡;duplex需要为full,代表启用全双工模式。

2、检测服务器间传输速度

选择两台服务器,一台作为server,一台作为client。

启动服务器:

启动客户端,即开始测试带宽:

使用的指令iperf -c ip地址 -f m(显示结果为MB) -d(双向测试)

结果中4代表从服务器到客户端的吞吐量,5代表客户端到服务器端吞吐量。

Snappy 是不可切分的,小心使用

之前一直都认为只要是大数据,只要选择Snappy作为压缩引擎就不会有问题。虽然压缩率不高,但是CPU消耗不高,解压速度快。所以特别适合用作热点数据表的压缩引擎。

但是今天在读 《Hadoop硬实战》的时候,发现了一个误解。原来Snappy是不可切分的。什么意思呢?简单来讲,就是千万别用来压缩Text文件。否则性能急剧下降。

———————————————————————————————

下面慢慢解释。。。。

1、首先得确认一点Snappy确实是不可切分的。

参考文章:https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable

但是为什么又有人说Snappy可以切分呢?其实我理解下来,一定要说明Snappy是不可切分的,之所以有人说可切分的,是因为把Snappy用在block级别,这样文件就可以切分了。

A、参考文章:http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

This use alone justifies installing Snappy, but there are other places Snappy can be used within Hadoop applications. For example, Snappy can be used for block compression in all the commonly-used Hadoop file formats, including Sequence Files, Avro Data Files, and HBase tables.

One thing to note is that Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.

B、参考文章:https://boristyukin.com/is-snappy-compressed-parquet-file-splittable/

For MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split. Snappy and GZip blocks are not splittable, but files with Snappy blocks inside a container file format such as SequenceFile or Avro can be split. Snappy is intended to be used with a container format, like SequenceFiles or Avro data files, rather than being used directly on plain text, for example, since the latter is not splittable and cannot be processed in parallel using MapReduce. Splittability is not relevant to HBase data.

C、参考资料:https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable

This means that if a whole text file is compressed with Snappy then the file is NOT splittable. But if each record inside the file is compressed with Snappy then the file could be splittable, for example in Sequence files with block compression.

所以Snappy是不可切分,所以不要将Snappy用到大文件上。如果大文件的文件大小超过了HDFS的block size,即大文件由HDFS的多个block组成的时候。Map阶段就必须等到整个大文件全部解压完毕后才能执行。

2、parquet文件是可以切分的,所有snappy可以用在block级别的压缩。所以parquet+snappy是可压缩可切分的。

The consequence of storing the metadata in the footer is that reading a Parquet file requires an initial seek to the end of the file (minus 8 bytes) to read the footer metadata length, then a second seek backward by that length to read the footer metadata. Unlike sequence files and Avro datafiles, where the metadata is stored in the header and sync markers are used to separate blocks, Parquet files don’t need sync markers since the block boundaries are stored in the footer metadata. (This is possible because the metadata is written after all the blocks have been written, so the writer can retain the block boundary positions in memory until the file is closed.) Therefore, Parquet files are splittable, since the blocks can be located after reading the footer and can then be processed in parallel (by MapReduce, for example).

3、snappy+hbase也是没问题的。

4、snappy+textfile,就别考虑了。