Snappy 是不可切分的,小心使用

之前一直都认为只要是大数据,只要选择Snappy作为压缩引擎就不会有问题。虽然压缩率不高,但是CPU消耗不高,解压速度快。所以特别适合用作热点数据表的压缩引擎。

但是今天在读 《Hadoop硬实战》的时候,发现了一个误解。原来Snappy是不可切分的。什么意思呢?简单来讲,就是千万别用来压缩Text文件。否则性能急剧下降。

———————————————————————————————

下面慢慢解释。。。。

1、首先得确认一点Snappy确实是不可切分的。

参考文章:https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable

但是为什么又有人说Snappy可以切分呢?其实我理解下来,一定要说明Snappy是不可切分的,之所以有人说可切分的,是因为把Snappy用在block级别,这样文件就可以切分了。

A、参考文章:http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

This use alone justifies installing Snappy, but there are other places Snappy can be used within Hadoop applications. For example, Snappy can be used for block compression in all the commonly-used Hadoop file formats, including Sequence Files, Avro Data Files, and HBase tables.

One thing to note is that Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.

B、参考文章:https://boristyukin.com/is-snappy-compressed-parquet-file-splittable/

For MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split. Snappy and GZip blocks are not splittable, but files with Snappy blocks inside a container file format such as SequenceFile or Avro can be split. Snappy is intended to be used with a container format, like SequenceFiles or Avro data files, rather than being used directly on plain text, for example, since the latter is not splittable and cannot be processed in parallel using MapReduce. Splittability is not relevant to HBase data.

C、参考资料:https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable

This means that if a whole text file is compressed with Snappy then the file is NOT splittable. But if each record inside the file is compressed with Snappy then the file could be splittable, for example in Sequence files with block compression.

所以Snappy是不可切分,所以不要将Snappy用到大文件上。如果大文件的文件大小超过了HDFS的block size,即大文件由HDFS的多个block组成的时候。Map阶段就必须等到整个大文件全部解压完毕后才能执行。

2、parquet文件是可以切分的,所有snappy可以用在block级别的压缩。所以parquet+snappy是可压缩可切分的。

The consequence of storing the metadata in the footer is that reading a Parquet file requires an initial seek to the end of the file (minus 8 bytes) to read the footer metadata length, then a second seek backward by that length to read the footer metadata. Unlike sequence files and Avro datafiles, where the metadata is stored in the header and sync markers are used to separate blocks, Parquet files don’t need sync markers since the block boundaries are stored in the footer metadata. (This is possible because the metadata is written after all the blocks have been written, so the writer can retain the block boundary positions in memory until the file is closed.) Therefore, Parquet files are splittable, since the blocks can be located after reading the footer and can then be processed in parallel (by MapReduce, for example).

3、snappy+hbase也是没问题的。

4、snappy+textfile,就别考虑了。

HBase和Solr的性能对比

以前一直没有搞懂既然Solr可以做多维检索,为什么还要用HBase呢?HBase的Rowkey只能做一个行键查询,远不如Solr的多维检索,而且性能也不差。然后我们就做了一个测试,测试结果如下:

测试结论:如果查询可以做成基于rowkey查询的话,最好使用hBase,这个性能比Solr快太多。而且在1000个并发的时候,solr无法承受住压力,而hbase性能依旧良好。

 

CDH5.13上独立安装Solr7.3.1

CDH5.13默认带的Solr版本号比较低,只有4.10.3。这种低版本无法支持我们的应用需求,所以只能独立安装Solr7.3.1,唯一的缺点就是无法在CDH上面统一管理了。

1、下载Solr的安装包省略,下载后解压Solr安装包。下面的流程默认解压到当前用户根目录,即 ~

2、解压solr7

tar -zxvf solr-7.3.1.tgz

3、创建ZK的solr7 root

cd ~/solr-7.3.1/bin
./solr zk mkroot /solr7 -z 10.127.60.2,10.127.60.3,10.127.60.4:2181
4、查看端口占用情况,避免端口被CDH占用
netstat -nl | grep 9983
或者
sudo netstae -nltp | grep 9983,这样可以拿到正在使用端口的pid
5、执行安装指令。如下指令,在一个物理机上建立两个solr实例。建议将每个实例的数据文件目录放在不同硬盘下。

sudo ./install_solr_service.sh ../../solr-7.3.1.tgz -i /srv/BigData/hadoop/solr1 -d /srv/BigData/hadoop/solr1/solr_data -u solrup -s solr1 -p 9983 -n

sudo ./install_solr_service.sh ../../solr-7.3.1.tgz -i /srv/BigData/hadoop/solr2 -d /srv/BigData/hadoop/solr2/solr_data -u solrup -s solr2 -p 9984 -n

6、修改 /etc/default/solr1.in.sh,改变部分参数。
----HDFS版本
SOLR_JAVA_MEM="-Xms16g -Xmx16g -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.hdfs.home=hdfs://10.127.60.1:8020/solr7 -XX:MaxDirectMemorySize=20g -Dsolr.autoSoftCommit.maxTime=-1 -Dsolr.autoCommit.maxTime=-1 -XX:+UseLargePages -Dsolr.hdfs.blockcache.slab.count=100"
ZK_HOST="10.127.60.2,10.127.60.3,10.127.60.4/solr7"

---本地
SOLR_JAVA_MEM="-Xms16g -Xmx16g -Dsolr.autoSoftCommit.maxTime=-1 -Dsolr.autoCommit.maxTime=-1 -XX:+UseLargePages"

ZK_HOST="10.127.60.2,10.127.60.3,10.127.60.4/solr7"

SOLR_JAVA_MEM="-Xms32g -Xmx32g -Dsolr.autoSoftCommit.maxTime=-1 -Dsolr.autoCommit.maxTime=-1 -XX:+UseLargePages"

ZK_HOST="10.127.60.2,10.127.60.3,10.127.60.4/solr7"
7、创建collection

---在zookeeper上传配置文件
bin/solr zk upconfig -z 10.127.60.2,10.127.60.3,10.127.60.4:2181 -n mynewconfig -d /path/to/configset

–参考资料:https://lucene.apache.org/solr/guide/7_3/solr-control-script-reference.html#solr-control-script-reference

./solr create -cTEST_bigdata -d /zp_test/solor7/conf -n TEST_bigdata -s 6

8、如果将数据温江放在HDFS上,solrup用户需要有对应的权限。在HDFS上创建目录 /solr7,同时赋予solrup用户所有权。 hdfs dfs -chown solrup /solr7