hive on spark rpc channel is closed

在新安装好的CDH集群上跑一个
select count(*) from bigtable
一启动spark任务就报错。但是一些小表查询,无需启动spark任务的sql就很快。说明问题出在了spark上。

在任务管理器里面,看到只启动了一个executor,关键是任务已经失败了,还有部分task一直处于running状态。

hive中的报错信息非常简单,死活找不到问题。
最后启用了spark-shell,用scala去执行同样的sql,结果日志报错非常的明显。。而且一目了然!!
而且日志里面写清楚了是node17的问题:


Can't create directory
/bigdata/diskb/yarn/nm/usercache/hive/appcache/application_1543298149182_0044 - Permission denied
Can't create directory
/bigdata/diskc/yarn/nm/usercache/hive/appcache/application_1543298149182_0044 - Permission denied
Can't create directory
/bigdata/diskd/yarn/nm/usercache/hive/appcache/application_1543298149182_0044 - Permission denied
Can't create directory
/bigdata/diske/yarn/nm/usercache/hive/appcache/application_1543298149182_0044 - Permission denied
Can't create directory
/bigdata/diskf/yarn/nm/usercache/hive/appcache/application_1543298149182_0044 - Permission denied
Can't create directory
/bigdata/diskg/yarn/nm/usercache/hive/appcache/application_1543298149182_0044 - Permission denied
Can't create directory
/bigdata/diskh/yarn/nm/usercache/hive/appcache/application_1543298149182_0044 - Permission denied
Can't create directory
/bigdata/diski/yarn/nm/usercache/hive/appcache/application_1543298149182_0044 - Permission denied
Can't create directory
/bigdata/diskj/yarn/nm/usercache/hive/appcache/application_1543298149182_0044 - Permission denied
Can't create directory
/bigdata/diskk/yarn/nm/usercache/hive/appcache/application_1543298149182_0044 - Permission denied

所以以后一旦出现hive sql查不出问题明细的时候,一定要使用spark-shell差错。感觉非常的清楚。
问题明确:
node17上的yarn nodemanager有问题了。那个数据目录没有权限。
解决方法:
1、依然是简单粗暴的方法:直接废掉node17的nodemanager,不就损失一个节点吗。。

2、变更目录权限与其他节点一直:其实我给那两个目录授权了 hive:yarn,使其权限和其他节点一样的权限。可惜还是不行。
—后来发现我错了。。其实我只变更了一块盘,这个节点的yarn一共有十个数据盘。。。。我需要处理十次。

3、删除usercahce下的所有目录!参考资料:

rm -rf /dn/yarn/nm/usercache/*

cdh kerberos 问题

1、kerberos导出keytab文件:

登录到kerberos server所在的主节点,然后执行

kadmin.local

然后执行:

xst -k /路径/XXX.keytab -norandkey XXX

例子:xst -k /home/test/test.keybat -norandkey test

注意:千万别执行ktadd -k /test/test.keytab test@PICC.COM。

不然的话所有用户的密码都要变更

2、kinit免密码登录,通过keytab文件进讨论

kinit test -k -t /home/test/test.keytab

3、列出当前所kinit使用的用户

klist -e

oracle rac 资源申请

oracle rac的资源如下:

1、至少两台PC机

2、一个共享NAS盘挂在到所有PC机上,因为rac是依靠共享存储的。

3、每个PC机需要一个Visual IP,一个物理ip

4、一个漂移IP,又叫scan ip

 

最小配置资源举例

两台物理机

每个pc都一个物理ip,一个visualip。合计四个ip

一个scan ip。

物理机1  物理iP:10.10.10.1 VOIP:10.10.10.101

物理机2  物理iP:10.10.10.2 VOIP:10.10.10.102

SCANIP:     10.10.10.103

NAS存储:2TB

chrome 禁用跨域安全认证

在正常情况,chrome禁止JAVASCRIPT跨域请求。但是谷歌是可以再启动的时候,禁用掉跨域认证。

MAC系统的操作指令如下,需要通过该指令启动chrome,所以启动前请先chrome退出。

open -a Google\ Chrome –args –disable-web-security –user-data-dir=””

Snappy 是不可切分的,小心使用

之前一直都认为只要是大数据,只要选择Snappy作为压缩引擎就不会有问题。虽然压缩率不高,但是CPU消耗不高,解压速度快。所以特别适合用作热点数据表的压缩引擎。

但是今天在读 《Hadoop硬实战》的时候,发现了一个误解。原来Snappy是不可切分的。什么意思呢?简单来讲,就是千万别用来压缩Text文件。否则性能急剧下降。

———————————————————————————————

下面慢慢解释。。。。

1、首先得确认一点Snappy确实是不可切分的。

参考文章:https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable

但是为什么又有人说Snappy可以切分呢?其实我理解下来,一定要说明Snappy是不可切分的,之所以有人说可切分的,是因为把Snappy用在block级别,这样文件就可以切分了。

A、参考文章:http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

This use alone justifies installing Snappy, but there are other places Snappy can be used within Hadoop applications. For example, Snappy can be used for block compression in all the commonly-used Hadoop file formats, including Sequence Files, Avro Data Files, and HBase tables.

One thing to note is that Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.

B、参考文章:https://boristyukin.com/is-snappy-compressed-parquet-file-splittable/

For MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split. Snappy and GZip blocks are not splittable, but files with Snappy blocks inside a container file format such as SequenceFile or Avro can be split. Snappy is intended to be used with a container format, like SequenceFiles or Avro data files, rather than being used directly on plain text, for example, since the latter is not splittable and cannot be processed in parallel using MapReduce. Splittability is not relevant to HBase data.

C、参考资料:https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable

This means that if a whole text file is compressed with Snappy then the file is NOT splittable. But if each record inside the file is compressed with Snappy then the file could be splittable, for example in Sequence files with block compression.

所以Snappy是不可切分,所以不要将Snappy用到大文件上。如果大文件的文件大小超过了HDFS的block size,即大文件由HDFS的多个block组成的时候。Map阶段就必须等到整个大文件全部解压完毕后才能执行。

2、parquet文件是可以切分的,所有snappy可以用在block级别的压缩。所以parquet+snappy是可压缩可切分的。

The consequence of storing the metadata in the footer is that reading a Parquet file requires an initial seek to the end of the file (minus 8 bytes) to read the footer metadata length, then a second seek backward by that length to read the footer metadata. Unlike sequence files and Avro datafiles, where the metadata is stored in the header and sync markers are used to separate blocks, Parquet files don’t need sync markers since the block boundaries are stored in the footer metadata. (This is possible because the metadata is written after all the blocks have been written, so the writer can retain the block boundary positions in memory until the file is closed.) Therefore, Parquet files are splittable, since the blocks can be located after reading the footer and can then be processed in parallel (by MapReduce, for example).

3、snappy+hbase也是没问题的。

4、snappy+textfile,就别考虑了。

hive 分桶

1、hive分桶可以加速JOIN速度。
2、hive可以独立分桶,也可以先分区,再进行分桶。

参考资料:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

3、可以执行抽样查询:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

4、其他优点可以参考:

https://www.qubole.com/blog/5-tips-for-efficient-hive-queries/

 

首先创建表,这里创建的是先分区再分桶。其实我们是可以不先分区,直接分桶,具体根据你的需求:

CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(ds STRING)
CLUSTERED BY(user_id) INTO 256 BUCKETS;

这时候可以在spark的任务栏里面看见,多了一个stage,这个stage的任务数正好等于分桶数256.这时我们再检查HDFS中该表目录下的数据文件数量正好也等于256个。

然后进行表插入。

set hive.enforce.bucketing = true; -- (Note: Not needed in Hive 2.x onward)
FROM user_id
INSERT OVERWRITE TABLE user_info_bucketed
PARTITION (ds='2009-02-25')
SELECT userid, firstname, lastname WHERE ds='2009-02-25';

分桶后,可以再HDFS中看到该表目录下存在多个子文件目录:

——————————————————————–

—-不启用bucket join

select *
from ods.p_t_contract_master m
left outer join ods.p_t_customer c on m.applicant_id = c.customer_id
limit 100;

2018-08-28 17:34:06,160 Stage-4_0: 192(+174)/735 Stage-5_0: 0/1099
Status: Finished successfully in 660.92 seconds

select *
from ods.p_t_contract_master m
left outer join t_customer_bucket c on m.applicant_id = c.customer_id
limit 100;

2018-08-28 17:49:00,042 Stage-6_0: 472/472 Finished Stage-7_0: 0(+180)/1099
Time taken: 728.227 seconds, Fetched: 100 row(s)

select *
from t_contract_master_bucket m
left outer join ods.p_t_customer c on m.applicant_id = c.customer_id
limit 100;
2018-08-28 17:55:06,944 Stage-8_0: 0(+192)/808 Stage-9_0: 0/1099
Time taken: 845.256 seconds, Fetched: 100 row(s)

select *
from t_contract_master_bucket m
left outer join t_customer_bucket c on m.applicant_id = c.customer_id
limit 100;

2018-08-28 18:09:32,400 Stage-10_0: 0(+192)/545 Stage-11_0: 0/1099
Time taken: 707.371 seconds, Fetched: 100 row(s)

——-启用bucket join
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

select *
from ods.p_t_contract_master m
left outer join t_customer_bucket c on m.applicant_id = c.customer_id
limit 100;

select *
from t_contract_master_bucket m
left outer join ods.p_t_customer c on m.applicant_id = c.customer_id
limit 100;

select *
from t_contract_master_bucket m
left outer join t_customer_bucket c on m.applicant_id = c.customer_id
limit 100;

2018-08-28 18:21:32,583 Stage-12_0: 0(+8)/472 Stage-13_0: 0/1099
Status: Finished successfully in 661.10 seconds