Solr优化二:启用MMapDirectory

SOLR默认使用的是NRTCachingDirectoryFactory,但是当我们不需要NRT的时候,可以使用MMapDirectory。

为什么要使用MMapDirectory?
简单来说,它比普通的文件目录要快。

前提:
1、操作系统最好是64位的。这样指针地址空间几乎不用考虑。32位只能使用3G左右。

性能对比:
1、使用前:

2、

cdh kerberosed hdfs java demo

cdh启用kerberos后,通过java直接操作hdfs需要额外的连接配置,具体代码如下:

System.setProperty("java.security.krb5.conf", "/Users/zhouxiang/develop/test/krb5.conf");
Class.forName("org.apache.hive.jdbc.HiveDriver").newInstance();
Configuration conf = new Configuration();
conf.set("hadoop.security.authentication", "kerberos");
conf.set("fs.defaultFS", "hdfs://test01:8020");
UserGroupInformation.setConfiguration(conf);
String principal = System.getProperty("kerberosPrincipal", "hdfs@TEST.COM");
String keytabLocation = System.getProperty("kerberosKeytab","/Users/zhouxiang/develop/test/hdfs.keytab"); UserGroupInformation.loginUserFromKeytab(principal, keytabLocation);
FileSystem fileSystem = FileSystem.get(conf); System.out.println("Connecting to -- "+conf.get("fs.defaultFS")); Path localPath = new Path("/Users/zhouxiang/Downloads/hadoop-common-2.6.0-cdh5.13.1.pom");
Path remotePath = new Path("/tmp/");
fileSystem.copyFromLocalFile(localPath, remotePath); FileStatus[] status = fileSystem.listStatus(remotePath); for(FileStatus stat: status) { System.out.println(stat.getPath()); }

cdh kerberosed hive jdbc demo

cdh启用kerberos后,通过jdbc连接hive需要额外的连接配置,具体代码如下:

System.setProperty("java.security.krb5.conf", "/Users/zhouxiang/develop/test/krb5.conf");
Class.forName("org.apache.hive.jdbc.HiveDriver").newInstance();
Configuration conf = new Configuration();
conf.set("hadoop.security.authentication", "kerberos");
UserGroupInformation.setConfiguration(conf);
String principal = System.getProperty("kerberosPrincipal", "hive@TEST.COM"); String keytabLocation = System.getProperty("kerberosKeytab", "/Users/zhouxiang/develop/test/hive.keytab"); UserGroupInformation.loginUserFromKeytab(principal, keytabLocation); String url = "jdbc:hive2://10.127.60.4:10000/default;auth=KERBEROS;principal=hive/bdp04@TEST.COM;"; Connection connection = DriverManager.getConnection(url, "", ""); Assert.assertNotNull(connection); Statement statement = connection.createStatement(); ResultSet resultSet = statement.executeQuery("select count(*) from ods.test"); while(resultSet.next()) { System.out.println(resultSet.getInt(1)); }

hdfs概述

到底什么是hdfs?
个人觉得hdfs其实只是在普通的单机文件系统上加了一层,从而将多个物理机的文件系统统一成一个分布式的文件系统。这个所谓加的一层,就是namenode,能够将各个打散的文件块,拼凑成一个大文件,再提供与操作系统文件系统一致的命令接口。这就是我们常用的dfs指令。
这样熟悉文件系统的用户,可以无缝的迁移到hdfs上,并且无需关心分布式的底层细节。所以,还是计算机里的那句名言:计算机里没有任何一个问题是加一层解决不了的,如果真的不能,就加两层。

HDFS Router-based Federation


到底什么是BLOCK?
其实一个BLOCK就是操作系统上一个普通文件,所谓block size为128mb。其实每一个大文件都切分为128mb个多个小文件,再分布到不同的服务器上。

上面的图是一个节点上dfs数据目录的文件情况。
先看路径 /subdir5/subdir235,这个目录分成了两级,每级都分了很多的子目录。这样可以的目的,可以避免单个文件夹的文件数量,超过操作系统上限。
再看具体文件,可以看到文件大小正好为128M。

对SECONDARY NAMENODE的误解
Secondary Namenode不是Namenode的备份,但是它确实可以作为热备。。

数据复制

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.
hdfs的每一个文件都被拆分成一系列的block,并放在了不同的物理机上。每一个文件的block size 和复制银子都可以单独设置。

All blocks in a file except the last block are the same size, while users can start a new block without filling out the last block to the configured block size after the support for variable length block was added to append and hsync.
所有的block,除了最后一块block,其大小都是一致的。

An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

hdfs的每一个文件只能写一次(除了apped和tuncate),并且任何时间只能有一个writer。

Safemode
On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.
启动的时候,namenode进入Safemode(安全模式)。安全模式状态下,不会发生block replication。如果发现任何data block副本数低于配置,namenode就是将这些block进行复制,从而达到指定的副本数。

Replication Pipelining
When a client is writing data to an HDFS file with a replication factor of three, the NameNode retrieves a list of DataNodes using a replication target choosing algorithm. This list contains the DataNodes that will host a replica of that block. The client then writes to the first DataNode. The first DataNode starts receiving the data in portions, writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.
简单意思就是,客户端在朝hdfs写入带有三个副本因子的文件时,namenode返回datanode列表。客户端首先写入第一个datanode,接着第一个datanode将数据转发写入第二个datanode,接着第二个datanode将数据转发写入第三个datanode。

拍自hadoop operation
拍自hadoop operation

Replica Placement: The First Baby Steps

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.
Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.
在不同机架的两个节点之间进行通讯,会通过switch。在大多数情况下,同一个机架不同节点之间的网络带宽远大于不同机架的不同节点之间。
The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.
For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.
通常情况,在复制因子为3时,如果writer(即写程序部署在集群内)在一个datanode上,hdfs会把第一个副本放到和writer同一个节点上。如果不是,会随机选择一个节点存放第一个副本。接着,在另一个机架上选择一个数据节点存放第二个副本,并在和第二个副本相同的机架上,选择另外一个数据节点存放第三个副本。简而言之,就是一个副本在一个节点上,另外两个副本在另一个机架的不同节点上。这种策略,降低了aggregate network的带宽,这样数据仅通过两个rack,而不是三个rack。但是,副本在不同机架上没有均匀分布。
If the replication factor is greater than 3, the placement of the 4th and following replicas are determined randomly while keeping the number of replicas per rack below the upper limit (which is basically (replicas – 1) / racks + 2).
Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time.
After the support for Storage Types and Storage Policies was added to HDFS, the NameNode takes the policy into account for replica placement in addition to the rack awareness described above. The NameNode chooses nodes based on rack awareness at first, then checks that the candidate node have storage required by the policy associated with the file. If the candidate node does not have the storage type, the NameNode looks for another node. If enough nodes to place replicas can not be found in the first path, the NameNode looks for nodes having fallback storage types in the second path.
The current, default replica placement policy described here is a work in progress.

参考资料:
1、http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
2、《hadoop operation》

hive multiple inserts 一次扫描多次插入

hive multiple inserts可以仅执行一次扫描,将这次扫描的数据分别根据不同的条件插入到多张表或者同一张表的不同分区中。但是不能插入同一张表!

Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;

这个语法可以降低扫描次数,如果分开多次插入,就会需要扫描多次。如果写成multiple inserts就只会扫表一次。官网解释如下:

Multiple insert clauses (also known as Multi Table Insert) can be specified in the same query.

Multi Table Inserts minimize the number of data scans required. Hive can insert data into multiple tables by scanning the input data just once (and applying different query operators) to the input data.

以下是执行计划截图:


测试情况:在20亿的数据中,执行扫描插入到六张表,只花了六分钟。