最近工作中遇到的问题,roaringbitmap存放在sequencefile中,由于bitmap对象过大触发了hadoop-common中的一个bug。这里写了问题原因和处理办法。
1.问题:
读取的单个Bitmap的sequenceFile文件(sequence File的value是bitmap对象)报错:java.lang.NegativeArraySizeException
$ hf -du -h /data/datacenter/bitmap/label/init/part-r-00000/data
688.5 M 2.0 G /data/datacenter/bitmap/label/init/part-r-00000/data
读取大bitmap的SequenceFile 报错:
java.lang.NegativeArraySizeException at org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:144) at org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:123) at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:179) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42) at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2245) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2218) at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78) at org.apache.spark.rdd.NewHadoopRDD$anon$1.hasNext(NewHadoopRDD.scala:168) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$anonfun$collect$1$anonfun$12.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$anonfun$collect$1$anonfun$12.apply(RDD.scala:927) at org.apache.spark.SparkContext$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
2.原因
发现hadoop-common包BytesWritable类里的bug。
There is an int overflow by setting new capacity in BytesWritable here:
public void setSize(int size) { if (size > getCapacity()) { setCapacity(size * 3 / 2); } this.size = size; }
700 Mb * 3 > 2Gb = int overflow!
As result you cannot deserialize (but can write and serialize) more than 700 Mb into BytesWritable.
In case you would like to use BytesWritable
, an option is set the capacity high enough before, so you utilize 2GB, not only 700MB:
randomValue.setCapacity(numBytesToWrite); randomValue.setSize(numBytesToWrite); // will not resize now
This bug has fixed in Hadoop recently, so in newer versions it should work even without that:
public void setSize(int size) { if (size > getCapacity()) { // Avoid overflowing the int too early by casting to a long. long newSize = Math.min(Integer.MAX_VALUE, (3L * size) / 2L); setCapacity((int) newSize); } this.size = size; }
3.解决方法:
Hadoop-common 到2.8才修复这个问题,而spark1.6依赖的是hadoop-common的2.2版本
解决方法,修改代码,就是指定先加载用户的类路径,再加载spark里的类。官网上写的很清楚,只怪没看官网啊, 检讨一下把官网的配置再看一遍。顺着这个思路可以看下spark是怎么设置yarn的参数的。
spark.driver.userClassPathFirst
false
(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only.
指定:
--conf spark.driver.userClassPathFirst=true \
--conf spark.executor.userClassPathFirst=true \
--jars /usr/local/hadoop/share/hadoop/common/hadoop-common-2.6.0-cdh5.5.2.jar,/home/mcloud/platform3/data-api_2.10-1.0.8.jar,/home/mcloud/platform3/JavaEWAH-1.0.2.jar,/home/mcloud/platform3/bitmap-ext_2.10-1.0.3.jar,/home/mcloud/platform3/commons-pool2-2.0.jar,/home/mcloud/platform3/jedis-2.5.1.jar,/home/mcloud/platform3/commons-dbutils-1.6.jar \
/home/xx/datacenter/jobcenter-job_2.10-1.0.jar \
指定这三个参数可以搞定这个问题, 找了好多方法,还是没有首先看spark的官方文档,很容易的一个问题,被搞的这么复杂。