大文件去重-编程知识

大文件去重

news/2025/3/10 5:18:25/文章来源:https://www.cnblogs.com/lhc-hhh/p/18430278

若文件存的字符如下图，要求进行去重

可将数据存入HashSet，如下，但如果文件很大，大于虚拟机内存的话，会报异常java.lang.OutOfMemoryError: Java heap space

        HashSet set = new HashSet();File file = new File("E:\\aa.txt");BufferedReader reader = new BufferedReader(new FileReader(file));String tempString = null;while ((tempString = reader.readLine()) != null) {tempString = tempString.trim();if(tempString != ""){System.out.println(tempString);set.add(tempString);}}

可尝试用分批读取，用Hash取模方法将大文件拆分成若干小文件，再将若干个小文件的数据存入HashSet，最后汇总结果

首先插入测试数据aa.txt

//多线程插入测试数据public  void set() throws FileNotFoundException {File file = new File("E:\\aa.txt");PrintWriter pws = new PrintWriter(file);CountDownLatch latch = new CountDownLatch(9);ExecutorService executorService = Executors.newFixedThreadPool(9);for(int i=0;i<9;i++){executorService.execute(new SetClass("name+"+UUID.randomUUID().toString(),latch,file,pws));}try {latch.await(); //线程阻塞, 当latch中数量为0时，放行} catch (InterruptedException e) {e.printStackTrace();}executorService.shutdown();  //关闭线程
        pws.close();}public class SetClass extends Thread{private final CountDownLatch countDownLatch;private File file;private PrintWriter pws;public SetClass(String name,  CountDownLatch countDownLatch1,File file,PrintWriter pws){super(name);this.countDownLatch = countDownLatch1;this.file = file;this.pws=pws;}@Overridepublic void run() {for(int i=0;i<100000;i++){pws.println(UUID.randomUUID().toString());System.out.println(Thread.currentThread().getName()+":"+i);}countDownLatch.countDown();}}

大文件进行拆分，利用Hash取模将重复的数据存入同一个小文件

/*** 将文件hash取模之后放到不同的小文件中* @param targetFile 要去重的文件路径* @param splitSize 将目标文件切割成多少份hash取模的小文件个数* @return*/public static File[] splitFile(String targetFile,int splitSize){File file = new File(targetFile);BufferedReader reader = null;PrintWriter[] pws = new PrintWriter[splitSize];File[] littleFiles = new File[splitSize];String parentPath = file.getParent();File tempFolder = new File(parentPath + File.separator + "test");if(!tempFolder.exists()){tempFolder.mkdir();}for(int i=0;i<splitSize;i++){littleFiles[i] = new File(tempFolder.getAbsolutePath() + File.separator + i + ".txt");if(littleFiles[i].exists()){littleFiles[i].delete();}try {pws[i] = new PrintWriter(littleFiles[i]);} catch (FileNotFoundException e) {e.printStackTrace();}}try {reader = new BufferedReader(new FileReader(file));String tempString = null;while ((tempString = reader.readLine()) != null) { // reader.readLine()逐行读取，避免一次性读完整个文件tempString = tempString.trim();if(tempString != ""){//关键是将每行数据hash取模之后放到对应取模值的文件中，确保hash值相同的字符串都在同一个文件里面int index = Math.abs(tempString.hashCode() % splitSize);pws[index].println(tempString);}}} catch (Exception e) {e.printStackTrace();} finally {if (reader != null) {try {reader.close();} catch (IOException e1) {e1.printStackTrace();}}for(int i=0;i<splitSize;i++){if(pws[i] != null){pws[i].close();}}}return littleFiles;}

对小文件进行去重并合并结果

/*** 对小文件进行去重合并* @param littleFiles 切割之后的小文件数组* @param distinctFilePath 去重之后的文件路径* @param splitSize 小文件大小*/public static void distinct(File[] littleFiles,String distinctFilePath,int splitSize){File distinctedFile = new File(distinctFilePath);FileReader[] frs = new FileReader[splitSize];BufferedReader[] brs = new BufferedReader[splitSize];PrintWriter pw = null;try {if(distinctedFile.exists()){distinctedFile.delete();}distinctedFile.createNewFile();pw = new PrintWriter(distinctedFile);Set<String> unicSet = new HashSet<String>();for(int i=0;i<splitSize;i++){if(littleFiles[i].exists()){System.out.println("开始对小文件：" + littleFiles[i].getName() + "去重");frs[i] = new FileReader(littleFiles[i]);brs[i] = new BufferedReader(frs[i]);String line = null;while((line = brs[i].readLine())!=null){if(line != ""){unicSet.add(line);}}for(String s:unicSet){pw.println(s);}unicSet.clear();System.gc();}}} catch (FileNotFoundException e) {e.printStackTrace();} catch (IOException e1){e1.printStackTrace();} finally {for(int i=0;i<splitSize;i++){try {if(null != brs[i]){brs[i].close();}if(null != frs[i]){frs[i].close();}} catch (IOException e) {e.printStackTrace();}//合并完成之后删除临时小文件if(littleFiles[i].exists()){littleFiles[i].delete();}}if(null != pw){pw.close();}}}

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.hqwc.cn/news/803099.html

如若内容造成侵权/违法违规/事实不符，请联系编程知识网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

9月13日关于数组存储数据

在题目中要求建立数组来存储项目信息，储存的内容包括String、int、boolean、double等各种不同类型，刚开始我还处于建立普通数组要不是int【】要不是string【】，越琢磨越不对劲这样并不能存储不同类型的数据，但是数据又需要统一存取，网上又没有这么简单的讲解，也是被这个简…

9.24日总结

今日上学配置了Node.JS的环境变量，并应用VScode进行JavaScript的相关学习应用其中

9月11日toString重载方法的使用

在编辑过程中我经常会写一部分调试一部分，至少知道哪里有错能够及时改正，在编写时发现studentManger中的打印出来的是地址，而不是自己想要的内容，经过查询是需要写toString来重载输出利用这样的方法，一是可以正常打印出自己想要的内容，而是可以根据一个参数打印出所有的信…

软件工程作业——结对项目

这个作业属于哪个课程 22级计科12班这个作业要求在哪作业要求这个作业的目标实现一个自动生成小学四则运算题目的命令行程序成员姓名学号 GitHub地址吕宏鸿 3122004446 结对项目宋观瑞 3122004402 结对项目1.PSP表格PSP2.1 预估耗时（分钟）实际耗时（分钟）计划 10 5* 估计…

9月10日循环条件的结束

在测试编程中涉及到输入错误要重新返回UI界面，但是我写的总是输入不管是对还是错都会直接结束程序，完全不符合要求，经过整理思路，查询代码结构，此处应该设计为双层循环外部为while，内部为witch case语句，当输入为1时执行case==1；经应该是执行生产计划类然后跳出witch条…

IDEA更改远程git仓库地址

前言我们在使用IDEA开发时，一般会配置好对应的git仓库，这样就比较容易对代码进行控制以及协同开发。但有时候，我们远程的仓库地址由于这样那样的原因，需要迁移（这在爱折腾的企业是常有的事情）。那么，我们该如何在IDEA中更新远程仓库地址呢？如何设置首先，我们点击上…

vue3开发中易遗漏的常见知识点

组件样式的特性 Scoped CSS之局部样式的泄露示例（vue3）：父组件： <template><h4>App Title</h4><hello-world></hello-world> </template> <script> import HelloWorld from ./HelloWorld.vue;export default {name: App,compo…

PasteForm最佳CRUD实践，实际案例PasteTemplate详解(一)

本文将介绍soft.pastecode.cn出品的PasteForm，PasteForm是贴代码使用Dto思想实现的CRUD的一个组件，或者说输出一个思想！为啥我觉得是最佳的CRUD呢？先结合你的实际项目解答下以下问题： 1.如果有一个系统，有100个表，你的管理端需要多少页面？别和我说100个表很多，需求复…

RTE大会报名丨重塑语音交互：音频技术和 Voice AI，RTE2024 技术专场第一弹！

Voice AI 实现 human-like 的最后一步是什么？AI 视频爆炸增长，新一代编解码技术将面临何种挑战？当大模型进化到实时多模态，又将诞生什么样的新场景和玩法？所有 AI Infra 都在探寻规格和性能的最佳平衡，如何构建高可用的云边端协同架构？AI 加持下，空间计算和新硬件也迎来…

彻底搞懂回溯算法

1.回溯算法的核心思想回溯算法的核心思想是:尝试+记录+回退。先尝试一种选项，在选择该选项的前提下继续寻解，如果最后寻解成功，则记录这个解，否则不用记录，然后再回退到选择该选项前的状态，改为尝试其它选项再继续寻解，判断其它选项是不是解。 2.回溯算法的关键点回溯…

9.23 ~ 9.29

集训9.23 集训第一天。早晨因为太多人没拿早读资料被老登 D 了。不是哥们你不早说现在我上哪给你找资料去 😅 上午模拟赛。发现 T1 的图挂了，于是看形式化题意；初始有一张 \(n\) 个点的完全图，接着删除 \(m\) 条边。询问有多少长度为 \(13\) 的序列 \(p_1,...,p_{13}…

CSP 集训记录

用来整理模拟赛等9.23 csp-3【noip23 ZR二十连测 DAY10】保龄. A.奇观狗市题目描述。不是这题意太大歧义了吧，我讨厌的第二种出题人——题意描述相当不清。CTH：13 座城市又不代表是 13 座不同的城市。直接看形式化题目的话（如果能看懂要干什么）那这题确实不难。解：容易…

大文件去重

相关文章