LoopAnimate、LLM-Seg、DreamScape、LoopGaussian、TransformerFAM

本文首发于公众号:机器感知

​LoopAnimate、LLM-Seg、DreamScape、LoopGaussian、TransformerFAM

图片

Lossless Acceleration of Large Language Model via Adaptive N-gram  Parallel Decoding

图片

While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM's original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficien......

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

图片

Understanding human instructions to identify the target objects is vital for perception systems. In recent years, the advancements of Large Language Models (LLMs) have introduced new possibilities for image segmentation. In this work, we delve into reasoning segmentation, a novel task that enables segmentation system to reason and interpret implicit user intention via large language model reasoning and then segment the corresponding target. Our work on reasoning segmentation contributes on both the methodological design and dataset labeling. For the model, we propose a new framework named LLM-Seg. LLM-Seg effectively connects the current foundational Segmentation Anything Model and the LLM by mask proposals selection. For the dataset, we propose an automatic data generation pipeline and construct a new reasoning segmentation dataset named LLM-Seg40K. Experiments demonstrate that our LLM-Seg exhibits competitive performance compared with existing methods. Furthermore, our prop......

Megalodon: Efficient LLM Pretraining and Inference with Unlimited  Context Length

图片

The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, land......

DeDoDe v2: Analyzing and Improving the DeDoDe Keypoint Detector

图片

In this paper, we analyze and improve into the recently proposed DeDoDe keypoint detector. We focus our analysis on some key issues. First, we find that DeDoDe keypoints tend to cluster together, which we fix by performing non-max suppression on the target distribution of the detector during training. Second, we address issues related to data augmentation. In particular, the DeDoDe detector is sensitive to large rotations. We fix this by including 90-degree rotations as well as horizontal flips. Finally, the decoupled nature of the DeDoDe detector makes evaluation of downstream usefulness problematic. We fix this by matching the keypoints with a pretrained dense matcher (RoMa) and evaluating two-view pose estimates. We find that the original long training is detrimental to performance, and therefore propose a much shorter training schedule. We integrate all these improvements into our proposed detector DeDoDe v2 and evaluate it with the original DeDoDe descriptor on the MegaD......

Enforcing Paraphrase Generation via Controllable Latent Diffusion

图片

Paraphrase generation aims to produce high-quality and diverse utterances of a given text. Though state-of-the-art generation via the diffusion model reconciles generation quality and diversity, textual diffusion suffers from a truncation issue that hinders efficiency and quality control. In this work, we propose \textit{L}atent \textit{D}iffusion \textit{P}araphraser~(LDP), a novel paraphrase generation by modeling a controllable diffusion process given a learned latent space. LDP achieves superior generation efficiency compared to its diffusion counterparts. It facilitates only input segments to enforce paraphrase semantics, which further improves the results without external features. Experiments show that LDP achieves improved and diverse paraphrase generation compared to baselines. Further analysis shows that our method is also helpful to other similar text generations and domain adaptations. Our code and data are available at https://github.com/NIL-zhuang/ld4pg. ......

LoopGaussian: Creating 3D Cinemagraph with Multi-view Images via  Eulerian Motion Field

图片

Cinemagraph is a unique form of visual media that combines elements of still photography and subtle motion to create a captivating experience. However, the majority of videos generated by recent works lack depth information and are confined to the constraints of 2D image space. In this paper, inspired by significant progress in the field of novel view synthesis (NVS) achieved by 3D Gaussian Splatting (3D-GS), we propose LoopGaussian to elevate cinemagraph from 2D image space to 3D space using 3D Gaussian modeling. To achieve this, we first employ the 3D-GS method to reconstruct 3D Gaussian point clouds from multi-view images of static scenes,incorporating shape regularization terms to prevent blurring or artifacts caused by object deformation. We then adopt an autoencoder tailored for 3D Gaussian to project it into feature space. To maintain the local continuity of the scene, we devise SuperGaussian for clustering based on the acquired features. By calculating the similarity ......

LoopAnimate: Loopable Salient Object Animation

图片

Research on diffusion model-based video generation has advanced rapidly. However, limitations in object fidelity and generation length hinder its practical applications. Additionally, specific domains like animated wallpapers require seamless looping, where the first and last frames of the video match seamlessly. To address these challenges, this paper proposes LoopAnimate, a novel method for generating videos with consistent start and end frames. To enhance object fidelity, we introduce a framework that decouples multi-level image appearance and textual semantic information. Building upon an image-to-image diffusion model, our approach incorporates both pixel-level and feature-level information from the input image, injecting image appearance and textual semantic embeddings at different positions of the diffusion model. Existing UNet-based video generation models require to input the entire videos during training to encode temporal and positional information at once. However......

TransformerFAM: Feedback attention is working memory

图片

While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length. ......

DreamScape: 3D Scene Creation via Gaussian Splatting joint Correlation  Modeling

图片

Recent progress in text-to-3D creation has been propelled by integrating the potent prior of Diffusion Models from text-to-image generation into the 3D domain. Nevertheless, generating 3D scenes characterized by multiple instances and intricate arrangements remains challenging. In this study, we present DreamScape, a method for creating highly consistent 3D scenes solely from textual descriptions, leveraging the strong 3D representation capabilities of Gaussian Splatting and the complex arrangement abilities of large language models (LLMs). Our approach involves a 3D Gaussian Guide ($3{DG^2}$) for scene representation, consisting of semantic primitives (objects) and their spatial transformations and relationships derived directly from text prompts using LLMs. This compositional representation allows for local-to-global optimization of the entire scene. A progressive scale control is tailored during local object generation, ensuring that objects of different sizes and densitie......

A Novel State Space Model with Local Enhancement and State Sharing for  Image Fusion

图片

In image fusion tasks, images from different sources possess distinct characteristics. This has driven the development of numerous methods to explore better ways of fusing them while preserving their respective characteristics. Mamba, as a state space model, has emerged in the field of natural language processing. Recently, many studies have attempted to extend Mamba to vision tasks. However, due to the nature of images different from casual language sequences, the limited state capacity of Mamba weakens its ability to model image information. Additionally, the sequence modeling ability of Mamba is only capable of spatial information and cannot effectively capture the rich spectral information in images. Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task. Specifically, we propose the local-enhanced vision Mamba block, dubbed as LEVM. The LEVM block can improve local information perception of the network and simu......

In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and  Action Recognition

图片

Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O ......

Self-Selected Attention Span for Accelerating Large Language Model  Inference

图片

Large language models (LLMs) can solve challenging tasks. However, their inference computation on modern GPUs is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. To address this inefficiency, we capitalize on LLMs' problem-solving capabilities to optimize their own inference-time efficiency. We demonstrate with two specific tasks: (a) evaluating complex arithmetic expressions and (b) summarizing news articles. For both tasks, we create custom datasets to fine-tune an LLM. The goal of fine-tuning is twofold: first, to make the LLM learn to solve the evaluation or summarization task, and second, to train it to identify the minimal attention spans required for each step of the task. As a result, the fine-tuned model is able to convert these self-identified minimal attention spans into sparse attention masks on-the-fly during inference. We develop a custom CUDA kernel to take advantage of the reduced context to attend to. We......

CompGS: Efficient 3D Scene Representation via Compressed Gaussian  Splatting

图片

Gaussian splatting, renowned for its exceptional rendering quality and efficiency, has emerged as a prominent technique in 3D scene representation. However, the substantial data volume of Gaussian splatting impedes its practical utility in real-world applications. Herein, we propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS), which harnesses compact Gaussian primitives for faithful 3D scene modeling with a remarkably reduced data size. To ensure the compactness of Gaussian primitives, we devise a hybrid primitive structure that captures predictive relationships between each other. Then, we exploit a small set of anchor primitives for prediction, allowing the majority of primitives to be encapsulated into highly compact residual forms. Moreover, we develop a rate-constrained optimization scheme to eliminate redundancies within such hybrid primitives, steering our CompGS towards an optimal trade-off between bitrate consumption and represe......

Magic Clothing: Controllable Garment-Driven Image Synthesis

图片

We propose Magic Clothing, a latent diffusion model (LDM)-based network architecture for an unexplored garment-driven image synthesis task. Aiming at generating customized characters wearing the target garments with diverse text prompts, the image controllability is the most critical issue, i.e., to preserve the garment details and maintain faithfulness to the text prompts. To this end, we introduce a garment extractor to capture the detailed garment features, and employ self-attention fusion to incorporate them into the pretrained LDMs, ensuring that the garment details remain unchanged on the target character. Then, we leverage the joint classifier-free guidance to balance the control of garment features and text prompts over the generated results. Meanwhile, the proposed garment extractor is a plug-in module applicable to various finetuned LDMs, and it can be combined with other extensions like ControlNet and IP-Adapter to enhance the diversity and controllability of the g......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/624426.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

平稳序列建模 #时间序列分析 #R语言

掌握序列预测的主要内容,熟悉预测的步骤。掌握平稳序列的步骤,能熟练运用R软件实现各个步骤。 训练数据在文末!!! 练习1、根据某城市过去63年中每年降雪量数据(行数据)(题目1数据.tx…

小白菜 macOS github提交代码

目录 1- 在上传的文件下 打开终端 2- 输入命令: 3- 输入命令 有些文件不需要上传 编写ignore文件 4- 编写ignore文件 5- 添加文件至暂存区域 并查看状态 6- 提交 7- 建立远程仓库 8- 连接远程仓库 9- 查看连接成功 10- 获取token免密登录 11- 现在远程仓…

Axios网络请求

Axios网络请求主要用于前后端请求,前后端分离时前端需要通过url请求后端的接口,并且处理后端传过来的数据。 Axios官网教程 安装 npm install axios在main.js导入 import axios from axios;//声明一个http变量!!&#xff01…

企业数据模型应用:数字化转型的关键引擎

一、引言 在数字化浪潮席卷全球的今天,数据已经成为企业运营、决策和创新的核心要素。企业数据模型作为数据管理的核心工具,正逐渐成为企业数字化转型的重要引擎。本文将深入探讨企业数据模型的基本概念、应用场景、挑战与对策,以及其在数字…

加强金融行业关键信息基础设施安全保护,有效防范网络安全风险

当前,随着数字化发展的不断深入,关键信息基础设施作为国家的重要战略资源,面临着国内外严峻的网络安全风险。为了确保国家安全,在国家发展各领域和全过程中,需要将安全发展贯穿始终,筑牢国家安全屏障。金融…

外包干了1年....字节跳动面试高频考点,懒加载

一、文章内容 什么是懒加载懒加载的优点什么时候使用懒加载学习懒加载前置内容实战懒加载图片 二、什么是懒加载? 从语法角度分析懒加载,懒是adj形容词,加载是名词;或者懒看为副词,加载作为动词,这样就能理解懒加载了就是懒懒的/地加载,更通俗的讲就是通过一种手段来加载.就…

单例模式五种写法

单例模式五种写法 单例模式有五种写法:饿汉、懒汉、双重检验锁、静态内部类、枚举. 单例模式属于设计模式中的创建型模式 一、单例模式应用场景 windows的task manager(任务管理器)就是很典型的单例模式; windows的recycle bin(回收站)也是典型的单例应用&#…

ENVI实战—一文学会使用传感器自带信息配准工具进行几何校正

实验1:学会使用传感器自带信息配准工具 目的:利用ENVI的传感器自带信息配准工具,掌握几何校正的一般方法。 过程: 1.对MODIS影像进行校正: ①读取影像:打开文件,点击“打开为”,…

Docker安装xxl-job分布式任务调度平台

文章目录 Docker安装xxl-job分布式任务调度平台1.xxl-job介绍2. 初始化“调度数据库”3、docker挂载运行xxl-job容器3.1、在linux的opt目录下创建xxl_job文件夹,并在里面创建logs文件夹和application.properties文件3.2、配置application.properties文件&#xff0c…

Python对txt文本文件内容进行替换,以便通过Origin进行数据分析

因为要使用Origin进行数据分析,数据集为单行文本逗号隔开,无法直接复制粘贴到Origin中,故为此整理了一下代码,方便后续直接使用。 一、任务需求 有个1.txt文档文件里面是一行数据信息,要将其规整为每行一个数据&…

Spring学习(二)

图解: 2.核心容器总结 2.2.1 容器相关 BeanFactory是IoC容器的顶层接口,初始化BeanFactory对象时,加载的bean延迟加载 ApplicationContext接口是Spring容器的核心接口,初始化时bean立即加载 ApplicationContext接口提供基础的be…

Linux:编译器 - gcc

Linux:编译器 - gcc gcc概述语言发展史gcc的编译过程预处理编译汇编 gcc的链接过程动态库与静态库 gcc概述 GCC(英文全拼:GNU Compiler Collection)是 GNU 工具链的主要组成部分,是一套以 GPL 和 LGPL 许可证发布的程…