M-A-P Daily Paper

The M-A-P daily paper project curates and reviews a selection of new papers published daily on arXiv, providing insightful commentary on cutting-edge research across various scientific disciplines.

🛠️ Papers This Week

(Expand to View)

01/11/2024

Paper	Comments
Reasons and Solutions for the Decline in Model Performance after Editing	TLDR: The method is not the focal point; rather, two interesting issues with model editing are identified: (1) There is a strong correlation between the explosive growth of the L1 norm in parameter layers during editing and the accuracy of the editing. When the L1 norm experiences explosive growth, model performance declines. (2) The diversity and sequence length of the editing targets have a significant impact on model performance. Higher perplexity in the editing target results in a more severe performance drop. If the L1 norm of the edited layer serves as a good indicator of catastrophic forgetting, this raises two valuable research questions: (1) Can the L1 norm be refined to focus on the features most affected by editing, potentially providing more insights? (2) The higher the perplexity of the editing target, the more severe the performance decline. In this paper’s case, perplexity is compared across several problem types, such as true/false questions, multiple-choice, and generation. This raises the question of whether, for certain problem types, the stability of model patterns may have a greater impact on performance than the robustness of memory for multiple specified facts, as editing may unintentionally disrupt certain higher-order pattern stability.
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents	Highlights the shift from niche to mainstream in agent benchmarking. Presents a smartphone-side agent benchmark where Table 1 shows the 4o model outperforming Claude by a margin of three points, although interestingly, the highest logical operation rate is from Gemini-1.5-Pro, despite its underwhelming performance overall. This benchmark may be a starting point for integrating simulation-based agent benchmarks.
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments	Proposes an intuitive method for compression without additional training, enabling dynamic size adjustment for large language models (LLMs) in variable memory settings. The approach continuously decomposes weight matrices, observing the residuals' impact with a calibration set, ranks them by importance, and dynamically loads/unloads parameters, which may prove useful in practical applications.
Representative Social Choice: From Learning Theory to AI Alignment	Although not specifically an alignment tool, this sociological model has potential applications in predictive analysis, providing flexibility in setting up population-based preferences across different agenda topics. Extending it to support composite population distributions could be valuable for simulating public opinion dynamics.
Nearest Neighbor Normalization Improves Multimodal Retrieval	Introduces a simple yet potentially effective incremental technique using the embeddings of the k-nearest neighbors to estimate retrieval bias, instead of relying on a global bias. This plug-and-play approach is straightforward to implement.
Constraint Back-translation Improves Complex Instruction Following of Large Language Models	Addresses the practical challenge of following complex composite instructions, highlighting back-translation as a potential solution. This topic lends itself well to academic exploration, as it may yield interesting insights without requiring extensive resources. Future work might consider leveraging CriticGPT-like data to further enhance this approach.
Length-Induced Embedding Collapse in Transformer-based Models	This paper identifies an issue where, as sequence length increases, the self-attention mechanism essentially functions as a low-pass filter, causing embeddings to retain only their low-frequency components. This observation is consistent with recent findings, such as those in Xiaomi's paper "HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation." It suggests that adjustments targeting this low-frequency dominance could be made through relatively cost-effective methods.
Commonsense Knowledge Editing Based on Free-Text in LLMs	Builds on prior knowledge editing research, noting that commonsense knowledge resides in both MLP and attention layers. Unlike structured triples, commonsense knowledge here reflects simple causal reasoning, such as "feeling thirsty, so drink water," suggesting further exploration of which layers contain specific types of commonsense knowledge.
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective	This paper offers insightful analysis on differences in model behavior under fast versus slow thinking training modes. Through gradient analysis using the nuclear norm of singular value decomposition (SVD) to represent the characteristics of the gradient matrix, it observes that without Chain of Thought (CoT) or with simplified CoT, gradients in shallow layers are larger and show notable differences between layers. In contrast, when detailed CoT is applied in slow thinking mode, gradients become more consistent across layers. A key takeaway is that under the slow thinking mode, gradients can differentiate correct responses from irrelevant ones, with instruction-tuned models aligning more closely with the behavior of the original pretrained model. This analysis suggests that appropriate step-wise division may indeed enhance the robustness of LLM-based agents.
Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress	The approach and concept are intuitive. The proposed STAC essentially analyzes the statistical distance between time steps induced by a policy's action distribution within a simulated environment. Excessive deviations in this distance indicate potential failure. This idea is somewhat analogous to world modeling, albeit a simplified one based on post-action world simulations, and represents a valuable direction.
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration	Using KV-Cache compression for video understanding models seems a reasonable approach. However, the implementation appears loosely related to multimodal processing. It introduces a post-visual attention mechanism to calculate cross-layer sparsity and within-layer token importance, dynamically adjusting the window size to select significant visual and language tokens, thus enhancing cache hit rates. For long video comprehension, one could consider compressing multiple frames within a slot under the same perspective or creating hierarchical structures. Representing continuous frames as sequential depictions of changes in the first frame could be an alternative for video representation.
MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts	The authors conducted thorough data collection and modeling of human faces and hands, gathering a dataset of over one million high-quality portrait images in various scenes.
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling	The paper is heavy on formulas without experiments; a detailed review is needed. The claims are quite strong. Prior models assumed diversity in representation and equal dimensions, allowing linearly invertible transformations for distributionally equivalent models. Here, distributional equivalence is based on high-dimensional vectors corresponding to semantic and syntactic patterns. Loosening prior conditions, the paper introduces additional linear properties that may not be intuitively interpretable, showing that equivalency can hold without satisfying both previous requirements. This warrants further analysis.
Learning to Achieve Goals with Belief State Transformers	This seems to be a variant of FIM tailored for long-text generation, differing from standard FIM loss. It incorporates a forward encoder and a backward encoder to encode prefixes and suffixes, respectively, with heads predicting the next word after the prefix and the preceding word before the suffix. The training objective combines both forward and backward Transformer goals, emphasizing the continuity between prefix and suffix, especially during inference where the forward model uses the prefix with an empty suffix to generate text in an autoregressive manner.

31/10/2024

Paper	Comments
Aligning Audio-Visual Joint Representations with an Agentic Workflow	The paper introduces an LLM and Agentic Workflow approach to achieve audio-visual alignment.
Multi-student Diffusion Distillation for Better One-step Generators	The research demonstrates improved generation quality and inference speed by distilling conditional teacher diffusion models into multiple one-step generators. The dimensional decoupling effectively reduces the learning complexity of the generation process.
Predicting Future Actions of Reinforcement Learning Agents	The study explores two approaches for predicting future events: accessing agent internal states and synthetic solutions. Among the three internal state methods examined (most frequently accessed simulation actions, action dependency trees, and LSTM hidden states), the first method showed significant improvements in action and event prediction accuracy. This suggests that despite being RL-trained agents, the prediction accuracy relies more on identifying fixed patterns rather than state activation and action logic relationships.
ML Research Benchmark	This solo-authored paper introduces seven agent tasks: MiniPile, LLM Merging, Edge LLM Compression, Edge LLM Training, Math Reasoning, LLM Efficiency, and BabyLM. The benchmark requires an agent workflow approach, allowing flexibility in model architecture selection while constraining resources to a single A100 40GB GPU and 24-hour time limit. The research indicates Claude-3.5 Sonnet outperforms GPT-4 on most tasks.
Decoupling Semantic Similarity from Spatial Alignment for Neural Networks	The research introduces Semantic Representational Similarity Matrices (RSMs) that decouple localization and semantic information from traditional RSMs. It addresses spatial misalignment through set matching problems and demonstrates the differences between conventional and semantic RSMs using a purpose-built toy dataset of partially overlapping image patches.
Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies	The study utilizes the BabyLM dataset to evaluate fine-grained curriculum learning strategies. Three objective curricula are defined: GROWING, INWARDS, and MMM. The research demonstrates that language acquisition theory principles, particularly the "moderate effect," can be effectively applied to curriculum learning in pre-training datasets. The findings suggest careful consideration of dependency granularity in curriculum design.
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning	This benchmark from DAMO Academy presents 1,200 mathematical problems with explicit and implicit visual contexts, covering plane geometry, solid geometry, analytic geometry, and calculus/functions. The geometric reasoning components represent particularly valuable contributions to the field, addressing a previous scarcity of such datasets.
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback	The research emphasizes the potential of synthesizing dense rewards from natural language descriptions in reinforcement learning (the most valuable quote), where irony, refusal to answer, stopping talking, and a large number of long-winded replies all contain a certain positive or negative signal. This signal is not even one-dimensional like agreement-opposition. There may be more complex emotions and many things that can be used as rewards, which current RLHF systems may not fully capture.
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference	The paper presents a sparse caching approach combined with sliding window mechanisms to capture recent information, dynamically segmenting historical tokens and prioritizing important tokens within local neighborhoods. Could be useful for long video modeling. Peak identification is achieved through local maximum sampling to preserve critical information within segments.
Adaptive Paradigm Synergy: Can a Cross-Paradigm Objective Enhance Long-Tailed Learning?	The research examines the relationship between self-supervised and supervised learning, introducing Adaptive Paradigm Synergy (APS) as a novel cross-paradigm objective. The approach addresses long-tail distribution challenges by dynamically adjusting the uniformity of latent space structures.
Testing GPT-4-o1-preview on math and science problems: A follow-up study	The study evaluates GPT-4-o1's performance on advanced scientific computation and mathematical problems, identifying specific weaknesses in spatial reasoning and physical concept understanding. Notable findings include significantly lower performance on "arbitrary number" problems compared to "no calculation" and "motivated number" problems. The interesting part is finding such a blindspot of o1.
Machine Unlearning using Forgetting Neural Networks	The research extends MLPs with a multiplicative forgetting function, demonstrating Ebbinghaus-like forgetting curves under variable forgetting rates using MNIST data. Ranking forgetting rates proved most effective among the forgetting function types, with multiple learning-forgetting phases improving test data generalization. Could be a plug-and-play method.

30/10/2024

Paper	Comments
Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts	The study analyzes h-space vectors extracted through U-Net layer outputs to observe evolution during the diffusion process. While the parameter analysis isn't particularly robust, it effectively demonstrates that the model learned gender biases related to occupations. This represents a potential latent pattern rather than a definitive higher-order semantic pattern. Visualization of h-space vectors revealed vector clusters containing fixed entity types such as square plates, soup pots, and sandwiches. However, across different clusters, it suggests that higher-order concepts related to "eating" may not have been well-learned.
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading	The infrastructure-focused paper presents logical findings, particularly relevant for MoE-type models. Key discoveries include significantly reduced GPU memory utilization during update phases and low PCIe link utilization during backpropagation and updates. The solution involves subdividing optimizer states, implementing interleaved parameter updates offloading on GPUs, overlapping optimizer subgroup movement and execution between GPU and CPU, efficiently placing and moving gradients, and utilizing higher precision PCIe transfers to avoid costly memory allocation. A performance model was developed but not thoroughly examined. A model-side insight regarding MoE relates to its comparison with an extremely wide dense model. In OAI's "Scaling Laws for Neural Language Models", Figure 6 analysis mentions "When we exclude embedding parameters, the performance of models with different depths converge to a single trend. Only models with fewer than 2 layers or with extreme depth-to-width ratios deviate significantly from the trend.". The exact definition of "extreme" remains unclear. Previous observations suggest MoE models achieving similar loss don't match the performance of corresponding wide models, which could be verified using hellaswag. Hyper-Connections technology shows potential for addressing the width-to-depth ratio optimization in MoE models.
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding	Presents an OCR-free benchmark for evaluating MLLMs' fine-grained visual perception and reasoning capabilities in document understanding. The benchmark covers text recognition, table recognition, text localization, table cell localization, key information extraction, document forgery detection, document QA, chart QA, and infographic QA.
A Systematic Assessment of OpenAI o1-Preview for Higher Order Thinking in Education	Compares o1-preview with human performance across various educational thinking paradigms, using common datasets for each paradigm. The conclusions lack sufficient credibility due to potential dataset exposure during training. However, the educational thinking patterns summary provides valuable insights, including: Critical Thinking, System Thinking, Computational Thinking, Design Thinking, Metacognition, Data Literacy, Creative Thinking, Collaborative Thinking, Abstract Reasoning, Spatial Reasoning, Quantitative Reasoning, Logical Reasoning, Analogical Reasoning, and Scientific Reasoning.
Offline Reinforcement Learning with OOD State Correction and OOD Action Suppression	Demonstrates the mutual influence between RL and Imitation Learning. LLM pre-training typically develops ICL capabilities, which essentially combines or retrieves stored patterns rather than solving entirely new problems. The paper addresses OOD states in offline RL that can lead to catastrophic failures during online deployment. The proposed solution introduces regularization to map OOD states to their nearest known states, following a similar pattern-matching approach.
Fourier Head: Helping Large Language Models Learn Complex Probability Distributions	Introduces the Fourier Head, which uses linear layers to extract Fourier series coefficients, quantizing them into equidistant intervals. The approach evaluates Fourier PDF values at interval center points to return likelihood values as classification distributions. This method appears more mathematically natural than linear layers for modeling CoT/Diffusion-based LLM states, as it inherently models continuous data distributions closer to semantic spaces.
Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding	Applies speculative decoding to speech synthesis, leveraging the hierarchical structure of codebooks in speech and music (encodec/soundstream).
Cross-Entropy Is All You Need To Invert the Data Generating Process	The key finding suggests that supervised classification models can be transformed to recover latent variables learned by self-supervised/unsupervised models through linear transformation, referencing the ICA theory, which may provide valuable insights into transfer conditions between self-supervised and supervised learning.
Learning and Unlearning of Fabricated Knowledge in Language Models	Initial findings indicate that in CPT learning of new knowledge, facts conflicting with common sense persist longer than ordinary facts or randomly scrambled facts, potentially causing inappropriate triggering effects.
MCPDial: A Minecraft Persona-driven Dialogue Dataset	Presents a dataset containing 250 Minecraft NPC character descriptions with corresponding player character descriptions and 49 hand-crafted dialogues. Introduces a novel pipeline for generating character-driven game dialogues based on collected character descriptions and dialogues, demonstrating application within Minecraft.
How Does Critical Batch Size Scale in Pre-training?	Introduces the concept of Critical Batch Size (CBS), marking the threshold where increased data parallelism no longer yields significant benefits. Experiments using C4 suggest CBS scales primarily with data size rather than model size. Studies included models up to 1.2B parameters, examining CBS patterns by controlling model and data size variations. The conclusions require further verification considering the data quantity is not big engough. CBS appears to be an optimizable hyperparameter, though its inclusion in usual scaling law iteration fitting may limit additional value.
L3Ms -- Lagrange Large Language Models	Formalizes SFT and alignment as a constrained optimization problem, aiming to minimize task perplexity while meeting application-specific minimum requirements. Introduces expectation and uniform constraints, applying minimum rewards to generated prompt-response pairs and probability lower bounds for inequality satisfaction. Mathematically, it discourages fixed patterns while minimizing model impact in prompt-to-response conversion, using Lagrange multipliers for constraint handling.
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse	The conclusions and narrative approach lack robustness. While drawing from cognitive psychology cases where human performance decreases with overthinking, the experiments lack robust control over CoT implementations and their impacts on model performance. The work appears to make claims about cognitive psychology alignment without sufficient investigation of underlying mechanisms connecting model behavior and human cognition.
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness	While the experiments could be more robust, the methodology offers insights. Training simple linear classifiers on pre-trained features and evaluating monosemantic feature performance under various noise conditions may provide an efficient way to observe internal model features, dependent on clean monosemantic decomposition.
Causal Interventions on Causal Paths: Mapping GPT-2's Reasoning From Syntax to Semantics	Reveals that GPT-2's initial 2-3 layers primarily capture syntactic structure, with attention heads showing high focus on causal delimiters. Identified attention heads with increased causal relationship sensitivity. The methodology of replacing key words in causal sentences to create non-causal versions and observing prediction impacts through layer-wise loss calculation could be valuable for future research.
Reducing the Scope of Language Models with Circuit Breakers	Represents a growing trend in parameter-task orthogonality-controlled fine-tuning. The approach identifies and controls minimal relevant parameters while decomposing instruction task requirements, often incorporating orthogonalization definitions. Applicable for implementing selective response rejection or improved format following, showing mechanistic coherence.
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups	Presents a layer grouping approach for efficient sparse autoencoder training, significantly reducing training costs while maintaining reconstruction quality and downstream task performance. Results indicate shared common features between adjacent layers in Pythia. The grouping strategy involves clustering layers based on angular similarity before training an SAE for each group, offering a practical approach to efficient SAE training.

29/10/2024

Paper	Comments
The Geometry of Concepts: Sparse Autoencoder Feature Structure	The study defines three scales in neuroscience—atomic, brain, and galaxy—and analyzes models across these scales. On the atomic scale, it eliminates distracting features, revealing parallel directions in related words, such as Vienna’s alignment with Austria, similar to Bern’s alignment with Switzerland. The brain scale introduces a notable lobe structure, with a prominent emphasis on code and math. At the galaxy scale, the point cloud (each point being a SAE Feature) shows anisotropy, with feature representations concentrated and not isotropic. The study highlights that "the underlying density varies with radius and, for a high-dimensional Gaussian distribution, is strongly concentrated around a relatively thin spherical shell." Additionally, clustering entropy is lower at the intermediate layers. The conclusions at the galaxy level are worth further contemplation. Marked for follow-up.
Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models	This benchmark, designed to evaluate LLMs as shopping assistants, is straightforward and focused. It can serve as a reference for specific downstream tasks, potentially as part of a CPT (Customer-Personalized Task) benchmark. However, it is not recommended as a pretraining reference.
Malinowski in the Age of AI: Can large language models create a text game based on an anthropological classic?	A pipeline was developed to explore whether LLMs can independently generate text games based on anthropological classics. Although the book itself is unfamiliar, this study demonstrates a playful approach. Recently, there have been more projects that incorporate LLMs into interactive storytelling, such as Google’s "Unbounded: A Generative Infinite Game of Character Life Simulation." This direction holds appeal as LLMs can significantly enhance engagement and freedom in narrative-based games, like murder mystery and RPG scenarios. Traditional RPG setups often lacked sufficient "Dungeon Master" and other player interactions, leaving an unmet desire for personal adventure within unique story worlds. Compared to companion-type agents, the strength here lies in structured narratives that prevent repetitive dialogues and create engaging, evolving scenarios.
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration	This work provides an ablation study on factors affecting multi-modal in-context learning (MM-ICL), particularly noting the impact of modality ordering on model performance. This issue was previously highlighted in the O1 multimodal pretraining proposal. Multimodal pretraining often employs either "paired" or "interleaved" formats for organizing image-text data, with the interleaved format leading to a sequence such as `[Text][Image][Text][Text][Image][Image][Text]`. Consequently, the model is less exposed to patterns involving multiple consecutive images followed by an instruction, as in `[Image][Image][Image][Image][Instruction]`. This potential mismatch in learned attention patterns could affect performance, although the trick was implemented in production without detailed ICP (input conditioning pattern) analysis. Applying ICP methods from textual contexts could be valuable here as well.
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation	Meta’s music generation model, MusicFlow, stands out as a relatively clean and straightforward approach to music generation, free from an overly complex layered structure commonly seen in similar models. The model compares MERT and HuBERT embeddings, finding HuBERT to be significantly stronger at the semantic level, thus opting for HuBERT. This choice, although somewhat disappointing, is understandable given HuBERT’s superior performance. Planning to listen to the generated music samples tomorrow.
Deep Learning Based Dense Retrieval: A Comparative Study	This paper presents a comparative study of dense retrievers using datasets FiQA, HotpotQA, and Quora, specifically analyzing models like BERT, SimCSE, ANCE, Contriever, and DPR series. The robustness analysis under adversarial attacks appears to lack practical relevance, as the real-world applicability of such attacks remains unclear. ANCE stands out as the most effective across various conditions. Further clarification on the real-life scenarios for these adversarial cases would be valuable.
Maintaining Informative Coherence: Migrating Hallucinations in Large Language Models via Absorbing Markov Chains	This paper utilizes absorbing Markov chains to quantify the importance of contextual information and measure information loss at different distances in generation. Although the benchmark results show modest improvements, the motivation addresses a genuine issue. The approach resembles the one in Professor He Junxian's recent work, Non-myopic Generation of Language Model for Reasoning and Planning, though this paper does not precisely target the "myopia" concept inherently related to neural text planning (NTP). Upon reflection, myopia in existing NTP and predictive encoding (PE) methods may not fully capture the hierarchical retrieval needed to support "one-to-many" relationships. This study prompts a rethinking on sentence construction as shaped by the loss weighting, where each dependency space angle learned in beam search remains narrow and "short-sighted." A potential evaluative approach for sentence continuity is to measure the probability of direct sequential prediction from sentence beginning to end relative to the search space distribution. By optimizing for an ideal loss based on this probability, and comparing it with the actual loss, biases in multi-task learning may become evident. This could be explored experimentally in the near term.
MarDini: Masked Autoregressive Diffusion for Video Generation at Scale	This paper combines masked autoregressive models with diffusion models to achieve scalable video generation, aligning closely with recent implementations and trials in LLM directions. In this setup, the masked autoregressive model component manages the extractable planning signals, potentially corresponding to Chain-of-Thought (CoT) or semantic span information within continuous embeddings. The diffusion model uses the mask-predicted control signals to refine details, reconstructing high-resolution frames. This unified learning objective warrants further scrutiny. The modular organization across video and text generation intuitively relieves the model from needing to pinpoint logical or temporal dependencies within an overwhelming search space. With diffusion models’ noise and denoising mechanisms, this extensive Gaussian distribution search space does not lend itself well to imitation learning, particularly as the semantic meaning of this scale remains somewhat elusive. The Diffusion of Thought study similarly relies on explicit CoT as a temporal sequence within a consistency model framework. At present, it appears that the natural reconstruction, congruent with diffusion, should remain within the diffusion process, while planning and high-level sequence structure benefit from autoregressive masking.
Uncertainty-Penalized Direct Preference Optimization	This paper introduces an uncertainty penalty to reduce overfitting in Direct Preference Optimization (DPO). By incorporating uncertainty-based regularization, it aims to mitigate the model's tendency to overfit during preference optimization, enhancing the generalization of learned preferences.
Understanding Adam Requires Better Rotation Dependent Assumptions	This paper explores Adam optimizer’s sensitivity to rotations in parameter space. The analysis suggests a need for better rotation-dependent assumptions to understand Adam's behavior fully. Requires in-depth reading and analysis; marked for future review.
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models	This paper introduces a dynamic token merging mechanism in byte-level language models to accelerate processing without degrading model performance, significantly reducing inference runtime. Recent research has seen numerous approaches to handling different tokens, suggesting that this area has matured. It is recommended to follow up on this work, and there are plans to compile a reference list related to this topic over the weekend.
ODRL: A Benchmark for Off-Dynamics Reinforcement Learning	This benchmark evaluates the ability of human-like RL agents to rapidly transfer strategies across structurally similar tasks, and the motivation behind it is considered very sensible. It is posited that the current approach of LLMs moving from NTP to SFT and then RLHF is essentially due to the infeasibility of directly scaling RL in the current language space size. There are insufficient signals, and the foundational models may not be robust enough, necessitating the use of imitation learning for cold starts. From an optimization perspective, the initial phase of NTP to SFT focuses on how to effectively imitate the target, while the latter aims to achieve a better verifier, enhancing robustness and general confidence in modeling the distribution of the positive space in the sampling space. Regarding scaling RL, the short-term focus is on leveraging the world knowledge within LLMs, while long-term research into cross-environment high-level strategy and experience generalization, akin to Google's research on Genie and Cross-Game DT, is deemed highly valuable. This belief is predicated on the understanding that efficiently generalizing highly abstract experiences and strategies learned in few-shot scenarios is instructive for scaling RL. Currently, the challenge lies in retaining highly abstract experience generalization, as there is scant research on pure strategy generalization across different Atari games. Much of the existing academic work on generalization focuses on recognizing highly abstract concepts within single games, rather than reaching the level of strategy generalization, leading to potential overclaims about generalization. It is recommended that more benchmarks like this be developed, without restricting them to robotic applications.
SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement	This paper integrates Monte Carlo Tree Search (MCTS) into code agents, resulting in a notable performance increase. However, similar works have emerged recently, and the use of the UCT trick appears somewhat ad-hoc, possibly due to an inability to fully grasp the mathematical intent behind the formulas presented. The results seem largely experimental. Furthermore, the introduction of many uncontrollable factors, particularly in the evaluation phase, where it states that it “uses all relevant context including trajectory information, file context, and executed tests to provide a quantitative value estimation and qualitative explanation in natural language,” feels rather vague. The paper appears to be somewhat supportive of a peer's work.
GPT-4o System Card	The paper does not capture many details, noting that the data organization only mentions "Web Data" and "Code and Math," which is an interesting point. In section 3.1, it appears that the red teaming efforts by OAI and Anthropic may be very intense and extreme, extending beyond just safety cases, which could lead to a qualitative change. However, effectively organizing a red team may involve various techniques. There was a tech blog referenced that raises questions about its credibility, available here: The Information Article. A minor detail in the evaluation section states, “We used Voice Engine to convert text inputs to audio, feed it to the GPT-4o, and score the outputs by the model. We always score only the textual content of the model output, except in cases where the audio needs to be evaluated directly, such as in evaluations for voice cloning.” The expression in section 5.3 seems to imply that their red teamers are also responsible for exploring broader potential scenarios for their models. Other sections feel somewhat vague, and further analysis may be warranted.
HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation	Proposed by Xiaomi, this paper introduces a new positional encoding method. It suggests that the attention pattern exhibits a U-shaped curve and analyzes specific components of RoPE, termed "activation components," which significantly impact the attention learned during the early training phases. The authors argue that low-frequency components are ineffective for representing positional information, advocating for a method that only utilizes high-frequency components. The actual effectiveness of this approach remains to be verified. Further exploration is planned for tomorrow, and a forward is sent to @单勇, as the intuitive mathematical implications of the "activation components" definition are not fully grasped yet.
LLMs Can Evolve Continually on Modality for X-Modal Reasoning	This paper presents Huawei's Any2Any model, which integrates single-modality adapters in parallel during the pre-training of a Q-Former. This approach is designed to effectively adapt to new modalities while allowing the adapters to be frozen post-training. The benchmark they established for evaluating continual learning in multimodal settings appears meaningful. Their main selling point is the claim that adding an audio modality to a text-image model can be done without retraining the existing text-image components. While this claim has practical implications, the proposed solution comes across as somewhat convoluted.
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?	This paper explores the concept of having multi-modal large language models (MLLMs) autonomously design evaluation hierarchies and generate questions based on user-defined assessment goals to benchmark other MLLMs. This approach facilitates the creation of a more user-centric Visual Question Answering (VQA) benchmark, which is a valuable perspective given the current scarcity of high-quality MLLM benchmarks.
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior	This paper proposes using an autoregressive generative prior model to act as a video tokenizer, aiming to remove redundant information from video data. The core intuition suggests that if a suitable function ( f ) representing the learned consistency model can be identified and combined with keyframes, it could serve as an effective tokenizer for multi-modal large language models (MLLMs). Theoretically, for individual images (or multi-image contexts), this function ( f ) represents reconstruction, while for video, the objective is to capture a form of semantic consistency that is learned across frames. This exploration could lead to innovative approaches in video processing using diffusion models, although further refinement of this idea is needed.
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions	This framework is specifically designed for tabular Kaggle competitions and involves a detailed multi-agent workflow. The process consists of five main steps: understanding the data and planning, cleaning the data, employing a retrieval-augmented generation (RAG) approach to plan specific libraries for each step, feature engineering, and modeling. The framework allows for flexibility and control by specifying external libraries for feature engineering and model fitting, enabling the integration of new libraries. The authors chose a smaller scenario, coinciding with MLE-Bench, which showcases their worldview on the practical application of multi-agent systems. By focusing on a domain-specific approach, they aim to reflect a realistic workflow while maintaining extensibility, ensuring that unnecessary tasks are decoupled from the model's responsibilities. This allows for clean models to make decisions and handle redundant work without overburdening them with inappropriate tasks. Notably, their framework has shown better submission rates and results in tabular settings compared to AIDE, though it may be overly detailed for some contexts. Using a more streamlined model, like o1-mini, resulted in excess noise from unnecessary context, impacting performance. This indicates that o1 has indeed learned a valuable lesson about agent functionalities. Interestingly, AIDE's approach seems simpler, raising questions about its underlying assumptions regarding model strength.
Diff-Instruct`*`: Towards Human-Preferred One-step Text-to-image Generative Models	This paper discusses a diffusion model for text-to-image generation developed by Xiaohongshu. It appears to focus on enhancing human preference in the generative process, likely proposing improvements over existing models to align better with user expectations and aesthetic qualities. It will be interesting to delve into their methodology and findings to understand how they achieve this goal and what differentiates their approach from other models in the field. I'll mark this for further review.
Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models	This paper explores enhancing reasoning capabilities in LLMs through cooperative strategic planning by breaking down reasoning patterns. The approach aligns well with our work on the Comparative Study on O1. The identified reasoning types—Deductive, Inductive, Abductive, Analogical Reasoning, and Contradiction—along with strategies such as Decomposition, Enumeration, Elimination, and Reflection, provide a comprehensive framework for analyzing reasoning processes in LLMs. It would be beneficial to examine how these strategies are operationalized in their experiments and whether they lead to significant improvements in reasoning performance. I'll keep this in mind for further investigation.

28/10/2024

Paper	Comments
Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?	The paper presents a concise yet sophisticated Visual Language Model (VLM) test set. While Bongard problems fundamentally focus on identifying graphical classification criteria, their rule patterns primarily rely on image-based pattern recognition features. These features, while not overtly complex, present meaningful challenges even for human subjects. A distinctive characteristic emerges from its relatively modest visual information density: this property circumvents typical vision encoder limitations, enabling effective evaluation of the encoder's global conceptual understanding capabilities. This aligns particularly well with the general optimization objectives of CLIP-like encoders, making the benchmark particularly valuable for assessing vision encoder training quality.
ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning	The paper presents an intuitive and well-structured approach: extracting fixed static reasoning templates from mathematical problems to evaluate model robustness. Rather than optimizing for specific benchmarks like GSM8k or MATH, the methodology generates additional effective samples through template utilization, representing a more systematic approach. A recent proposal suggests extending this methodology to data augmentation, particularly relevant for competitive programming problems (e.g., LeetCode). Given the finite set of problem templates (e.g., knapsack, greedy algorithms, dynamic programming), the approach becomes viable when four conditions are met: 1. Establishment of root templates; 2. Robust template expansion capability for incorporating new elements. 3. Stable prompt transformation mechanisms for template-to-problem conversion. 4. Fixed brute-force algorithms for template-based solution generation. This framework enables generation of apparently out-of-distribution problems while maintaining consistent solution methodologies. The approach appears particularly promising for competitive programming training data generation. Mathematical problems, especially high-school examination problems, present even more straightforward opportunities for template extraction and application.
PDL: A Declarative Prompt Programming Language	The research presents a programmable abstraction language for LLM-to-Agent transformation, comparable to frameworks like Coze and Difny. The implementation demonstrates practical utility with well-designed abstractions.
Offline-to-Online Multi-Agent Reinforcement Learning	The research provides additional validation for the extrapolation of single-agent offline reinforcement learning methodologies to multi-agent online reinforcement learning scenarios. The successful transfer of single-agent offline RL effectiveness to multi-agent online environments suggests numerous potential applications in the agent domain. A particularly promising direction involves verification processes, where polarization in individual agent functionality and feedback mechanisms demonstrates improvements in overall multi-agent collaborative efficiency. While the current implementation remains preliminary, it represents a promising direction for future research development.
EDGE: Enhanced Grounded GUI Understanding	The research presents a scalable pipeline and generalized data synthesis framework capable of automatically generating large-scale, multi-granularity training data from web pages for GUI Agent training. The key insight lies in the extraction of both explicit textual content and latent elements from web data. The study demonstrates the continued value of Common Crawl as a comprehensive data source.
Counting Ability of Large Language Models and Tokenization	This theoretical paper presents three key findings: 1. In theory, RNNs and LSTMs can execute dynamic counting through maintenance of independent counters, while Transformers are constrained to TC0 complexity level. 2. Chain of Thought (CoT) reasoning combined with ideal assumptions enables complete counting capabilities. 3. The combination of imperfect tokenization with CoT performs below theoretical CoT limits, though it appears questionable whether tokenization represents the primary bottleneck in achieving CoT's theoretical maximum performance.
CloserMusicDB: A Modern Multipurpose Dataset of High Quality Music	The research presents a potentially valuable cold-start dataset featuring diverse music label annotations.
Brain-like Functional Organization within Large Language Models	While the paper presents speculative conclusions and methodology requiring further validation, it introduces an intriguing research approach: extracting patterns from LLMs as fixed regressor feature initializations for brain activity prediction. The study demonstrates coupling between these features and specific functional brain networks using a designated dataset. Despite the limited dataset scope, the methodological framework appears theoretically sound. The approach potentially enables identification of functional brain networks not represented in current LLMs, suggesting opportunities for targeted model enhancement.
Scaling Law with Learning Rate Annealing	The research incorporates annealing effects into Scaling Law modeling. Initial examination of the formulation suggests potential theoretical limitations, particularly regarding the lack of comprehensive analysis of annealing's impact on loss functions. This inadequate theoretical foundation may indicate incomplete consideration of these effects in the mathematical modeling.
Stick-breaking Attention	The research, authored by Yikang Shen, presents a theoretically elegant approach: implementing attention through a stick-breaking process where, for each token in a sequence, the model determines the proportion of remaining attention (the 'stick') to allocate, continuing until complete allocation is achieved. This methodology demonstrates two significant advantages over the conventional softmax+RoPE approach: 1. The theoretical framework enables learning of hierarchical paragraph information, avoiding the unnatural point-to-multipoint relationships inherent in RoPE. 2. The sequential allocation mechanism introduces an ingeniously designed ordering constraint. 3. The mathematical formulation warrants further analysis.
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks	The research presents a benchmark for evaluating video comprehension capabilities of long-context multimodal agents.
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark	The research introduces a novel benchmark for multimodal audio understanding and reasoning capabilities. This benchmark merits attention from researchers working on foundation models and general audio architectures, as it comprehensively covers speech, sound effects, and music domains. Notable terminological distinction is made between 'Audio' for general audio content and 'Sound' for sound effects, providing useful nomenclature standardization.
No Free Lunch: Fundamental Limits of Learning Non-Hallucinating Generative Models	-
Can Stories Help LLMs Reason? Curating Information Space Through Narrative	The research investigates narrative-based Chain of Thought approaches to enhance LLM problem-solving capabilities, representing another exploratory implementation of CoT methodology.
Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning	The study presents a KV-Cache optimization methodology utilizing importance score computation and pre-allocation mechanisms. Initial review does not reveal significant novel insights. Further detailed analysis of the specific implementation is warranted.
Applying sparse autoencoders to unlearn knowledge in language models	The research demonstrates that unlearning can be achieved through single Sparse Autoencoder (SAE) features. Key findings indicate that while zero activation of features proves ineffective, negative scaling is necessary for unlearning. However, this negative scaling approach introduces comparable or increased side effects in unrelated multiple-choice tasks. While the methodology lacks robustness, potentially due to suboptimal feature processing, the intuition behind the approach merits consideration.
Flow Generator Matching	-
BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training	The research demonstrates that unlearning can be achieved through single Sparse Autoencoder (SAE) features. Key findings indicate that while zero activation of features proves ineffective, negative scaling is necessary for unlearning. However, this negative scaling approach introduces comparable or increased side effects in unrelated multiple-choice tasks. While the methodology lacks robustness, potentially due to suboptimal feature processing, the intuition behind the approach merits consideration.

If you are intereted in the work published by us, please navigate to our full paper list.