M-A-P Daily Paper

The M-A-P daily paper project curates and reviews a selection of new papers published daily on arXiv, providing insightful commentary on cutting-edge research across various scientific disciplines.

🛠️ Papers This Week

(Expand to View)

18/10/2024

Paper	Comments
aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Completion	This is quite an interesting paper. Despite not performing thorough data cleaning, it still demonstrates competitive performance. The main selling point is the use of FIM. The paper mentions an insight that FIM: Given a suffix and prefix to predict the middle, when the order of {suffix; prefix; middle} and {prefix; suffix; middle} each takes up half, it yields the best results. Therefore, the paper directly adopts this conclusion, though this finding feels rather abrupt, and I'm curious about the underlying rationale. Additionally, I have a long-held view regarding code, especially object-oriented code. Many of the relationships between code elements are, in fact, parallel. At the repository level, code forms an actual tree structure, which is a well-known concept. This leads me to wonder if there are better learning methods—not just limited to FIM, which still feels somewhat conservative. Is it possible to directly optimize for a dependency tree at a certain level of granularity? It seems they also released a benchmark, though I haven’t verified its value.
An Evolved Universal Transformer Memory	This paper proposes an external Memory mechanism called NAMM. Currently, there seems to be a growing trend to encode Memory in ways that integrate closely with model architecture, facilitating speculative decoding or dense retrieval based on features derived from attention modeling. This remains an intriguing direction.
Trust but Verify: Programmatic VLM Evaluation in the Wild	This is a recommended read: an evaluation pipeline with a fairly insightful approach, somewhat similar in motivation to HelloBench, but more detailed in its implementation. They decouple potentially relevant factual information from the response based on entities, then verify this factual information in a step-by-step manner. This allows for assessing the accuracy of important information within the response. Although the process includes an LLM, it does not entirely rely on an "LLM-as-a-Judge" model. One of the notable insights here is that, in evaluations involving LLMs, a high-performing LLM does not necessarily need to serve as a judge; instead, it can perform information extraction and integration. This significantly broadens the scope of evaluation by structuring ground truth and requiring the LLM to break down and align the answer with critical factual points and scoring criteria. Ultimately, this method assigns scores in a way that represents a small yet insightful trick.
From PINNs to PIKANs: Recent Advances in Physics-Informed Machine Learning	It is recommended to read about this area: Physics-Informed Machine Learning. This article provides a very detailed historical review, but turning PINN into PIKAN lacks significance and is not worth delving into. From my previous experience with PINN and conversations with people studying astronomy, one fascinating aspect is that many complex equations do not require numerous data points to achieve fitting. However, the progress in the field of physics remains relatively slow for now.
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations	A method for asymmetric KV-Cache Quantization is proposed, highlighting the high sensitivity of the symmetric approach. It is necessary to understand the reasons behind this sensitivity and the rationality of this mechanism. The paper on Qwen may be of more value.
Estimating the Probabilities of Rare Outputs in Language Models	The issue pointed out here represents a highly valuable research method: specifically sampling to examine potential problems under worst-case scenarios. The main takeaway in the paper is that emphasizing the effectiveness of extremely low-probability events shows that importance sampling outperforms activation extrapolation, which in turn outperforms naive sampling. Sometimes, it feels that when analyzing LLMs, simply increasing the temperature setting could reveal significant insights, such as identifying which types of queries have become overly fixed in the model’s learning, or which aspects the model itself exhibits high uncertainty about. Those with time could conduct specialized studies on our model in this direction.
Atomic Calibration of LLMs in Long-Form Generations	This article has a similarity to Article 3, and it seems likely that many such papers will be published soon. It is recommended to follow up specifically on this topic by establishing a similar system, where answers are broken down into atomic facts, each of which is validated individually, ultimately resulting in a composite factual score. This approach would enable evaluations beyond the constraints of MQA, allowing for a more accurate assessment of the model’s performance in providing a free-form Q&A experience that aligns more closely with users' actual reading experiences.
Learning to Route with Confidence Tokens	This may be part of Apple’s multi-level model routing mechanism. Simply put, it requires the model to have a sense of its own uncertainty, passing on tasks with high uncertainty to larger models, and if still uncertain, responding with “not sure.” This has some practical value for Apple, as it allows them to store a smaller model on edge devices, minimizing the storage size of the model and enhancing the user experience with minimal communication. This is indeed a consideration worth noting for edge model deployment, and the motivation is clear.
A Little Human Data Goes A Long Way	A clean and concise experimental report; I personally appreciate this type of paper. It solidly validates the collapse effect caused when synthetic data occupies an absolute majority. Recommended reading.
Tuning Language Models by Mixture-of-Depths Ensemble	During pre-training of music encoders, I gained significant insights into similar methods. The differences in learning across intermediate layers in the representation of music audio are rather pronounced, with substantial variation between layers. Ensembling at the song level may yield far better performance than analyzing individual layers. In text-based models, however, this phenomenon does not appear as pronounced. In Decoder-only LLMs, the relationships between layers may not necessarily require ensembling; instead, there could be certain nested or linear relationships. By using outputs from intermediate layers, it is possible to enhance the performance on complex reasoning tasks without significantly increasing the number of parameters. As LLM development progresses, it truly seems worthwhile to spend more time studying what each part is learning and how it should be learned.
Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond	For the Long Convolution Sequence Model (LCSM), a basic SISO (Single Input Single Output) unit is abstracted, where the input is denoted as y and the output symbol is designated as z. Based on relaxed polynomial interpolation, it appears that each layer performs a tiling operation on z (without needing to wait for the "bubbling" of z), thereby allowing a high degree of parallelism between different layers. A quasi-linear complexity inference algorithm is proposed. The formula paper recommends manually working through the formulas for inference.
Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models	It could be a useful tool, but it may also be unreliable. Based on sentence-level keyword segmentation quality, the quality of LLM's multilingual tokenizer is evaluated. Several LLMs perform worse than expected, which is worth noting.
Improving Instruction-Following in Language Models through Activation Steering	This idea was likely mentioned in the paper summary shared a few days ago: using vector representations of activation differences to achieve model alignment. The paper performs an ablation on guidance weights, involving multiple instructions guiding simultaneously, and evaluates the performance of guided vector migration. I believe the most noteworthy aspect here is the method of multi-instruction guidance. The effectiveness of this approach may suggest that the inability of open-source models to follow fine-grained instructions could be due to the failure of these fine-grained instruction signals to activate relevant internal mode vectors within the model. This indicates two possibilities: either similar formats were not well learned during the pre-training phase, or the model lost this capability during supervised fine-tuning (SFT) elicitation. This can be validated by constructing specific ICL (In-Context Learning) and similar queries; ICL only needs to decouple to verify whether it can correctly apply the required format to data.
Language Model Preference Evaluation with Multiple Weak Evaluators	A collection of weak estimators forms a partial order to obtain preferences with relatively high confidence. Around six months ago, we created a similar synthetic dataset and observed highly consistent findings. Our observations at that time were even embarrassingly naive, where the model rated different responses. Although different models were used, as long as any one model rated a pair of responses with a score difference exceeding two points, the consistency between different models was very high, much higher than the consistency between annotators. However, when we later used such data to train a reward model (RM), we observed mode collapse. We feel that there may be some underlying factor here, but perhaps it requires deeper investigation and exploration, mainly to understand how this mode collapse arises.
VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models	It seems to be a benchmark highly relevant to user experience, examining subtle differences in the tone of model outputs.
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees	This paper is not particularly deep, but it triggered a thought from some time ago that might be worth recording. Recently, I shared an idea that, in most cases—especially in mathematics and coding—the reasoning process is not a single path or a tree structure. Instead, its true form is a directed acyclic graph (DAG), where many conclusions and chains of thought (CoT) are on the same level without a partial ordering between them. The data formed by Monte Carlo Tree Search (MCTS) and imitation learning based on CoT often leads the model to learn a partial order. Another insight that I strongly believe in is that, unlike in games like Dota or Go, learning from partially ordered data poses a problem. In these games, the search space is large, and the positive sample space is also extensive, allowing the model to learn a relatively unrestrictive positive distribution with a balanced but limited amount of data. However, I believe reasoning in text does not work this way. If we define effective steps as positive cases, then the positive distribution in reasoning tasks is actually much narrower than in typical RL scenarios. This narrow distribution, combined with only a subset of partial ordering and a very limited number of negative cases, can easily cause the model to learn incorrectly. I think this is an important idea: reasoning should be restored to its original correct form as a directed acyclic graph. Then, we can create a CriticGPT, generating incorrect cases based on the DAG and using the correct DAG as positive cases to optimize through reinforcement learning (RL).
RecurFormer: Not All Transformer Heads Need Self-Attention	A tradeoff was made between PyramidInfer and vllm, where the KV-Cache was retained, but attention was replaced with Mamba. The defined RR and RA-I metrics are quite interesting, and I recommend reading it. The definition of the k-threshold in the text subtly implies an assumption that the attention distribution is actually block-structured, which I find rather insightful.
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures	Recommended reading: This is a general method, and it is very solid. The article mentions that the image2text module and Vision Arena achieved over 95% consistency, so it is suggested to follow up.
Hypothesis Testing the Circuit Hypothesis in LLMs	This is a valuable paper, but it still reflects several core issues in current Mechanical Interpretability, which prevent pretraining from directly leading to conclusions. Specifically: 1.Insufficient extremity: Synthetic data is not a strictly complete finite automaton that is rich in side information. In fact, leaving aside the results they obtained, even the authors themselves do not know what the ideal situation should be (this paper addresses part of this, but the approach is overly simplistic). 2.The authors have not conducted pretraining on their own data and are therefore unsure how to capture the characteristics of textual data. Here are three key aspects that, in my view, are most significant: A. Partial Observability: Synthetic data is inherently more conducive to constructing controlled environments with partial observability, allowing assessment of the scale of partial observability the model can learn. B. Data Correlation: Synthetic data allows for easier modeling of strong correlations between data points. C. Controllable Incompleteness: Synthetic data naturally allows for introducing controlled noise, such as ambiguous operations (symbols), underlying ambiguities, etc., which makes it possible to systematically determine the effects of various types of toxic data.
Retrospective Learning from Interactions	While the datasets and methods are somewhat simplified, this work addresses an important issue: how to learn subtle implicit signals across multiple rounds of interaction. The limited availability of high-quality, annotated multi-turn interaction data likely makes this a challenging area to explore in depth, as current resources remain scarce. This type of research may, in the future, largely be conducted by organizations with extensive resources, given the practical challenges. Nonetheless, this is a highly relevant area, as feedback in language is inherently vague.
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding	Strongly support and appreciate the direction of multi-token prediction. The principle here is that NTP and SFT perform imitation learning, where human output logic is emulated, rather than aiming for data compression. From a perspective of more accurately emulating human output, structuring output in a way closer to natural human language (outputting coherent thought segments rather than word-by-word) aligns better with the nature of imitation learning. Multi-token prediction approximates this, while simply scaling RL on LLMs may not achieve a strategy improvement similar to AlphaGo's. Though results may not yet fully achieve that aim, there's still considerable potential.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation	The DPSK’s 4o framework appears flexible; however, it does not introduce particularly new content.
Optimal Quantization for Matrix Multiplication
Looking Inward: Language Models Can Learn About Themselves by Introspection	Explores the idea that models possess introspective capabilities, potentially recognizing what they do or do not know. While this type of research leans toward a more speculative approach, the concept of self-prediction training presents intriguing design elements. Its method of comparing a model's predictions, including using 4o to anticipate its own behavior and testing for robustness, makes it an interesting read.
Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models	A highly valuable paper worth reading, as it breaks down evaluation into individual skill levels, expanding assessments beyond broad benchmarks to examine specific skill performance. This approach is highly intuitive.
Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation	Strongly support this work. It uses Chain of Embeddings (CoE) representations across latent space to measure model capabilities. The method is excellent as it can be applied in pretraining and potentially simplified. Validating the consistency of Agentic Workflow connections in pretraining hints at an underlying multi-hop reasoning capacity, making this a very meaningful approach.
Improving Discrete Optimization Via Decoupled Straight-Through Gumbel-Softmax

17/10/2024

Paper	Comments
OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities	The author emphasizes the importance of effective evaluation data with key characteristics: 1) Avoid coupling between capabilities during model evaluation, focusing instead on areas for architectural or data improvements (e.g., OmnixR’s tri-modal understanding and reasoning coupling, and MMLU-Pro’s intensive coupling of atomic computation/reasoning and knowledge, which may limit optimization). 2) Evaluations should target common, universal issues, including those likely to arise in the future. 3) Cognitive science concepts are essential in shaping effective evaluations. 4) Evaluations should serve as tools to identify issues and encourage improvement, rather than becoming long-term benchmarks cited indefinitely.
Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance	The author suggests the direction of prompting LLMs to actively seek information and create workflows. Once the synthetic environment is robust, workflows with real-world applicability could emerge. However, reinforcement learning (RL) may not be necessary at this stage currently, as developing reusable workflows is already valuable.
PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking	This is a discussion of GPT-O1 systems by an MIT PI in material science, including insightful elements like iterative ORPO, dynamic knowledge graphs, and "thinking tokens" that function similarly to discourse markers in O1 systems, offering interesting perspectives overall.
Revealing the Barriers of Language Agents in Planning	Blocksworld and TravelPlanner datasets are valuable. The commenter concurs with the key finding that open-source models tend to follow instructions on a coarse level, often struggling with complex combinations of instructions. This highlights a needed area of improvement: the capability to disassemble and follow complex, combinatory instructions. The definitions of episodic and parametric memory are also considered interesting.
JudgeBench: A Benchmark for Evaluating LLM-based Judges	This benchmark, focusing on factuality and complex reasoning, serves as a meaningful complement to RewardBench in analyzing reward model effectiveness.
DDIL: Improved Diffusion Distillation With Imitation Learning	-
MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router	-
Open Domain Question Answering with Conflicting Contexts	Presents a valuable reasoning benchmark that incorporates conflicting context.
Understanding the Role of LLMs in Multimodal Evaluation Benchmarks	Observes that many tasks in MMMU do not require visual world knowledge, underscoring the importance of discerning whether visual inputs and spatial awareness are essential when designing benchmarks. Suggests categorizing tasks as those not needing visuals, those needing visuals but not spatial reasoning, and those requiring spatial reasoning.
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks	-
Is Complex Query Answering Really Complex?	Differentiating between partial and full reasoning queries provides significant insights into long CoT problems, where the challenge lies in distinguishing actual reasoning from memorized content. This insight could benefit benchmarks such as HotpotQA.
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines	A multilingual and multimodal benchmark.
A Scalable Communication Protocol for Networks of Large Language Models	The central insight is that frequent communications should use traditional protocols, infrequent ones structured data, and rare communications natural language. This is considered a sound approach, as not all agents need an active role in guiding other agents. Many agent papers overemphasize this need, creating a sci-fi/practical application divide.
Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information	This work may contribute to control multi-task SFT data distribution
Counterfactual Generative Modeling with Variational Causal Inference	-
Exploring Model Kinship for Merging Large Language Models	Introduces an intriguing concept of model kinship for merging models. Although currently small in scope, it has significant potential, especially in exploring static patterns in heads and their semantic similarities across models. The concept's broader scope is worth further investigation.
Rethinking Visual Counterfactual Explanations Through Region Constraint	-

15/10/2024

Paper	Comments
TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs	TMGBench includes 144 types of games based on the Robinson-Goforth topology, with each type containing multiple instances. These games can be further organized into sequential, parallel, and nested complex forms. Evaluation metrics designed for these games reflect dynamic and scalable assessments of fluid intelligence, highlighting significant gaps between open-source models and models like Claude and 4o. It is recommended to emphasize the importance of dynamic pattern composition capabilities.
STACKFEED: Structured Textual Actor-Critic Knowledge Base Editing with FeedBack	This paper presents an interesting reversal of thinking by involving the modification of Knowledge Bases using Actor-Critic approaches, which seems quite reasonable. There appears to be a strong need for such a reverse thought process.
Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning	This work from Tencent employs OpenCodeInterpreter and Self-Improvement for mathematical applications. While it may seem unremarkable, many methods currently incorporate strong verifiers to support Chain of Thought (CoT).
AFlow: Automating Agentic Workflow Generation	The author expresses a personal disinterest in the direction of a single agent workflow solving all problems; however, this paper feels like an exception and is quite fundamental. The theoretical value of this paper for frameworks like Coze and Difny is high. It raises an interesting abstraction: if a dedicated Workspace definer could generate numerous sequential workspace descriptions from similar inputs and outputs, could it initialize potential workflows effectively? Recommended for reading and following up.
VideoAgent: Self-Improving Video Generation	This work focuses on self-improving video generation based on external feedback. While it appears to be somewhat effective, it has only been tested in robotic scenarios using datasets such as MetaWorld, iTHOR, and BridgeData V2. The author's familiarity with the field is limited, making it difficult to determine if the title may be somewhat overstated, as it inevitably evokes thoughts of AIGC.
OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models	This work is framework-oriented. The author hasn't yet reviewed the repository but has read the paper introduction, which suggests comprehensive and solid support. The PRM accommodates settings for both final score selection and overall scores, while the strategy options support majority vote and maximum scoring. The formalization of the problem presented is quite generalizable. Overall, it's well summarized, although details on construction and testing at this early stage seem less meaningful.
Alignment Between the Decision-Making Logic of LLMs and Human Cognition: A Case Study on Legal LLMs	This appears to be a framework that could extrapolate to evaluate the quality of conventional multi-step Chain of Thought (CoT) generation, defining unrelated tokens, erroneous tokens, and correct tokens. It may be beneficial to refine the granularity of analysis based on LLM-as-a-judge principles. The framework concerning interaction seems superfluous, with some steps and standards compared to standard CoT extending into specialized evaluation frameworks.
Mechanistic Interpretability for AI Safety: A Review	The paper offers a basic overview of mechanistic interpretability concepts and history, which may be of interest to some readers.
Zero-shot Commonsense Reasoning over Machine Imagination	This study generates text QA pairs from a knowledge base and uses text-to-image models to create corresponding images, forming a Visual Question Answering (VQA) dataset that includes text, answer options, and images. The work, conducted by an intern, may suggest a sense of being scooped. The intern employed a dual-tower model that aligns text with CLIP embeddings, ultimately creating synthetic image descriptions to match image embeddings. While the idea may seem inexpensive, it does demonstrate an enhancement in commonsense reasoning ability, which is significant.
Two Heads Are Better Than One: A Multi-Agent System Has the Potential to Improve Scientific Idea Generation	This concept, resembling a sci-fi novel, is quite enjoyable. It involves generating a digital representation for each researcher and simulating how different researchers might collaborate on research. There is curiosity about how the digital representation of one’s own researcher would appear and how it might analyze the contributions of specific agents.
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer	This autoregressive diffusion-based image generation model from MIT is highly recommended for thorough reading. The author hasn't had sufficient time yet and intends to mark it for future review.
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions	Congratulations on this valuable synthetic long video generation dataset!
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	This benchmark is highly valuable for video understanding. The only concern may be its brevity, as it appears relatively short, with long videos defined as under 10 minutes. A significant takeaway from the research is that model performance tends to saturate between 8-16 frames. It is recommended for reading.
Introducing an Improved Information-Theoretic Measures of Predictive Uncertainty	The author has not yet had time to thoroughly examine the formulas but intends to mark this work for future reference. A preliminary check of the transformations between Equation 1 and Equation 5 did not reveal any immediate issues. It is believed that this type of metric could play a crucial role in enhancing the data efficiency of preference data, although no solid ideas have emerged yet.
When Attention Sink Emerges in Language Models: An Empirical View	This paper is highly recommended for reading. Aside from a possible size limitation, the experiments are extensive and rigorous. The author plans to conduct a detailed study tomorrow but notes several key takeaways: (1) weight decay encourages attention sinks; (2) the larger the training data, the more pronounced the model's attention sink behavior; (3) random and repeated sequences significantly impact the emergence of attention sinks.
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models	This approach is simple and effective, utilizing residual connections and introducing decoupled high-resolution adaptations to address reconstruction accuracy issues. The work appears to widen the information bottleneck. After a long time, it has been a refreshing experience for the author, who has a background in NLP, to read a CV-oriented solution in detail. It is highly recommended for reading.
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators	This training-free model compression method does not utilize calibration data, focusing instead on weight blocks and incorporating bypass parameters for reconstruction. This approach seems to align well with the idea of compressing fine-grained MoE (Mixture of Experts) strategies. The author intends to review it in detail tomorrow.
Adapt-∞: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection	This data selection method can also be applied to text SFT (Supervised Fine-Tuning). Essentially, it proposes a pool of SFT data to measure distribution, with new data being added continuously and quality selections made based on distribution. This concept appears to be a promising system-level demo, although the execution seems rather cursory. It is recommended for reading.
BookWorm: A Dataset for Character Description and Analysis	This dataset may be quite important for role-playing, as it contains several reliable character descriptions and in-depth analyses.
Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective	This new benchmark for text-to-image synthesis measures the causal impact of semantic variations on models. The design seems clever, with a significant takeaway that cross-modal alignment plays a key role in UNet or Transformers, indicating that the capabilities of text encoders are not the sole determining factor.
Predicting from Strings: Language Model Embeddings for Bayesian Optimization	This work from GDM explores the nature of the direction, which seems to stem from the encouragement of such endeavors. It involves embedding experimental inputs as feature vectors using language models and applying them in context regression models. By pre-training a transformer-based regression model on extensive offline evaluation data, it achieves uncertainty-aware numerical predictions for new objective functions.
LOBG: Less Overfitting for Better Generalization in Vision-Language Model	This research significantly improves the generalization capabilities of vision-language models by filtering out irrelevant fine-grained information and maintaining structural topology and hierarchical logic during distillation. There has been a recent increase in studies of this nature. If MLLM represents human visual perception of the external world, it is essential to sift through redundant information purposefully. While the methods in this article seem rather basic, it lacks depth.
α-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs	The author has not yet had the opportunity to derive formulas, but intends to mark this for future formula practice.
FormalAlign: Automated Alignment Evaluation for Autoformalization	The issues addressed in this paper are significant. The challenges faced in Lean arise primarily from inadequate formalization. The metrics used are not sufficiently persuasive; the cross-entropy approach appears somewhat arbitrary. The difficulties in Lean translation stem from grammatical details, and semantic analysis could reveal the oddity of certain outcomes.
Scalable Multi-Domain Adaptation of Language Models using Modular Experts	A recommendation for an early work called Deep-ICL, which has room for improvement. Although the initial approach was not very strong, the fundamental ideas align closely with this direction. It might be among the earliest papers in this area. The proposed method involves using a backbone to integrate several trained expert modules that can be activated based on the input information, along with training a routing system for unseen tasks. This direction may be worth researching for edge applications.
Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code	This benchmark addresses a valuable problem and format. An important takeaway is that "keywords, identifiers, and type identifiers are the most prone to hallucinations."
Retrieval Instead of Fine-tuning: A Retrieval-based Parameter Ensemble for Zero-shot Learning	This is another paper similar to the previous one.
Gradient-Free Neural Network Training on the Edge	The gradient-free training approach involves replacing the gradient with an intermediate tensor, deciding to flip related nodes. Further detailed study is planned for tomorrow.
MoIN: Mixture of Introvert Experts to Upcycle an LLM	This paper is also similar to the previous one.
Can In-context Learning Really Generalize to Out-of-distribution Tasks?	The experiments are quite solid, but the conclusions are similar to those in Jiaoda Li's paper, which also utilized GPT-2. Beyond the unnecessary mathematical details, the main points are: 1. ICL is fundamentally about retrieving the most relevant implicit functions (or patterns) learned during pretraining to solve problems. 2. Learning new input-output mappings is quite challenging. This paper does not extend to cases with multiple pattern combinations or analyze the data intensity required for pattern learning, which is a significant challenge in this direction.
Reconstructive Visual Instruction Tuning	By introducing reconstructive visual instruction tuning, the fine-grained understanding of LMMs has been significantly enhanced, and hallucination phenomena have been reduced. The focus seems to be on avoiding the introduction of excessive low-level visual information. Balancing this aspect appears to be a delicate matter, bordering on the philosophical. This can be viewed alongside the earlier paper.
Boosting Deductive Reasoning with Step Signals In RLHF	This paper has not yet been reviewed but is noted for future reference.
MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models	This paper is forwarded to Yang Guang and Peng Tao for consideration. It serves as a potential benchmark for evaluating fluid capabilities. The key takeaway aligns with the previous paper, highlighting that LLMs essentially learn a set of patterns and perform pattern retrieval and scheduling during inference.
Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes	This paper analyzes the I/O complexity of attention mechanisms, indicating it is worth studying further. Noted for future reference.
SeRA: Self-Reviewing and Alignment of Large Language Models using Implicit Reward Margins
Inference and Verbalization Functions During In-Context Learning
Nudging: Inference-time Alignment via Model Collaboration
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization
One Step at a Time: Combining LLMs and Static Analysis to Generate Next-Step Hints for Programming Tasks
The Same But Different: Structural Similarities and Differences in Multilingual Language Modeling
Automated Rewards via LLM-Generated Progress Functions
ACER: Automatic Language Model Context Extension via Retrieval
REDO: Execution-Free Runtime Error Detection for COding Agents
Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning
DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models

14/10/2024

Paper	Comments
Editing Massive Concepts in Text-to-Image Diffusion Models	This work is about concept editing on scalable image. But the results show the model lacks robustness. The collection of 1,000 potentially problematic concepts presents a meaningful problem space. But it remains unclear how this model applies to real-world applications. In order to prevent generating outdated, copyrighted, incorrect, and biased content, it's crutial to know these errors during the generation. Diffusion-based models demonstrate limited concept-based world knowledge, leading to an unsustainable pattern of continuous patching. While this approach could potentially be applied to prevent copyrighted image generation. But copyrighted images should not be used in the training process. Additionally, the study lacks experimental validation for potential model collapse issues, suggesting room for methodological improvement. The ICEB benchmark for evaluating concept-based Image Editing represents a significant contribution, offering unprecedented scale in this domain.
Promptly Yours? A Human Subject Study on Prompt Inference in AI-Generated Art	Figures 11-14 reveal significant insights: Diffusion models demonstrate limited generalization from original prompts to generated images, with both humans and AI showing inability to recall original prompts.
KV PREDICTION FOR IMPROVED TIME TO FIRST TOKEN	It's one of the interesting works from Apple recently. It proposes a small auxiliary model which is used to process the prompt and produce an approximation of the KV cache used by a base model. The commenter mentions an idea which hasn't been realised. It's to build a small model to predict what experts can be activated by MoE based on the given tokens. But it's time-costly in loading models and putting them back. A potential benefit of this approach is enabling the utilization of models with larger total parameters while operating under memory constraints. The commenter guesses KVP-C and KVP-LP here suggest robust activation patterns across different model sizes. The models can retain similar activation patterns with different model size and different pruning.
UNIQ: Offline Inverse Q-learning for Avoiding Undesirable Demonstrations	Good writing, and the motivation is clear. This research addresses sparse expert data in offline imitation learning. It maximizes a statistical distance between the learning policy and the undesirable policy (represented by undesirable demonstrations), rather than learning approaches where the main goal is to minimize the distance between the learning policy and expert (or near-optimal) demonstrations like traditional imitation. It's an intuitive idea to RLHF.
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?	A great study on Mechanical Interpretability using synthetic data. A recurring limitation in this research domain is the lack of rigorous data analysis, with investigations primarily focused on qualitative observations rather than quantitative measurements. The research extends previous findings on single-layer Transformers' capability to express multi-step algorithms, demonstrating that this property generalizes to multi-layer Transformer architectures. The research presents a substantial contribution to the field. However, it lacks deep insights into how multi-step algorithms manifest within model parameters. The study validates out-of-distribution (OOD) algorithmic generalization: while training employed identity covariance matrices, the looped transformers maintained low loss when tested on data with different covariance matrices, demonstrating generalization capability across varying distributions. This observation leads to a significant hypothesis regarding pre-training: assuming the model acquires numerous algorithmic circuits (considering many abstract multi-hop reasoning processes as algorithms), two key implications emerge. First, such algorithmic patterns are demonstrably learnable. Second, these patterns show potential for higher-level extrapolation when relevant parameters are activated. For instance, divide-and-conquer methodologies learned from code-related data may generalize to broader data processing tasks, provided proper attention head activation and avoidance of overfitting between attention heads and input patterns. The paper warrants careful consideration.
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content	Potentially valuable video dataset with significant resource investment. Further investigation recommended.
Baichuan-Omni Technical Report	The paper demonstrates Baichuan's continued focus on preliminary pre-training research, with suboptimal textual performance. However, their encoder implementations present potentially valuable research directions. The addition of a Projector following the Visual Encoder warrants investigation for its potential to enhance the efficiency and effectiveness of vision-to-semantic token conversion, though the approach appears theoretically sound. The Stage 2 implementation, incorporating synthetic QA and high-quality OCR, aligns with established practices in CLIP retraining. This methodology could prove valuable if it successfully substitutes the need for training new CLIP and caption models. However, their audio processing approach, based on early conceptualizations of audio-to-image dependency learning, raises methodological concerns. The absence of OmniBench evaluation limits comprehensive performance assessment.
SimpleStrat: Diversifying Language Model Generation with Stratification	The study provides quantitative analysis of model response diversity through the implementation of ConvergeQA, which yields an average of 28 distinct answers per query. This framework potentially serves as a valuable benchmark for evaluating model tendencies between deep search patterns with crystallized responses versus exploratory behavior. The methodology presents a potentially useful observational benchmark for assessing model response diversity characteristics.
Agents Thinking Fast and Slow: A Talker-Reasoner Architecture	While Google's Agent framework paper presents relatively conventional findings without significant metrics, it demonstrates a notable behavioral pattern where a talker directs a reasoner to verify intermediate steps. This pattern shows marked similarities to observations in recent experimental studies across HotpotQA, Collie, AIME, Usaco, and LiveCodeBench benchmarks. Collie presents an exceptional case: analysis of 50 sampled instances revealed no instances of pre-planned thought patterns. However, model behavior across the other four benchmarks consistently demonstrated either divide-and-conquer (CoT) approaches or utilization of known method matching (UKM) for solution retrieval. The most intriguing aspect of these experiments lies in the high convergence of thinking patterns within individual benchmarks. This raises questions about whether these patterns emerge from model learning or result from rigid synthetic data training. Evidence supports the latter hypothesis: 1. The frequency of divide-and-conquer approaches in USACO and LiveCodeBench significantly exceeds that of known method matching (UKM), despite template-based approaches being more common in human problem-solving. 2. High consistency of thought patterns within individual benchmarks. This observation raises the possibility that the talker-reasoner verification pattern in the current paper might represent another instance of fixed pattern generation similar to those observed in other models.
∀uto∃∨∧L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks	A novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning.
NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models	The research investigates behavioral pattern control through attention head manipulation during inference, presenting an interesting methodological approach. However, the application to Multiple Choice Questions (MCQ) appears to be a suboptimal choice for demonstrating the method's potential. The focus on benchmark performance optimization, particularly with the notable absence of MMLU evaluation, suggests limited scope in experimental validation.
The structure of the token space for large language models	The research proposes that token subspaces constitute a stratified manifold rather than a conventional manifold. However, the experimental validation appears insufficient to robustly support this claim.
Towards Cross-Lingual LLM Evaluation for European Languages	Benchmark collection of minor language in European.
CryoFM: A Flow-based Foundation Model for Cryo-EM Densities	-
VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding	The study presents a challenging benchmark for precise video frame extraction, advancing the granularity of image description from coarse-grained to fine-grained analysis. The framework enhances task complexity by requiring precise correspondence between detailed visual elements and their descriptions.
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning	This notable collaboration between Anthropic and Renmin University presents three significant findings: 1. Model parameter scaling does not inherently improve resistance to poisoning attacks. 2. A log-linear relationship exists between attack effectiveness and data poisoning ratio. 3. Data poisoning effects demonstrate generalization to extrapolated triggers beyond the poisoned dataset. The research merits careful consideration due to these significant implications for model security and scaling behavior.
On the token distance modeling ability of higher RoPE attention dimension	The paper presents a highly valuable analysis of dimensional contributions to attention heads within RoPE (Rotary Position Embedding), specifically examining their role in identifying Positional Heads. The experimental analysis yields significant insights: Figure 9 demonstrates meaningful contributions from the top 10% of heads, with their masking resulting in notably greater performance degradation compared to masking the top 5%. This observation necessitates further investigation into the relationship between head activation sparsity during standard LLM inference, particularly regarding whether longer text sequences require increased head activation. The research identifies that higher-dimensional components of attention heads generally contribute more significantly to attention scores. Additionally, the length extrapolation methodology expands the range of high-dimensional attention distribution. The experimental methodology presents several promising directions for further investigation.
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression	Initial review indicates that the methodology employs attention scores for dynamic token proportion determination, utilizing only important tokens during the pre-filling phase while maintaining seamless compatibility with existing frameworks. This approach shows particular promise for Long Video Understanding applications, given the inherent redundancy in video data. The direction warrants further investigation.
Scaling Laws for Predicting Downstream Performance in LLMs	The proposed solution lacks optimization for the practical challenge of downstream performance fitting and demonstrates insufficient understanding of downstream dataset distributions. While Formula 5.1 merits investigation, the research identifies a promising direction: as pre-training data comprehension and segmentation becomes more granular, the development of refined data mixture laws becomes crucial for experimental cost optimization. External commentary suggests that from an inference time efficiency perspective, MLP or more sophisticated regression models present superior alternatives to the current approach, though lightgbm with simulation demonstrates certain limitations. Additionally, more precise control of learning rate and data scheduler could prove valuable. Related work 'Does your data spark joy? Performance gains from domain upsampling at the end of training.' demonstrates performance improvements through increased proportion of high-quality data during the cooling phase of training. Unlike MiniCPM, this approach achieves performance gains through upsampling existing high-quality data proportions rather than introducing new data. This suggests that data mixture ratios may not be static. For those unfamiliar with data mixture laws, this paper, along with D-CPT Law, RegMix, and a recent non-LLM Amazon study, provides valuable insights into the field.

If you are intereted in the work published by us, please navigate to our full paper list.