M-A-P Daily Paper

The M-A-P daily paper project curates and reviews a selection of new papers published daily on arXiv, providing insightful commentary on cutting-edge research across various scientific disciplines.

🛠️ Papers This Week

(Expand to View)

25/10/2024

Paper	Comments
Should We Really Edit Language Models? On the Evaluation of Edited Language Model	Model Editing is a field with many insightful ideas. The earliest ideas about parameter probes likely originated from researchers focused on Model Editing. Model Editing primarily emphasizes reliability, generalization ability, and locality. An interesting takeaway from this paper is that Model Editing has a broadly similar impact on benchmarks across different capability dimensions. The experiments are quite solid , recommend reading.
Multi-agent cooperation through learning-aware policy gradients	1.Introducing a meta-level allows the multi-agent problem to be transformed into a single-agent problem. 2.More "generous" agents may suffer losses; learning-aware agents can achieve cooperation by exploiting naive learners. When two learning-aware agents meet, the exploitation strategy shifts to a cooperative strategy. 3.The formulation of this POMDP is intriguing. Perhaps due to limited prior exposure to similar papers, the modeling and internal derivations seem quite decent.
Aligning CodeLLMs with Direct Preference Optimization	The study of aligning code language models (CodeLLMs) based on Qwen through the DPO algorithm, awaiting data release.
SIKeD: Self-guided Iterative Knowledge Distillation for mathematical reasoning	A self-guided iterative training method that enables a small model to learn and select reasoning strategies suitable for different tasks. To some extent, it can be considered a fusion of various exploratory reasoning paths directed by correct references.
Scaling up Masked Diffusion Models on Text	This paper explores the scalability and effectiveness of Masked Diffusion Models (MDMs) in text processing. Recently, some non-NTP-based LLM papers have made notable progress;Few days ago have a similar one: Future Token Prediction - Causal Language Modeling with Per-Token Semantic State Vector for Multi-Token Prediction. In the first half of the year, Kuaishou Technology conducted a solid ablation study on a token-weighted adjustment approach and observed some performance gains. There're some rumors that,Some researchers are working on DPSK, suggesting it's a direction worth following and merging.
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs	Appreciation of the Skywork-RM approach, personal viewpoint: 1.The focus remains primarily on small datasets, which are mainly selected from the HelpSteer2, OffsetBias, WildGuardMix, and Magpie series datasets. (Personally, I have some reservations about the long-term reference value of this approach.) 2.The dataset has already been released and can be further tracked and analyzed: https://huggingface.co/collections/Skywork/skywork-reward-data-collection-66d7fda6a5098dc77035336d 3.The Bradley-Terry Model demonstrates the best performance, which is worth further investigation.
From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages	When extending to a new language, using the embeddings of an existing language to initialize the embeddings of the new language is a relatively reliable and well-validated approach. This paper takes a somewhat more elaborate approach, but the insight is similar.
Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation	Polish LLM, When discussing MAP-Neo at ICLR, several researchers from Northern Europe inquired about how to build a native internet-based corpus for their own countries from scratch. The paper did not include the Hugging Face link, but it’s worth following up to see if they will share the data later.
Efficient Inference for Augmented Large Language Models	This work is dedicated to accelerating the online deployment of Tool-Augmented LLMs, which is a highly valuable issue. In terms of the approach, it seems worthwhile to adopt a model or dictionary to predict, in a fixed format, the possible token count or time required by certain tools. This estimation can then be used to achieve system-level scheduling, making it a feature that can be implemented with minimal cost. It is recommended to follow up on this and deploy it online, as it sounds like there’s little downside to doing it well. It’s worth noting that the paper uses the OPT model, which I personally think can be ignored.
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment	Using gzip compression to directly measure the alignment between potential training data and the target task distribution places greater emphasis on syntactic and structural patterns related to the target task compared to n-gram methods. This characteristic may be advantageous in certain contexts.

24/10/2024

Paper	Comments
Bayesian scaling laws for in-context learning	While the experiments are relatively simple with limited insights, the mathematical formulation has some elegant aspects. The decoupling of task sets from finite alphabet sets of symbols provides valuable insight. The study suggests that even LLMs cannot converge to a single partial observable formal language automaton, it could be multiple ones. The assumption that the probabilities of future symbol depend entirely on task posteriors is reasonable in extreme cases. However, the experimental comparison between base and instruct models lacks robustness.
A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration	The paper formalizes the definition of Coherent CoT, which considers all previous reasoning steps in each step. Considering o1 can achieve extreme optimization for Coherent CoT, this represents a valuable optimization target. However, both the theoretical proofs and mechanism comparisons with conventional CoT lack substantial foundation.
Enhancing Two-Player Performance Through Single-Player Knowledge Transfer: An Empirical Study on Atari 2600 Games	The research addresses an intriguing question about adapting single-player policies to two-player environments in the same game scenario. While findings show successful adaptation and reduced total runtime, the analysis lacks depth. The paper would benefit from examining behavior changes learned from the policy, particularly regarding cooperation, interference, and independent operation patterns.
Semantic-guided Search for Efficient Program Repair with Large Language Models	This work presents an alternative to beam search using speculative decoding. The debug datasets Defects4J and HumanEval-Java warrant further investigation for potential integration into sandbox environments and comprehensive code model evaluation, particularly regarding vulnerability detection and thorough debugging tests.
The Scene Language: Representing Scenes with Programs, Words, and Embeddings	The paper presents a valuable perspective on image cognition, conceptualizing images as comprising structure, entities, and visual details. This hierarchical approach to modeling relationships between main entities in images, with visual details bound to entities, could potentially enable a context-aware, efficient information extraction scheme for image encoders.
Math Neurosurgery: Isolating Language Models' Math Reasoning Abilities Using Only Forward Passes	While the methodology appears trivial, the research question holds significant value. Despite the commenter shows preference on SAE methods, current probing-based methods remain inefficient. A better story could be improving the locating of weight during model forwarding contrained on low resource. The paper employs weight parameters based on binary classification activation multiplied by absolute values to determine top k relevant parameters. (In fact, it would be practical for merely identifying valuable heads without requiring detailed understanding of each head's function.)
Markov Chain of Thought for Efficient Mathematical Reasoning	The paper simplifies CoT to a Markov chain and ensures reasoning accuracy through interaction with CodeInterpreter. The MCoTInstruct dataset, developed by the Qwen team, warrants further exploration.
ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference	The paper presents a model-predicted routing system for expert caching, essentially a MoE version of cache-optimized Speculative Decoding. While computationally intensive, this approach may hold value for edge computing applications, though further expertise in edge deployment is needed for comprehensive evaluation.
SPikE-SSM: A Sparse, Precise, and Efficient Spiking State Space Model for Long Sequences Learning	-
In Context Learning and Reasoning for Symbolic Regression with Large Language Models	The paper explores using LLMs for symbolic regression of scientific formulas, employing constant guessing and function form prediction with Scipy for fitting. Despite simple experiments and questionable ablation studies, the direction merits attention, particularly regarding LLMs' potential to better predict formula patterns for parameter fitting - essentially an enhanced PINN with LLM+tool integration.
Beware of Calibration Data for Pruning Large Language Models	The study reveals that calibration data quality is not the primary factor affecting pruning performance. Instead, similarity between calibration and training data shows greater impact. Suggests considering a "self-generate and sample" strategy, using LLMs to create synthetic calibration data similar to training distribution.
Understanding Layer Significance in LLM Alignment	Proposes learning binary masks for increment weight matrices in LoRA to indicate layer importance during instruction tuning. Figure 2 notably shows crucial layers for instruction fine-tuning concentrated in deeper layers. Findings suggest consistent important layer distribution across datasets, with potential consistency across models. The approach improves model performance and alignment efficiency through selective layer fine-tuning. This increasingly verified understanding among the community warrants further investigation.
Cross-lingual Transfer of Reward Models in Multilingual Alignment	English reward models best maintain initial multilingual LLM representation diversity, while non-English models tend toward stronger representation collapse. This suggests maintaining partial English performance is crucial for cross-lingual/task transfer.
Beyond position: how rotary embeddings shape representations and memory in autoregressive transformers	While well-written, the paper analyzes RoPE mechanism, which fundamentally remains an engineering trick to reduce "redundant information interference" during training. This information bottleneck primarily optimizes denoising of long-range one-to-one/one-to-many dependencies. The approach may not benefit hierarchical dependency learning effectively, particularly in complex scenarios (eg, requiring a summary generation in a customized order from multiple contexts). The core limitation lies in the lack of true long-range supervision, with RoPE merely reducing learning difficulty rather than addressing fundamental challenges.

23/10/2024

Paper	Comments
Baichuan Alignment Technical Report	Methodologically, the report contains limited new information, primarily reiterating known concepts. It shows a particular inclination towards Model Merging, which might hold value for further SFT-based explorations. Personally, I also support Model Merging after SFT on multiple models, as some papers have pointed out that it activates different heads within the model. Since sparse activation of heads follows different instructions and patterns, merging multiple models focused on various domains could potentially enhance instruction-following coverage across domains. Exploring domain-specific fine-tuning (purely on the SFT level) and subsequent merging is worth pursuing, and similar strategies were employed in deepseek-2.5. The three benchmarks in this report provide some insights: 1. CFBench indicates that Baichuan also recognizes the issue of following composite instructions, similar to work like Collie by Professor Shunyu Yao, and OAI’s release, which strictly follows JSONL formats, suggesting that this direction is worth further investigation. 2. SysBench studies the impact of system messages, offering a unique perspective that could be practically beneficial for deployment analysis. 3. FB-Bench focuses on multi-turn context understanding capabilities
OAI - Yang Song’s Consistency Model Topic [Personal post-class assignment, marking for later review]: Improved Techniques for Training Consistency Models; Diffusion Posterior Sampling for Linear Inverse Problem Solving: A Filtering Perspective; Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing; Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Creativity in AI: Progresses and Challenges	Creative writing was once a very popular topic within NLP, but the shift towards LLMs has somewhat diminished interest in this direction. I still remember a few impressive works from recent years, such as Meta’s interactive playwright project, which was genuinely impactful when I first encountered it. This raises a few ideas: 1. Creative writing could be an important application for us, and given the extensive datasets on figurative language and novel writing, it may be worthwhile to review and organize these resources systematically. 2. The paper presents a fairly comprehensive classification of evaluation criteria, which could serve as a direct takeaway: [interestingness, coherence, relevance, human-likeness, fluency, flexibility, originality, elaboration]. 3. Personally, I am very supportive of immersive experiences in role-playing and narrative settings. The paper does not cover any related work, suggesting this area remains largely unexplored.
Few-shot In-Context Preference Learning Using Large Language Models	I am highly optimistic about this direction, where LLMs are used for large-scale applications such as guideline generation and feedback summarization through human annotation. Whether forming candidate reward functions, providing critical questioning, or helping humans with error correction and detail highlighting, there is much to explore and valuable experience to share.
SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning	An MCTS+LLM-driven AutoML framework. As a preview, we will release a similar, flexible, and extensible AutoML framework later this weekend or early next week. Our focus is on reliable submissions, achieving a one-time submission success rate of approximately 90% across various tabular tasks, generally performing at or above the middle level. The framework’s flexibility, achieved through decoupling functions and library support, allows for strong extensibility.
Non-myopic Generation of Language Model for Reasoning and Planning	An insightful paper from Professor Junxian He, whose work is consistently refreshing. The insight here lies in the model’s inherent short-sighted prediction range, which I believe is an inevitable aspect of NTP, though not everyone may agree. This short-sighted range often leads the model to deviate from the globally optimal sequence. Designing a sampling method to estimate an optimal future distribution could enable non-myopic planning. It’s like experiencing an auditory revelation when reading He’s work, reminiscent of “hearing celestial music that briefly clears the mind.” However, I feel this approach could serve broader purposes and more impactful outcomes. A small model naturally possesses this short-sighted range, which could be used to parse pre-training data by leveraging this range—short-sightedness...
Conjuring Semantic Similarity	An imaginative approach that measures textual consistency based on semantic consistency between images evoked by text.
Large Body Language Models	Recently observed a paper on sign language generation by Microsoft, suggesting potential technical breakthroughs in this field, likely driven by the emergence of large-scale datasets. From a basic understanding, the models appear to outperform previous ones, and the methodologies have modernized significantly.
Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective	Not a particularly strong paper, but offers a vague insight: preference data from dialogues could yield more DPO benefits than instruction-based preference data. This might be worth verifying and reflecting on to see if it holds true, and why.
DEAN: Deactivating the Coupled Neurons to Mitigate Fairness-Privacy Conflicts in Large Language Models	There has been a surge of papers on directly controlling activation for managing instruction following and other behaviors, and this one suggests this technique may soon see a significant advancement. This paper specifically addresses the mitigation of fairness-privacy conflicts in LLMs by deactivating neurons linked to both fairness and privacy. It aligns with a recent debate on whether deception could simply be managed as one or more heads within an LLM. Personally, I believe it is precisely that—deception across different scenarios might involve multiple heads, but ultimately, heads remain the core component. Here, fairness and privacy also represent complex definitions that likely correspond to specific heads.
Influential Language Data Selection via Gradient Trajectory Pursuit	Recently saw a Stanford paper critiquing gradient-based data selection, which was humorous and refreshing—GDM responded by publishing a rationale for it, akin to the XLNet-RoBERTa response cycle. Frankly, that earlier paper had a mismatched analysis—the method didn’t align with the stated goals. It mainly highlighted that, beyond reaching a certain dataset quality level, overall distribution is more significant than isolated data point impacts, which the previous paper failed to address comprehensively.
Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models	Highlights a non-robust scenario in MLLMs where sensitivity arises from changing input order, which is natural given the common MLLM training approaches (especially SFT). Addressing this issue doesn’t seem overly complex.
GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks	A rather curious topic—an LLM focused on geospatial code generation. Amusingly, thought it might contain some novel test cases, but upon prompting from another researcher, learned they were manually assessed. An intriguing paper. Their SFT datasets might be worth reviewing for potential expansion into additional tool use scenarios.
Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards	Uses PPO to train an LLM on a Ludii game synthesis task. Well-written, with conclusions showing poor generalization on unseen tasks based on known definitions, which was somewhat expected given the model`'`s small size and limitations. Still, worth reading—a well-executed RL project with clear writing.
A Simple Model of Inference Scaling Laws	-
Pantograph: A Machine-to-Machine Interaction Interface for Advanced Theorem Proving, High Level Reasoning, and Data Extraction in Lean 4	A strong framework-level work, leveraging MCTS for proof search while supporting high-level reasoning steps and straightforward strategy function definitions. Highly recommend future prover development to adopt this to improve data sampling efficiency.
Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge	Mathematically elegant definition showing that model "unlearning" might not equate to actual unlearning but rather concealing knowledge. An intriguing paper, highly recommended for reading; Suhang Wang and Wenpeng Yin bring notable insight here.

22/10/2024

Paper	Comments
SMART: Self-learning Meta-strategy Agent for Reasoning Tasks	This paper introduces an interestingly defined environment and learning objective, aiming for a language model (LM) to learn and choose optimal strategies on the first attempt. It models this process as a Markov Decision Process (MDP) and uses reinforcement learning (RL) for training. The approach to environment definition and learning process is conceptually sound; however, the paper seems primarily to establish a foundational position in the area. The classification of “thoughts” is overly simplistic, consisting of only three categories, and the study lacks exploration into thought hierarchy, particularly the definition of fine-grained thoughts. This aspect suggests potential for further refinement and follow-up. Recommended reading.
Improve Vision Language Model Chain-of-thought Reasoning	The paper demonstrates significant gains on various Visual Question Answering (VQA) benchmarks through Chain-of-Thought (CoT) supervised training on Multimodal Language Models (MLLMs). However, it lacks additional information and currently does not appear to have released the pipeline or dataset used. Marked for future attention pending data release.
Reflection-Bench: probing AI intelligence with reflection	A potential Out-of-Distribution (OOD) benchmark candidate, this paper designs seven tasks suited for evaluating Large Language Models (LLMs). These tasks encompass Perception, Memory, Decision-Making, Prediction, Belief Updating, Counterfactual Thinking, and Meta Reflection.
Are Language Model Logits Calibrated?	The calibration definition and use of the Wasserstein distance in this paper are intriguing, with potential for further extension. Calibration here is defined as the alignment between the output probability of candidate words and their inferred relative likelihood given the context. An important takeaway is that instruction-tuned models exhibit poor calibration and higher relative entropy, with notable mode collapse. This may highlight an important issue for consideration in current model alignment efforts.
InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems	Developed by Shanghai AI Lab, this LEAN model generates proofs for each statement using a combination of best-first search and critic-guided search. In the initial phase, InternLM2-StepProver performs a quick scan to identify proofs, which are then added to the training set while resolved problems and their negative statements are removed. The paper presents solid speculative decoding optimizations and Critic Model updates, with quantitative analysis on resource evaluation yielding intriguing results. Specifically, the generated paths for correct proofs and incorporated mathematical tools tend to be shorter than most erroneous paths—a finding that, while estimated, seems promising. While recent trends have shifted focus toward O1, LEAN remains a worthwhile area, especially for its ability to leverage CPU-intensive computations and produce accurate, extended CoT outputs. Recent insights, partly from discussions with colleagues (@Zhang Yue & @Zhan Tianyang), highlight that OmegaPRM mainly supervises the initial error locations, which makes it less data-efficient compared to methods like CriticGPT due to BoN and MCTS sampling to completion. Given this, scaling CriticGPT with mathematics-focused LEAN efforts would be highly impactful.
How to Build a Pre-trained Multimodal Model for Simultaneously Chatting and Decision-making?	This paper addresses a well-defined, natural question of substantial value: building a Multimodal Language Model (MLLM) that can function as an interactive agent capable of both observation and action. The model receives two types of feedback upon processing information: (1) interaction, and (2) direct action prediction. This approach effectively integrates traditional MLLM functions with task-driven operation, though the method itself is relatively straightforward with a somewhat narrowed scope of application. Nonetheless, this direction holds great potential, suggesting a new multimodal model category where an MLLM-initialized Genie could support embodied actions alongside optional verbal interactions. The problem could be further generalized, meriting further thought.
Chasing Random: Instruction Selection Strategies Fail to Generalize	In summary, this paper suggests that current instruction selection strategies and metrics provide limited utility. Despite its backing from GDM, the approach feels somewhat unsolid; the datasets chosen (e.g., FLAN, Dolly) have inherent issues that undermine generalizability. Additionally, the methods for selecting data lack detailed consideration of data distribution. This direction remains valuable as the pool of available instruction data expands, but the focus should shift from quality alone to a stronger emphasis on data distribution considerations.
Long Term Memory: The Foundation of AI Self-Evolution	This paper offers a thought-provoking theoretical experiment and a conceptual system design, though certain development directions proposed seem questionable. Two valuable ideas stand out: (1) Cognitive accumulation is crucial, though defining it as spanning the entire pre-training phase may be misguided. An internal report on Chinese-English transfer suggests that the choice of early training data is especially influential. (2) The paradigm shift from imitation learning to learning from feedback is essential. Current RLHF practices are costly and ad-hoc, with data generation potentially more expensive than model training itself. For scalable model learning, the process should ideally avoid overly expensive data labeling. Additionally, the current reward generation approach feels arbitrary, differing considerably from how rewards naturally arise in human-world interactions. While a clear alternative remains elusive, a more robust solution seems both possible and necessary.
Collaboratively Adding New Knowledge to an LLM	Key takeaway: Full-parameter fine-tuning more readily leads to catastrophic forgetting compared to LoRA, which consistently performs better across various conditions. However, the experiments are limited, making the conclusions tentative. This work, by IBM, is noted here for future reference.
DFlow: Diverse Dialogue Flow Simulation with Large Language Models	This paper proposes a method for generating diverse, multi-turn dialogues that adhere to predefined paths or trees, following task logic and constraints. This approach aims to enhance dialogue understanding capabilities and presents a scalable solution for synthetic data generation.
How to Evaluate Reward Models for RLHF	The paper introduces the valuable RewardBench framework, which merits careful analysis of its distribution.
Truncated Consistency Models	This paper on diffusion models is understood to improve generation quality by reducing the denoising task in early time steps. However, the argument against degrading to trivial functions is not entirely convincing; there is a sense that something may be sacrificed in this approach.
Lossless KV Cache Compression to 2%	This paper presents the CLA from Hung Yuan.
Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning	This paper provides a theoretical analysis and empirical evidence demonstrating the suboptimality of the sequential training method using Supervised Fine-Tuning (SFT) followed by Differential Preference Optimization (DPO). It also introduces two effective joint training methods.
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation	This paper focuses on agent evaluation for smartphones, indicating a trend towards an influx of similar studies in the near future.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark	This paper introduces a multimodal mathematical reasoning benchmark, with Table 1 providing an interesting definition of image classification.
OpenMU: Your Swiss Army Knife for Music Understanding	The paper does not utilize MERT, which is disappointing.
Automated Proof Generation for Rust Code via Self-Evolution	This framework holds significant value as it aims to address the data scarcity issue in automated proof generation for Rust code. The generated data could serve as a robust corrective mechanism similar to CriticGPT, enhancing the model's error-correction capabilities. The potential for scaling CriticGPT is particularly promising. For detailed reasoning, refer to the comments on "InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems."
Pre-training Distillation for Large Language Models: A Design Space Exploration	Pre-training distillation represents a meaningful direction worth exploring. It is marked for future review.
Compute-Constrained Data Selection	This paper by Rush formalizes the data selection problem in Supervised Fine-Tuning (SFT) as a utility function that incorporates cost considerations. It seems to align with a recent trend of drawing motivations from behavioral economics and various physical sciences to address large model challenges, applying models from other disciplines to assess their effectiveness. The analysis suggests that conclusions may not favor more complex methods, arguing that perplexity or gradient information is not useful. However, this conclusion appears unsolid, as it likely reflects the distinction between analyzing data distribution versus individual data quality. It seems that distribution effects are more significant, particularly in the context of pre-training.
Self-Explained Keywords Empower Large Language Models for Code Generation	The key takeaway from this paper is that large language models (LLMs) struggle to extract and interpret low-frequency keywords from problem descriptions effectively. A recent insight from reading prompt papers suggests focusing not on the methodologies employed, which often lack significance, but rather on the issues artificially highlighted by the authors. Identifying these potential problems can yield valuable insights.
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling	This paper combines Monte Carlo Tree Search (MCTS) with Best-of-N (BoN) sampling. However, the author is skeptical about this direction, as it primarily relies on BoN. The method involves maintaining a set of parent nodes during the sampling process and iteratively branching and pruning low-quality responses to reduce computational overhead. The author expresses a preference for returning to a form that reflects the actual Directed Acyclic Graph (DAG) of reasoning.

21/10/2024

Paper	Comments
Do LLMs "know" internally when they follow instructions?	The study employs linear probes across different layers (early/middle/last layers) and different positions of tokens (first/middle/last token) to identify whether modifying representations along with dimension in the input embedding space links to successful instruction-following. This methodology connects with another recent relevant work 'Improving Instruction-Following in Language Models through Activation Steering.' From the perspective of mechanical interpretability, the findings demonstrate the capability of linear probing in identifying the parameters in even an abstract scenario like instruction-following. This can be effectively generalised to identifying patterns in CoT. It can also be utilised in activating more effective reasoning patterns through activation steering. This is a promising research direction. The value of parameter probing this kind of methodology appears underappreciated in the field.
Do LLMs estimate uncertainty well in instruction-following?	The methodology for cross-model uncertainty comparison in this paper requires further verification. Some of the propsoed methods are based on probability and mean token entropy. The study identifies normalized p(true) as the most reliable evaluation metric. Additional verification is needed to understand its cross-model applicability of these metrics. The evolution of uncertainty during pre-training merits further investigation.
MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts	The introduction of momentum into SMoE raises negative effects to computational efficiency and the bound of model architecture，regarding Formula 9 in the paper. The paper lacks clear justification for the crutial meaning of dynamics of the expert representations in SMoEs.
Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection	The paper presents a novel attention mechanism.
How Does Data Diversity Shape the Weight Landscape of Neural Networks?	Key findings include: 1) Dropout tends to promote more uniform distribution of empirical spectral density (ESD), while weight decay leads to heavier tails. 2) The impact of data diversity on weight matrices aligns with the effect of dropout but contrasts with that of weight decay.
Streaming Deep Reinforcement Learning Finally Works	This paper proposes a method to stabilize Streaming DRL.
Supervised Chain of Thought	The paper's primary contribution lies in introducing the concept of prompt search complexity. It proposes that search complexity depends on both total information in latent vector and amount of information each CoT step can extract, defined as C(m,s). This framework offers a more well-defined approach to quantifying CoT requirements across different task types compared to the vaguer concept hops as the amount of information is more quantifiable.
Almost-Linear RNNs Yield Highly Interpretable Symbolic Codes in Dynamical Systems Reconstruction	Recommended reading. The motivation proposed in this work is notable for its abstraction of linear subregions and the most parsimonious representation of linear subregions. This framework appears natural for understanding the existence of attention in Chain of Thought (CoT) processes. If we consider language generation not as a word-by-word process, but rather as a switching state system where content is planned and then expressed, then switch to the next state. These transitions between states might correspond to representations of certain subregions. But are these symbolically linear. A pertinent question arose regarding whether neural architecture should directly emulate human brain if Neural Text Processing (NTP) and Supervised Fine-Tuning (SFT) are forms of imitation learning. Human primitive interaction patterns fundamentally align more closely with feedback-based learning mechanisms, essentially representing a scaled implementation of reinforcement learning (RL). From this theoretical perspective, even the Neural Text Processing (NTP) paradigm can be considered an ad-hoc solution. While the current developmental stage necessitates the incorporation of imitation learning for fundamental pattern acquisition, it suggests a potential evolutionary trajectory for NTP: transitioning from word-level processing to higher-order, more dynamic level. This hypothesis is supported by the inherent existence of hierarchical transitional logic structures in language output and composition, independent of neurological architecture. The manifestation of these patterns in language generation persists whether the objective is to emulate neural processes or natural human linguistic output patterns.
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training	(!) This work addresses the widely acknowledged information bottleneck issue in MLLM encoders. The approach targets specific token-patch correspondences. Potential improvements could involve dynamic, context-aware sub-image framing based on text embeddings, though training complexity may present challenges.
Speciesism in Natural Language Processing Research	An interesting finding of this work is recent LLMs exhibit speciesist bias.
Associative memory and dead neurons	(!) This work examines neurons exhibiting activation function saturation. It might be valuable.
Latent Weight Diffusion: Generating Policies from Trajectories	(!) Presents potential benefits for the generalization of cross-game Decision Transformer. The approach models different policy behaviors using latent variable z, deriving target policy function distributions through conditional independence. The policy representation shows promise for cross-game generalization.
On Partial Prototype Collapse in the DINO Family of Self-Supervised Methods	-
Provable Benefits of Complex Parameterizations for Structured State Space Models	This work empirically demonstrate the benefits of complex parameterizations for SSMs. Key finding demonstrates more efficient utilization of dimention in complex SSMs, though experiments remain relatively simple.
In-context learning and Occam's razor	Highly recommended reading. Noteworthy sections include 3.1 and 3.5, analysing the influence of prequential coding. The length of prequential coding, as the upper-bound of the data and model, is tight. It's very insightful that it shows how learning algorithms can be used to compress data through prequential coding, and that minimizing the resulting “prequential code length” achieved by a learning algorithm is equivalent to jointly minimizing the training error and complexity of the model it fits. The commenter was thinking of why data mixture is effective. The thought is its efficacy stems not from reweighting mechanisms, but rather from partial ordering and pre-training dynamics. Further hypothesis: the fundamental value of mixture approaches lies in their capacity to enhance the probability of correctly learning partial ordering. This necessitates developing a framework for attributing dependencies among different samples. To illustrate this concept, consider the acquisition of university-level knowledge without proper exposure to dependent the knowledge learned in secondary school. In such scenarios, three potential outcomes emerge: Just rote memorised the knowledge in university; learnt only noise; learnt non-robust, unstable knowledge. This analysis suggests that data scheduler design may be crucial for future pre-training methodologies, particularly in modeling correct partial ordering.
RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph	This work is not interesting. But the commenter agrees modeling code or mathematics algorithms using graph. Here is an another early attempt called Steiner, a series of reasoning models trained on synthetic data using RL, constructing a DAG in its model.

If you are intereted in the work published by us, please navigate to our full paper list.