(Expand to View)
Paper | Comments |
---|---|
Scaling Laws for Precision | This study investigates the impact of low-precision training and inference on language model quality and cost, and proposes a "precision-aware" scaling law. It primarily focuses on the effects of quantization. The dataset used throughout is Dolma V1.7, and GPTQ is applied for post-training quantization, with comparisons made to other quantization methods.While I feel that the deeper implications of the proposed law still warrant further exploration, some significant takeaways were validated: common post-training quantization techniques can lead to substantial degradation in model performance, particularly after extensive data pretraining. This suggests that more pretraining computation does not necessarily result in a stronger model (though this claim seems somewhat less solid).The study found that training costs are linearly related to precision, and that low-precision training can reduce computational costs while maintaining loss stability. However, the paper does not delve deeply into the dynamics of using different precisions at different stages. Although this aspect wasn't explicitly addressed, based on the observations made in the study, it could be a promising direction for future research. |
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding | This study is relatively straightforward and easy to understand, not overly ostentatious compared to recent studies on process supervision. The authors propose LaTent Reasoning Optimization (LaTRO), which utilizes the model's own probability estimates as a reward function. By optimizing high-quality reasoning paths, they achieve a smoothing effect on the reward probability distribution when moving from the query to the response through a specific path. This reduces the extremity of rewards corresponding to different reasoning paths. However, if errors are significant or occur later in the process, as mentioned in the paper, the reward will approach zero. The concern here is that errors occurring towards the end might pose a hidden risk, potentially affecting reasoning performance. Currently, most papers in this field include both greedy and self-consistency settings. Recommended reading. |
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models | This is a study exploring the Multi-Agent framework, with two important takeaways: 1.All tasks performed by the Critic Agent are crucial for tree expansion and solution search, with tasks related to node termination and solution verification showing the most significant impact. Making mistakes is not a concern; debugging is crucial. 2.Exploring diversified strategies is more effective than iterative optimization based on a single solution. |
BitNet a4.8: 4-bit Activations for 1-bit LLMs | BitNet a4.8 is a technique enabling 4-bit activation for 1-bit large language models (LLMs). Specifically, it applies 4-bit quantization to the inputs of the attention and feed-forward network layers, while the intermediate states are sparsified and quantized to 8-bit. During training, a two-phase process is employed to gradually transition from 8-bit to 4-bit, utilizing gradient approximation combined with mixed-precision training to update the parameters. |
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models | The multimodal MoE model, however, differs from the conventional meaning of MoE and is more akin to the mixed training of three models. MoT achieves modality-specific processing by decoupling the non-embedding parameters of the model (including the feed-forward network, attention matrices, and layer normalization) while retaining a global self-attention mechanism. In MoT, each modality (text, image, speech) has its independent set of non-embedding parameters, such as feed-forward networks, attention projection matrices, and layer normalization. MoT applies a global self-attention mechanism across all modalities to capture cross-modal relationships. I had actually proposed a similar idea before, and I still think it should somehow be feasible. For the speech modality, models like SoundStream and Encodec use codebooks that could be treated in a similar manner in this context. By constructing an MoE model with eight codebooks corresponding to eight experts, while maintaining global attention, my assumption is that this approach could indeed save substantial computation. Based on my experience with pretraining MERT, I feel that the performance loss would be less significant compared to some of the current common processing methods used for speech. |
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models | Similar to MAP-Neo, the greatest value of this work lies in data preparation, creating heuristic filters, and conducting various pre-training data analyses. With this groundwork, the performance is roughly on par with Qwen-2.5-Coder of the same model size across various leaderboards. Today, all comprehensive rules and code will be released, and processed pre-training data, SFT (supervised fine-tuning), and other related data will be gradually made available. Figure 3, in fact, represents a clever personal approach to studying heuristic rules in pre-training, since these transparent LLM projects often lack sufficient GPU resources and credits to run extensive training and ablation studies. Thus, figuring out whether adding a particular rule is effective becomes a philosophical question. At that time, I devised a somewhat unconventional but potentially inexpensive validation method: 1.First, generate embeddings for the pre-training dataset before adding new heuristic rules, then visualize the distribution using PCA. 2.Project the dataset filtered with the new heuristic rules onto the previous PCA distribution to observe which data points have been removed [this can be further quantified at a finer granularity]. 3.Randomly inspect clusters that have significantly reduced in density, as shown in Figure 3, to verify whether they align with the expected data removal.Perform sample annotation to assess the rate of false positives. Empirical conclusion: if the false positive rate is below 5%, the rule can be directly applied without further verification; otherwise, additional consideration is needed. Many of MAP-Neo’s rules were adapted in this manner without further training, but at that time, MAP-Neo was still somewhat immature and overly aggressive in its filtering. I hold a high regard for the work on these transparent models, as the underlying principles are quite simple. Personally, I believe there’s no point in hiding many incremental tricks. In the realm of LLMs, model capabilities improve significantly over time. As for pre-training data, no matter how well the rules are refined, leaderboard performance could easily be surpassed by approaches like DCLM and Fineweb-edu, which directly fit downstream tasks using fasttext [not that I endorse this method]. Instead, it's better to release the methods for public discussion and critique, while also boosting the reputation and visibility of one’s models.There are two major takeaways in this paper regarding two-stage SFT and GitHub stars: 1.Two-stage SFT is effective. 2.GitHub stars are a significant pitfall when it comes to heuristic rules; in short, the conclusion is to avoid filtering based on this metric. |
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI | A very meaningfully mathematical benchmark was developed through collaboration with more than 60 mathematicians, resulting in the creation of hundreds of original and highly challenging mathematical problems. These problems cover various branches of modern mathematics, and they can be automatically validated, with answers typically being integer solutions or SymPy objects. Currently, the accuracy rate is extremely low. It represents a highly valuable dataset for mathematics competitions. |
GUI Agents with Foundation Models: A Comprehensive Survey | The summary of the data source section feels like a rather useful cheat sheet and is quite comprehensive. |
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks | Microsoft's Magentic-One, a general multi-agent system, aims to address complex tasks and utilizes GAIA, AssistantBench, and WebArena as test sets. It employs a leading agent (Orchestrator) to plan, monitor progress, and replan to recover from errors. Other specialized agents perform specific tasks as needed, such as operating web browsers, navigating local files, or writing and executing Python code. |
Analyzing The Language of Visual Tokens | Interesting perspective. Essentially, it involves using tokens extracted from visual data to train GloVe and then observing similarities across various topological structures. It demonstrates that although visual languages to some extent follow Zipf's law, the higher frequency of new tokens and the lower compression rates indicate a more dispersed distribution of information. The lack of grammatical structure and hierarchical organization in visual languages leads to higher perplexity and weaker hierarchical structures. Recommended reading. (But the formatting isn't pleasing——Maybe rushed?) |
Clustering in Causal Attention Masking | The general conclusion should be that, theoretically, tokens are proven to converge into a single cluster, and the existence of metastable clusters within the model is confirmed. Here, what is referred to as "causal attention" means that each token can only interact with the preceding tokens, ensuring the correctness of the generation order. |
HourVideo: 1-Hour Video-Language Understanding | This study selected 500 egocentric videos from the Ego4D dataset, with video durations ranging from 20 to 120 minutes, not all of which exceed 1 hour. In terms of tasks, the coverage feels very comprehensive, with various tasks designed, including summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval). The average Gemini score is around 30, which appears to have significant potential for improvement and represents a benchmark with room for advancement. |
SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference | A rather ingenious operation called speculative decoding segments each prompt-response pair into a sequence of token IDs and extracts all possible suffixes from these sequences to construct a suffix tree. Each node in this tree represents a token, and the path from the root to any given node corresponds to a subsequence that appears in the training data. Given any pattern, it quickly identifies possible continuation paths in the suffix tree, then extracts a smaller subtree, which is ultimately used to accelerate inference. |
Kwai-STaR: Transform LLMs into State-Transition Reasoners | A preliminary study on Mathematics O1 defined five actions: Formalize, Decompose, Solve Subproblem, Solve Parent, and Summarize. It then used the STaR approach for fine-tuning (LoRA). The overall information content was not extensive, and the performance was satisfactory, but particularly challenging benchmarks were not tested. |
Vision Language Models are In-Context Value Learners | Generative Value Learning (GVL). Essentially, GVL allows the Vision-Language Model (VLM) to generate globally consistent value estimates by providing the entire trajectory as input. However, GVL also requires the VLM to focus on individual frames and output accurate value predictions by randomly shuffling the input frames. The main claim of this paper is that it can overcome temporal bias. I briefly explored the two datasets they used for evaluation: the Open X-Embodiment and ALOHA datasets. Overall, it is quite informative and is recommended for reading. |
Scaling Laws for Pre-training Agents and World Models | This study investigates the impact of scale on world modeling and behavior cloning by utilizing generative pretraining losses on large-scale datasets. Specifically, behavior cloning, which involves predicting actions, and world modeling, which involves predicting the outcomes of actions [i.e., images], are defined as two tasks. The findings reveal that the trade-off between model and dataset size is influenced by the compression rate of the tokenizer, task type, and architectural choices. Additionally, it was discovered that using continuous embeddings (i.e., images) as the objective leads to a rapid increase in the ideal model size for the corresponding dataset, indicating a higher learning difficulty. In contrast, tokenization reduces the learning difficulty for models. Recommended reading |
Paper | Comments |
---|---|
On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models | This paper proposes decoupling semantic conditions from other conditions and uses cosine weighting to adjust the contribution of low-level control conditions. It introduces weight transfer strategies from pre-trained models to larger datasets and higher resolutions through interpolated positional embeddings, scaled noise scheduling, and stronger data augmentation. However, upon closer examination, the hyperparameters involved appear numerous and potentially difficult to tune, especially without extensive diffusion experiments. Insights from experts on effective tuning tricks in this context would be valuable. |
Discovering Data Structures: Nearest Neighbor Search and Beyond | This work demonstrates that NN can learn data structures from scratch, which outperform traditional benchmarks for specific problems. The data structures examined include uniform distributions, more challenging distributions, Zipfian distribution, and uniform distribution over a 30-dimensional unit hypersphere. Previously, the significance of such ML experiments was unclear, but this research does suggest practical applications, such as measuring Data Consuming Efficiency. Specifically, compared to proposed traditional benchmarks like k-d trees or binary search, it remains to be seen how structured data could verify new structure generalizability with minimal training. Research in this area still appears sparse but promising. |
DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models | This work combines VAE and diffusion models, employing a learnable Transformer pre-trained language model as the encoder to map structured text data to a latent space. Using reparameterization techniques, input data is encoded into latent features. The latent representations are then denoised in the latent space, and a noise removal network is trained to restore the original latent vectors. These features are finally injected into an LLM decoding process to generate high-quality, controllable synthetic data. Remarkably, Mistral models fine-tuned on synthetic data generated by DiffLM outperform those trained on real data in HumanEval and MBPP benchmarks. This novel approach could facilitate large-scale data generation and rewriting projects. |
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? | This paper introduces a benchmark dataset, Interaction2Code, for testing Code Agents in interactive scenarios. The dataset construction includes webpage selection, automated interactions, post-processing, and interaction extraction. If the data quality holds up, it could be a valuable dataset for evaluation purposes. Further assessment might be warranted to determine its utility. |
Inference Optimal VLMs Need Only One Visual Token but Larger Models | This intriguing paper, though not entirely conclusive, establishes a scaling law between LLM size and the number of tokens provided by a Vision Encoder during inference. Two parameters are introduced to represent LLM quality and visual information compression. Observations reveal a logarithmic-linear decline in performance as visual tokens decrease, but LLM parameters have a fivefold greater impact on downstream errors than the number of visual tokens. Thus, minimizing inference FLOPs is more effective by reducing visual tokens than LLM parameters. For visual reasoning, the optimal configuration is a large LLM with minimal visual tokens, while OCR and document understanding tasks require more visual tokens. The ablations on LLaVA-Onevision parameters are thorough and recommended for reading. |
Wave Network: An Ultra-Small Language Model | This paper, though perhaps lacking immediate practical applications, is conceptually interesting. It uses complex vectors to represent each token, encoding both global and local semantics. Specifically, complex vectors comprise magnitude vectors (global semantics) and phase vectors (relationships between tokens and global semantics). From a signal processing perspective, token embeddings are treated as discrete signals in the frequency domain, with magnitudes summed for global semantics and phase vectors for local relationships. Token representations are updated using complex vector operations, simulating wave interference (addition) and modulation (multiplication). The claim that token embeddings focus on local semantics, lacking direct global representation, is reasonable, but experimental validation is weak. Nevertheless, the approach is novel. |
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios | This paper reveals MLLMs' vulnerability to misleading instructions. It introduces explicit and implicit misleading prompts, such as instructions like "The correct answer is {incorrect option}," and constructs a new multimodal uncertainty benchmark (MUB) to evaluate susceptibility. Results indicate high susceptibility rates, with an average misleading rate of over 86% across MLLMs, and 27% even for simple explicit misleading scenarios. This method of assessment may be worth further exploration. |
Game Plot Design with an LLM-powered Assistant: An Empirical Study with Game Designers | GamePlot, an LLM-based tool, assists game designers in creating immersive narratives and refining them through collaborative gameplay testing. The most appreciated feature is the ability to modify plots during testing, followed by NPC summaries and multiplayer settings. Participants value content generation, content control, and editing capabilities. Domestically, a similar product is "Caiyun Xiaomeng," but user experience suggests simplification could reduce the entry barrier, especially for role-playing scenarios. In practice, users often want simplified interactive movie-like experiences rather than the complexity of traditional role-playing games. Challenges remain, particularly in RAG+Database construction and narrative immersion. Despite limitations, the concept holds significant potential. |
DroidSpeak: Enhancing Cross-LLM Communication | DroidSpeak cleverly reduces communication overhead in multi-agent LLM frameworks by selectively reusing intermediate data from the sender LLM, eliminating redundant computations. The approach requires multi-agent models to be at least partially homogeneous. |
Mixtures of In-Context Learners | MOICL divides a set of demonstrations into k subsets, trains k ICL experts, and combines their token predictions using a trainable weighting function. This concept is quite intriguing, and Ponti and Minervini's group consistently produces thought-provoking ideas. It is worth following. |
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection | This paper calculates the importance of KV caches for each attention head and introduces a voting mechanism to select a subset of critical KV cache tokens for computation. It designs a cache selection mechanism allowing similar queries to share selection results, reducing selection frequency and ensuring efficiency. |
Textual Aesthetics in Large Language Models | Furu's paper defines textual aesthetics, designs corresponding SFT data, and evaluates the concept. The aesthetics definition remains unclear, appearing to focus more on organization and layout or text coherence for identical semantic content, as illustrated in Figure 1. This is a novel issue worth revisiting. |
SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models | While the idea of this paper is promising, the improvements seem to enhance robustness rather than factuality. The approach leverages the divergence between early and final layer logits to approximate KL gradient, selecting tokens based on early layer approximations. The weighted average of these estimates informs adjustments to the final logits. |
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs | This paper introduces a Code-Switching Curriculum Learning method for multilingual generalization, emulating human second-language acquisition through hierarchical training. It pre-trains with word-level code-switching data, advances with sentence-level data, and concludes with monolingual corpora. Consistent with internal findings, the experiments are relatively simple. |
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control | This paper identifies task-relevant components (e.g., attention heads) and leverages sparsity to achieve near-independent task control. LLM computations resemble a Directed Acyclic Graph (DAG), with output changes measured by replacing specific node activations. This perspective aligns with CoT reasoning and introduces a principal component-based trick. |
Fantastic LLMs for Preference Data Annotation and How to (not) Find Them | The paper introduces the "strong-weak hypothesis," suggesting that increasing preference gaps between two LLMs enhances density ratio reward function accuracy. This hypothesis is validated through experiments on 221 LLM pairs. The study uses log density ratios between well-aligned and poorly-aligned LLMs as reward signals to generate preference-aligned data. If effective, this method could yield substantial preference data. |
Paper | Comments |
---|---|
Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment | Recently, many colleagues have discussed a key issue in ORM: if the evaluation is based solely on the final result, the reward signals obtained may be very sparse. Therefore, the Reward Model needs to generate different reward signals for different responses, which encourages learning from some responses that may not be entirely correct but contain reasonable information. The core insight of this paper from Alibaba lies in adaptively identifying important information and converting sample-level supervision into fine-grained, subsequence-level supervision, thereby making the reward and action space density more aligned with the input information density. The optimization goal and path are quite fundamental. However, the paper includes many extraneous elements, such as introducing adaptive masks to dynamically update the threshold for preference judgments and a Schmitt trigger. The author’s personal thought is more straightforward: if we simply focus on refining the reward generation process, for example, since a single reward for an entire response can be vague, why not allow a large model to run a pipeline that dissects the scoring dimensions? If we provide highly annotated CoT and reference scoring weights, and allow a large model to review and score progressively, this would be less about playing with algorithms and more about directly applying computational power to a longer, more detailed reward generation pipeline. Last time, a colleague from GDM mentioned that they scaled up computational power for generating PRM rewards, somewhat like applying self-consistency to RM, which reportedly yielded some benefits, although it is unclear how reliable this rumor is. |
The Graph's Apprentice: Teaching an LLM Low Level Knowledge for Circuit Quality Estimation | This Verilog instance dataset appears to be sizable and should be valuable. It could potentially be merged into our evaluation or used to cover an additional corner case. |
A Theoretical Perspective for Speculative Decoding Algorithm | This work by Mengdi Wang provides a theoretical analysis of speculative decoding, abstracting the decoding problem through a Markov chain formalization. The preliminary process involves generating draft sequences using a small model and then validating the tokens of these draft sequences with a large model. The first two claims made by the author are very strong: one provides the exact formula for the expected number of rejections in speculative decoding, indicating that the acceleration rate is inversely proportional to the distribution difference. The other proves that, under the condition of keeping the distribution unbiased, any unbiased algorithm will have at least as many rejections as speculative decoding, demonstrating that speculative decoding is optimal among this class of algorithms. The paper also introduces batch speculative decoding, which seems like a solid contribution. |
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models | In this work, the visual aspect merely serves to extend the set of mathematical seed problems. The approach, however, could be applied not only to mathematics. It includes 501 high-quality seed problems across multiple topics, each represented as a Python program. These programs are carefully designed to automatically generate a large number of concrete problem instances, covering variants such as numerical changes, geometric transformations, and function type variations. This approach is similar to the idea shared earlier regarding generating LeetCode-style problems from a template, which can then be dynamically extended into real LeetCode problems. This methodology seems useful for training models; with some adjustments, it could be leveraged in pre-training to create a small batch of synthetic data, yielding potential benefits. Out of the 501 seed problems, 227 are from existing visual mathematics datasets, while 274 are newly collected or developed. Beyond OOD evaluation, this approach can also support program-based evaluation, where a large collection of related algorithms/templates can be used to test the internal robustness of a single algorithm/template. Additionally, tricks could be applied to these algorithm templates, such as constructing cases like "how many animals are in the cage if there are chickens and rabbits", to test the degree of pattern solidification in the model. This is an effective and low-cost direction that can provide valuable insights. |
A Deep Dive Into Large Language Model Code Generation Mistakes: What and Why? | This paper analyzes code generation errors in large language models, using GPT-4 and Gemini Pro 1.0, and benchmarks such as HumanEval-X and CoderEval. It provides a valuable analysis of errors occurring in language model code generation. The paper identifies seven main categories of errors: conditional errors, garbage code, mathematical formula and logic errors, minor output formatting errors, operational sequence errors, API misuse, and indexing errors. In the cause analysis, apart from corner cases and training gaps, several key insights are offered: 1. Misleading coding conventions and guidelines; 2. The impact of In-Context Learning (ICL). Both 1 and 2 have similar effects: ICL is not necessarily wrong, but it may introduce strange influences in subsequent outputs. There seems to be much potential for further exploration here. 3. Misleading function documentation. One hypothesis is that LLMs somehow learn a pattern in code generation where the function signature is expected to fully align with the implementation. 4. Sensitivity to position. |
Scaling Laws with Hidden Structure | This paper is highly recommended for reading, as its modeling approach is fundamental. The author seems to believe that neural networks can effectively learn discrete distributions through hidden factorial structures in the data. From my reading, the assumption is that each discrete element (though not explicitly mentioned in the paper, it can be intuitively linked to tokens) is mapped to a learned vector, and any unknown or known factorized embedding can be represented as a nested distribution satisfying the factorial assumption. Additionally, the paper observes that the learning speed is related to statistical complexity χ, suggesting that MLPs can leverage the implicit product form of the target distribution to improve learning efficiency. The paper also argues that generalization ability is related to the connectivity of the factorization graph and its statistical complexity. Although the experiments are somewhat toy-like, the findings can be linked to many phenomena in large language models (LLMs). From a circuit perspective, I feel that it is relatively clear how LLMs learn individual functions, and this research could clarify this further. The most valuable aspect of the paper in terms of Mechanical Interpretability is understanding where traditional grammar or CFG assumptions do not align with text grammar and how to construct CFGs (or multiple CFGs) that resemble text grammar but are controllable. This would help identify the subtle boundaries and mechanisms of whether and how the learning occurs. |
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models | The quality of this paper is not particularly high. It introduces a single-elimination tournament approach to reduce the number of comparisons required to achieve a robust Elo score. However, this is an emerging direction that I find promising. Recently, I came across another paper that uses multiple non-Elo algorithms to model other statistical significances based on different models' responses to the same prompt. This paper could be considered a pioneering work in the field, opening up a small new area. As for Arena, there are a lot of assumptions that are problematic. For example, it attempts to represent user profiles, but which types of users does it represent? Are different users truly consistent? The paper provides a simple win/loss analysis, but what about clustering and analyzing response patterns? How are user preferences reflected? There's a lot to explore from a statistical perspective. Additionally, the chatbot arena approach itself is not particularly efficient. |
Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation | This paper raises an interesting and important issue. The related work gives the impression that there has been insufficient research into the impact of data ordering in LLM training. The paper utilizes LEAN and HH. The author introduces a new data ordering method, called intuitive ordering, in which the relevant intermediate supervision for each proof step always appears to the left of the proof step. Personally, I feel that this is somewhat of a toy model because it’s difficult to find this kind of intuitive ordering in pretraining data. Nevertheless, I still appreciate articles that introduce new problems. |
Vision-Language Models Can Self-Improve Reasoning via Reflection | This is a relatively good A+B paper that introduces an iterative self-training framework, R3V, to enhance vision-language reasoning abilities through reflection on self-generated CoT (Chain of Thought) reasoning. The method itself is not particularly groundbreaking, but it is still worth reviewing for tuning experiences and the observed improvements in performance. |
GWQ: Gradient-Aware Weight Quantization for Large Language Models | The idea is simple, intuitive, and effective. The proposed GWQ method retains the top 1% of weights with the largest gradient absolute values in FP16 precision, while quantizing the remaining weights to a lower bit format, achieving a low quantization loss. This can be considered a parameter quantization version of Speculative Decoding. |
Can Large Language Models generalize analogy solving like people can? | This is an out-of-distribution (OOD) task where participants are asked to infer a new letter string based on a given transformation rule. The performance of models and humans is compared, and interestingly, adults and some LLMs (such as GPT-4o and Llama-3.1 405B) outperform children in this task with the Latin alphabet. However, Claude-3.5 and Gemma-2 27B perform slightly worse. This observation highlights the rare lack of robustness in Claude-3.5-Sonnet for OOD tasks, whereas Llama-3.1-405B does not perform poorly. It might be worthwhile to add Llama-3.1-405B as a baseline in our OOD benchmark comparisons. |
Thinking Forward and Backward: Effective Backward Planning with Large Language Models | The paper proposes a backward planning algorithm where the LLM first generates a backward plan, then reverses the sequence and self-validates it. This approach helps LLMs avoid inherent biases in backward planning, generates more diverse candidate plans, and utilizes the asymmetry between the forward and backward directions of planning problems. The benchmarks used are limited to three constructed tasks: graph planning, array transformation, and block world tasks. However, the experimental design is quite interesting; it employs breadth-first search (BFS) to compute the number of steps for both forward and backward searches. ING-VP used a similar approach, and while many recent reasoning benchmarks have not, it is actually possible to derive the structure of a Reasoning Directed Acyclic Graph (RDAG), where for steps that are well-defined, the nesting depth and total steps can be clearly calculated, which can provide valuable new insights. |
How Far is Video Generation from World Model: A Physical Law Perspective | This paper explores the ability of video generation models to discover physical laws, particularly the ability to identify these laws purely from visual data. The quick takeaway is that diffusion models and insufficient data alone cannot solve the out-of-distribution (OOD) generalization problem. During the generalization process, the model tends to refer to similar training cases rather than learning universal rules. Future research should focus on improving models to better understand and apply physical laws. This is somewhat similar to the characteristics of LLMs, but it appears that the knowledge learned by diffusion models is shallower (possibly due to the lower information density in visual data, making it harder to extract rules). Earlier this year, during an ICLR discussion with Professor Tan Xu and Xing Chao, we talked about why diffusion models rarely mention "grokking." If one is learning explicit rules or patterns like resolution or extraction, it is relatively smooth. A follow-up paper analyzing diffusion model grokking from the perspective of physical laws could be a very decent contribution. |
Evaluating Creative Short Story Generation in Humans and Large Language Models | This is not a static benchmark, so it cannot be included in existing evaluation systems, but it quantifies some points that might already be known: 1. Stories generated by models tend to have higher vocabulary and syntactic complexity than those generated by humans, but they have lower readability. 2. Human-generated stories exhibit higher vocabulary diversity. They have lower complexity but higher diversity, which could be an interesting point. 3. Humans are more likely to use pronouns and often write from the first or second-person perspective, while models tend to favor the third-person perspective. 4. Humans' story transitions and plot twists create a greater sense of surprise, meaning that humans have more creative twists, whereas models tend to be more mundane and logical. I am also quite curious about how Robert trains his models, as the results still seem intriguing. |
Improving Steering Vectors by Targeting Sparse Autoencoder Features | The paper primarily addresses the issue of steering vector intervention. They control the behavior of language models by adding steering vectors, which are implemented by inserting activation vectors during the model's forward propagation process. In this work, they increase the impact prediction by inserting the steering vectors, thus achieving a certain degree of controllability (SAE-Targeted Steering, SAE-TS). The aim is to achieve more precise steering control by measuring the impact of steering vectors on Sparse Autoencoder (SAE) features. This method seems to have significant implications for alignment, especially during supervised fine-tuning (SFT), where the prediction of the impact of any single data point could be highly important. It is worth considering how to follow up on this approach. |
Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders | Recently, it seems particularly fitting to write a position paper on how to use Sparse Autoencoders (SAE) and techniques like the logit lens to understand model parameters and achieve controllability over model behavior. There seems to be a small technical breakthrough in this area, with slightly novel papers emerging almost every day. This paper proposes framing the token-feature matching problem as a resource allocation problem constrained by a sparsity budget. Existing TopK SAE methods solve this allocation problem under the constraint that each token can match at most K features, but they fail to fully leverage the advantages of adaptive computation. Therefore, they propose two new SAE variants: Feature Choice SAEs and Mutual Choice SAEs. Feature Choice SAEs relax the constraint that each feature can match at most M tokens, addressing the sparse allocation issue. Mutual Choice SAEs remove the constraint on token-feature matching numbers, allowing free allocation within the total sparsity budget. The new loss design they propose is somewhat similar to MoE load balancing. |
TableGPT2: A Large Multimodal Model with Tabular Data Integration | This is a project from Zhejiang University's Jake Zhao Junbo, which focuses on building a comprehensive pipeline for table understanding. It includes pretraining and other components, and the approach is quite detailed. The benchmarks and datasets involved in the paper could be worth reviewing for potential references or reuse. |
Context Parallelism for Scalable Million-Token Inference | The paper introduces Context Parallelism (CP) to optimize long-context LLM inference. It specifically focuses on long contexts and presents two lossless, accurate circular attention variants: pass-KV and pass-Q. Additionally, scalability tests are conducted across multiple nodes. |
Paper | Comments |
---|---|
Human-inspired Perspectives: A Survey on AI Long-term Memory | The focus is on several concepts presented in this paper:\1. The paper introduces several types of human memory: episodic memory, semantic memory, and procedural memory. It maps the first two to non-parametric memory and the latter to parametric memory. Within the context of this survey, the authors expect that various episodes and semantics are to be within the context accepted by language models, rather than associations generated within the model itself. This may not necessarily be a correct belief, but it is worth considering.\2. Furthermore, the paper proposes a memory management mechanism, emphasizing that adaptive storage, adaptive retrieval, and adaptive forgetting handle different types of information separately. These three operations are defined very concisely: storage, retrieval, and forgetting. Currently, generally speaking, LLMs rarely explicitly manage forgetting, including agents; this might be achievable through those circuit control-based schemes that have appeared in many recent papers. |
WLPlan: Relational Features for Symbolic Planning | |
GPT for Games: An Updated Scoping Review | This survey offers a well-structured perspective, introducing two noteworthy aspects. Firstly, the title clearly defines the scope (2020-2024), providing a focused temporal range that avoids an exhaustive historical review. Secondly, it presents a novel approach to literature selection, suggesting that the process itself can be a relevant research topic. While most current surveys are AutoSurveys, the method used here could inspire studies analyzing how literature for a survey topic is selected, based on previous reviews. The paper is divided into three main areas: 1) Game Generation, 2) Agent Creation in Games, and 3) Game User Research. In the Game Generation section, the study summarizes methodologies that generate entire game content based on frameworks like stories or programs, covering granularity levels from stories and missions to levels and characters. It also discusses design development through user prompts and interaction with large language models (LLMs), where LLMs primarily serve as tools for quickly generating various layouts and mechanisms. In the context of interactive gameplay, the paper likens this approach to tabletop RPGs, where LLMs provide story content, user experience enhancements, and real-time creative support. This field shows significant potential, with only around 30 papers selected for review, and many appear to be standouts in a limited field. Research in game user studies also appears sparse, with only a few papers in this category. |
Project Sid: Many-agent simulations toward AI civilization | This paper proposes a mega-scale Stanford Town. |
GameGen-X: Interactive Open-world Game Video Generation | This paper presents the OGameData dataset, which supports text-to-video generation and video continuation tasks, enabling models to generate high-quality, open-domain game videos with long sequences. It integrates character interaction and scene content control within video generation. As one of the earliest works in China paralleling Google’s Genie, the follow-up results appear promising. After reviewing their promotional video, minor issues in scene transitions were observed, though overall performance is impressive. The keyboard control feature is notably commendable. |
Latent Paraphrasing: Perturbation on Layers Improves Knowledge Injection in Language Models | |
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models | Based on SAE training, an inclination parameter t is introduced to encourage the model to better represent tail concepts. |
Mitigating Tail Narrowing in LLM Self-Improvement via Socratic-Guided Sampling | Impact of long-tail data is highlighted, and this phenomenon is indeed quite pronounced. During the model's self-improvement process, oversampling simple queries and undersampling complex ones lead to a concentration of distribution on high-probability data. High-probability data coincidentally aligns with the model's internal common patterns, ultimately resulting in mode collapse. We also have worked on self-improvement, and we find that this issue is both common and significant. The solution presented in this paper appears intuitive, using various forms of guidance to identify and resample tail data. Based on personal experience, seemingly less elegant methods can often be insightful and practical; in this case, “less elegant” refers to the use of four types of guidance by Professor Huang and Professor Guitao, which lack obvious intrinsic logical connections. The ablation study suggests that the proposed state reset approach is generally more effective, where this state reset is somewhat similar to reverting to a prior reasoning step after multiple unsuccessful attempts at the current step |
Physics in Next-token Prediction | Recently, TeleAI has published quite a few works that may not be highly effective but are quite imaginative, such as SentenceVAE and a collaboration with BAAI on continuously scaling model pre-training up to 1 trillion parameters. In this paper, a new formula is proposed to quantify the energy consumption required for information transmission in the context of Next-token Prediction (NTP) as an information compression process. Consistency with the OAI Scaling Law is also derived |
Self-Evolved Reward Learning for LLMs | This work proposes a self-evolved reward learning approach. The key innovation compared to SPIN and previous methods is that this approach self-evolves the RM through a feedback loop using the RM itself. The LLM serves as the RM, generating feedback on the dataset that is subsequently used to refine its own learning. This iterative ”feedback-then-train” loop allows the RM to self-evolve over time, gradually improving its performance. It can also generates high-quality preference data and reduce the reliance on human-annotated data. 'Self-Improving' topic is finally gaining momentum and becoming more popular in the field. |
Constant Acceleration Flow | The work on Diffusion really involves a lot of explicit physics concepts. |
Generalizability of Memorization Neural Networks | The paper presents a systematic theoretical analysis of generalizability of memorization neural networks. It provides a formula modeling the minimum number of parameters required to memorize any dataset sampled i.i.d. The research demonstrates that some commonly used memorization networks do not have generalizability even if the dataset is drawn i.i.d. from a data distribution and contains a sufficiently large number of samples. This work also provides complexity analysis. Recommended reading for those interested in interpretability studies. |
Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement | The fundamental concept is that after modeling the code trees from the majority of GitHub repositories, GitHub can theoretically serve as a simulation environment for scaling RL. This approach is viable because GitHub contains comprehensive data with natural interaction records, even after filtering. Compared to conventional RL environments, it additionally provides codebase summaries, essentially functioning as a summarization based on global observations. The LingMa paper collected approximately 90,000 PRs from 4,000 repositories. The data underwent filtering to ensure code change quality and relevance, and then followed STaR's approach for training, implementing a fixed three-stage CoT (Chain of Thought) training framework: repository comprehension, fault localization, and patch generation. They employed their classic rejection sampling method, using two metrics - fault localization accuracy and patch similarity - for data filtering to ensure high-quality synthetic data. This suggests promising potential for scaling Decision Transformer/RL using standard code bases as initialization. |
Mastering the Craft of Data Synthesis for CodeLLMs | A comprehensive survey on CodeLLM data processing published by Oracle. |
Interpretable Language Modeling via Induction-head Ngram Models | A notable contribution that builds upon infini-Gram, which computes next-word probability distributions through longest suffix matching in reference corpora. The research introduces induction heads and employs custom neural similarity metrics for efficient search of potential next-word completions in input contexts. This process enables Induction-Gram to provide ngram-level justification for each generated word. Through this approach, it allows for coarse-grained evaluation of how language models predict subsequent words. |
Evolving Alignment via Asymmetric Self-Play | The paper presents a combination of RLHF and evolutionary approaches, essentially layering evolution over RLHF without substantially addressing the inherent preference modeling issues in RLHF. The work's notable insight lies in simultaneously optimizing both Creator's generation strategy and Solver's response strategy. This approach suggests a broader application beyond Creator roles - one potential direction for scaling RL on pretrained corpora involves a model with basic text comprehension capabilities, where a Rewriter & Creator fits the distribution of information between the question set and the original pretraining corpus, which aims to cover all essential information from the original pretraining corpus with minimal questions. The other component, similar to their Solver, focuses on problem optimization. The paper's implementation of minimax-regret, increasingly referenced in recent multi-agent works, merits review. While the paper's core claimed contribution is evolving previously uncovered prompts to encompass more scenarios, this advancement might be considered incremental. |
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks | A benchmark for evaluating planning and reasoning in human-robot collaboration tasks. |
If you are intereted in the work published by us, please navigate to our full paper list.