M-A-P Daily Paper

The M-A-P daily paper project curates and reviews a selection of new papers published daily on arXiv, providing insightful commentary on cutting-edge research across various scientific disciplines.

Click to view previous selection.

🔥 Paper Today: 12/11/2024

Paper	Comments
Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models	The benchmark developed by IDEA Research Institute focuses on the financial domain. It evaluates the model's discrimination ability and static properties. The section on Stock Movement Prediction can be reviewed to assess whether it should be included in the database. For the other sections, it seems that simply reviewing the corresponding knowledge would lead to significant improvements, as they primarily rely on memorization.
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework	This is a multimodal long document benchmark, with a maximum score of 5, as shown in the table below. Compared to Qwen-2-VL, Gemini, Claude-3.5-Sonnet, and 4o, it seems that this benchmark has substantial value. The benchmark consists of 851 samples, with each sample containing hundreds of pages of documents. During testing, the top 5 pages retrieved by the retriever are used, and the model is required to answer questions based on these 5 pages. In this case, 4o, Sonnet, and Gemini may have been constrained by the retriever, while Qwen-2-VL might truly perform poorly. However, other open-source models might perform even worse. The benchmark appears to have good discriminative ability and could serve as a direction for optimization.
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts	This is a valuable mathematical benchmark, crawled from OEIS. Each example on OEIS includes a brief mathematical description, some examples of integers, generation rules, related materials, and applicable rules for calculating the integer sequence. The subjects involved include graph theory, group theory, formal languages, and many others. This benchmark requires the model to pass unit tests when generating samples, and it is evident that the probability of passing in a single attempt is very low. It also integrates the model's ability to formalize mathematics and convert it into code. It is recommended to incorporate this into our static OpenBenchmark inference rules section.
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios	This is the first time seeing a paper from iQIYI, and it presents a good approach with a focus on Long-Context scenarios for Instruction Following Evaluation (IFEval). The benchmark shows clear discriminative ability, with a strong correlation to model size.
Counterfactual Generation from Language Models	The paper begins by reformulating the language model as a Generalized Structural Equation Model (GSEM). In this GSEM, the primary introduction in the formulas is the distribution of uncertainty from the state to the final generated token. Specifically, in Definition 2.1, the variables U and V are involved. The paper models the randomness here, separating the random components from the deterministic ones, and extracts a formula for the latent noise variables of the observable and controllable strings. The noise is then observed and modeled by comparing the logits of different tokens. Finally, based on the modeled distribution, the noise is controlled within certain limits and inserted into the generated logits, resulting in counterfactual strings. The potential of this technique is considerable, as the noise distribution could model a wide range of phenomena. The experiments are somewhat limited, but the theoretical foundation is solid, and the approach is ingenious. The paper is recommended for reading.
Continual Memorization of Factoids in Large Language Models	This paper explores a valuable problem: how large language models (LLMs) retain long-tail factual knowledge during continual pretraining or fine-tuning. The importance of this issue lies in its common occurrence during supervised fine-tuning (SFT). Below is a visualization showing the distribution of sentence embeddings from an SFT dataset (OpenHermes) after sampling from the Cosmopedia data. It can be seen that a large portion of the data remains uncovered. Covering these edge-case data along with the normal SFT data could yield a model that, similar to Free Lunch, benefits from improved coverage across different QA scenarios. In this paper, the focus is more on preventing forgetting by mixing randomly generated word sequences or pretraining data during each training stage. The intuition behind this approach seems to be aimed at preventing large distributional differences. Since the experiments mainly use factual datasets, the results show that K-Pile (knowledge-related) performs better. It would be interesting to see a more detailed analysis of the impact of distribution shifts. In addition to common solutions such as current SFT and various alignment techniques, there may be smoother methods for transitioning from base models to SFT.
More Expressive Attention with Negative Weights	The paper introduces a novel attention mechanism called Cog Attention, which allows attention weights to take negative values, thereby enhancing expressive power. This idea seems to have been mentioned by Huang Wenhao. Additionally, two tricks are introduced to support this approach: subtracting the maximum absolute value from the exponent function to avoid numerical overflow, and normalizing by using the sum of the absolute values of the results as the denominator to avoid NaN errors from division by zero. The author validates Cog Attention on a 141M parameter LLM and an image generation model, showing that it outperforms traditional Transformers. The core capability claimed for Cog Attention is its ability to simultaneously delete, copy, or retain tokens in any task involving indirect object recognition. This means that the expressiveness of a single attention head is significantly enhanced. Figure 2 provides a clear explanation of how this works, and Figure 5 shows a distribution worth examining. [Innocent smile] It’s unclear whether there are any plans for a quick scaling experiment or internal reports on similar ideas, but this seems like an interesting direction and aligns with findings in neuroscience, particularly in the paper by Professor Rui Yan from Renmin University.
The Super Weight in Large Language Models	The paper identifies a phenomenon called "Super Weights" in large language models (LLMs), which are a small number of parameters that have disproportionately large importance for model quality. Their observation method is data-agnostic and avoids the suspicion of coupling with specific data by detecting peaks in the input and output distribution of certain layers during a single forward pass to locate Super Weights. The paper also emphasizes the strong correlation between Super Weights and Super Activations. "Super weights are often found in an early layer’s down projection," and they quickly generate Super Activations, which persist throughout the model at the same size and location, held by skip connections. Super Activations help suppress the likelihood of stop words. In this paper, the focus is on the combination of quantization with Super Weights. This discovery feels quite significant, as studying what exactly is stored within these heads and weights is also crucial.
Generative Adapter: Contextualizing Language Models in Parameters with A Single Forward Pass	This paper presents a very clear approach to LLM personalization by training a lightweight Generative Adapter on top of a pre-trained language model (LM) to generate layer-wise additive updates to the parameters. For each Transformer block, the Generative Adapter uses the outer product of the past contextual hidden states from the corresponding base LM layer to generate incremental weights. By generating and accumulating these adapters layer by layer, the model can adapt to new contexts online. The Generative Adapter is trained using two self-supervised tasks, reconstruction and completion, to ensure that the generated adapters can effectively leverage contextual information.
Quantifying Artificial Intelligence through Algebraic Generalization	The paper introduces the theory of algebraic circuit complexity and provides a concise framework for evaluating the algebraic generalization ability of AI systems. It appears to be an excellent sandbox for research in Mechanical Interpretability. It would be worth considering tomorrow the quantifiable scope covered by this approach and its correspondence with grammars. Compared to the work of Professor Allen Zhu, this framework seems to significantly broaden the scope of studying model generalization.

08/11/2024

Paper	Comments
LBPE: Long-token-first Tokenization to Improve Large Language Models	This paper introduces LBPE, an algorithm that prioritizes merging relatively long tokens during the encoding phase. Specifically, LBPE merges tokens based on the inverse ranking of token length rather than the vocabulary ranking. This approach increases the frequency of longer tokens in the final token representation, thereby balancing the frequencies of tokens of different lengths. As for the improvement of result, this idea seems effective with minimal cost, making it feasible for a quick merge to verify its impact. It maybe an existing issue that was previously considered minor.
Fox-1 Technical Report	A small model was trained by Professor Zhang Tong's team at UIUC, and now more and more researchers are trying to train their own small-scale LLMs, which is actually a positive sign. In this work, a Curriculum Learning approach similar to Ziya2 was used, and the benefits seem to be a multi-step distribution shift in the data.
How Good is Your Wikipedia?	This paper propose an effective ablation experiment, which was noted that Wikipedia may not always serve as a high-quality resource on low-resource languages. They categorized languages into four tiers and examined the impact of initial data filtering (primarily language filtering and deduplication) and heuristic rule filtering. It was highlighted that in low-resource language environments, Wikipedia should indeed be filtered carefully. The threshold setting was based on a kernel density estimation method, which, in my opinion, is not very convincing and could present a potential issue.
Balancing Pipeline Parallelism with Vocabulary Parallelism	It is a training infra paper of Professor Lin Min's group. They propose Vocabulary Parallelism, which essentially involves partitioning the vocabulary layer along the vocabulary dimension and distributing it evenly. This approach allows the computations of the vocabulary layer to be represented similarly to the forward/backward propagation of Transformer layers. Although I'm not an expert in infrastructure, this seems quite novel and sounds fairly plug-in compatible. The only question is how severe the performance degradation might be. I agree with finding a student to give it a try.
Benchmarking Distributional Alignment of Large Language Models	This paper analyzed whether LLMs can accurately simulate the distribution of viewpoints within specific demographic groups. I previously referred to this as Empathy: GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models. I personally believe this is a highly important capability in practical use—being able to infer a user’s identity role and then adapt responses accordingly, essentially, emotional intelligence. It seems like a direction worth analyzing and fine-tuning. Another interesting finding in GIEBench is that the 'emotional intelligence' of models indeed increases with size. His paper provides several useful datasets: OpinionQA, GlobalOpinionQA, and NYT Book Opinions, which are worth reviewing.
ASL STEM Wiki: Dataset and Benchmark for Interpreting STEM Articles	A challenging video benchmark, comprising a parallel corpus of 254 Wikipedia articles on STEM topics along with over 300 hours of American Sign Language (ASL) interpretation videos. However, it may require models to understand sign language, which might not be a realistic expectation for current MLLMs, as it overly couples them with a specialized domain capability. Scientific explanations in sign language demand extensive understanding of visual details and a certain level of reasoning. With well-provided additional QA, this could be an excellent resource.
LLM-PySC2: Starcraft II learning environment for Large Language Models	This work presents a gaming environment for LLMs to play StarCraft, with implementation methodology similar to Alibaba's approach of using MLLM for gaming. Two noteworthy observations emerge from this study: 1. While larger LLM variants can generate syntactically correct text actions, they demonstrate suboptimal performance in complex tasks. 2. Although sufficient model parameters are necessary, enhanced reasoning capabilities do not directly translate to improved decision-making outcomes. For instance, GPT-4 exhibited superior performance in most experiments but failed to achieve victory in certain simple tasks. This suggests that pre-trained large models cannot directly handle complex decision-making tasks, and learning within the deployment environment appears to be virtually inevitable.
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks	This can be considered the audio equivalent of BBH (Big-Bench-Hard), with SUPERB incorporating a substantial addition of music and environmental audio content in this iteration. This development is particularly noteworthy as M-A-P is currently in the process of developing AudioFLAN, which now encompasses over 200 tasks.
Recycled Attention: Efficient inference for long-context language models	The method alternates between full-context attention and partial-context attention during inference. During partial attention steps, it recycles the full attention pattern from the previous token and focuses only on the K most attended tokens, thereby reducing data movement and attention computation costs. The timing of full attention steps is determined based on query embedding similarity. The fundamental effectiveness of this approach is inherently linked to the natural occurrence of semantic blocks in expression, where tokens within semantic blocks typically exhibit high similarity. The method requires some degree of continued pre-training, though it can function without it at the cost of minor performance degradation.
Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models	The study presents an automated evaluation approach for image-to-text generation using diffusion models. The core methodology involves generating images from the produced descriptions via diffusion models, extracting features from both original and generated images, and measuring their similarity using cosine similarity metrics. High similarity scores indicate accurate text descriptions generated by the model, while low scores reveal potential weaknesses in model performance. The approach shows promise, particularly if the similarity critic is well-calibrated, potentially yielding significant results.
VISTA: Visual Integrated System for Tailored Automation in Math Problem Generation Using LLM	The core insight of this work appears to be the potential for generating large-scale reasoning problems requiring visual details and geometric diagrams through the mediation of code and LLMs, despite being framed within a multi-agent narrative. This presents a potentially scalable approach for synthetic data generation.
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images	Tencent's 3D character generation model introduces the Anime3D++ dataset, comprising 10,811 quality-controlled three-dimensional anime character models, all standardized to fixed poses. Methodologically, the paper emphasizes two key aspects: first, the decoupling of control signals, and second, multi-granular reconstruction, which involves initially generating multiple A-pose RGB images and normal maps, followed by iterative optimization of character quality.

🛠️ Papers This Week

(Expand to View)

08/11/2024

Paper	Comments
Scaling Laws for Precision	This study investigates the impact of low-precision training and inference on language model quality and cost, and proposes a "precision-aware" scaling law. It primarily focuses on the effects of quantization. The dataset used throughout is Dolma V1.7, and GPTQ is applied for post-training quantization, with comparisons made to other quantization methods.While I feel that the deeper implications of the proposed law still warrant further exploration, some significant takeaways were validated: common post-training quantization techniques can lead to substantial degradation in model performance, particularly after extensive data pretraining. This suggests that more pretraining computation does not necessarily result in a stronger model (though this claim seems somewhat less solid).The study found that training costs are linearly related to precision, and that low-precision training can reduce computational costs while maintaining loss stability. However, the paper does not delve deeply into the dynamics of using different precisions at different stages. Although this aspect wasn't explicitly addressed, based on the observations made in the study, it could be a promising direction for future research.
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding	This study is relatively straightforward and easy to understand, not overly ostentatious compared to recent studies on process supervision. The authors propose LaTent Reasoning Optimization (LaTRO), which utilizes the model's own probability estimates as a reward function. By optimizing high-quality reasoning paths, they achieve a smoothing effect on the reward probability distribution when moving from the query to the response through a specific path. This reduces the extremity of rewards corresponding to different reasoning paths. However, if errors are significant or occur later in the process, as mentioned in the paper, the reward will approach zero. The concern here is that errors occurring towards the end might pose a hidden risk, potentially affecting reasoning performance. Currently, most papers in this field include both greedy and self-consistency settings. Recommended reading.
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models	This is a study exploring the Multi-Agent framework, with two important takeaways: 1.All tasks performed by the Critic Agent are crucial for tree expansion and solution search, with tasks related to node termination and solution verification showing the most significant impact. Making mistakes is not a concern; debugging is crucial. 2.Exploring diversified strategies is more effective than iterative optimization based on a single solution.
BitNet a4.8: 4-bit Activations for 1-bit LLMs	BitNet a4.8 is a technique enabling 4-bit activation for 1-bit large language models (LLMs). Specifically, it applies 4-bit quantization to the inputs of the attention and feed-forward network layers, while the intermediate states are sparsified and quantized to 8-bit. During training, a two-phase process is employed to gradually transition from 8-bit to 4-bit, utilizing gradient approximation combined with mixed-precision training to update the parameters.
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models	The multimodal MoE model, however, differs from the conventional meaning of MoE and is more akin to the mixed training of three models. MoT achieves modality-specific processing by decoupling the non-embedding parameters of the model (including the feed-forward network, attention matrices, and layer normalization) while retaining a global self-attention mechanism. In MoT, each modality (text, image, speech) has its independent set of non-embedding parameters, such as feed-forward networks, attention projection matrices, and layer normalization. MoT applies a global self-attention mechanism across all modalities to capture cross-modal relationships. I had actually proposed a similar idea before, and I still think it should somehow be feasible. For the speech modality, models like SoundStream and Encodec use codebooks that could be treated in a similar manner in this context. By constructing an MoE model with eight codebooks corresponding to eight experts, while maintaining global attention, my assumption is that this approach could indeed save substantial computation. Based on my experience with pretraining MERT, I feel that the performance loss would be less significant compared to some of the current common processing methods used for speech.
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models	Similar to MAP-Neo, the greatest value of this work lies in data preparation, creating heuristic filters, and conducting various pre-training data analyses. With this groundwork, the performance is roughly on par with Qwen-2.5-Coder of the same model size across various leaderboards. Today, all comprehensive rules and code will be released, and processed pre-training data, SFT (supervised fine-tuning), and other related data will be gradually made available. Figure 3, in fact, represents a clever personal approach to studying heuristic rules in pre-training, since these transparent LLM projects often lack sufficient GPU resources and credits to run extensive training and ablation studies. Thus, figuring out whether adding a particular rule is effective becomes a philosophical question. At that time, I devised a somewhat unconventional but potentially inexpensive validation method: 1.First, generate embeddings for the pre-training dataset before adding new heuristic rules, then visualize the distribution using PCA. 2.Project the dataset filtered with the new heuristic rules onto the previous PCA distribution to observe which data points have been removed [this can be further quantified at a finer granularity]. 3.Randomly inspect clusters that have significantly reduced in density, as shown in Figure 3, to verify whether they align with the expected data removal.Perform sample annotation to assess the rate of false positives. Empirical conclusion: if the false positive rate is below 5%, the rule can be directly applied without further verification; otherwise, additional consideration is needed. Many of MAP-Neo’s rules were adapted in this manner without further training, but at that time, MAP-Neo was still somewhat immature and overly aggressive in its filtering. I hold a high regard for the work on these transparent models, as the underlying principles are quite simple. Personally, I believe there’s no point in hiding many incremental tricks. In the realm of LLMs, model capabilities improve significantly over time. As for pre-training data, no matter how well the rules are refined, leaderboard performance could easily be surpassed by approaches like DCLM and Fineweb-edu, which directly fit downstream tasks using fasttext [not that I endorse this method]. Instead, it's better to release the methods for public discussion and critique, while also boosting the reputation and visibility of one’s models.There are two major takeaways in this paper regarding two-stage SFT and GitHub stars: 1.Two-stage SFT is effective. 2.GitHub stars are a significant pitfall when it comes to heuristic rules; in short, the conclusion is to avoid filtering based on this metric.
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI	A very meaningfully mathematical benchmark was developed through collaboration with more than 60 mathematicians, resulting in the creation of hundreds of original and highly challenging mathematical problems. These problems cover various branches of modern mathematics, and they can be automatically validated, with answers typically being integer solutions or SymPy objects. Currently, the accuracy rate is extremely low. It represents a highly valuable dataset for mathematics competitions.
GUI Agents with Foundation Models: A Comprehensive Survey	The summary of the data source section feels like a rather useful cheat sheet and is quite comprehensive.
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks	Microsoft's Magentic-One, a general multi-agent system, aims to address complex tasks and utilizes GAIA, AssistantBench, and WebArena as test sets. It employs a leading agent (Orchestrator) to plan, monitor progress, and replan to recover from errors. Other specialized agents perform specific tasks as needed, such as operating web browsers, navigating local files, or writing and executing Python code.
Analyzing The Language of Visual Tokens	Interesting perspective. Essentially, it involves using tokens extracted from visual data to train GloVe and then observing similarities across various topological structures. It demonstrates that although visual languages to some extent follow Zipf's law, the higher frequency of new tokens and the lower compression rates indicate a more dispersed distribution of information. The lack of grammatical structure and hierarchical organization in visual languages leads to higher perplexity and weaker hierarchical structures. Recommended reading. (But the formatting isn't pleasing——Maybe rushed？)
Clustering in Causal Attention Masking	The general conclusion should be that, theoretically, tokens are proven to converge into a single cluster, and the existence of metastable clusters within the model is confirmed. Here, what is referred to as "causal attention" means that each token can only interact with the preceding tokens, ensuring the correctness of the generation order.
HourVideo: 1-Hour Video-Language Understanding	This study selected 500 egocentric videos from the Ego4D dataset, with video durations ranging from 20 to 120 minutes, not all of which exceed 1 hour. In terms of tasks, the coverage feels very comprehensive, with various tasks designed, including summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval). The average Gemini score is around 30, which appears to have significant potential for improvement and represents a benchmark with room for advancement.
SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference	A rather ingenious operation called speculative decoding segments each prompt-response pair into a sequence of token IDs and extracts all possible suffixes from these sequences to construct a suffix tree. Each node in this tree represents a token, and the path from the root to any given node corresponds to a subsequence that appears in the training data. Given any pattern, it quickly identifies possible continuation paths in the suffix tree, then extracts a smaller subtree, which is ultimately used to accelerate inference.
Kwai-STaR: Transform LLMs into State-Transition Reasoners	A preliminary study on Mathematics O1 defined five actions: Formalize, Decompose, Solve Subproblem, Solve Parent, and Summarize. It then used the STaR approach for fine-tuning (LoRA). The overall information content was not extensive, and the performance was satisfactory, but particularly challenging benchmarks were not tested.
Vision Language Models are In-Context Value Learners	Generative Value Learning (GVL). Essentially, GVL allows the Vision-Language Model (VLM) to generate globally consistent value estimates by providing the entire trajectory as input. However, GVL also requires the VLM to focus on individual frames and output accurate value predictions by randomly shuffling the input frames. The main claim of this paper is that it can overcome temporal bias. I briefly explored the two datasets they used for evaluation: the Open X-Embodiment and ALOHA datasets. Overall, it is quite informative and is recommended for reading.
Scaling Laws for Pre-training Agents and World Models	This study investigates the impact of scale on world modeling and behavior cloning by utilizing generative pretraining losses on large-scale datasets. Specifically, behavior cloning, which involves predicting actions, and world modeling, which involves predicting the outcomes of actions [i.e., images], are defined as two tasks. The findings reveal that the trade-off between model and dataset size is influenced by the compression rate of the tokenizer, task type, and architectural choices. Additionally, it was discovered that using continuous embeddings (i.e., images) as the objective leads to a rapid increase in the ideal model size for the corresponding dataset, indicating a higher learning difficulty. In contrast, tokenization reduces the learning difficulty for models. Recommended reading

07/10/2024

Paper	Comments
SPARSING LAW: TOWARDS LARGE LANGUAGE MODELS WITH GREATER ACTIVATION SPARSITY	An excellent paper with several key takeaways: 1) Different activation functions (ReLU and SiLU) exhibit similar performance but opposite trends in sparsity during training. ReLU's sparsity increases with more training data, while SiLU's sparsity decreases. 2) Below a certain bottleneck point, activation ratios increase linearly with width-to-depth ratios, indicating deeper architectures have potential advantages under fixed parameter budgets. 3) At similar width-to-depth ratios, the limit of activation sparsity shows weak correlation with parameter scale, implying that LLMs' internal activation patterns are insensitive to parameter size.
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models	A robust multimodal long-text benchmark developed by AI2, simulating comparative analysis workflows in scientific research. The benchmark requires cross-referencing among relevant documents. The results are highly challenging, and it could be worth considering for integration.
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models	This paper focuses on ensembling activation functions, demonstrating significant improvements in loss and benchmark performance. It remains to be seen how these results scale. The theoretical sections could be hard to understand.
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination	Data leakage appears more prevalent in MLLMs than in LLMs. A similar empirical observation is that MLLMs could guess related videos based on a single frame. If models can guess, they succeed; if dense vision details are needed, they fail, even with more frames.
AutoGameUI: Constructing High-Fidelity Game UIs via Multimodal Learning and Interactive Web-Based Tool	Tencent leverages multimodal learning and interactive web tools to efficiently build game UIs. Their approach models the latent space of UI/UX design, computes matching probabilities, and applies attention mechanisms and planning algorithms to finalize UI/UX matches. The GAMEUI dataset, comprising 42 game UIs, is noteworthy alongside the RICO dataset, both of which deserve further attention.
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding	This benchmark offers selected 900 YouTube videos across eight categories such as lifestyle, sports, and education. It uses a hybrid annotation pipeline to generate QA pairs, ensuring high-quality data. StreamingBench encompasses 18 tasks, covering 900 videos and 4,500 curated QA pairs, with five questions per video at different time frames to simulate streaming scenarios. Most videos are under 10 minutes. Besides, the latest YouTube content might contain unseen cases to models during training.
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level	This work by Huawei and Prof. Jun Wang 's group presents an automated data scientist framework. It aligns with a recent surge of similar studies, signalling the resurgence of AutoML.

06/11/2024

Paper	Comments
On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models	This paper proposes decoupling semantic conditions from other conditions and uses cosine weighting to adjust the contribution of low-level control conditions. It introduces weight transfer strategies from pre-trained models to larger datasets and higher resolutions through interpolated positional embeddings, scaled noise scheduling, and stronger data augmentation. However, upon closer examination, the hyperparameters involved appear numerous and potentially difficult to tune, especially without extensive diffusion experiments. Insights from experts on effective tuning tricks in this context would be valuable.
Discovering Data Structures: Nearest Neighbor Search and Beyond	This work demonstrates that NN can learn data structures from scratch, which outperform traditional benchmarks for specific problems. The data structures examined include uniform distributions, more challenging distributions, Zipfian distribution, and uniform distribution over a 30-dimensional unit hypersphere. Previously, the significance of such ML experiments was unclear, but this research does suggest practical applications, such as measuring Data Consuming Efficiency. Specifically, compared to proposed traditional benchmarks like k-d trees or binary search, it remains to be seen how structured data could verify new structure generalizability with minimal training. Research in this area still appears sparse but promising.
DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models	This work combines VAE and diffusion models, employing a learnable Transformer pre-trained language model as the encoder to map structured text data to a latent space. Using reparameterization techniques, input data is encoded into latent features. The latent representations are then denoised in the latent space, and a noise removal network is trained to restore the original latent vectors. These features are finally injected into an LLM decoding process to generate high-quality, controllable synthetic data. Remarkably, Mistral models fine-tuned on synthetic data generated by DiffLM outperform those trained on real data in HumanEval and MBPP benchmarks. This novel approach could facilitate large-scale data generation and rewriting projects.
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?	This paper introduces a benchmark dataset, Interaction2Code, for testing Code Agents in interactive scenarios. The dataset construction includes webpage selection, automated interactions, post-processing, and interaction extraction. If the data quality holds up, it could be a valuable dataset for evaluation purposes. Further assessment might be warranted to determine its utility.
Inference Optimal VLMs Need Only One Visual Token but Larger Models	This intriguing paper, though not entirely conclusive, establishes a scaling law between LLM size and the number of tokens provided by a Vision Encoder during inference. Two parameters are introduced to represent LLM quality and visual information compression. Observations reveal a logarithmic-linear decline in performance as visual tokens decrease, but LLM parameters have a fivefold greater impact on downstream errors than the number of visual tokens. Thus, minimizing inference FLOPs is more effective by reducing visual tokens than LLM parameters. For visual reasoning, the optimal configuration is a large LLM with minimal visual tokens, while OCR and document understanding tasks require more visual tokens. The ablations on LLaVA-Onevision parameters are thorough and recommended for reading.
Wave Network: An Ultra-Small Language Model	This paper, though perhaps lacking immediate practical applications, is conceptually interesting. It uses complex vectors to represent each token, encoding both global and local semantics. Specifically, complex vectors comprise magnitude vectors (global semantics) and phase vectors (relationships between tokens and global semantics). From a signal processing perspective, token embeddings are treated as discrete signals in the frequency domain, with magnitudes summed for global semantics and phase vectors for local relationships. Token representations are updated using complex vector operations, simulating wave interference (addition) and modulation (multiplication). The claim that token embeddings focus on local semantics, lacking direct global representation, is reasonable, but experimental validation is weak. Nevertheless, the approach is novel.
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios	This paper reveals MLLMs' vulnerability to misleading instructions. It introduces explicit and implicit misleading prompts, such as instructions like "The correct answer is {incorrect option}," and constructs a new multimodal uncertainty benchmark (MUB) to evaluate susceptibility. Results indicate high susceptibility rates, with an average misleading rate of over 86% across MLLMs, and 27% even for simple explicit misleading scenarios. This method of assessment may be worth further exploration.
Game Plot Design with an LLM-powered Assistant: An Empirical Study with Game Designers	GamePlot, an LLM-based tool, assists game designers in creating immersive narratives and refining them through collaborative gameplay testing. The most appreciated feature is the ability to modify plots during testing, followed by NPC summaries and multiplayer settings. Participants value content generation, content control, and editing capabilities. Domestically, a similar product is "Caiyun Xiaomeng," but user experience suggests simplification could reduce the entry barrier, especially for role-playing scenarios. In practice, users often want simplified interactive movie-like experiences rather than the complexity of traditional role-playing games. Challenges remain, particularly in RAG+Database construction and narrative immersion. Despite limitations, the concept holds significant potential.
DroidSpeak: Enhancing Cross-LLM Communication	DroidSpeak cleverly reduces communication overhead in multi-agent LLM frameworks by selectively reusing intermediate data from the sender LLM, eliminating redundant computations. The approach requires multi-agent models to be at least partially homogeneous.
Mixtures of In-Context Learners	MOICL divides a set of demonstrations into k subsets, trains k ICL experts, and combines their token predictions using a trainable weighting function. This concept is quite intriguing, and Ponti and Minervini's group consistently produces thought-provoking ideas. It is worth following.
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection	This paper calculates the importance of KV caches for each attention head and introduces a voting mechanism to select a subset of critical KV cache tokens for computation. It designs a cache selection mechanism allowing similar queries to share selection results, reducing selection frequency and ensuring efficiency.
Textual Aesthetics in Large Language Models	Furu's paper defines textual aesthetics, designs corresponding SFT data, and evaluates the concept. The aesthetics definition remains unclear, appearing to focus more on organization and layout or text coherence for identical semantic content, as illustrated in Figure 1. This is a novel issue worth revisiting.
SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models	While the idea of this paper is promising, the improvements seem to enhance robustness rather than factuality. The approach leverages the divergence between early and final layer logits to approximate KL gradient, selecting tokens based on early layer approximations. The weighted average of these estimates informs adjustments to the final logits.
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs	This paper introduces a Code-Switching Curriculum Learning method for multilingual generalization, emulating human second-language acquisition through hierarchical training. It pre-trains with word-level code-switching data, advances with sentence-level data, and concludes with monolingual corpora. Consistent with internal findings, the experiments are relatively simple.
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control	This paper identifies task-relevant components (e.g., attention heads) and leverages sparsity to achieve near-independent task control. LLM computations resemble a Directed Acyclic Graph (DAG), with output changes measured by replacing specific node activations. This perspective aligns with CoT reasoning and introduces a principal component-based trick.
Fantastic LLMs for Preference Data Annotation and How to (not) Find Them	The paper introduces the "strong-weak hypothesis," suggesting that increasing preference gaps between two LLMs enhances density ratio reward function accuracy. This hypothesis is validated through experiments on 221 LLM pairs. The study uses log density ratios between well-aligned and poorly-aligned LLMs as reward signals to generate preference-aligned data. If effective, this method could yield substantial preference data.

05/11/2024

Paper	Comments
Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment	Recently, many colleagues have discussed a key issue in ORM: if the evaluation is based solely on the final result, the reward signals obtained may be very sparse. Therefore, the Reward Model needs to generate different reward signals for different responses, which encourages learning from some responses that may not be entirely correct but contain reasonable information. The core insight of this paper from Alibaba lies in adaptively identifying important information and converting sample-level supervision into fine-grained, subsequence-level supervision, thereby making the reward and action space density more aligned with the input information density. The optimization goal and path are quite fundamental. However, the paper includes many extraneous elements, such as introducing adaptive masks to dynamically update the threshold for preference judgments and a Schmitt trigger. The author’s personal thought is more straightforward: if we simply focus on refining the reward generation process, for example, since a single reward for an entire response can be vague, why not allow a large model to run a pipeline that dissects the scoring dimensions? If we provide highly annotated CoT and reference scoring weights, and allow a large model to review and score progressively, this would be less about playing with algorithms and more about directly applying computational power to a longer, more detailed reward generation pipeline. Last time, a colleague from GDM mentioned that they scaled up computational power for generating PRM rewards, somewhat like applying self-consistency to RM, which reportedly yielded some benefits, although it is unclear how reliable this rumor is.
The Graph's Apprentice: Teaching an LLM Low Level Knowledge for Circuit Quality Estimation	This Verilog instance dataset appears to be sizable and should be valuable. It could potentially be merged into our evaluation or used to cover an additional corner case.
A Theoretical Perspective for Speculative Decoding Algorithm	This work by Mengdi Wang provides a theoretical analysis of speculative decoding, abstracting the decoding problem through a Markov chain formalization. The preliminary process involves generating draft sequences using a small model and then validating the tokens of these draft sequences with a large model. The first two claims made by the author are very strong: one provides the exact formula for the expected number of rejections in speculative decoding, indicating that the acceleration rate is inversely proportional to the distribution difference. The other proves that, under the condition of keeping the distribution unbiased, any unbiased algorithm will have at least as many rejections as speculative decoding, demonstrating that speculative decoding is optimal among this class of algorithms. The paper also introduces batch speculative decoding, which seems like a solid contribution.
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models	In this work, the visual aspect merely serves to extend the set of mathematical seed problems. The approach, however, could be applied not only to mathematics. It includes 501 high-quality seed problems across multiple topics, each represented as a Python program. These programs are carefully designed to automatically generate a large number of concrete problem instances, covering variants such as numerical changes, geometric transformations, and function type variations. This approach is similar to the idea shared earlier regarding generating LeetCode-style problems from a template, which can then be dynamically extended into real LeetCode problems. This methodology seems useful for training models; with some adjustments, it could be leveraged in pre-training to create a small batch of synthetic data, yielding potential benefits. Out of the 501 seed problems, 227 are from existing visual mathematics datasets, while 274 are newly collected or developed. Beyond OOD evaluation, this approach can also support program-based evaluation, where a large collection of related algorithms/templates can be used to test the internal robustness of a single algorithm/template. Additionally, tricks could be applied to these algorithm templates, such as constructing cases like "how many animals are in the cage if there are chickens and rabbits", to test the degree of pattern solidification in the model. This is an effective and low-cost direction that can provide valuable insights.
A Deep Dive Into Large Language Model Code Generation Mistakes: What and Why?	This paper analyzes code generation errors in large language models, using GPT-4 and Gemini Pro 1.0, and benchmarks such as HumanEval-X and CoderEval. It provides a valuable analysis of errors occurring in language model code generation. The paper identifies seven main categories of errors: conditional errors, garbage code, mathematical formula and logic errors, minor output formatting errors, operational sequence errors, API misuse, and indexing errors. In the cause analysis, apart from corner cases and training gaps, several key insights are offered: 1. Misleading coding conventions and guidelines; 2. The impact of In-Context Learning (ICL). Both 1 and 2 have similar effects: ICL is not necessarily wrong, but it may introduce strange influences in subsequent outputs. There seems to be much potential for further exploration here. 3. Misleading function documentation. One hypothesis is that LLMs somehow learn a pattern in code generation where the function signature is expected to fully align with the implementation. 4. Sensitivity to position.
Scaling Laws with Hidden Structure	This paper is highly recommended for reading, as its modeling approach is fundamental. The author seems to believe that neural networks can effectively learn discrete distributions through hidden factorial structures in the data. From my reading, the assumption is that each discrete element (though not explicitly mentioned in the paper, it can be intuitively linked to tokens) is mapped to a learned vector, and any unknown or known factorized embedding can be represented as a nested distribution satisfying the factorial assumption. Additionally, the paper observes that the learning speed is related to statistical complexity χ, suggesting that MLPs can leverage the implicit product form of the target distribution to improve learning efficiency. The paper also argues that generalization ability is related to the connectivity of the factorization graph and its statistical complexity. Although the experiments are somewhat toy-like, the findings can be linked to many phenomena in large language models (LLMs). From a circuit perspective, I feel that it is relatively clear how LLMs learn individual functions, and this research could clarify this further. The most valuable aspect of the paper in terms of Mechanical Interpretability is understanding where traditional grammar or CFG assumptions do not align with text grammar and how to construct CFGs (or multiple CFGs) that resemble text grammar but are controllable. This would help identify the subtle boundaries and mechanisms of whether and how the learning occurs.
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models	The quality of this paper is not particularly high. It introduces a single-elimination tournament approach to reduce the number of comparisons required to achieve a robust Elo score. However, this is an emerging direction that I find promising. Recently, I came across another paper that uses multiple non-Elo algorithms to model other statistical significances based on different models' responses to the same prompt. This paper could be considered a pioneering work in the field, opening up a small new area. As for Arena, there are a lot of assumptions that are problematic. For example, it attempts to represent user profiles, but which types of users does it represent? Are different users truly consistent? The paper provides a simple win/loss analysis, but what about clustering and analyzing response patterns? How are user preferences reflected? There's a lot to explore from a statistical perspective. Additionally, the chatbot arena approach itself is not particularly efficient.
Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation	This paper raises an interesting and important issue. The related work gives the impression that there has been insufficient research into the impact of data ordering in LLM training. The paper utilizes LEAN and HH. The author introduces a new data ordering method, called intuitive ordering, in which the relevant intermediate supervision for each proof step always appears to the left of the proof step. Personally, I feel that this is somewhat of a toy model because it’s difficult to find this kind of intuitive ordering in pretraining data. Nevertheless, I still appreciate articles that introduce new problems.
Vision-Language Models Can Self-Improve Reasoning via Reflection	This is a relatively good A+B paper that introduces an iterative self-training framework, R3V, to enhance vision-language reasoning abilities through reflection on self-generated CoT (Chain of Thought) reasoning. The method itself is not particularly groundbreaking, but it is still worth reviewing for tuning experiences and the observed improvements in performance.
GWQ: Gradient-Aware Weight Quantization for Large Language Models	The idea is simple, intuitive, and effective. The proposed GWQ method retains the top 1% of weights with the largest gradient absolute values in FP16 precision, while quantizing the remaining weights to a lower bit format, achieving a low quantization loss. This can be considered a parameter quantization version of Speculative Decoding.
Can Large Language Models generalize analogy solving like people can?	This is an out-of-distribution (OOD) task where participants are asked to infer a new letter string based on a given transformation rule. The performance of models and humans is compared, and interestingly, adults and some LLMs (such as GPT-4o and Llama-3.1 405B) outperform children in this task with the Latin alphabet. However, Claude-3.5 and Gemma-2 27B perform slightly worse. This observation highlights the rare lack of robustness in Claude-3.5-Sonnet for OOD tasks, whereas Llama-3.1-405B does not perform poorly. It might be worthwhile to add Llama-3.1-405B as a baseline in our OOD benchmark comparisons.
Thinking Forward and Backward: Effective Backward Planning with Large Language Models	The paper proposes a backward planning algorithm where the LLM first generates a backward plan, then reverses the sequence and self-validates it. This approach helps LLMs avoid inherent biases in backward planning, generates more diverse candidate plans, and utilizes the asymmetry between the forward and backward directions of planning problems. The benchmarks used are limited to three constructed tasks: graph planning, array transformation, and block world tasks. However, the experimental design is quite interesting; it employs breadth-first search (BFS) to compute the number of steps for both forward and backward searches. ING-VP used a similar approach, and while many recent reasoning benchmarks have not, it is actually possible to derive the structure of a Reasoning Directed Acyclic Graph (RDAG), where for steps that are well-defined, the nesting depth and total steps can be clearly calculated, which can provide valuable new insights.
How Far is Video Generation from World Model: A Physical Law Perspective	This paper explores the ability of video generation models to discover physical laws, particularly the ability to identify these laws purely from visual data. The quick takeaway is that diffusion models and insufficient data alone cannot solve the out-of-distribution (OOD) generalization problem. During the generalization process, the model tends to refer to similar training cases rather than learning universal rules. Future research should focus on improving models to better understand and apply physical laws. This is somewhat similar to the characteristics of LLMs, but it appears that the knowledge learned by diffusion models is shallower (possibly due to the lower information density in visual data, making it harder to extract rules). Earlier this year, during an ICLR discussion with Professor Tan Xu and Xing Chao, we talked about why diffusion models rarely mention "grokking." If one is learning explicit rules or patterns like resolution or extraction, it is relatively smooth. A follow-up paper analyzing diffusion model grokking from the perspective of physical laws could be a very decent contribution.
Evaluating Creative Short Story Generation in Humans and Large Language Models	This is not a static benchmark, so it cannot be included in existing evaluation systems, but it quantifies some points that might already be known: 1. Stories generated by models tend to have higher vocabulary and syntactic complexity than those generated by humans, but they have lower readability. 2. Human-generated stories exhibit higher vocabulary diversity. They have lower complexity but higher diversity, which could be an interesting point. 3. Humans are more likely to use pronouns and often write from the first or second-person perspective, while models tend to favor the third-person perspective. 4. Humans' story transitions and plot twists create a greater sense of surprise, meaning that humans have more creative twists, whereas models tend to be more mundane and logical. I am also quite curious about how Robert trains his models, as the results still seem intriguing.
Improving Steering Vectors by Targeting Sparse Autoencoder Features	The paper primarily addresses the issue of steering vector intervention. They control the behavior of language models by adding steering vectors, which are implemented by inserting activation vectors during the model's forward propagation process. In this work, they increase the impact prediction by inserting the steering vectors, thus achieving a certain degree of controllability (SAE-Targeted Steering, SAE-TS). The aim is to achieve more precise steering control by measuring the impact of steering vectors on Sparse Autoencoder (SAE) features. This method seems to have significant implications for alignment, especially during supervised fine-tuning (SFT), where the prediction of the impact of any single data point could be highly important. It is worth considering how to follow up on this approach.
Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders	Recently, it seems particularly fitting to write a position paper on how to use Sparse Autoencoders (SAE) and techniques like the logit lens to understand model parameters and achieve controllability over model behavior. There seems to be a small technical breakthrough in this area, with slightly novel papers emerging almost every day. This paper proposes framing the token-feature matching problem as a resource allocation problem constrained by a sparsity budget. Existing TopK SAE methods solve this allocation problem under the constraint that each token can match at most K features, but they fail to fully leverage the advantages of adaptive computation. Therefore, they propose two new SAE variants: Feature Choice SAEs and Mutual Choice SAEs. Feature Choice SAEs relax the constraint that each feature can match at most M tokens, addressing the sparse allocation issue. Mutual Choice SAEs remove the constraint on token-feature matching numbers, allowing free allocation within the total sparsity budget. The new loss design they propose is somewhat similar to MoE load balancing.
TableGPT2: A Large Multimodal Model with Tabular Data Integration	This is a project from Zhejiang University's Jake Zhao Junbo, which focuses on building a comprehensive pipeline for table understanding. It includes pretraining and other components, and the approach is quite detailed. The benchmarks and datasets involved in the paper could be worth reviewing for potential references or reuse.
Context Parallelism for Scalable Million-Token Inference	The paper introduces Context Parallelism (CP) to optimize long-context LLM inference. It specifically focuses on long contexts and presents two lossless, accurate circular attention variants: pass-KV and pass-Q. Additionally, scalability tests are conducted across multiple nodes.

04/11/2024

Paper	Comments
Human-inspired Perspectives: A Survey on AI Long-term Memory	The focus is on several concepts presented in this paper:\1. The paper introduces several types of human memory: episodic memory, semantic memory, and procedural memory. It maps the first two to non-parametric memory and the latter to parametric memory. Within the context of this survey, the authors expect that various episodes and semantics are to be within the context accepted by language models, rather than associations generated within the model itself. This may not necessarily be a correct belief, but it is worth considering.\2. Furthermore, the paper proposes a memory management mechanism, emphasizing that adaptive storage, adaptive retrieval, and adaptive forgetting handle different types of information separately. These three operations are defined very concisely: storage, retrieval, and forgetting. Currently, generally speaking, LLMs rarely explicitly manage forgetting, including agents; this might be achievable through those circuit control-based schemes that have appeared in many recent papers.
WLPlan: Relational Features for Symbolic Planning
GPT for Games: An Updated Scoping Review	This survey offers a well-structured perspective, introducing two noteworthy aspects. Firstly, the title clearly defines the scope (2020-2024), providing a focused temporal range that avoids an exhaustive historical review. Secondly, it presents a novel approach to literature selection, suggesting that the process itself can be a relevant research topic. While most current surveys are AutoSurveys, the method used here could inspire studies analyzing how literature for a survey topic is selected, based on previous reviews. The paper is divided into three main areas: 1) Game Generation, 2) Agent Creation in Games, and 3) Game User Research. In the Game Generation section, the study summarizes methodologies that generate entire game content based on frameworks like stories or programs, covering granularity levels from stories and missions to levels and characters. It also discusses design development through user prompts and interaction with large language models (LLMs), where LLMs primarily serve as tools for quickly generating various layouts and mechanisms. In the context of interactive gameplay, the paper likens this approach to tabletop RPGs, where LLMs provide story content, user experience enhancements, and real-time creative support. This field shows significant potential, with only around 30 papers selected for review, and many appear to be standouts in a limited field. Research in game user studies also appears sparse, with only a few papers in this category.
Project Sid: Many-agent simulations toward AI civilization	This paper proposes a mega-scale Stanford Town.
GameGen-X: Interactive Open-world Game Video Generation	This paper presents the OGameData dataset, which supports text-to-video generation and video continuation tasks, enabling models to generate high-quality, open-domain game videos with long sequences. It integrates character interaction and scene content control within video generation. As one of the earliest works in China paralleling Google’s Genie, the follow-up results appear promising. After reviewing their promotional video, minor issues in scene transitions were observed, though overall performance is impressive. The keyboard control feature is notably commendable.
Latent Paraphrasing: Perturbation on Layers Improves Knowledge Injection in Language Models
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models	Based on SAE training, an inclination parameter t is introduced to encourage the model to better represent tail concepts.
Mitigating Tail Narrowing in LLM Self-Improvement via Socratic-Guided Sampling	Impact of long-tail data is highlighted, and this phenomenon is indeed quite pronounced. During the model's self-improvement process, oversampling simple queries and undersampling complex ones lead to a concentration of distribution on high-probability data. High-probability data coincidentally aligns with the model's internal common patterns, ultimately resulting in mode collapse. We also have worked on self-improvement, and we find that this issue is both common and significant. The solution presented in this paper appears intuitive, using various forms of guidance to identify and resample tail data. Based on personal experience, seemingly less elegant methods can often be insightful and practical; in this case, “less elegant” refers to the use of four types of guidance by Professor Huang and Professor Guitao, which lack obvious intrinsic logical connections. The ablation study suggests that the proposed state reset approach is generally more effective, where this state reset is somewhat similar to reverting to a prior reasoning step after multiple unsuccessful attempts at the current step
Physics in Next-token Prediction	Recently, TeleAI has published quite a few works that may not be highly effective but are quite imaginative, such as SentenceVAE and a collaboration with BAAI on continuously scaling model pre-training up to 1 trillion parameters. In this paper, a new formula is proposed to quantify the energy consumption required for information transmission in the context of Next-token Prediction (NTP) as an information compression process. Consistency with the OAI Scaling Law is also derived
Self-Evolved Reward Learning for LLMs	This work proposes a self-evolved reward learning approach. The key innovation compared to SPIN and previous methods is that this approach self-evolves the RM through a feedback loop using the RM itself. The LLM serves as the RM, generating feedback on the dataset that is subsequently used to refine its own learning. This iterative ”feedback-then-train” loop allows the RM to self-evolve over time, gradually improving its performance. It can also generates high-quality preference data and reduce the reliance on human-annotated data. 'Self-Improving' topic is finally gaining momentum and becoming more popular in the field.
Constant Acceleration Flow	The work on Diffusion really involves a lot of explicit physics concepts.
Generalizability of Memorization Neural Networks	The paper presents a systematic theoretical analysis of generalizability of memorization neural networks. It provides a formula modeling the minimum number of parameters required to memorize any dataset sampled i.i.d. The research demonstrates that some commonly used memorization networks do not have generalizability even if the dataset is drawn i.i.d. from a data distribution and contains a sufficiently large number of samples. This work also provides complexity analysis. Recommended reading for those interested in interpretability studies.
Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement	The fundamental concept is that after modeling the code trees from the majority of GitHub repositories, GitHub can theoretically serve as a simulation environment for scaling RL. This approach is viable because GitHub contains comprehensive data with natural interaction records, even after filtering. Compared to conventional RL environments, it additionally provides codebase summaries, essentially functioning as a summarization based on global observations. The LingMa paper collected approximately 90,000 PRs from 4,000 repositories. The data underwent filtering to ensure code change quality and relevance, and then followed STaR's approach for training, implementing a fixed three-stage CoT (Chain of Thought) training framework: repository comprehension, fault localization, and patch generation. They employed their classic rejection sampling method, using two metrics - fault localization accuracy and patch similarity - for data filtering to ensure high-quality synthetic data. This suggests promising potential for scaling Decision Transformer/RL using standard code bases as initialization.
Mastering the Craft of Data Synthesis for CodeLLMs	A comprehensive survey on CodeLLM data processing published by Oracle.
Interpretable Language Modeling via Induction-head Ngram Models	A notable contribution that builds upon infini-Gram, which computes next-word probability distributions through longest suffix matching in reference corpora. The research introduces induction heads and employs custom neural similarity metrics for efficient search of potential next-word completions in input contexts. This process enables Induction-Gram to provide ngram-level justification for each generated word. Through this approach, it allows for coarse-grained evaluation of how language models predict subsequent words.
Evolving Alignment via Asymmetric Self-Play	The paper presents a combination of RLHF and evolutionary approaches, essentially layering evolution over RLHF without substantially addressing the inherent preference modeling issues in RLHF. The work's notable insight lies in simultaneously optimizing both Creator's generation strategy and Solver's response strategy. This approach suggests a broader application beyond Creator roles - one potential direction for scaling RL on pretrained corpora involves a model with basic text comprehension capabilities, where a Rewriter & Creator fits the distribution of information between the question set and the original pretraining corpus, which aims to cover all essential information from the original pretraining corpus with minimal questions. The other component, similar to their Solver, focuses on problem optimization. The paper's implementation of minimax-regret, increasingly referenced in recent multi-agent works, merits review. While the paper's core claimed contribution is evolving previously uncovered prompts to encompass more scenarios, this advancement might be considered incremental.
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks	A benchmark for evaluating planning and reasoning in human-robot collaboration tasks.

If you are intereted in the work published by us, please navigate to our full paper list.