M-A-P Daily Paper

The M-A-P daily paper project curates and reviews a selection of new papers published daily on arXiv, providing insightful commentary on cutting-edge research across various scientific disciplines.

🛠️ Papers This Week

(Expand to View)

11/10/2024

Paper	Comments
Zero-Shot Generalization of Vision-Based RL Without Data Augmentation	This paper explores the generalization of vision-based agents in new environments (specifically across game environments), which is a highly valuable long-term research topic. Humans can quickly adapt to playing games by reading guides and practicing, while models often require time for training or rely on strong "external knowledge" supervision. The common approach involves using some form of a memory module. This memory can be trained through disentangled vision representations (as shown in this paper), or based on text or a well-defined shared state space (such as controller inputs). About a year ago, work was done on this topic, including projects like MORE-3S: Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces and Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction. Earlier, Google had also worked on Multi-game Decision Transformers, though all of these are based on Decision Transformers (DT). It is believed that this could also be explored using multi-modal language models (MLLM), though significant progress has not been seen in this area for over a year, which is somewhat disappointing.
Can Transformers Reason Logically? A Study in SAT Solving	This is an interesting small experiment, using SAT to show that LLMs combined with Chain-of-Thought (CoT) reasoning can generalize to out-of-distribution (OOD) cases within the same task. However, it doesn’t demonstrate how well models handle task-level OOD generalization. The mechanical interpretability aspect isn’t very solid, but the main takeaway is that the format and length of supervised fine-tuning (SFT) data are crucial. The rest of the findings are not as impactful.
DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models	This paper presents a data analysis agent benchmark. The benchmarks listed in Table 1 are worth considering for potential integration into specialized areas of code evaluation. This paper is recommended for review.
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts	This method introduces zero-experts and copy-experts, though it doesn’t fully qualify as a heterogeneous Mixture-of-Experts (MoE) model since the sizes appear uniform. Further evaluation is needed to assess its true value.
The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making	A new moral dilemma benchmark developed by GDM.
A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models	This is a survey introducing hardware-based model architecture design. It was quickly scanned and seems decent but requires more in-depth study.
Agent S: An Open Agentic Framework that Uses Computers Like a Human	This paper introduces a GUI agent framework. A brief scan didn’t reveal anything particularly innovative.
Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines	This is a fascinating paper, featuring an interesting design in which the authors built a pseudo-Turing machine capable of performing seven types of arithmetic operations. The key achievement is that it works well. If the design and memory capabilities can be extended, this could potentially serve as another route for inference time scaling. In fact, the accuracy is quite high. If LLM agents can be made reliable, the command set could be defined as finite, consisting of Chain-of-Thought (CoT) operations alongside a few essential actions like determining when to stop or shortening the history context. This paper is highly recommended for reading and may inspire new research directions.
The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks	This paper presents a potentially valuable out-of-distribution (OOD) benchmark similar to IQ tests.
COMMA: A Communicative Multimodal Multi-Agent Benchmark	This benchmark evaluates multimodal multi-agent frameworks through interaction. The research direction holds significant potential, though it might become less impactful if too many similar puzzles flood the field in the future.
WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents	This work is somewhat similar to frameworks based on potential rules. However, the rule format feels trivial, and the limited action space makes the work seem unimpressive compared to other agents based on the Minecraft environment.
LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts	A multimodal synthetic data generation scheme base on MLLM
From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions	LLM understands tool documentation in a different way than the human perspective, and having the LLM refine the documentation from the model's perspective allows it to better understand the use of tools
DemoShapley: Valuation of Demonstrations for In-Context Learning	The value (high or low) of ICL Samples has a significant impact on the overall Performance of the model.
Evolutionary Contrastive Distillation for Language Model Alignment	This study observed that synthetic preferences constructed by combining instructions (which imposed additional requirements on format) significantly benefited model learning. A potential reason is that this type of preference data connects patterns that are intrinsically related but would not otherwise be linked. This is a promising area for further exploration, recommended to read
Benchmarking Agentic Workflow Generation	The name is aptly chosen, and the direction selected is solid; however, data diversity and quality are somewhat lacking. Workflow-based Evaluation is indeed a highly valuable evaluation topic. Taking Coze and Difny as examples, if we consider LLMs as productivity tools, two key points emerge for further validation:1.Can an LLM, for various professional workflows, create a rational "workflow" and a continuous working space for commonly used tasks in real-life settings? 2.Following the rational workflow established in point 1, can an LLM provide correct reasoning if the workflow itself is reasonable?If an LLM can accomplish both, it can be considered a valuable productivity tool for that specific profession and context. If data could be collected following this pattern, it would make for an excellent benchmark. Unfortunately, the domain categorization in this work is quite limited and remains focused on academic topics related to agents, rather than expanding to real-world professional scenarios.
AgentBank: Towards Generalized LLM Agents via Fine-Tuning on 50000+ Interaction Trajectories	A large-scale CoT-enabled Agent Instruct dataset with Reformalize potential, stimulating the model's deep search reliability.
MKGL: Mastery of a Three-Word Language	A noteworthy work where LLMs leverage KG-based vocabulary to generalize over knowledge graphs, predicting reasoning trajectories on KGs.

10/10/2024

Paper	Comments
Identifying and Addressing Delusions for Target-Directed Decision-Making	The paper by Doina and Bengio discusses delusions in RL Agents caused by extreme data distributions (defined as inappropriate training data distribution in the literature) and unsuitable update rules leading to false beliefs. The paper summarizes multiple causes of delusions, where two main types of targets could be problematic from the prospective of the commenter: globally unrealistic goals and globally possible but too sparse goals, where improper sampling and rule updates lead to misjudgments and incorrect learning. The paper also emphasizes that MisEvaluating Non-Delusional Targets still exists, and such errors, along with misanalysis of sparse targets, significantly harm RL Agents' generalization. When applying the idea to RLHF, overly difficult Preference Data might face similar issues. For LLMs, if a preference pair represents either a sparse target (with long, hard-to-identify reasoning chains) or is an OOD case for the SFT-trained model, it might lead to incorrect reasoning shortcuts. The paper proposes two potential solutions: 1) Letting the estimator know potential strategies adopted by the generator, essentially using the model self-generated targets to replace the exisiting given targets - for RLHF, this suggests not using pairs from other models to train your LLM. 2) Letting the model know the experienced targets, which in RLHF terms would mean having the LLM generate diverse response candidates.
Falcon Mamba: The First Competitive Attention-free 7B Language Model	Claims to be the first competitive SSM-based Pretrain Model.
Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning	-
Pyramidal Flow Matching for Efficient Video Generative Modeling	-
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate	Highly recommended paper, which introduces an FID-based normalized metric for evaluating modal alignment learning effects in LVLM pretraining. Notably, Figure 1's graphs 1 and 2 show that after MIR decreases (and decreases early), although PPL continues to decrease, the model performance does not increase. The evaluation uses MME, MMBench, and SEED-Img, providing valuable assessment of visual capabilities, potentially indicating a vision-semantic alignment bound.
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation	Presents a valuable benchmark for assessing video generation physical reliability. Previous discussions among the commenter and video generation researchers at ICLR 2024 revealed no significant Grokking observations in Diffusion. A hypothesis was proposed that learning physics commonsense rules might be a form of grokking, which this benchmark might help verify.
Aria: An Open Multimodal Native Mixture-of-Experts Model	The MLLM shows good results. While the report might lack extensive details, section 4.2 presents interesting expert visualizations.
Pixtral 12B	Based on the reported benchmarks (lacking video), both methodology and results are inferior to Aria. Performance is suboptimal.
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation	Worth reviewing for its potentially useful data and incorporation of rewards in later diffusion steps. Connects to the decision-making paper mentioned above - might benefit from generating multiple incorrect guidances to create pairs more aligned with the model's capabilities.
Personalized Visual Instruction Tuning	Addresses interesting research questions and introduces P-Bench, a significant MLLM Bench. The definition of "personalization" appears somewhat limited; could benefit from expanding to examine how different contexts affect image question responses. While methods and pipeline might not be optimal, Figure 2 presents interesting examples.
Does Spatial Cognition Emerge in Frontier Models?	Recommended as a specialized observation category without affecting the general abiility of the LLM evaluation, useful for detecting Spatial Cognition emergence in LLMs.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering	From OpenAI, it is similar to one of the ongoing M-A-P Multi-Agent, which appears stronger than baselines in MLE-Bench. Kaggle competition presents a valuable testing scenario for Agents, offering generalizable data analysis tasks.
Retrieval-Augmented Decision Transformer: External Memory for In-context RL	Methodology similar to the R2Play paper from six months prior.
MM-Ego: Towards Building Egocentric Multimodal LLMs	First-person perspective multimodal model addressing embodiment needs, though the approach may not meet the precision requirements for spatial information in embodiment applications. Successfully identifies a meaningful problem and scenario.
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models	-
CursorCore: Assist Programming through Aligning Anything	Potentially valuable code benchmark.
Emergent properties with repeated examples	Examines the relationship between sample repetition and model memory/generalization. Abstract conclusions appear counterintuitive.
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates	Slightly overclaims for attention. Copy an apt comment: "Uses random string searches and SQL injection-like methods to exploit prompts for each benchmark". Unlikely to significantly impact benchmark usage unless models are kept private or API-only. Highlights ongoing concerns about LLM-based Evaluation robustness and reliability.
Self-Boosting Large Language Models with Synthetic Preference Data	-
VHELM: A Holistic Evaluation of Vision Language Models	An MLLM evaluation from Percy Liang.
ReIFE: Re-evaluating Instruction-Following Evaluation	Presents interesting perspectives with costly experiments. Key takeaways: 1) LLM-based Evaluation Protocol results' Consistency highly correlates with the base LLM 2) Large models perform better as evaluators, but complex evaluation protocols can improve smaller models' evaluation capabilities 3) Meta-Evaluation Datasets selection should be more diverse and careful.
Temporal Reasoning Transfer from Text to Video	Transfers "coarse-grained" Temporal Reasoning capabilities from text to visual understanding. Shows some effectiveness, addresses a valuable problem though methods are somewhat rough.
Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling	-
Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond	Relatively simple approach but potentially valuable for LLM SFT with controlled ROI.
One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation	Presents a generalizable LORA parameter initialization method.
Do better language models have crisper vision?	Examines whether language models acquire visual cognition during pretraining.
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning	-
Neural Circuit Architectural Priors for Quadruped Locomotion	-
Data Selection via Optimal Control for Language Models	Pretraining data selection methodology. It appears more rigorous than a previous similar work from Pengfei Liu: 1) the Pengfei paper does not analyze the effects brought the duplication between Finweb-Edu data dumps; 2) the method intuitively degrade the data diversity. Key considerations for such kind of methods include: 1) Scaling extent 2) Data Repetition/Data-Constraint Law consideration 3) Impact of data diversity 4) Downstream dependence. Worth dedicated exploration.
Response Tuning: Aligning Large Language Models without Instruction	Demonstrates emergence of Instruction Following capabilities through response-only training. Complements an earlier M-A-P paper, I-SHEEP, which shows instruction-only training also leads to emergence. Key takeaway: Instruction following may be an existing Internal Capacity, suggesting alignment might not require strong Distribution Shift (means larger cost). It may be a good idea to fine-tuning a specific weights related to instruction following.
ING-VP: MLLMs cannot Play Easy Vision-based Games Yet	Shows that although 4o and Claude-3.5-Sonnet significantly outperform other models, they struggle with multi-round vision-based game reasoning.
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design	-
KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks	Recomending another M-A-P paper. It offers more thorough ablation than contemporary ZebraLogic and LogicGame, emphasizing Knowledge-Orthogonal characteristics. The commenter promotes Knowledge-Orthogonal Reasoning Capacity Evaluation as a future direction, focusing on models' ability to leverage existing knowledge and perform Task Composition. Suggests evaluation should consider knowledge capabilities alongside specific domain performance.

If you are intereted in the work published by us, please navigate to our full paper list.