M-A-P Daily Paper


The M-A-P daily paper project curates and reviews a selection of new papers published daily on arXiv, providing insightful commentary on cutting-edge research across various scientific disciplines.



back to main page

🛠️ Papers This Week

(Expand to View)

20/9/2024
Paper Comments
Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case Conclusion 1: VL-Agent powered by 4o and Gemini performs better than humans in Black Myth Wukong. Conclusion 2: 4o is significantly better than Gemini at learning planning skills. The collected Black Myth Wukong dataset could be quite useful.
KnowFormer: Revisiting Transformers for Knowledge Graph Reasoning
How the (Tensor-) Brain uses Embeddings and Embodiment to Encode Senses and Decode Symbols Proposes a cognitive model with an exploratory approach to using a representation layer as a subsymbolic global workspace and an index layer for coordination (containing symbols for concepts, time instances, and predicates); the experimental validation remains preliminary.
Autoformalization of Game Descriptions using Large Language Models An automated tool for converting text-based games to decision problems, appears promising with potential for further development into a Dynamic Agentic Benchmark.
RAG-Modulo: Solving Sequential Tasks using Experience, Critics, and Language Models Utilizes RAG to assist LLM-based agents with decision-making in interactive environments, tested on BabyAI and AlfWorld; if reproducible, the results would be quite impressive.
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines Congratulations on the release! [Salute] A benchmark combining multimodal and search scenarios, appears to align closely with user-oriented use cases, more so than prior multimodal benchmarks.
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm Recommended reading. TLDR: Subtracting the mean of a vector’s components is equivalent to removing its projection along the uniform vector; layer normalization is the normalization of the component orthogonal to the uniform vector, with a scaling factor. Information lost during layer normalization is irreversible in the scale-and-shift step.
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL Uses LLMs for value function evaluation in RL. Recommends works as references for this domain: Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models, Contextual Transformer for Offline Meta Reinforcement Learning, Natural Language Reinforcement Learning.
MEXMA: Token-level objectives improve sentence representations Sentence encoders also benefit from token-level objectives.
Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning Another instance of XoT, introducing iteration.
PersonaFlow: Boosting Research Ideation with LLM-Simulated Expert Personas Can be adapted into a complex version of a cosmopedia/persona synthetic data pipeline.
Scaling FP8 training to trillion-token LLMs
Prompts Are Programs Too! Understanding How Developers Build Software Containing Prompts
Neural Networks Generalize on Low Complexity Data
Mixture of Diverse Size Experts
19/9/2024
Paper Comments
AIvril: AI-Driven RTL Generation With Verification In-The-Loop
RTLRewriter: Methodologies for Large Models aided RTL Code Optimization Similar to the previous paper, this uses LLMs for RTL design. The performance value in this domain appears significantly improved compared to earlier this year.
Self-Contrastive Forward-Forward Algorithm
Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation This work offers a useful image generation preference dataset. It aligns with our work on VideoScore, which builds automatic metrics to simulate fine-grained human feedback for video generation. Model-based AIGC metrics seem promising, particularly for music and video, though image-related research in this area is somewhat established.
Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent Integrates a validator agent with the TOT framework to verify each thought branch.
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Intuitively, native support for dynamic resolution might not entirely make sense for CLIP’s semantic image-text mapping (personal view: for similar regions, does higher resolution imply more tokens per image?
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement Better Data and Annotation is all you need
Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey
Mamba Fusion: Learning Actions Through Questioning
GRIN: GRadient-INformed MoE Figure 6 is particularly interesting, as GRIN exhibits domain-level expert clustering. However, the model has a 16-to-2 selection rather than a fine-grained expert structure. (Previously, Meta’s BTX exhibited a phenomenon where, over time, experts trained together became more similar. From a dense modeling perspective, expert differentiation is relatively natural. Referenceg: Branch-Train-MiX, which combines expert LLMs into a Mixture-of-Experts LLM.)
Computational Dynamical Systems Yet to fully understand.
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning Recommended reading, as it compares CoT’s benefits across different downstream tasks and confirms that CoT’s primary advantage lies in symbolic execution.
18/9/2024
Paper Comments
When Context Leads but Parametric Memory Follows in Large Language Models A highly interesting evaluation and analysis paper examining how models utilize embedded knowledge in internal parameters and information from context. It notes that all models follow the same pattern in using context, specifically, "including a consistent reliance on both contextual (around 70%) and parametric (around 30%) knowledge." The WikiAtomic dataset used here seems somewhat similar to Hotpot QA, with synthetic data for wiki-based multi-hop reasoning being used once again.
Scores as Actions: a framework of fine-tuning diffusion models by continuous-time reinforcement learning Establishes a framework for fine-tuning diffusion models through continuous-time reinforcement learning.
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents Investigates the balance between utility and truthfulness in LLM agents.
CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks Utilizes MCTS for planning rather than single-step approaches.
Explaining Datasets in Words: Statistical Models with Natural Language Parameters An approach exploring statistical models that incorporate natural language parameters for dataset explanation.
NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training A study focusing on next token prediction for self-supervised pre-training in speech.
Text-To-Speech Synthesis In The Wild Large-scale synthetic TTS dataset.
jina-embeddings-v3: Multilingual Embeddings With Task LoRA Provides multilingual embeddings using task-specific LoRA.
Kolmogorov-Arnold Transformer Not particularly convincing on KAN.
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation Congratulations to the authors!
On the limits of agency in agent-based models Investigates the limitations of agency within agent-based models.
Autoregressive + Chain of Thought (CoT) ≃ Recurrent: Recurrence's Role in Language Models and a Revist of Recurrent Transformer An important paper theoretically proving that CoT serves a similar function to recurrence in models.
Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding HCOT activates multi-step reasoning.
17/9/2024
Paper Comments
SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL Models reasoning as a directed acyclic graph, introducing role-specific tokens.
On the Diagram of Thought Treats reasoning as a directed acyclic graph and introduces role-specific tokens.
Can GPT-O1 Kill All Bugs? An Evaluation of GPT-Family LLMs on QuixBugs A preliminary rough evaluation report of the O1 Debug.
Towards Explainable Automated Data Quality Enhancement without Domain Knowledge
An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems
Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison Proposes a training-free method to assess the quality of preference datasets by introducing three metrics based on scale, label noise, and information content. Though coarse, this approach offers valuable insights.
Audio Transformers: Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions
Self-Attention Limits Working Memory Capacity of Transformer-Based Models Uses an N-back dataset to verify that large language models solve problems by learning a specific attention paradigm. This physics-based LLM experiment is limited by the use of toy data, making its conclusions challenging to generalize to true LLM pre-training.
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs
Fast Analysis of the OpenAI O1-Preview Model in Solving Random K-SAT Problem: Does the LLM Solve the Problem Itself or Call an External SAT Solver? Another rough, quick evaluation report of the O1 model's SAT-solving capability.
Learning Source Disentanglement in Neural Audio Codec
LOLA -- An Open-Source Massively Multilingual Large Language Model
OmniGen: Unified Image Generation
NVLM: Open Frontier-Class Multimodal LLMs
Apollo: Band-sequence Modeling for High-Quality Audio Restoration
16/9/2024
Paper Comments
United in Diversity? Contextual Biases in LLM-Based Predictions of the 2024 European Parliament Elections The study examines hidden political biases within LLMs using data from the Brexit referendum. It verifies a consensus that LLMs may not be suitable for Computational Social Science research in certain contexts. Because using prompts cannot elicit political views and voting tendencies from various demographic backgrounds. Besides, a detailed analysis of how internal biases are embedded within the parameters was not conducted.
Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis The study tests the coding abilities of LLMs using the Romanian Informatics Olympiad, which can directly mergeable test set.
Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU The analysis of the four attention patterns in MLLMs is intriguing. Pattern 1: recent tokens exhibit high attention scores. Pattern 2: tokens derived from videos typically receive high attention scores. Pattern 3: positions with high attention scores appear as vertical lines. Pattern 4: high attention scores shift forward as the multi-round inference progresses. Based on these defined attention patterns, the study further introduces attention bias to remove irrelevant KV cache.
Cross-Entropy Optimization for Hyperparameter Optimization in Stochastic Gradient-based Approaches to Train Deep Neural Networks -
Robust Training of Neural Networks at Arbitrary Precision and Sparsity -
Unleash LLMs Potential for Recommendation by Coordinating Twin-Tower Dynamic Semantic Token Generator The study introduces instructions and user representations to a single-tower generative recommendation system.
What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing Data slicing hypothesis standards and verification based on LLM
Language Models "Grok" to Copy The authors present three arguments: 1. The ability to copy is grokking during the training process. 2. The ability to copy is independent of the total number of training tokens. 3. Induction heads deepen from shallower to deeper layers, potentially indicating that the ability to copy evolves into a higher-order intrinsic capability in later stages, and the copy's grokkingis obtained some relations with deep layers.
Symbolic Regression with a Learned Concept Library This paper proposes a concept dataset, i.e., called LaSR, which is potentially suitable for LLM physics research.
Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens The study discusses planning/pausing tokens in the Decision Transformer
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings The study enables the vision encoder to gain more contextually relevant information from text.
ValueCompass: A Framework of Fundamental Values for Human-AI Alignment -
MindScape Study: Integrating LLM and Behavioral Sensing for Personalized AI-Driven Journaling Experiences -
Explore the Hallucination on Low-level Perception for MLLMs This study presents an interesting test on multimodality, assessing the model's recognition capabilities regarding its relatively spatial understanding of entities in images and finding that MLLM performance relatively worse on it. It highlights a significant existing issue within MLLMs. This benchmark is practically meaningful and this issue can be improved through enhanced data.
Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges The study proposes a contamination detection method for local order quiz, finding that contamination levels in GPT-4 are notably high, particularly in the training sets of HumanEval and GSM8k.

If you are intereted in the work published by us, please navigate to our full paper list.