M-A-P Daily Paper

The M-A-P daily paper project curates and reviews a selection of new papers published daily on arXiv, providing insightful commentary on cutting-edge research across various scientific disciplines.

🛠️ Papers This Week

(Expand to View)

28/9/2024

Paper	Comments
MIO: A Foundation Model on Multimodal Tokens	Focuses on joint modeling of multimodal tokens. While the paradigm may not introduce fundamentally unique aspects, the summary and organization of Pretrain and SFT data are among the most solid efforts in current open-source Any2Any models. This is an initial version, with plans for further ablation results. Also relevant is OmniBench: Towards The Future of Universal Omni-Language Models, as OLM represents an imaginative research direction.
Emu3: Next-Token Prediction is All You Need	From BAAI's Any2Any, with several points worth noting in data processing: (1) using optical flaw removal for minimal/extreme motion between frames, (2) adding supplementary data specifically for image understanding during pretraining, and (3) introducing [SOV], [SOT], and [EOV] tokens for parsing meta text and vision tokens. The training flow incorporates DPO.
Infer Human's Intentions Before Following Natural Language Instructions	A straightforward intuition and method, with the core takeaway being that instructions may contain ambiguities, potentially diverging from what humans intend. By analyzing instructions to infer what a person truly intends for the robot to do in a given context, this additional step can help improve performance.
Compositional Hardness of Code in Large Language Models -- A Probabilistic Perspective	Recommended reading. The assumptions do not appear overly strong, providing probabilistic insights into why decomposing a problem into multiple parts can be beneficial. A closer examination of the proof could be worthwhile to assess the soundness.
StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?	The takeaway is that, similar to humans, models may perform better under moderate stress (as suggested by the Yerkes-Dodson law). This type of research provides an interesting perspective on model behavior, aligning with recent studies in the field that yield intriguing findings, although the reliability of such insights may still warrant further validation.
Enhancing Elusive Clues in Knowledge Learning by Contrasting Attention of Language Models	An intriguing mechanism that contrasts the attention distributions of small and large models on knowledge-intensive documents, selectively dropping tokens to encourage learning of non-obvious but important clues and improving performance. Interestingly, this effect is also observed in larger models, suggesting that even larger models benefit from filtering out non-essential information. However, the experimental evidence could be more convincing.
Explanation Bottleneck Models	Preliminary concept of a VAE Explanation Model freed from strict concept sets.
FactorSim: Generative Simulation via Factorized Representation	Recommended reading. This approach seems to emulate the human abilities of environmental and spatial imagination, complementing RL's methods for learning to adapt to environments. It is reminiscent of a paper from nearly two years ago, Mind's Eye: Grounded Language Model Reasoning through Simulation. Simulated environments like these, aside from code/math, are promising areas for exploration. For instance, a simulated chemistry lab where the model decides on actions and observes the outcomes, or similar environments like Atari games, where the model can apply learned tutorials to evaluate generalization of its skills. This seems a highly valuable research direction.
Just Say What You Want: Only-Prompting Self-Rewarding Online Preference Optimization
Human Mobility Modeling with Limited Information via Large Language Models	This paper explores an interesting concept, particularly with the V's A and B human behavior monitoring datasets, which offer valuable insights. It may be worthwhile to consider how similar approaches could be applied to collect daily behavior patterns across various fields. Relevant datasets include the National Household Travel Survey and the Activity-Based Model Dataset from the Southern California Association of Governments.
MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models	Recommended reading. This work introduces learnable masks to achieve domain-specific adaptation, revealing key insights, such as whether rare downstream patterns learned in pretraining are fully aligned. The approach is quite novel.
Exploring Semantic Clustering in Deep Reinforcement Learning for Video Games	Examines which types of video games have similarities.
Post-hoc Reward Calibration: A Case Study on Length Bias	Recommended reading. This approach intuitively edits rewards by estimating and removing biases from the reward, making it quite adaptable.
Search for Efficient Large Language Models
FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction
HydraViT: Stacking Heads for a Scalable ViT
Why Companies "Democratise" Artificial Intelligence: The Case of Open Source Software Donations	Systematically analyzes the benefits for companies that fund open-source projects, concluding that it yields numerous advantages, including cost savings, visibility, and increased engagement.
Inference-Time Language Model Alignment via Integrated Value Guidance	Aligns or enhances alignment during inference, raising the possibility that instruction-following in LLMs, if viewed as a mechanical pattern, could be achieved training-free through controlling attention pattern composition. Stanford recently had related work in this area.
Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models	Recommended reading, claiming SOTA performance for LLMs used as gradient priors.
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness	Recommended reading. This paper introduces a straightforward yet effective concept, positing that human preferences are nuanced rather than binary, making it worth consideration.
What Would You Ask When You First Saw a2+b2=c2? Evaluating LLM on Curiosity-Driven Questioning	Explores an intriguing question: how well LLMs exhibit curiosity-driven behavior when seeking knowledge and completing prerequisites.
HDFlow: Enhancing LLM Complex Problem-Solving with Hybrid Thinking and Dynamic Workflows
CSCE: Boosting LLM Reasoning by Simultaneous Enhancing of Casual Significance and Consistency	This paper provides a compelling perspective on problem-solving, emphasizing that each inference step should be evaluated for its relevance to the overall solution. It suggests that context can sometimes serve as noise in relation to the primary optimization goal, advocating for a focus on dependencies and assessing whether addressing a subproblem contributes to the main objective. While the method could be further refined, the underlying perspective is valuable.

26/9/2024

Paper	Comments
Beyond Following: Mixing Active Initiative into Computational Creativity	This is an HCI study on user-AI collaborative creativity. There were many similar studies one or two years ago. It seems that apart from instruction following, this represents a creative pattern that remains to be explored.
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale	A Multi-Agent SE framework, introducing Feature Localization and Edition, which are distinct from previous frameworks. It seems more aligned with an agile development workflow.
Task-oriented Prompt Enhancement via Script Generation	SoT+PoT. The most obvious value of such XoT approaches, especially after o1, is how to extract a large set of scalable high-confidence data for training the model.
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models	-
Harnessing Diversity for Important Data Selection in Pretraining Large Language Models	A framework for estimating pretraining data quality based on clustering and downstream performance. From a pretraining perspective, this approach doesn't seem particularly useful. However, it might be valuable for extracting data sources for producing SFT by leveraging the clustering method.
Algorithmic Drift: A Simulation Framework to Study the Effects of Recommender Systems on User Preferences	-
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization	-
Demystifying Issues, Causes and Solutions in LLM Open-Source Projects	An interesting read. It analyzes issues in LLM open-source projects.
Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification	Figure 5 demonstrates the actual impact on distribution. There are many unnatural aspects within CLIP, for example, humans decide what to focus on based on their own context, which is evident in several benchmarks. The bottleneck in MLLM's visual information during image encoder pretraining is quite severe, and there is significant room for improvement.
Uncertainty Representations in State-Space Layers for Deep Reinforcement Learning under Partial Observability	-
Enhancing Temporal Sensitivity and Reasoning for Time-Sensitive Question Answering	The methods used here are questionable and can be ignored. However, the four TSQA benchmarks used might be valuable as corner cases that LLM users could experience.
Towards User-Focused Research in Training Data Attribution for Human-Centered Explainable AI	-
Counterfactual Token Generation in Large Language Models	-
INT-FlashAttention: Enabling Flash Attention for INT8 Quantization	Baichuan's INT8 + Flash Attention approach.
How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not	-
Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale	LLM-based data pretrain filtering, with a related training dataset released. This is indeed the direction for the future and is recommended reading.
FineZip : Pushing the Limits of Large Language Models for Practical Lossless Text Compression	-
AXCEL: Automated eXplainable Consistency Evaluation using LLMs	-
Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents	Shorter CoT combined with tool-using calls can produce many useful agents, rather than relying on excessively long CoTs (especially considering potential redundancy). Recommended reading. An interesting piece of agent work. Creating new APIs is akin to creating reasoning shortcuts, similar to encapsulating functions in code.
Tell Me What You Don't Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing	-
Multi-objective Evolution of Heuristic Using Large Language Model	-
Attention Prompting on Image for Large Vision-Language Models	Recommended reading. It employs an embarrassingly naive yet effective trick. When humans look at images, they decide what to focus on based on context. For MLLMs, adjusting the CLIP embedding at inference-time based on context is a highly meaningful research topic.
A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms	-
Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference	-
Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval	-

25/9/2024

Paper	Comments
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models	Developed with interns, currently the best open-source long-text generation benchmark on the market. The sections on Open-Ended QA and Heuristic Text Generation are particularly valuable.
MonoFormer: One Transformer for Both Diffusion and Autoregression	A surprisingly simplistic approach combining Diffusion and Autoregressive techniques.
EuroLLM: Multilingual Language Models for Europe	Highly recommended! The Joint Scaling Law is intriguing and may be extrapolated to domain-specific ratios. Once the number of domains increases, it could yield a higher ROI than the D-CPT Law or similar optimization functions.
MaskBit: Embedding-free Image Generation via Bit Tokens	-
CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data	Recommended for reading. It's a high-quality benchmark. Notably, they mention that Tencent annotated over 20,000 CoT questions, but it is unclear how much that cost. It might have been better used for training on o1 tasks.
LLM Echo Chamber: Personalized and Automated Disinformation	-
On the Complexity of Neural Computation in Superposition	-
Watch Your Steps: Observable and Modular Chains of Thought	This is CMU's version of DoT. It feels similar to a recent THU paper, but this one is written more clearly and is easier to understand. There are some differences in the mechanism, but the approach of identifying and naming steps, defining the input/output behavior, and replacing CoT explanations with formalized steps for the same examples is explained more comprehensibly here.
Reward-Robust RLHF in LLMs	-
Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models	-
ControlMath: Controllable Data Generation Promotes Math Generalist Models	By retaining only the more difficult problems during each round of synthetic data iteration, the difficulty of the math problems generated by the model is improved.
Parse Trees Guided LLM Prompt Compression	-
Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models	No comment provided.
In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models	Raises a valuable issue: LLMs are not sensitive to trivial context changes, leading to the A-Not-B problem.
VLMine: Long-Tail Data Mining with Vision Language Models	-
TFG: Unified Training-Free Guidance for Diffusion Models	-
Empirical Insights on Fine-Tuning Large Language Models for Question-Answering	Finding 1 and Figure 4 are interesting.
Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection	An interesting read, with a compelling automated hyperparameter search mechanism.
Fine-Tuning is Fine, if Calibrated	-
Steward: Natural Language Web Automation	-
CLSP: High-Fidelity Contrastive Language-State Pre-training for Agent State Representation	-

24/9/2024

Paper	Comments
OmniBench: Towards The Future of Universal Omni-Language Models	The authors put significant effort into data annotation for this paper. In OmniBench, each data sample requires simultaneous use of audio, image, and textual information to answer during inference. Every sample was manually checked by coauthors, and missing one modality reduces the accuracy to guessing from 2-3 options. There are two key takeaways: 1) Currently, none of the open-source Omni models can process all three modalities simultaneously, and even closed-source models have limited capabilities with no cross-modal generalization. 2) The performance of Multimodal Large Language Models (MLLMs) is constrained under image description, audio transcription, and text response conditions, with models like Gemini reaching this upper bound and Reka-core trailing by about 10 points. However, the encoders still lose a considerable amount of information. It is rumored that GDM has a similar system, which could become a future benchmark for Omni-Language Models (OLM).
Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models	A potentially useful embodied multimodal benchmark.
LLMs are One-Shot URL Classifiers and Explainers	The same discovery was made about six months ago. Our experiments were slightly more extreme, showing that GPT-3.5 and Claude are excellent URL classifiers, capable of determining the source based on the root domain. This is significant because this ability can be leveraged to gather seed data for various domains, enabling the cold-start training of models like fastText and BERT. This can further assist in classifying already pre-collected training data. We plan to release a dataset with 67 domain-specific categories within 2-3 weeks.
A-VL: Adaptive Attention for Large Vision-Language Models	In MLLMs, images and text have distinct self-attention patterns. Therefore, different caching modes are designed for each modality during inference. The visual component “stores the cache of potentially useful information but only computes the most critical parts,” while the language part prioritizes local information.
VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models	A benchmark for evaluating the generalization of text-to-image models.
Distribution-Level Feature Distancing for Machine Unlearning: Towards a Better Trade-off Between Model Utility and Forgetting	Introduces Optimal Transport to prevent the model from being misled by forgetting. While the author does not frequently read such papers, this approach appears to make sense.
Target-Aware Language Modeling via Granular Data Sampling	Introduces n-gram feature recognition related to downstream tasks to identify relevant data, aiming to enhance downstream task performance without sacrificing the performance of other tasks.
Past Meets Present: Creating Historical Analogy with Large Language Models	-
Do Large Language Models have Problem-Solving Capability under Incomplete Information Scenarios?	A good design concept for an interactive benchmark, though the data quality and differentiation seem to be lacking, and the hard mode is not particularly challenging. Examples include "Who is Undercover" and "Twenty Questions".
VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models	1) VLMs exhibit varying sensitivity to different colors but consistently show insensitivity to green across various VLMs. 2) They demonstrate differing shape sensitivity and semantic recognition based on the LLM’s capacity, despite using the same fixed visual encoder.
Identify As A Human Does: A Pathfinder of Next-Generation Anti-Cheat Framework for First-Person Shooter Games	A framework for anti-cheat mechanisms in games.
Orthogonal Finetuning for Direct Preference Optimization	-
Inference-Friendly Models With MixAttention	-
Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task	No comment provided.
Efficiently Dispatching Flash Attention For Partially Filled Attention Masks	No comment provided.
Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking	Analyzes judgment biases in LLM-based benchmarking.
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?	A brief review of AI applications in medicine, with little improvement noted.
VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning	A benchmark similar to MMMU, potentially useful.
Drift to Remember	-
You Only Use Reactive Attention Slice For Long Context Retrieval	A Reactive Attention mechanism applied to retrieval-augmented generation (RAG). This work by Meta seems intuitively reasonable and is recommended for further reading.
Prompt Baking	A lightweight method for model parameter personalization based on prompts. This is an interesting small trick, recommended for further reading.
RNR: Teaching Large Language Models to Follow Roles and Rules	A potentially useful benchmark for role-playing scenarios.
Language agents achieve superhuman synthesis of scientific knowledge	-
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model	-
A Multi-LLM Debiasing Framework	-
One-shot World Models Using a Transformer Trained on a Synthetic Prior	-
ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models	A useful domain-specific benchmark.
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders	Recommended reading. Describes a phenomenon called feature absorption in sparse autoencoders, which may affect the reliability of SAE.
Proof Automation with Large Language Models	Recommended reading. It decouples low-level proof from high-level speculative decoding in formal proof systems.
TracrBench: Generating Interpretability Testbeds with Large Language Models	A high-quality benchmark.

23/9/2024

Paper	Comments
Guided Profile Generation Improves Personalization with LLMs	This paper itself holds limited value, but several recent papers have proposed similar approaches, either generating synthetic data or using CoT (Chain-of-Thought) guidance during inference. The core idea is similar to requiring the model to link Hotpot QA-like data or induce the model to connect internal patterns across different data, forming coherent reasoning chains. This helps the model convert unfamiliar reasoning shortcuts into familiar, simple reasoning combinations. Such approaches have potential for scaling, and since the assumption is that the original data is high quality or at least factually correct, the generated data does not necessarily require strong domain-specific verifiers. Recommended reading: EntiGraph.
AutoVerus: Automated Proof Generation for Rust Code	This paper presents a valuable benchmark and dataset work focused on Rust code.
LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models	-
Can we only use guidelines instead of shots in prompt?	This is a potential solution for generating CoT in batches, focusing on mass production and modification of guidelines so that questions of the same type can be answered correctly using these guidelines. However, the merging of different guidelines appears somewhat simplistic. Recommended reading.
The Impact of Large Language Models in Academia: from Writing to Speaking	The reverse impact of GPT-like models on humans has become a popular topic in HCI since mid-last year, and recently, many NLP researchers have been exploring it as well. It appears to be an emerging direction. (This is more about educating GPT users, not models).
The FIX Benchmark: Extracting Features Interpretable to eXperts	"We find that popular feature-based explanation methods have poor alignment with expert-specified knowledge, highlighting the need for new methods that can better identify features interpretable to experts."
YesBut: A High-Quality Annotated Multimodal Dataset for Evaluating Satire Comprehension Capability of Vision-Language Models	Image-based satire recognition is an interesting benchmark scenario with potential for improvement. This is an area where certain improvements can be made at the data level. Recommended benchmark: II-Bench, an Image Implication Understanding Benchmark for Multimodal Large Language Models.
Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Tensorflow Pretrained Models	-
Generating Visual Stories with Grounded and Coreferent Characters	-
LLMS STILL CAN’T PLAN; CAN LRMS? A PRELIMINARY EVALUATION OF OPENAI’S O1 ON PLANBENCH	O1 PlanBench quick evaluation.
Imagine yourself: Tuning-Free Personalized Image Generation	-
EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models	Empathy, surprisingly, is an internal mechanism of models. The larger the model, the more it attempts to interpret the implicit information represented by user input and adapt its responses accordingly. A few months ago, an internship project worked on a similar benchmark. Recommended benchmark: GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models. Some cases from this could be considered for inclusion in safety testing.

If you are intereted in the work published by us, please navigate to our full paper list.