data subTLDR week 52 year 2025

r/MachineLearningr/dataengineeringr/SQL

Rewriting Nightmare Databases: Solutions and Suggestions, Converting CSV to SQL Efficiently, The Art of Showcasing SQL Skills, The Unseen Value of Data Engineers, Kafka: Overkill or Essential?

December 28, 2025•Week 52, 2025

Posted in r/dataengineeringbyu/Different_Pain5781•12/23/2025

267

Most data engineers would be unemployed if pipelines stopped breaking

Discussion

Data engineers widely agree that troubleshooting and fixing issues is a highly visible part of their job, often overshadowing the preventative work they do in building and maintaining robust systems. Despite the perception that their value lies in problem-solving, their work also involves less noticeable tasks like data contracts, upstream alignment, cost control, schema evolution, access rules, and preemptive quality checks. In mature teams, the focus shifts from firefighting to prevention, with success measured by the quiet and smooth running of systems. This work is often undervalued as it's less visible, but is critical in large, well-established companies where pipelines are typically stable. Negative sentiment was directed towards the suggestion that engineers merely act as a retry button, with many emphasizing the complexity and value of their roles.

105 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Vodka-_-Vodka•12/25/2025

238

Am I crazy or is kafka overkill for most use cases?

Discussion

There's a consensus that Kafka may be overkill for smaller-scale projects, such as processing 10k events per day, with the setup and maintenance potentially being time-consuming and labor-intensive. A simple alternative solution or a managed service could be more efficient. However, if the volume is expected to increase considerably, Kafka might be justified. Some participants criticize the 'resume-driven' approach, emphasizing the importance of relevant skills and problem-solving. There's also a warning about potential data quality issues and pipeline failures due to excessive data movement. Overall, the sentiment leans towards caution before implementing complex systems unnecessarily.

119 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/ArtisticHamster•12/25/2025

231

[D] Best papers of 2025

Discussion

The top papers of 2025 were focused on advancements in language modeling and machine learning, as highlighted by Reddit comments. Deepseek R1 and V3 were recognized for their impact on open-source language logic models (LLMs) and Chinese contributions. Large language diffusion models were praised for being faster, more controllable, and breaking the reversal curse. The Vision Language Action Models were lauded for their potential in robotics. Recurrent/Latent Reasoning Models shook up ARC this year, reviving interest in RNNs. The community also appreciated work on efficient LLMs, lowering accessibility barriers. However, some felt the absence of breakthroughs in standard reinforcement learning or computer vision. The overall sentiment was positive but also acknowledged the rapid pace of progress in the field.

30 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/EarthGoddessDude•12/25/2025

180

New table format announced: Oveberg

Meme

The announcement of the new table formats, Oveberg and FuckLake, hosted on ASS, has generated a mix of humor and genuine interest. A majority of users appreciate the humorous tone of the announcement, while others query its practical applications. There's curiosity about the ability of Oveberg to process large data sets, and questions about importing data into these new systems. A few commenters shared job requirements for managing these new data platforms, highlighting the need for experience in data engineering. Despite the playful tone, there's a strong underlying interest in the potential of these new data formats.

31 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/0ZQ0•12/28/2025

141

How do you as an AI/ML researcher stay current with new papers and repos? [D]

Discussion

AI/ML researchers stay current with new papers and repositories by setting up author notifications on Google Scholar, searching conference papers for relevant keywords, tracking citations to seminal papers in their niche, and tuning recommendation algorithms on platforms like Scholar Inbox. Reading paper titles and abstracts from top conferences is another practice, whereas social media, like Twitter, provide a quick, albeit sometimes misleading, source of information. Tools such as Semantic Scholar, which offer semantic searches and updates on new papers matching the user's profile, are also used. However, the abundance of new information often leads to a backlog, requiring significant time for research and review.

61 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/moji-mf-joji•12/24/2025

118

[D]2025 Year in Review: The old methods quietly solving problems the new ones can't

Discussion

The thread emphasizes the return to pre-Transformer principles in NLP research to solve problems that new methods like Transformers can't efficiently handle. Continuous HMMs, Viterbi searches, and n-gram smoothing have all resurfaced to tackle these issues. The discussion also highlights the limit of scale in dealing with efficiency and reliability, suggesting a need to reintroduce structure. However, some argue that much of AI research is incremental and potentially deceptive. There's a call for defining the right structure for desired AI systems to increase data efficiency. Some criticize Language Model's (LLM) approach, stating they rewrite rather than polish text. The sentiment is mixed.

33 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/Everlier•12/27/2025

104

[D] r/MachineLearning - a year in review

Discussion

In 2025, Machine Learning's central discussions revolved around open-source accessibility, conference sustainability, visa and access barriers, and research integrity. DeepSeek's decision to open-source their work, despite 45x training efficiency gains, was largely discussed, with the trend towards monetization models similar to Meta's Llama strategy. The surge in conference submissions led to concerns about acceptance processes and the overall quality of work presented. International researchers faced visa challenges, prompting calls for venue relocation. Issues with peer-review and publishing integrity were exposed, with a focus on declining review quality at top ML conferences. Other key topics included Mamba's lack of adoption, the debate between Vision Transformers and CNNs, and the economic impact of AI.

7 comments

Save

View on Reddit →

Posted in r/SQLbyu/LessAccident6759•12/22/2025

boss rewrites the database every night. Suggestions on 'data engineering? (update on the nightmare database)

Discussion

The discussion revolves around a company's dysfunctional database system, which lacks a central logic and clearly defined dependencies, making navigation difficult and scalability an issue. The daily practice of dropping and rewriting tables is criticized. A majority of commenters agree that the system, involving 15-20 table joins, is too complex and unsustainable. Some suggest it resembles a poorly designed data warehouse, and urge the author to document, simplify, and incrementally load tables. Others suggest that the situation might not be as dire, depending on the company's data requests and the specific business area. The overall sentiment is a mix of concern and constructive advice.

63 comments

Save

View on Reddit →

Posted in r/SQLbyu/ImpossibleAlfalfa783•12/25/2025

Does anyone know a tool to convert CSV file to "SQL statements"?

SQLite

The top suggestions to convert a CSV file into SQL statements include writing a Python script using pandas to determine types and create the table statement, and using the CSV Lint plugin for Notepad++. Other popular recommendations involve using DuckDB, which can infer types automatically and create the table and import data, or crafting a string in Excel. Some users recommend direct data loading into a table in various database dialects, but note that the table would first need to be created by another method. Python is widely hailed as a versatile language for achieving this task.

85 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 52 year 2025

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!