data subTLDR week 41 year 2025

r/MachineLearningr/dataengineeringr/SQL

Demystifying Advanced Work Concepts: CTEs vs Temp Tables, Case-Sensitivity in SQL, Monthly Cumulative Balance Techniques, Understanding 'Real-Time' in Data Engineering, and Unseen Challenges in Production Data Pipeline Rollouts

October 12, 2025•Week 41, 2025

Posted in r/dataengineeringbyu/wtfzambo•10/9/2025

457

I'm sick of the misconceptions that laymen have about data engineering

Discussion

The thread discusses widespread frustration with the misuse of data engineering, particularly the trend of building expensive real-time data systems without understanding the business case. Many agree that real-time often erroneously makes it onto requirements lists, despite few businesses needing instant data updates. This results in over-engineered systems that are costly to build and maintain. Several comments emphasize the importance of clarifying what real-time means to stakeholders, with one user suggesting that the term is often used to mean daily updates rather than instantaneous ones. Overall sentiment leans toward a need for better communication and understanding between engineers, project managers, and stakeholders.

188 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/Foreign_Fee_5859•10/8/2025

233

[D] Bad Industry research gets cited and published at top venues. (Rant/Discussion)

Discussion

There's a growing concern about the undue weight and attention given to research papers from major tech companies like Meta, Deepmind, and Apple, despite potential flaws and lack of novelty. Some argue that such papers get accepted to top conferences and cited heavily mainly due to the industry name attached. The process is criticized as a publicity-driven cycle wherein papers are published, selectively impressive results are highlighted, and the work gets overhyped, leading to widespread acceptance. Many believe that the significant financial resources available in the AI industry also play a role. However, others point out that certain heavily-cited works do contribute to the field, even if their initial proposals were flawed or incremental. Overall, the sentiment leans towards a call for more rigorous standards in accepting and citing industry research.

57 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Flat_Direction_7696•10/7/2025

196

I just rolled out my first production data pipeline, and I expected the hardest things would be writing ETL scripts or managing schema changes. I soon discovered the hardest things were usually things that had not crossed my mind:

Help

The real-world challenges in data engineering extend far beyond writing ETL scripts or managing schema changes. Practitioners often grapple with dirty or inconsistent data, ensuring pipelines are idempotent to avoid data duplication or poisoning, and setting up effective monitoring and alerts. Working with inexperienced teams can further complicate these tasks. Educational programs often fail to prepare students for these realities, with clean datasets providing an unrealistically smooth learning experience. Managers' technical decisions, often made without consulting data engineers, and shifting priorities can exacerbate these challenges. The crux of data engineering lies in ensuring visibility into data changes, understanding failures, ensuring safe reruns, and aligning teams on good data definitions.

35 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/NoGanache5113•10/7/2025

176

I can’t* understand the hype on Snowflake

Discussion

A majority of users appreciate Snowflake for its convenience and simplicity, emphasizing its user-friendly experience and robust performance. They note that the data warehouse solution requires less maintenance and generally just works, making it a preferred choice for many professionals. Despite criticisms about cost and vendor lock-in, users argue that the benefits outweigh these drawbacks. Recent improvements, including solutions around machine learning and governance, were noted, and some users highlighted Snowflake's evolution to an end-to-end cloud platform. The sentiment is generally positive, with users recognizing Snowflake as a valuable tool for data analysts and engineers, particularly in smaller teams or solo roles where simplicity and efficiency are key.

122 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/random_sydneysider•10/10/2025

119

[R] DeepSeek 3.2's sparse attention mechanism

Research

The novel sparse attention mechanism in DeepSeek 3.2 has sparked interest for its dynamic sparsity, where models learn which tokens to focus on, offering more flexibility than traditional methods. While implementing the FlashMLA kernel is complex, the community is developing simpler, Triton-based implementations. There's also interest in Multi Head Latent Attention for its speed and high performance. Some users mention minor quality degradation and limitations with parsing long files in DeepSeek web, but express overall satisfaction with the cost reductions brought by sparse attention. There's a call for accessible resources explaining the differences between this mechanism and regular MHA.

8 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/blank_waterboard•10/9/2025

[D] Anyone using smaller, specialized models instead of massive LLMs?

Discussion

The consensus is that smaller, specialized models can be more efficient and cost-effective than large language models (LLMs). Businesses are successfully employing post-training custom models, such as PEFT, which are reliable and cost-efficient. Fine-tuning specific tasks allows for the usage of smaller models, and distilling large teacher models to small student LLMs is a common practice. Small language models (SLMs), with fewer parameters than typical LLMs, are being successfully trained on high-quality curated datasets. However, some users caution that smaller models can be brittle and may struggle with out-of-distribution inputs or complex topics. Overall sentiment is positive towards smaller, specialized models.

52 comments

Save

View on Reddit →

Posted in r/SQLbyu/RutabagaJumpy3956•10/7/2025

Which advanced concepts do you use at work?

Discussion

In a discussion about advanced concepts used at work, majority of commenters emphasized the importance of Common Table Expressions (CTEs), citing readability and simplicity as the main reasons. While all techniques have their place and can be similarly performant, the consensus was that code should be easy to understand for other people. Temp tables also found favor, particularly when dealing with large queries or inefficient databases that struggle with joins. Context was deemed critical in deciding between techniques, with the frequency of the code being run and the size of the query being key considerations. The sentiment was generally positive and constructive.

58 comments

Save

View on Reddit →

Posted in r/SQLbyu/FewNectarine623•10/8/2025

SQL Server treating 'Germany' and 'gErmany' the same — is it really case-sensitive?

SQL Server

The discussion primarily revolves around the case sensitivity in SQL Server and how it treats the same words with different cases. The majority view, backed by the highest upvotes, states that SQL Server's default collation is case insensitive, and altering it is unusual unless necessary. There's a clarification that while SQL Server is always case-sensitive in storage (case aware), it's not always case-sensitive in comparisons, governed by COLLATION. Also, Oracle and Postgres were mentioned as databases that default to case-sensitivity. A few commentators noted their experiences with inconsistent case handling causing problems. Contributors also suggested resources for understanding COLLATION better. Overall, the sentiment was mixed, with some expressing frustration over case sensitivity issues.

49 comments

Save

View on Reddit →

Posted in r/SQLbyu/datascientist933633•10/9/2025

How do I do a cumulative balance/running total in SQL by month?

Discussion

The majority opinion among Reddit users suggests that SQL does allow for a cumulative balance or running total by month. Users recommend using window functions for this purpose, specifically the `SUM()` function `OVER` previous rows ordered by fiscal year and month. While some debate exists over whether it's necessary to specify the scope as `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`, the consensus leans towards using explicit scope for clarity and to avoid potential surprises. The overall sentiment of the thread is positive and informative, demonstrating a helpful exchange of coding solutions.

16 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 41 year 2025

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!