data subTLDR week 23 year 2025

r/MachineLearningr/dataengineeringr/SQL

Crashing Production with an Open Transaction, Debating SQL Practices for DAs, Unraveling the 'Partition By' Magic, European Databricks Event: A Letdown, Keeping Up with Industry Jargon

June 8, 2025•Week 23, 2025

Posted in r/dataengineeringbyu/BadBouncyBear•6/6/2025

837

I attended a databricks event in Europe

Meme

The overall sentiment towards the Databricks event in Europe was negative, with the original poster rating it a 3/10, largely due to unmet expectations around event perks. Commenters agreed, characterizing it as a typical, somewhat disappointing conference experience. There was also confusion and humor around the phrase data bricked the fuck up, which sparked a discussion about its provocative, possibly inappropriate use in a professional setting. Some users found the phrase amusing, although others questioned its meaning, highlighting the potential for misunderstanding when using slang or jargon.

52 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/PossibilityRegular21•6/3/2025

586

When you miss one month of industry talk

Meme

The discussion reveals a mixed understanding and knowledge about industry-specific terms like DuckDB, data lake, and related concepts. Some participants grasp these concepts well, while others express confusion or lack of knowledge. One insightful comment explains different data storage methods, comparing API-backed, SQL database, and inline object store to iceberg, duck lake, and delta lake, respectively. There's also a mention of DuckDB as an open table format by DuckDB Labs that doesn't require the use of DuckDB itself. Overall, the conversation is a blend of awareness and puzzlement over industry jargon.

30 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/jamesvoltage•6/6/2025

212

[R] LLMs are Locally Linear Mappings: Qwen 3, Gemma 3 and Llama 3 can be converted to exactly equivalent locally linear systems for interpretability

Research

The research presents an alternative approach to LLM interpretability, demonstrating that LLMs can be converted into nearly equivalent linear systems without changing outputs or weights. This method allows nearly-exact token attribution and works across various models. However, the detached Jacobian linear system is only valid for each specific input sequence and computation can be slow and VRAM intensive. Despite some skepticism about the claim of exact equivalence without mathematical proof, the research was generally well-received, sparking discussions about its potential for steering, safety analysis, and enhancing understanding of transformer decoder functions. The sentiment was mostly positive, with commendations for the novel approach.

42 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/kerokero134340•6/5/2025

197

A disaster waiting to happen

Discussion

The sentiment in the thread is primarily negative towards the company's decision to replace existing pipelines with an AI-driven platform. The AI platform's skepticism stems from concerns about data inconsistencies, lack of audit trails, and unrealistic expectations from management. While some suggest documenting concerns or trying to communicate the risks to management, the most upvoted comments suggest a more pragmatic approach of understanding management's desired outcomes and offering alternative solutions. There's also a strong sentiment suggesting the original poster should prepare to leave the company if the situation doesn't improve. Several comments emphasize the importance of protecting oneself by documenting everything and understanding the potential risks and costs associated with the new AI tool.

94 comments

Save

View on Reddit →

Posted in r/SQLbyu/frothymonk•6/5/2025

189

I crashed production today by not closing a BEGIN; transaction block

PostgreSQL

The thread centers on a developer who accidentally crashed their production system by leaving a BEGIN transaction block open in a database query. The community views these mishaps as a rite of passage, offering supportive anecdotes and relatable experiences. There's a shared sentiment that mistakes like these are learning opportunities and can even signify a level of seniority in the field. The importance of backups is also highlighted, with one user sharing a story of data loss due to a mistaken query. A database administrator (DBA) suggests that automated scripts can prevent such issues, pointing to the value of proactive measures.

80 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/tibetbefree•6/2/2025

163

[D] TMLR paper quality seems better than CVPR, ICLR.

Discussion

The majority sentiment suggests that TMLR papers generally exhibit better quality and correctness due to the journal's emphasis on value to the community over novelty or state-of-the-art results. This contrasts with conferences like CVPR and ICLR, which lack clear novelty definitions and often focus on whether a paper achieves cutting-edge results, leading to subjectivity in review process. Authors appreciated TMLR's longer review process and lesser 'sales pressure', allowing them to focus more on the science. However, some users pointed out that this doesn't trivialize the value of big conferences, as major breakthroughs often land there, indicating a need to balance intrinsic and extrinsic motivations in research.

17 comments

Save

View on Reddit →

Posted in r/SQLbyu/ervisa_•6/2/2025

What I Wish I Knew About SQL When I Started as a DA

MySQL

The article on SQL best practices drew a mixed response. Many commented on the use of aliases, with some questioning the usefulness of identical table and alias names, while others upheld it for code consistency. Some users disagreed with the usage of SELECT * in CTE. Also, the trailing commas and SQL keywords as aliases were less favored. The issue of integer division was flagged. Critiques included the article being polished but simplistic and not comprehensive, and the lack of specificity about the database engine. The use of CTE for performance improvement was contested. Nevertheless, some found the article enjoyable.

27 comments

Save

View on Reddit →

Posted in r/SQLbyu/intimate_sniffer69•6/6/2025

Can someone explain the magic of partition by to me and when to use it instead of group by?

Discussion

The discussion revolves around the use of SQL's 'partition by' versus 'group by' in pre-computing data for Power BI. A significant number of commenters supported the idea of pushing computation upstream into the data warehouse to keep Power BI data models lean and simple. They pointed out the efficiency of 'partition by' for computing subtotals and doing month-to-date, quarter-to-date, and year-to-date calculations. Some, however, criticized the complexity of the SQL code and suggested using 'group by' instead. The overall sentiment was mixed, with some favoring the 'partition by' approach for its efficiency, while others advocated for a simpler solution.

48 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 23 year 2025

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!