← Back to data subTLDR
data subTLDR week 23 year 2025
r/MachineLearningr/dataengineeringr/SQL
Crashing Production with an Open Transaction, Debating SQL Practices for DAs, Unraveling the 'Partition By' Magic, European Databricks Event: A Letdown, Keeping Up with Industry Jargon
•Week 23, 2025
Posted in r/dataengineeringbyu/BadBouncyBear•6/6/2025
837
I attended a databricks event in Europe
Meme
The overall sentiment towards the Databricks event in Europe was negative, with the original poster rating it a 3/10, largely due to unmet expectations around event perks. Commenters agreed, characterizing it as a typical, somewhat disappointing conference experience. There was also confusion and humor around the phrase data bricked the fuck up, which sparked a discussion about its provocative, possibly inappropriate use in a professional setting. Some users found the phrase amusing, although others questioned its meaning, highlighting the potential for misunderstanding when using slang or jargon.
Posted in r/dataengineeringbyu/PossibilityRegular21•6/3/2025
586
When you miss one month of industry talk
Meme
The discussion reveals a mixed understanding and knowledge about industry-specific terms like DuckDB, data lake, and related concepts. Some participants grasp these concepts well, while others express confusion or lack of knowledge. One insightful comment explains different data storage methods, comparing API-backed, SQL database, and inline object store to iceberg, duck lake, and delta lake, respectively. There's also a mention of DuckDB as an open table format by DuckDB Labs that doesn't require the use of DuckDB itself. Overall, the conversation is a blend of awareness and puzzlement over industry jargon.
Posted in r/MachineLearningbyu/jamesvoltage•6/6/2025
212
[R] LLMs are Locally Linear Mappings: Qwen 3, Gemma 3 and Llama 3 can be converted to exactly equivalent locally linear systems for interpretability
Research
The research presents an alternative approach to LLM interpretability, demonstrating that LLMs can be converted into nearly equivalent linear systems without changing outputs or weights. This method allows nearly-exact token attribution and works across various models. However, the detached Jacobian linear system is only valid for each specific input sequence and computation can be slow and VRAM intensive. Despite some skepticism about the claim of exact equivalence without mathematical proof, the research was generally well-received, sparking discussions about its potential for steering, safety analysis, and enhancing understanding of transformer decoder functions. The sentiment was mostly positive, with commendations for the novel approach.
Posted in r/dataengineeringbyu/kerokero134340•6/5/2025
197
A disaster waiting to happen
Discussion
The sentiment in the thread is primarily negative towards the company's decision to replace existing pipelines with an AI-driven platform. The AI platform's skepticism stems from concerns about data inconsistencies, lack of audit trails, and unrealistic expectations from management. While some suggest documenting concerns or trying to communicate the risks to management, the most upvoted comments suggest a more pragmatic approach of understanding management's desired outcomes and offering alternative solutions. There's also a strong sentiment suggesting the original poster should prepare to leave the company if the situation doesn't improve. Several comments emphasize the importance of protecting oneself by documenting everything and understanding the potential risks and costs associated with the new AI tool.
Posted in r/SQLbyu/frothymonk•6/5/2025
189
I crashed production today by not closing a BEGIN; transaction block
PostgreSQL
The thread centers on a developer who accidentally crashed their production system by leaving a BEGIN transaction block open in a database query. The community views these mishaps as a rite of passage, offering supportive anecdotes and relatable experiences. There's a shared sentiment that mistakes like these are learning opportunities and can even signify a level of seniority in the field. The importance of backups is also highlighted, with one user sharing a story of data loss due to a mistaken query. A database administrator (DBA) suggests that automated scripts can prevent such issues, pointing to the value of proactive measures.
Posted in r/MachineLearningbyu/tibetbefree•6/2/2025
163
[D] TMLR paper quality seems better than CVPR, ICLR.
Discussion
The majority sentiment suggests that TMLR papers generally exhibit better quality and correctness due to the journal's emphasis on value to the community over novelty or state-of-the-art results. This contrasts with conferences like CVPR and ICLR, which lack clear novelty definitions and often focus on whether a paper achieves cutting-edge results, leading to subjectivity in review process. Authors appreciated TMLR's longer review process and lesser 'sales pressure', allowing them to focus more on the science. However, some users pointed out that this doesn't trivialize the value of big conferences, as major breakthroughs often land there, indicating a need to balance intrinsic and extrinsic motivations in research.
Posted in r/SQLbyu/ervisa_•6/2/2025
83
What I Wish I Knew About SQL When I Started as a DA
MySQL
The article on SQL best practices drew a mixed response. Many commented on the use of aliases, with some questioning the usefulness of identical table and alias names, while others upheld it for code consistency. Some users disagreed with the usage of SELECT * in CTE. Also, the trailing commas and SQL keywords as aliases were less favored. The issue of integer division was flagged. Critiques included the article being polished but simplistic and not comprehensive, and the lack of specificity about the database engine. The use of CTE for performance improvement was contested. Nevertheless, some found the article enjoyable.
Posted in r/SQLbyu/intimate_sniffer69•6/6/2025
57
Can someone explain the magic of partition by to me and when to use it instead of group by?
Discussion
The discussion revolves around the use of SQL's 'partition by' versus 'group by' in pre-computing data for Power BI. A significant number of commenters supported the idea of pushing computation upstream into the data warehouse to keep Power BI data models lean and simple. They pointed out the efficiency of 'partition by' for computing subtotals and doing month-to-date, quarter-to-date, and year-to-date calculations. Some, however, criticized the complexity of the SQL code and suggested using 'group by' instead. The overall sentiment was mixed, with some favoring the 'partition by' approach for its efficiency, while others advocated for a simpler solution.
Subscribe to data-subtldr
Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.