data subTLDR week 19 year 2025

r/MachineLearningr/dataengineeringr/SQL

Unveiling the SQL Debate: Acceptance Despite Flaws, Database Transaction Nightmares, Time-Travelling DWH Developers, Azure Synapse's 'Nightmare' Scenario, and the AI in Art Conundrum

May 11, 2025•Week 19, 2025

Posted in r/SQLbyu/AFRIKANIZ3D•5/11/2025

875

When it finally executes (my first data meme)

MySQL

The discussion primarily revolves around the use of CTEs, or Common Table Expressions, in SQL coding. The community suggested moving complex CTEs to temporary tables to improve execution. Participants also discussed the benefits of creating an index for a temp table, which can provide a performance boost. Some skepticism was expressed about the effectiveness of using temporary tables over CTEs unless data persistence is needed. Some comments also touched on the challenges of handling large, complex SQL codes, especially legacy codes, and the need for optimization and rebuilding. The overall sentiment was constructive and informative.

57 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/wtfzambo•5/5/2025

765

I f***ing hate Azure

Discussion

The sentiment towards Azure Synapse among Reddit users is generally negative, with many expressing frustration at the platform's features and reliability. Users are particularly critical of Spark as the sole available runtime, the use of notebooks in production, and the claim that data engineers are not necessary. Some users share their experiences of lost data and unexpected high charges. While some suggest workarounds like using definition files instead of notebooks, the overall consensus is that poorly set-up Synapse can create a nightmare scenario. The comments also reflect a broader sense of dissatisfaction with the data engineering field's challenges and complexities.

223 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/Arqqady•5/11/2025

536

[D] POV: You get this question in your interview. What do you do?

Discussion

The thread discusses a complex interview question involving the calculation of hardware utilization for a 37B param transformer model. The majority sentiment is critical, with users noting that the question doesn't reflect real-world considerations and offers limited insight into a candidate's potential performance in the role. Some participants provided detailed calculations, arriving at a rough consensus of approximately 21.6% - 21.7% for the answer. A few users also highlighted the need to consider factors like communication overhead, architecture, and interconnect speeds, which the question seemingly ignores. Overall, the thread indicates a preference for interview questions that better assess practical skills and knowledge.

110 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/SureResort6444•5/6/2025

439

Fiverr, Duolingo, Shopify etc..

Meme

The discussion revolves around the use of generative AI in various fields such as art and financial reporting, with a mixed sentiment. Many express skepticism, fearing that full automation could lead to problems in areas needing accuracy and accountability. However, some highlight the potential benefits in specific contexts where there's room for error or where fact-checking against hallucinations is feasible. There's also a debate on AI's role in art, with critiques stating AI-produced art lacks human touch, suffering, or story. Yet, others point out that many can't distinguish between AI and human art, questioning the practical value of such distinction.

14 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/turhancan97•5/11/2025

424

[D] What Yann LeCun means here?

Discussion

The discussion revolves around Yann LeCun's statement that 4 years of a human child's experience equals 30 minutes of YouTube uploads. Participants agree that LeCun highlights the difference in data sourcing between humans and AI models; humans process less data but show superior spatial and visual intelligence. Top comments discuss the human eye's data compression and the massive data quantity in video uploads. Several commentators suggest that LeCun might include all sensory information in his comparison. Some express concern about biological efficiency and potential genetic encoding of world models, calling for counterarguments. The thread reflects a mixed sentiment, with fascination and concern about implications for AI development.

103 comments

Save

View on Reddit →

Posted in r/SQLbyu/getflashboard•5/5/2025

164

Uncle Bob Martin: "SQL was never intended to be used by computer programs. It was a console language for printing reports. Embedding it into programs was one of the gravest errors of our industry."

Discussion

The discussion on Uncle Bob Martin's statement about SQL being a grave error in the computer programming industry evoked mixed reactions. The most popular sentiment suggests accepting SQL's usage as a necessary evolution, with users acknowledging the rarity of a single language standardizing the data field for decades. Some questioned the viability of alternatives, like direct database function calls. Others highlighted that issues like SQL injection attacks reflect carelessness rather than the language's shortcomings. A few participants pointed out the trade-off of making SQL more machine-friendly potentially reducing its human-friendliness. Overall, the sentiment veers toward SQL acceptance despite its perceived shortcomings.

106 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/itty-bitty-birdy-tb•5/8/2025

143

We benchmarked 19 popular LLMs on SQL generation with a 200M row dataset

Open Source

In a study comparing 19 Language-to-Logic Models' (LLMs) SQL generation against a 200M row dataset, Claude 3.7 stood out for accuracy, while GPT models were praised for their all-around performance. However, many noted that LLMs typically read more data than a human-written query, leading to inefficiencies. Multiple comments pointed out the limited application of LLMs in complex data modeling, as they struggle to generate expansive queries and handle multiple table joins. The potential benefits of a human-in-the-loop system were also discussed. Overall, despite some promising results, there is consensus on the need for further optimization and enhancement of LLMs.

20 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/RedRhizophora•5/5/2025

138

[D] Fourier features in Neutral Networks?

Discussion

The use of Fourier features in neural networks is debated in the deep learning community. Some argue the methods aren't gaining popularity due to computational inefficiency or lack of empirical benefit. However, others point out their widespread use in Graph Neural Networks (GNNs) and as filters on frequency domains. Critics suggest Fourier analysis is more relevant when texture is the predominant feature of images, but less relevant for problems like visual understanding or object recognition. Proponents counter that localized Fourier transforms are useful in image processing and achieving good perceptual performance. The sentiment is mixed overall.

65 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/KoOBaALT•5/8/2025

125

[D] Why is RL in the real-world so hard?

Discussion

The challenges of applying reinforcement learning (RL) to real-world problems like energy systems and supply chain optimization are primarily due to limited exploration of action space, limited data, and data noise. Most comments suggest that the lack of data and inability to further explore can only be solved by acquiring more data or utilizing domain knowledge. Some propose the creation of simulators for efficient training, but acknowledge the difficulty of perfecting these. Hybrid methods of offline-online learning and off-policy RL were also mentioned as potential solutions. The overall sentiment leans towards the necessity of more data and the exploration of alternative methods beyond RL.

26 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/we_are_mammals•5/7/2025

109

Absolute Zero: Reinforced Self-play Reasoning with Zero Data [R]

Research

The topic of Reinforced Self-play Reasoning with Zero Data sparked a lively discussion. Many participants found the concept intriguing, though some voiced concerns about its potential implications, particularly a phrase from the paper that suggested outsmarting humans and intelligent machines for the 'brains behind the future'. Some users pointed out that the approach wasn't as 'zero data' as initially suggested, as it starts with a base pre-trained model. Critics argued the method seems to favor larger companies with more resources, and while technically compelling, it may introduce alignment and bias enhancing risks. The overall sentiment was mixed.

15 comments

Save

View on Reddit →

Posted in r/SQLbyu/Adela_freedom•5/9/2025

Sleep? Not when there's an uncommitted transaction haunting you. 😴 👻

Discussion

In a spirited discussion about managing database transactions, the consensus suggests that leaving transactions uncommitted can lead to significant issues, especially in production environments. Top contributors recommend setting clients to auto-commit to avoid forgotten transactions and the potential to accidentally delete tables. Database administrators should monitor for prolonged open transactions, with some even advocating for killing such transactions to maintain system integrity. The importance of committing and pushing changes in Git was also highlighted, as forgetting this step can lead to lost work and project delays. The overall sentiment was cautionary, urging meticulousness in managing transactions.

10 comments

Save

View on Reddit →

Posted in r/SQLbyu/Salt_Anteater3307•5/5/2025

Started as a DWH Dev in a Massive Company. Feels Like Ive Time-Traveled to 2005

Oracle

The user who recently started as a DWH developer at a large enterprise describes feeling overwhelmed by the outdated processes and systems, comparing it to time-traveling to 2005. Despite the company's plans to migrate to Azure, there's a lack of a specific migration plan and numerous delays. Fellow Redditors suggest embracing the challenge, viewing it as an opportunity to become a valued expert by learning and documenting the system. Some argue that the problem lies not in the platform, but in the lack of clear documentation and processes. Others express that such situations are common in large enterprises, and overcoming them requires time and patience. Overall sentiment is a mix of sympathy and constructive advice.

39 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 19 year 2025

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!