data subTLDR week 27 year 2025

r/MachineLearningr/dataengineeringr/SQL

Optimizing SQL Queries with 'EXPLAIN ANALYZE', Debating AI's Guessing Game, Navigating SQL Interview Questions, Data Cleaning Woes, and the 2025 Open Source Tech Stack

July 6, 2025•Week 27, 2025

Posted in r/SQLbyu/nikkiinit•7/3/2025

888

I wrote one SQL query. It ran for 4 hours. I added a single index. It ran in 0.002 seconds.

PostgreSQL

The discussion highlights the importance of using the 'EXPLAIN ANALYZE' function in SQL to diagnose and optimize slow queries. The consensus suggests that while adding an index might often speed up queries, it is not a guaranteed solution for all cases. Proper understanding of execution plans is crucial, and the problem can sometimes be solved through rewriting poorly written queries or re-architecting the process and logic. Patience is also mentioned, with some participants sharing experiences of letting queries run for hours or even days. Resources for further learning, like 'use-the-index-luke.com', are also recommended. The overall sentiment is instructive and positive.

120 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/victorviro•7/5/2025

517

When data cleaning turns into a full-time chase

Meme

The discussion revolves around the challenges and frustrations of data cleaning, considered by many as a time-consuming but essential task. Respondents widely agree on the importance of clean data for reliable analysis. They emphasize the need for better tools and methodologies to streamline the process. The sentiment is largely negative due to the tedious nature of the task, although there's recognition of its significance in generating accurate results. Despite the grumbles, there's a shared understanding that data cleaning is a fundamental part of the data analysis process that can't be overlooked.

31 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/DataCraftsman•7/4/2025

464

2025 Open Source Tech Stack

Open Source

Several technical professionals engaged in a lively discussion about recommended open-source tools for various use cases. A majority of users, based on upvote count, advocated for FastAPI over Flask due to its better support for async work and built-in validation and documentation features. There were differing opinions on Docker's open-source status, with some suggesting a switch to Podman. However, others clarified that Docker itself is open source, while only some related software like Docker Desktop is proprietary. Some participants criticized the recommendations, questioning the inclusion of certain tools in the machine learning stack and expressing confusion over the map. The general sentiment was mixed, with useful insights interspersed with criticism and suggestions for improvement.

71 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/moji-mf-joji•7/6/2025

201

[D] Remembering Felix Hill and the pressure of doing AI research

Discussion

The AI research community is reflecting on the pressures of academia, amplified by a detailed account of personal experiences in graduate school for Natural Language Processing (NLP). The post sparked a vulnerable conversation, with many users resonating with the struggle and expressing appreciation for the shared story. Meanwhile, some highlighted the potential of AI in mental health, urging that its implementation must be undertaken ethically, prioritizing safety and augmenting, not replacing, human care. The loss of Felix Hill, a figure in the field, was mourned, with users recognizing the importance of addressing mental health issues in academia. Overall, the sentiment was mixed but leaned towards supportive and empathetic.

22 comments

Save

View on Reddit →

Posted in r/SQLbyu/Analyst2163•7/2/2025

147

AI is basically guessing, and doesn't really know the answer

Discussion

AI tools, such as Claude, while praised for their eloquence and speed, are seen as fundamentally flawed due to their lack of comprehension and accuracy. They are perceived as relying heavily on sounding correct rather than being correct, which can lead to miscommunication and misinformation. The community suggests that AI should not be blindly trusted, especially with detailed and specific queries, and its responses need verification. Despite this, some users believe there's potential for AI in tasks like structuring long code or comments. However, the AI's tendency to change the code itself is seen as problematic. The sentiment is predominantly critical.

79 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/guohealth•7/3/2025

132

[D] AI/ML interviews being more like SWE interviews

Discussion

AI/ML/DS job interviews are increasingly resembling software engineering (SWE) interviews, focusing more on data structures, algorithms, and coding skills. This reflects a shift in the industry towards seeing AI as an integral part of software, rather than a separate field for creative research. The primary role of AI, ML, and DS professionals is now the execution and integration of learning models into existing systems. However, some roles, particularly in leading industrial research labs, still involve significant research and deep theoretical knowledge. This trend has been met with mixed responses, with some expressing concern over the potential dilution of specialized AI/ML roles.

40 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/Nice-Comfortable-650•7/6/2025

127

[P] We built this project to increase LLM throughput by 3x. Now it has been adopted by IBM in their LLM serving stack!

Project

The LMCache project, an open-source initiative aimed at reducing repetitive computations in LLM inference systems, has been adopted by IBM. It increases throughput in chat applications by threefold, by efficiently offloading and loading large KV cache data to and from DRAM and disk, a significant improvement for multi-round QA settings where context reuse is important but GPU memory is limited. While some users questioned whether the project could affect task performance or accuracy, the consensus is that it primarily saves time without impacting the inference process. Questions on the efficiency of this caching framework compared to others, and the performance penalty of offloading to RAM or disk were also raised.

5 comments

Save

View on Reddit →

Posted in r/SQLbyu/Various_Candidate325•7/1/2025

Got this SQL interview question and how you'd answer it

Discussion

The Reddit community shared diverse perspectives on the SQL interview question about investigating a 0% conversion rate for a product. The top-rated comment emphasized the importance of not immediately assuming the issue lies in the SQL. Other top comments suggested first verifying whether the 0% rate is accurate based on the available data and understanding the context of 'conversion' in the scenario. A step-by-step, methodical approach was also suggested, including tracing the issue backwards, checking for anomalies in the data, and asking clarifying questions to better understand the issue. The discussion was largely constructive, with users debating whether the question tests SQL logic or communication structure.

54 comments

Save

View on Reddit →

Posted in r/SQLbyu/Independent-Sky-8469•7/6/2025

In terms of SQL projects

Discussion

The consensus among experienced SQL users is that while the basics may seem simple, real-world application of SQL can be complex and challenging. Learning SQL in depth truly begins when dealing with actual stakeholders, business requirements, and deadlines. Dealing with real-world constraints, integrating with existing data or code, and optimizing and debugging are common in the field. Users also highlight the importance of understanding datasets, as they often contain irregularities. Furthermore, working in large corporations with complex business needs provides a unique challenge that can make SQL seem less simple. There's also a need to interpret stakeholder requests accurately, a skill that requires practice and cannot be acquired through pet projects or online courses alone.

22 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 27 year 2025

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!