data subTLDR week 34 year 2025

r/MachineLearningr/dataengineeringr/SQL

Visualizing SQL Queries: A Game Changer or Just a Gimmick?, Serverless and OLAP: Hype or Efficiency Boosters?, Right Joins in SQL: Unnecessary or Underappreciated?, The Chaos of Inherited Data Infrastructures, Celebrating Job Offers Amid Market Fluctuations and AI Growth

August 24, 2025•Week 34, 2025

Posted in r/dataengineeringbyu/UnusualRuin7916•8/21/2025

3422

My friend just inherited a data infrastructure built by a guy who left 3 months ago… and it’s pure chaos

Meme

A common issue among data infrastructures is a lack of documentation and version control, often resulting in chaotic systems that are difficult to manage. This is particularly problematic when the system's architect leaves the company, leaving successors to reverse-engineer complex, undocumented processes. Many professionals complain about having to manage everything themselves, with insufficient support or compensation, leading to a cycle of turnover and system breakdown. There's a trend of non-technical managers undervaluing data roles until they face difficulties in obtaining crucial data or reports. Some individuals, however, derive satisfaction from untangling and streamlining these messes, as long as they're given time to do so. The general sentiment is a mix of frustration and resignation, with a hint of humor.

218 comments

Save

View on Reddit →

Posted in r/SQLbyu/Adela_freedom•8/22/2025

341

Different databases, different hurdles 🏁😉

Discussion

Despite the hype around serverless and OLAP (Online Analytical Processing) databases, the discussion reveals mixed feelings. While some users note serverless can potentially match traditional DBMS speeds, others argue it's mostly marketing and doesn't inherently improve efficiency. The distinction between OLAP and OLTP (Online Transaction Processing) as payloads rather than database types was highlighted. There's recognition that OLAP doesn't necessitate a cloud, paid environment, and can be set up independently. Notably, the perceived financial burden of OLAP was mentioned, contradicting the common belief that serverless saves money. The sentiment overall appears mixed to negative towards serverless and OLAP.

17 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Optimal-Finish8744•8/19/2025

339

Finally Got a Job Offer

Career

Securing a job offer after numerous rejections and applications is applauded, with valuable insights shared on the application process. Using tools like GPT for CV creation and Job Analytics Chrome Extension for keyword optimization was recommended. Persistence and speed were emphasized, advising to continue applying until an offer is received, and to apply to jobs posted within a week to increase chances. Technical proficiency in SQL, Python, Pyspark, Databricks, and dbt was highlighted, with LinkedIn suggested as a platform. Despite experiences of rejection, the sentiment was mostly positive, acknowledging job market fluctuations and the growth of AI.

80 comments

Save

View on Reddit →

Posted in r/SQLbyu/Various_Candidate325•8/24/2025

240

Writing beautiful CTEs that nobody will ever appreciate is my love language

Discussion

The discussion revolves around the practice of refining SQL queries, with some users sharing a sense of satisfaction in crafting well-structured, efficient code—despite it often going unnoticed by others. However, others argue against the use of Common Table Expressions (CTEs), especially when dealing with large datasets. They propose temp tables as a more efficient alternative, allowing incremental testing and reducing server load. Some suggest that the problem lies not with CTEs, but with their misuse, and argue for a balanced approach using all available tools. The debate over capitalization in code also emerges, with some favoring tradition and clarity, while others see it as an outdated practice due to modern IDEs' color-coding capabilities. Overall sentiment is mixed, reflecting diverse experiences and perspectives.

83 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/EdgeCautious7312•8/18/2025

231

Thing that destroys your reputation as a data engineer

Discussion

The discussion revolves around mistakes that can tarnish a data engineer's reputation. Many comments emphasize the importance of handling data correctly, especially with coding zip codes as numbers, which can remove leading zeros and cause significant disruptions. Commenters also highlight the risk of storing sensitive data insecurely, such as unhashed Social Security Numbers. Additionally, the importance of professional conduct is noted, particularly regarding interactions with colleagues and clients. Some comments suggest solutions to correct erroneous data, underscoring the necessity of problem-solving skills in this profession. The sentiment is largely cautionary, emphasizing the need for meticulousness and professionalism in data engineering.

165 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/AnyIce3007•8/18/2025

195

[D] Conferences need to find better venues

Discussion

The scientific community is increasingly concerned about visa issues restricting international researchers' accessibility to U.S. conferences. This has sparked a push towards broader geographic distribution of conference venues, with Singapore, Canada, and Brazil emerging as successful alternatives. Some suggest dual-location conferences, although others express concerns this could dilute their value. Issues of visa denial and fear of not being able to reenter the U.S. impact not only non-U.S. researchers but also non-U.S. PhD students currently based in the country. Overall, the sentiment reflects a need for a more inclusive, globally-considerate approach to conference planning.

50 comments

Save

View on Reddit →

Posted in r/SQLbyu/Garvinjist•8/23/2025

What is the point of a right join?

MySQL

The consensus among commenters is that right joins in SQL have their utility, despite seeming unnecessary at first glance. They're particularly useful in complex queries where one might need to preserve the readability and logic of code. A right join can also be beneficial when dealing with null values in the dataset, as it can help to avoid the loss of valuable data. However, many agreed that the choice between left and right joins often boils down to personal preference and the specific context of the data analysis task at hand. Overall, sentiment was mixed but leaned towards positive.

79 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/bjjonin•8/21/2025

[P] Language Diffusion in <80 Lines of Code

Project

The user created a language diffusion model in less than 80 lines of code using Hugging Face's Transformers and finetuned DistilBERT on the TinyStories dataset. Their work received mixed reactions, with some criticizing the reliance on libraries and the output's quality, while others defended the usefulness of such concise examples using existing libraries. A few users expressed interest in implementing smaller Language Diffusion Models and asked about potential datasets for more practical application. Queries about the use of metric-based unmasking or remasking techniques during inference were also raised. In general, the mood was supportive with suggestions for improvement.

29 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 34 year 2025

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!