data subTLDR week 24 year 2025

r/MachineLearningr/dataengineeringr/SQL

Unlocking the Mystery of Diagram Tools, Securing SQL Queries with SoarSQL, AI Impersonators in Job Interviews, and Surviving Ancient Stored Procedures: Your Guide to Navigating the Technical Terrain

June 15, 2025•Week 24, 2025

Posted in r/dataengineeringbyu/iknewaguytwice•6/12/2025

1270

AI is literally coming for you job

Discussion

There's growing concern among professionals about the rise of AI in job interviews, with instances of 'scammer AI agents' making it past HR screenings. These AI are adept at providing detailed, technical responses but fail to address specific problems or follow instructions outside their programming. One user reported a significant number of such instances during the hiring process for a data engineering role. Some see this as an unintended consequence of increasingly automated hiring processes, while others recommend unconventional methods to identify AI, such as asking obscure questions. The sentiment is mixed, with some finding the idea amusing and others expressing frustration at the wasted time and potential security risks.

211 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Adela_freedom•6/13/2025

387

You haven’t truly suffered until you’ve debugged a multi-thousand-line stored procedure from 2009 👹

Meme

The sentiment within the discussion is mixed, leaning towards frustration. Many programmers have experienced the struggle of debugging complex, multi-thousand-line stored procedures, a task made more challenging when the business side expects replication of a flawed output. Further complications arise when the procedure involves dynamic SQL or calls other lengthy stored procedures. The community suggests tools like sqlformat.org and its underlying Python library sqlparse to aid in these arduous tasks. A recurring pain point is nested stored procedures with obfuscating error handling routines. Some are dealing with transitioning to new systems like Snowflake/DBT/Airflow, a process burdened by difficulty in estimating the workload.

74 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/EzPzData•6/12/2025

347

Databricks forgot to renew their websites certification

Meme

Databricks website certification lapse led to a lively discussion. The majority opinion, backed by high upvote counts, pointed to both Cloudflare and Google Cloud Platform (GCP) outages as the probable cause. A common misconception that the error was due to an expired certificate was debunked, with users pointing out that the error code did not indicate expiration. Some comments highlighted the difficulties of selling products like Cloudflare that only gain attention during malfunctions. There was a brief mention about Databricks on Azure using an entirely different infrastructure, underscoring the complexities of cloud services. The overall tone was mixed, with a blend of clarification and criticism.

22 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/Striking-Warning9533•6/14/2025

213

[D] Machine Learning, like many other popular field, has so many pseudo science people on social media

Discussion

The majority of Reddit users agree that pseudo-science discussions on AI can mislead the public, often resulting from a poor understanding of the field or a desire to sound knowledgeable. Some suggest the misinformation originates from pop culture, oversimplified media portrayals, and the complexity of the subject, which can deter people from engaging with the actual science. Users emphasize the importance of fact-checking and promoting educational resources. They also highlight the need for more effective communication from AI experts to the wider public, suggesting that clear, layman's terms explanations could counteract misinformation. The overall sentiment is mixed but leans towards concern and resolution-seeking.

70 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/tanishqkumar07•6/12/2025

213

[P]: I reimplemented all of frontier deep learning from scratch to help you learn

Project

The Reddit community has shown mixed sentiment towards a user's open-source project, which aims to help AI beginners by providing an implementation of modern deep learning techniques. Users appreciate the initiative but criticize the claim of having implemented all frontier ML research. Concerns are raised about the lack of tests and potential inaccuracies in the implementations. The creator acknowledges the need for more thorough testing and clarifies that certain boilerplate parts may have been AI-generated. Some users find the resource educational, while others suggest areas for improvement, like adjusting the router in the MoE implementation.

18 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/caiopizzol•6/15/2025

192

Processing 50 Million Brazilian Companies: Lessons from Building an Open-Source Government Data Pipeline

Open Source

The open-source project, CNPJ Data Pipeline, was created to process Brazil's company registry, a massive 21GB government dataset. The Python-based pipeline efficiently handles large files, unstable server connections, and monthly updates. The insights shared included: the database often being the bottleneck, the imperfect nature of government data, and the importance of memory-aware processing. The pipeline has proven to be a valuable resource for Brazilian startups, researchers, and data scientists, offering a solution to a common challenge. Commenters appreciated the project, offered technical suggestions, and discussed the inflated number of registered companies in Brazil due to the MEI - Micro Empreendedor Individual program. The overall sentiment was positive.

25 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/NOAMIZ•6/9/2025

178

[D] What underrated ML techniques are better than the defaults

Discussion

The discussion reveals a consensus on the value of feature engineering and ensemble methods in machine learning. These techniques reportedly outperform mere model selection or tuning. Kaggle, particularly the write-ups from winners and top-notebooks in completed competitions, is identified as an excellent resource for practical tips. Skepticism exists around hyperparameter tuning, with suggestions that its popularity is declining. However, Optuna, a hyperparameter search algorithm/library, finds mention as a widely used tool. A minority view suggests that sometimes a simple t-test might suffice. Additionally, there is an ongoing debate about training the final model on validation and test data before deployment.

82 comments

Save

View on Reddit →

Posted in r/SQLbyu/ExoticArtemis3435•6/13/2025

146

You guys use this feature? or is there better way to do it

SQL Server

Many Reddit users find the tables feature highly useful, but the diagrams feature is generally under-utilized and considered subpar, especially in SSMS. ER diagrams are praised, but the quality and utility of diagram tools varies widely. Some users note that they rarely receive requested diagrams from vendors. There's a common sentiment that many software writers avoid creating foreign key constraints due to error handling concerns. Despite these issues, the diagram feature is still seen as a quick and simple way to understand relationships in large schemas, aiding in data troubleshooting and issue escalation.

78 comments

Save

View on Reddit →

Posted in r/SQLbyu/rahulsingh_ca•6/10/2025

132

SQL 🤝 Google Sheets

Discussion

SoarSQL's ability to connect and run SQL queries on Google Sheets data has sparked discussion, with security of the data being a top concern but reassurances provided that soarSQL does not store or expose data to the web. Users also mentioned the DuckDB extension for querying Google Sheets and queried the open-source nature of the software due to the absence of a Linux option. The functionality to insert or delete records was also questioned. Overall, the update was well-received, though users seek further details about its security and compatibility.

14 comments

Save

View on Reddit →

Posted in r/SQLbyu/op3rator_dec•6/9/2025

102

onlyProdBitesBack

Discussion

The sentiment in the thread leans towards recognizing the challenges that 'prod' faces in dealing with everyone's 'garbage'. This observation is supported by a majority of the upvotes. However, the conversation lacks depth with only a couple of comments, one of which is unrelated to the topic.

2 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 24 year 2025

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!