← Back to data subTLDR
data subTLDR week 22 year 2026
r/MachineLearningr/dataengineeringr/SQL
Unlocking SQL Query Challenges, Discovering Fraud Rings as Graph Problems, Decoding BigQuery Anomalies, Exploring the Future of Data Engineering, and Redefining the Semantic Layer
•Week 22, 2026
Posted in r/MachineLearningbyu/laginimaineb•5/27/2026
253
AI-generated CUDA kernels silently break training and inference [R]
Research
Several top-ranked AI-generated CUDA kernels from NVIDIA's SOL-ExecBench have been found to fail in production workloads, causing significant debugging challenges. The main issue appears to be the accumulation of the embedding-gradient half of the kernel in bf16 instead of fp32, causing the loss to diverge and never recover. This problem is particularly difficult to detect as it mimics the symptoms of unsuccessful research ideas. The issue becomes invisible under AdamW's per-parameter normalization. Other submissions had different bugs. Commenters stressed the need for rigorous testing of numerical accuracy in kernels and more comprehensive validation methods. The overall sentiment was mixed, oscillating between frustration and optimism for potential fixes.
Posted in r/dataengineeringbyu/Alternative-Guava392•5/26/2026
161
Future of data engineering
Discussion
The consensus among Reddit's data engineering community is that the rise of AI won't render them obsolete, but will rather act as an efficient tool, elevating their work. While AI can write code, human engineers are still needed to build pipelines, solve outages, and manage business requirements. This mirrors past tech revolutions, like the advent of spreadsheets, that initially sparked job loss fears but ultimately increased demand for skilled workers. Data engineering is seen as safer than generic software engineering due to the complexities of large data. However, they warn against complacency, suggesting that those who don't adapt to advancements like AI may be left behind.
Posted in r/dataengineeringbyu/cyamnihc•5/30/2026
149
Semantic layer
Discussion
The Semantic Layer is essentially metadata and context, critical to making sense of underlying data. It involves representing data in a way that reflects business language, likely through a structured dimensional model and field names that are not cryptic. This is not a new concept, but AI is bringing it to the forefront as a solution for making sense of complex data. Some commenters argue that AI can be used to inspect data, describing it and verifying accuracy, while others insisted on the irreplaceable value of human data engineers. The semantic layer isn't a one-off task but evolves with the business, requiring regular updates.
Posted in r/MachineLearningbyu/dh7net•5/28/2026
121
A new dataset with more that 100M hi-quality, curated images, with captions and meta data! [P]
Project
The newly released MONET dataset, an Apache 2.0-licensed image–text dataset built from 2.9 billion images and refined to 104.9 million high-quality samples, has been well-received in the online community. Users appreciate the vast amount of data, some reflecting on how difficult it was to gather such datasets just five years ago. The dataset's inclusion of almost 15 million synthetic images was also highlighted, with clear metadata to distinguish real from synthetic. The dataset's size, at 68TB, impressed many, and users are eager to share the news in their research circles. Overall, the sentiment is positive, viewing MONET as a valuable contribution to research.
Posted in r/dataengineeringbyu/blu_lazr•5/28/2026
113
Well played Dagster
Meme
Discussion revolved around comparing Dagster, a data orchestration tool, with similar tools like Airflow. Many users expressed a preference for Dagster's user interface, although some found it to be a cheap imitation of other tools. Others noted the adoption of Dagster's asset-based approach by Airflow, but highlighted the differences in implementation. One user, apparently involved with Airflow's development, praised its lightweight nature and compatibility with modern infrastructure. Some criticized Dagster's freemium model as a barrier to usage. Overall, opinions were mixed, reflecting diverse experiences and preferences in the community.
Posted in r/SQLbyu/Wise_Safe2681•5/28/2026
42
What’s the most challenging SQL query you’ve ever written, and how did you optimize it for better performance?
Discussion
SQL query challenges often arise from poor data model design and lack of data governance. The most difficult queries typically involve reporting or analysis, requiring innovative solutions to complex problems. However, optimization isn't always critical, provided that processing times remain reasonable. High costs can result from inefficient queries, such as one that ran hourly, resulting in a $24,000 daily expense. The difficulty of writing SQL queries frequently lies not in the coding, but in extracting clear business requirements from stakeholders, complicated by differing definitions and undocumented practices. Consequently, better communication and documentation practices may alleviate some query-related challenges.
Posted in r/SQLbyu/FixelSmith•5/27/2026
16
Detecting fraud rings: the social-graph problem in disguise
Discussion
Fraud rings detection is a complex graph problem often mistaken for a SQL problem. A crucial part of the work lies in the shared_attributes table and the recursive CTE, a join that walks the data. This method is suited for a few thousand-member components on a modern warehouse, after which costs increase. Streaming or real-time ring detection is a separate design. Often, strong signals like device fingerprint and payment-method hash collapse a ring to one or two degrees, making the recursive CTE mostly useful for extra checks. Third-level joins can generate false-positives due to low-signal connections, which can be mitigated by weighing attributes accordingly.
Posted in r/SQLbyu/JLTDE•5/28/2026
13
Absolutely puzzled with this Bigquery result
BigQuery
The BigQuery issue discussed revolves around a query returning unexpected results. The top-voted comment suggests that BigQuery, due to its design for large scale data analysis, might be converting the COUNT(*) / HAVING construction into a COUNT DISTINCT, thus affecting the result. Others suggested checking the actual code against the pseudocode to ensure no errors, and considering that BigQuery never physically deletes anything, which could result in multiple counts. An alternate query structure was also proposed to address the issue. The sentiment was mixed, with users providing various potential explanations and solutions.
Subscribe to data-subtldr
Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.