← Back to data subTLDR

data subTLDR week 7 year 2026

r/MachineLearningr/dataengineeringr/SQL

Visualizing SQL with open-source tool sql-crack, Strategies for maintaining discoverable SQL queries, Tracing foreign-key relationships in SQL schemas, Reviews on O'Reilly's Data Engineering Design Patterns, Reflecting on changes in Data Engineering over the past 9 years

Week 7, 2026
Posted in r/MachineLearningbyu/Hope9999912/10/2026
438

[D] Ph.D. from a top Europe university, 10 papers at NeurIPS/ICML, ECML— 0 Interviews Big tech

Discussion
Despite a strong academic record and relevant industry experience, a Ph.D. holder is struggling to secure interviews at big tech companies. Commenters overwhelmingly attribute this to the importance of networking and personal connections in the job search process. They advise leveraging relationships formed during conferences, internships, or collaborations for job opportunities, stating that relying solely on cold applications can be ineffective. Some also suggest the field of Natural Language Processing (NLP) might be oversaturated, making job hunting more competitive. The job market's current state is also cited as a factor, with even highly qualified candidates facing numerous rejections. The sentiment is mixed, acknowledging the job market realities while encouraging proactive networking.
143 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/Working-Read18382/13/2026
399

[D] ICML: every paper in my review batch contains prompt-injection text embedded in the PDF

Research
The discovery of prompt-injection text in ICML review papers sparked a debate on its ethical implications and purpose. Some participants defended the practice as a deterrent for reviewers using Language Learning Models (LLMs) for automated reviews, arguing it ensures the responsibility for evaluation lies solely with the reviewer. Concerns were raised about the potential for ethics violation accusations, especially in light of previous scandals involving prompt injections. Some suspected the text may have been added by program chairs to monitor policy compliance. There was also discussion about the fairness of using such techniques depending on whether authors allowed LLM use in the review process.
54 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/Tough_Ad_65982/9/2026
278

[P] A Python library processing geospatial data for GNNs with PyTorch Geometric

Project
City2Graph, a Python library for processing geospatial data for PyTorch Geometric, has been introduced. The library constructs heterogeneous graphs from various domains including morphology, transportation, mobility, and proximity. Users responded positively to the library, highlighting its cool representation of geographic data and its potential usefulness in geospatial projects. Some users suggested enhancements, such as the integration of spatio-temporal properties and the inclusion of factors like semaphore state and cycle duration. Overall, there's interest in feedback from a Machine Learning perspective rather than geography, indicating a desire for the tool's broader application in other fields.
10 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/Playful-Fee-43182/15/2026
250

Can we stop these LLM posts and replies? [D]

Discussion
There is growing frustration among Reddit users about the increase in 'Low-Level Machine' (LLM) generated posts and comments, which they believe add no value and create unnecessary noise. This is seen across machine learning and programming subreddits, undermining confidence in Reddit as a platform for genuine discussion. Moderation efforts are ongoing to combat this issue, but users are encouraged to report such content to aid in its removal. Users also expressed concern over the difficulty in distinguishing higher quality bot-generated content, which complicates detection efforts. There is also a call for additional moderators to manage this challenge.
46 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/xean3332/13/2026
175

Has anyone read O’Reilly’s Data Engineering Design Patterns?

Discussion
The O'Reilly's Data Engineering Design Patterns book is generally seen as a solid resource for beginners in data engineering, with the content being considered somewhat basic for experienced professionals. There's appreciation for its practical advice and code examples. Some readers find it particularly useful for understanding how to build scalable systems and the functioning of databases. However, others express that the knowledge doesn't always apply to non-FAANG jobs. An alternative recommendation mentioned is 'Designing Data-Intensive Applications' by Klepmann. The sentiment towards the book is largely positive, though it's not viewed as the absolute best in the field.
39 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/rmoff2/11/2026
159

It's nine years since 'The Rise of the Data Engineer'…what's changed?

Discussion
The data engineering field has seen major shifts over the last nine years. Infrastructure abstraction is now common, eliminating the need for teams to manage clusters. The emergence of analytics engineering as a discipline and the rise of the 'modern data stack' hype cycle have reshaped the industry. Despite these changes, the gap between having data and understanding the business domain remains, alongside pipeline maintenance burdens and stakeholder expectations versus data quality reality. However, starting a project has become significantly easier, there's improved tooling for testing and observability, and version control for transformations is standard. Opinions vary on the role of a data engineer and the cyclical nature of software.
37 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/alphter2/12/2026
142

I built a website to centralize articles, events and podcasts about data

Personal Project Showcase
There's a positive sentiment towards a new website called dataaaaa, designed as a central hub for data-related articles, events, and podcasts. The site has been appreciated for its unique concept, especially for making data research significantly faster. However, users have expressed a desire for an RSS feed feature to integrate the website into their feed readers. They also suggested an 'exclude list' in the site's filter settings. The developer mentioned that the site sources articles mainly through RSS feeds coupled with AI post-processing for summaries and tagging, and is considering expanding the site's features in the future.
17 comments
Share
Save
View on Reddit →
Posted in r/SQLbyu/Friendly_Cold13492/15/2026
127

SQL advice to yourself 5 years ago

Discussion
Experienced SQL users recommend thorough documentation and saving all code. They emphasize the importance of structured queries and detailed comments within the code. Additionally, the use of Common Table Expressions (CTEs), window functions, and the QUALIFY clause are noted as invaluable tools. Users are advised against lazily using 'Select *' in joins and instead specify required column names. Creating views is also recommended. There's a general suggestion to understand SQL's underlying mechanics, like pages and index working. While there's some acknowledgment of the difficulties these practices entail, the consensus suggests these practices are key to effective SQL usage.
77 comments
Share
Save
View on Reddit →
Posted in r/SQLbyu/iambuv2/13/2026
78

Visualizes SQL as interactive flow diagrams, open source tool

Snowflake
The open-source tool 'sql-crack', designed to visualize complex SQLs as interactive flow diagrams, has received positive feedback from users. It is appreciated for its offline functionality, with no network calls or telemetry. However, some users reported issues with the tool not processing specific SQL server queries or scripts with heavy CTE usage. The creator responded positively to these concerns, promising to address them in future updates. A suggestion to create a JetBrains version was noted for future consideration. The tool's creator was also lauded for their work, with users expressing intent to support via donations.
19 comments
Share
Save
View on Reddit →
Posted in r/SQLbyu/SIR_DONALDY2/11/2026
31

How do you keep SQL queries discoverable + understandable (maybe resharable)?

Discussion
The Reddit community recommends several solutions to organize and document SQL queries. Many suggest using a data warehouse or implementing queries as views or stored procedures to maintain a single source of truth. There's emphasis on in-line commenting for documentation and storing queries in a code repository like GitHub for easy access and change tracking. Some users recommend parameterizing code and using tools like PowerBI for easy query access and visualization. Despite differing opinions on tools and methods, the consensus is clear: systematic organization, thorough documentation, and maintaining a single source of truth are key for efficient SQL query management. The sentiment is largely constructive and solution-oriented.
24 comments
Share
Save
View on Reddit →
Posted in r/SQLbyu/BearComprehensive6432/9/2026
30

Visual foreign-key relationship tracing for SQL schemas

Discussion
The discussion centers around the challenges faced while understanding large SQL schemas and whether ER diagrams or tracing relationships are more beneficial. The sentiment is mixed, with some finding ER diagrams not particularly useful and favoring a more hands-on approach, such as using pen and paper or tracing relationships directly. Some suggested presenting data in 3D space or a series of connected graphs to overcome the complexity. There were also suggestions for improving readability, such as showing cardinality in diagrams, using proper data types, and listing links in a tabular format. The need for collaboration during schema design and understanding was also highlighted.
10 comments
Share
Save
View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

Get the weekly data subTLDR in your inbox!

We respect your privacy. No spam, ever.