← Back to data subTLDR

data subTLDR week 6 year 2026

r/MachineLearningr/dataengineeringr/SQL

Fresh Grad's Strategy for Sales Data Integration, Testing Complex SQL Queries Pre-Production, Querying Local Spreadsheets: Tips and Tools, Frustration Over AI Misuse in Data Engineering, Notebooks vs. Spark Jobs: Weighing Convenience Against Control

Week 6, 2026
Posted in r/dataengineeringbyu/uncertainschrodinger2/4/2026
499

Data Engineering as an After Thought

Meme
The sentiment towards data engineering as an afterthought is largely negative. Many users express frustration with the misuse of AI in data engineering projects, citing these as costly and ineffective. Some users highlight the challenges of managing such projects, especially when higher-ups have unrealistic expectations about AI capabilities. There is also a consensus that such projects are often used by executives to improve the company's image rather than to deliver tangible benefits. The lack of support provided by consulting firms and the burden of maintenance are also common complaints. Few commenters see any value in these projects, causing some to question the legitimacy of these consulting firms.
22 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/mwc3602/5/2026
382

Notebooks, Spark Jobs, and the Hidden Cost of Convenience

Blog
The discussion revolves around the use of notebooks versus Spark jobs in production environments. Despite the convenience of notebooks, many users express concern over their limitations in terms of version control, change control, and rollback process. However, some defend their use, citing benefits in debugging and explaining pipeline failures, particularly in machine learning workloads. The consensus leans towards the need for a balanced approach—choosing between notebooks and Spark jobs should depend on the specific pipeline. Uncontrolled deployment into production, regardless of tool, is generally criticized, emphasizing the importance of proper checks and controls. The overall sentiment is mixed.
88 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/Outside_Reason67072/3/2026
196

DoorDash Sr Data Engineer

Career
The interview experience at DoorDash for a Senior Data Engineer position was marked by rigorous technical rounds and perceived lackluster interaction with the recruiter. The technical rounds consisted of system design, data modeling, business partnership, and leadership. The system design question required in-depth knowledge of DataBricks, while data modeling included advanced graph visualizations. The overall sentiment was mixed, with some feeling the process was overly complex and others attributing the high standards to the competitive market. Frustration arose from a lack of feedback post-interview, despite the candidate feeling they performed well. The sentiment towards recruiters was generally negative due to perceived coldness and unavailability.
54 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/Striking-Warning95332/6/2026
123

[D] Saw this papaer from ICLR with scores 2,2,2,4 and got accepted, HOW

Discussion
The acceptance of an ICLR paper with low review scores (2,2,2,4) has sparked debate. Key criticisms include severe format violation, such as reduced page margins and incorrect Latex formatting, and perceived unfairness in the acceptance process. Some participants suggest exceptions for well-known authors, while others highlight the arbitrary role of area chairs. A few voices argue the paper's substantial benchmark improvement may justify the area chair's override of low scores. Although the authors rectified the formatting issue, the incident has underscored concerns about the transparency and consistency of the academic review process. The sentiment is largely negative.
53 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/Hopeful-Reading-67742/5/2026
122

[D] What to do with an ML PhD

Discussion
The discussion revolves around the future career prospects for a soon-to-graduate ML PhD student without a remarkable publication record. The most upvoted advice suggests applying to mid-tier companies with R&D departments, participating in relevant online communities, and honing both general AI and coding skills. Additional suggestions include studying for specific interview types and seeking advice from a PhD supervisor. Some commenters also propose roles such as a regular software engineer, ML engineer, or a postdoc position, possibly in Europe. The sentiment leans towards the importance of practical skills and real-world experience, rather than relying solely on academic credentials.
49 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/ternausX2/3/2026
89

[D] Where is modern geometry actually useful in machine learning? (data, architectures, optimization)

Discussion
The application of modern geometry in machine learning (ML) is a topic of interest, with users highlighting the use of geometric deep learning and the Muon optimizer as key examples. Many feel that geometric concepts have not significantly influenced model or optimizer design beyond basic settings. Riemannian or manifold-aware optimization is seen as useful but often behaves like fancy preconditioning. Topology, especially persistent homology, is considered a powerful analysis tool, but integrating it into model training poses challenges. The sentiment is mixed, with some optimism that geometry could remove certain pathologies in ML and others expressing skepticism about its practical utility. Some users are exploring the implications of geometry on ML primitives and argue for the prescriptive use of symmetry. The discussion also touched on the limitations of traditional computer science theory in studying large parallel programs like neural networks.
32 comments
Share
Save
View on Reddit →
Posted in r/SQLbyu/OriginalAssignment192/5/2026
19

Fresh grad tackling sales data integration project. Need advice

PostgreSQL
The fresh graduate's plan to tackle a sales data integration project at a small manufacturing firm by setting up a local PostgreSQL database, loading CSV/Excel files into staging tables, transforming the data, and connecting it to Power BI for reporting is seen as a solid approach. PostgreSQL is recommended for small-scale setups due to its capacity and automation capabilities. For file ingestion, a simple python script is suggested to monitor a folder and load new files. Regarding modeling, a thin star schema is preferred without over-normalizing. Advice includes direct fetching from the ERP database to avoid loading CSVs and ensuring the system is config-driven to avoid duplicates. Overall, sentiment is positive.
6 comments
Share
Save
View on Reddit →
Posted in r/SQLbyu/Brian_MPLS2/2/2026
13

Query a local spreadsheet?

SQL Server
The discussion focuses on solutions for extracting data from a third-party app without tech support or system access. The most supported idea is to save the extracted data as a CSV file, which is easier to query. Further suggestions include executing SQL on a workbook using ADODB via VBA, using DuckDB to query spreadsheets, and automating the process with a lightweight ETL tool such as Epitech Integrator. The thread emphasizes the importance of understanding the available tools and output needs for the specific situation in order to find a feasible solution. Overall, the sentiment is positive, with users actively offering various solutions.
20 comments
Share
Save
View on Reddit →
Posted in r/SQLbyu/Historical-Hand80912/3/2026
13

How do you validate complex queries before running them on production?

Discussion
The Reddit community offered various strategies to test complex SQL queries before running them on production environments. The most popular suggestion was to simply wait for issues to be reported post-production, indicating a tendency to learn from mistakes rather than preemptively avoiding them. Others emphasized the importance of having a rollback plan in place. A common practice mentioned includes creating temporary tables and replicating the source query's actions, allowing for safe testing within the production warehouse. Some users advised against running intricate queries on production due to potential resource strain. Overall, the sentiment leaned towards learning from errors and ensuring contingencies are in place.
22 comments
Share
Save
View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

Get the weekly data subTLDR in your inbox!

We respect your privacy. No spam, ever.