← Back to data subTLDR
data subTLDR week 6 year 2026
r/MachineLearningr/dataengineeringr/SQL
Fresh Grad's Strategy for Sales Data Integration, Testing Complex SQL Queries Pre-Production, Querying Local Spreadsheets: Tips and Tools, Frustration Over AI Misuse in Data Engineering, Notebooks vs. Spark Jobs: Weighing Convenience Against Control
•Week 6, 2026
Posted in r/dataengineeringbyu/uncertainschrodinger•2/4/2026
499
Data Engineering as an After Thought
Meme
The sentiment towards data engineering as an afterthought is largely negative. Many users express frustration with the misuse of AI in data engineering projects, citing these as costly and ineffective. Some users highlight the challenges of managing such projects, especially when higher-ups have unrealistic expectations about AI capabilities. There is also a consensus that such projects are often used by executives to improve the company's image rather than to deliver tangible benefits. The lack of support provided by consulting firms and the burden of maintenance are also common complaints. Few commenters see any value in these projects, causing some to question the legitimacy of these consulting firms.
Posted in r/dataengineeringbyu/mwc360•2/5/2026
382
Notebooks, Spark Jobs, and the Hidden Cost of Convenience
Blog
The discussion revolves around the use of notebooks versus Spark jobs in production environments. Despite the convenience of notebooks, many users express concern over their limitations in terms of version control, change control, and rollback process. However, some defend their use, citing benefits in debugging and explaining pipeline failures, particularly in machine learning workloads. The consensus leans towards the need for a balanced approach—choosing between notebooks and Spark jobs should depend on the specific pipeline. Uncontrolled deployment into production, regardless of tool, is generally criticized, emphasizing the importance of proper checks and controls. The overall sentiment is mixed.
Posted in r/dataengineeringbyu/Outside_Reason6707•2/3/2026
196
DoorDash Sr Data Engineer
Career
The interview experience at DoorDash for a Senior Data Engineer position was marked by rigorous technical rounds and perceived lackluster interaction with the recruiter. The technical rounds consisted of system design, data modeling, business partnership, and leadership. The system design question required in-depth knowledge of DataBricks, while data modeling included advanced graph visualizations. The overall sentiment was mixed, with some feeling the process was overly complex and others attributing the high standards to the competitive market. Frustration arose from a lack of feedback post-interview, despite the candidate feeling they performed well. The sentiment towards recruiters was generally negative due to perceived coldness and unavailability.
Posted in r/MachineLearningbyu/Striking-Warning9533•2/6/2026
123
[D] Saw this papaer from ICLR with scores 2,2,2,4 and got accepted, HOW
Discussion
The acceptance of an ICLR paper with low review scores (2,2,2,4) has sparked debate. Key criticisms include severe format violation, such as reduced page margins and incorrect Latex formatting, and perceived unfairness in the acceptance process. Some participants suggest exceptions for well-known authors, while others highlight the arbitrary role of area chairs. A few voices argue the paper's substantial benchmark improvement may justify the area chair's override of low scores. Although the authors rectified the formatting issue, the incident has underscored concerns about the transparency and consistency of the academic review process. The sentiment is largely negative.
Posted in r/MachineLearningbyu/Hopeful-Reading-6774•2/5/2026
122
[D] What to do with an ML PhD
Discussion
The discussion revolves around the future career prospects for a soon-to-graduate ML PhD student without a remarkable publication record. The most upvoted advice suggests applying to mid-tier companies with R&D departments, participating in relevant online communities, and honing both general AI and coding skills. Additional suggestions include studying for specific interview types and seeking advice from a PhD supervisor. Some commenters also propose roles such as a regular software engineer, ML engineer, or a postdoc position, possibly in Europe. The sentiment leans towards the importance of practical skills and real-world experience, rather than relying solely on academic credentials.
Posted in r/MachineLearningbyu/ternausX•2/3/2026
89
[D] Where is modern geometry actually useful in machine learning? (data, architectures, optimization)
Discussion
The application of modern geometry in machine learning (ML) is a topic of interest, with users highlighting the use of geometric deep learning and the Muon optimizer as key examples. Many feel that geometric concepts have not significantly influenced model or optimizer design beyond basic settings. Riemannian or manifold-aware optimization is seen as useful but often behaves like fancy preconditioning. Topology, especially persistent homology, is considered a powerful analysis tool, but integrating it into model training poses challenges. The sentiment is mixed, with some optimism that geometry could remove certain pathologies in ML and others expressing skepticism about its practical utility. Some users are exploring the implications of geometry on ML primitives and argue for the prescriptive use of symmetry. The discussion also touched on the limitations of traditional computer science theory in studying large parallel programs like neural networks.
Posted in r/SQLbyu/OriginalAssignment19•2/5/2026
19
Fresh grad tackling sales data integration project. Need advice
PostgreSQL
The fresh graduate's plan to tackle a sales data integration project at a small manufacturing firm by setting up a local PostgreSQL database, loading CSV/Excel files into staging tables, transforming the data, and connecting it to Power BI for reporting is seen as a solid approach. PostgreSQL is recommended for small-scale setups due to its capacity and automation capabilities. For file ingestion, a simple python script is suggested to monitor a folder and load new files. Regarding modeling, a thin star schema is preferred without over-normalizing. Advice includes direct fetching from the ERP database to avoid loading CSVs and ensuring the system is config-driven to avoid duplicates. Overall, sentiment is positive.
Posted in r/SQLbyu/Brian_MPLS•2/2/2026
13
Query a local spreadsheet?
SQL Server
The discussion focuses on solutions for extracting data from a third-party app without tech support or system access. The most supported idea is to save the extracted data as a CSV file, which is easier to query. Further suggestions include executing SQL on a workbook using ADODB via VBA, using DuckDB to query spreadsheets, and automating the process with a lightweight ETL tool such as Epitech Integrator. The thread emphasizes the importance of understanding the available tools and output needs for the specific situation in order to find a feasible solution. Overall, the sentiment is positive, with users actively offering various solutions.
Posted in r/SQLbyu/Historical-Hand8091•2/3/2026
13
How do you validate complex queries before running them on production?
Discussion
The Reddit community offered various strategies to test complex SQL queries before running them on production environments. The most popular suggestion was to simply wait for issues to be reported post-production, indicating a tendency to learn from mistakes rather than preemptively avoiding them. Others emphasized the importance of having a rollback plan in place. A common practice mentioned includes creating temporary tables and replicating the source query's actions, allowing for safe testing within the production warehouse. Some users advised against running intricate queries on production due to potential resource strain. Overall, the sentiment leaned towards learning from errors and ensuring contingencies are in place.
Subscribe to data-subtldr
Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.