← Back to data subTLDR
data subTLDR week 19 year 2026
r/MachineLearningr/dataengineeringr/SQL
Competitive SQL Gaming Sparks Interest, SQL Query Optimization for Large Datasets Explored, Merge Statements in ETL Pipelines Debated, Dissatisfaction with Databricks Prompts Migration Talks, Gamified Learning Platforms for Data Engineers Praised
•Week 19, 2026
Posted in r/dataengineeringbyu/zoso•5/7/2026
280
Is anyone migrating away from Databricks?
Discussion
Many users are expressing dissatisfaction with Databricks due to high costs and slow testing. Some users suggest that the platform may be better suited for larger-scale, ML-heavy workloads, rather than small data engineering tasks. Several users recommend migrating to AWS native services as a cost-effective alternative. While some find Databricks' testing process cumbersome, others share strategies to streamline testing within the platform. The overall sentiment is mixed, with many users expressing frustration and considering migration away from Databricks, albeit acknowledging its potential benefits for specific use cases.
Posted in r/MachineLearningbyu/Pure-Ad9079•5/6/2026
173
Stop letting LLMs edit your .bib [D]
Discussion
There is significant dissatisfaction with the incorrect citation of prior literature, often blamed on the use of Legal Language Models (LLMs) for editing .bib files. Many agree that manually cross-checking citations is tedious but essential. Tools like Zotero and Overleaf, Google Scholar browser plug-in, and other resources that generate .bib from DOI or arXiv links are highly recommended for accuracy. Some users also suggest penalties for incorrect citations. The sentiment leans towards a combination of automated tools and manual checks to ensure citation integrity, rejecting the idea of solely relying on LLMs, despite the convenience.
Posted in r/dataengineeringbyu/sham_nt•5/8/2026
171
Leetcode for Data Engineering?
Discussion
Many data engineers prefer a gamified approach to learning, recommending platforms like SQL Murder Mystery for engaging SQL practice. Other popular resources include Stratascratch, which offers curated problem sets, and Datalemur for interview-like problems. Zillacode was suggested for practicing Pyspark problems. Some users emphasized the importance of mastering fundamentals across languages like Python, SQL, and Bash, suggesting Leetcode as a good resource. Overall, the sentiment was positive towards using these platforms to learn and hone data engineering skills, while also preparing for potential interviews.
Posted in r/MachineLearningbyu/akardashian•5/10/2026
157
PhD students in ML, how many hours on average do you work? [D]
Discussion
Most PhD students in Machine Learning seem to work between 6 to 10 hours a day, focusing on cognitive tasks for around 3-5 hours and spending the rest on coding, writing, or less intense tasks. The importance of maintaining a healthy work-life balance, including regular breaks and leisure activities, is emphasized to avoid burnout and increase productivity. Some students noted the deceptive nature of long working hours, suggesting that it often leads to mental fatigue rather than increased output. Extreme cases were mentioned, such as students in AI labs in China reportedly working over 15 hours a day, but these were outliers and generally seen as unhealthy. Overall sentiment leans towards valuing efficient, focused work over prolonged hours.
Posted in r/MachineLearningbyu/snekslayer•5/8/2026
112
Getting harassed by an aggressive “independent researcher” demanding very specific citations and phrasing in my paper [D]
Discussion
The post discusses a researcher's experience of being persistently badgered for specific citations by an independent researcher. Commenters largely support the original poster's frustration, suggesting they block the offender. Some suggest this behavior is characteristic of independent researchers who lack academic affiliations, while others defend the legitimacy of independent researchers but condemn the behavior at hand. A few comments also denounce the pressure to control how one's work is characterized in other papers, labeling it reputation management. The overall sentiment is mixed, with sympathy towards the poster's predicament and broader discussions on the dynamics of independent and academic research.
Posted in r/dataengineeringbyu/compass-now•5/7/2026
104
Do you really need spark?
Discussion
Many Reddit users argue that Apache Spark, while powerful, is often not necessary for data sets under 100GB. Many suggest using other tools like Postgres or DuckDB until they no longer suffice, and only then considering Spark. Some also draw attention to unnecessary spending on Spark platforms when dealing with smaller data sets. However, others argue for Spark's versatility, citing its widespread support, stable development, and extensive documentation. They argue that it's a safe bet for businesses, even if not strictly necessary, due to its scalability and the availability of affordable managed solutions like Databricks. Sentiment is mixed, reflecting a divide based on data size and cost considerations.
Posted in r/SQLbyu/Far-Round2092•5/6/2026
42
Real-time SQL PVP. Same prompt, same data, fastest correct query wins.
MySQL
In a recently released 1v1 PvP SQL game, players are challenged to write the quickest correct query based on a shared prompt and schema. The game is praised for its unique, competitive approach to learning SQL quickly, drawing comparisons to CSS Battle. Some players have suggested improvements such as enabling WASD directional controls, auto-placement of the cursor in the query text-box, and showing results instead of auto-completion. Despite some reservations about the game's design, the overall sentiment is positive, with players expressing interest and curiosity.
Posted in r/SQLbyu/Effective_Ocelot_445•5/7/2026
27
How do you optimize SQL queries when working with millions of rows in production databases?
MySQL
For optimizing SQL queries with large datasets, commenters suggest building and utilizing the right indexes, checking execution plans, and filtering unused data. Regularly updating statistics is also beneficial. Pre-computing aggregates and materializing them into tables can improve performance. Temp tables can prefilter data, improving efficiency in some cases. For analytical analysis, a columnar database such as ClickHouse is recommended. SQL Server 2019+ supports UDF inlining, which can help avoid extra calculations per row. Materialized queries/views, partitioning for parallel processing, and regular statistics updates are suggested practices. Overall sentiment is constructive with a focus on sharing practical and effective strategies.
Subscribe to data-subtldr
Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.