← Back to data subTLDR
data subTLDR week 18 year 2026
r/MachineLearningr/dataengineeringr/SQL
SQL Protocol Game Evolves with Social Features, Common Mistake Juniors Make with Execution Plans, Prepping for SQL Interviews with Common Patterns, LinkedIn Trends Tool for Data Engineering Jobs, and SQLGlot Now 5x Faster in Python
•Week 18, 2026
Posted in r/dataengineeringbyu/Dubinko•4/28/2026
369
I scan LinkedIn daily for Data Engineering Job trends
Personal Project Showcase
The tool that scans LinkedIn for data engineering job trends has been positively received, with users appreciating its potential for guiding learning and job hunting. There were suggestions to expand its scope, segmenting data by industry and refining its filter function. Concerns were raised about certain tools like MongoDB and Scala being unexpectedly high, while others like Dagster and DBT seemed underrepresented. There was also an interesting discussion about the best ways to scrape LinkedIn, with recommendations for specific APIs and proxies. Overall, the sentiment was positive with an interest in further development and refinement.
Posted in r/MachineLearningbyu/icannotchangethename•4/29/2026
231
An interactive semantic map of the latest 10 million published papers [P]
Project
The interactive semantic map of the latest 10 million published papers received positive feedback, with users praising its exploration-friendly design and execution. Some users were particularly interested in the Voronoi partitioning procedure, with suggestions to consider density-aware clustering methods like HDBSCAN for more organic cluster boundaries. Questions were also raised about the labelling process and the use of SPECTER 2. The creator clarified that the labelling was partly inspired by the OpenAlex/CWTS Leiden pipeline. Concerns were raised about unlabelled empty spaces, and queries about the processing of the 10 million papers were also made. The thread was generally positive and constructive.
Posted in r/dataengineeringbyu/captaintobs•5/1/2026
229
sqlglot is now 5x faster while still being written in python
Open Source
The SQLGlot framework, an open-source SQL parser and transpiler/analysis tool written in Python, has been enhanced to operate 5x faster via mypyc compilation into C. Users appreciate the speedier SQL parsing for data pipelines, praising its utility in code analysis, lineage extraction, and SQL dialect translation. Some have leveraged it for semantic chunking in large report SQL translation and anti-pattern detection in BigQuery. Users should install the c extensions for full advantage of the improvements. Those having difficulties are encouraged to contribute back to the community. Overall sentiment: highly positive.
Posted in r/MachineLearningbyu/ZeusZCC•4/29/2026
176
Why isn’t LLM reasoning done in vector space instead of natural language?[D]
Discussion
The discussion revolves around whether Language Learning Models (LLMs) should use vector-based reasoning instead of the traditional language-based method. While vector-based reasoning could be faster and more intuitive, it may also make the process more opaque and less reliable for logical tasks. A balance must be struck between opaqueness and interpretability, as completely vectorizing the process could result in loss of control and debuggability. This is especially important for AI safety and interpretability. Current LLMs already perform computations in latent space with the text reasoning serving as a trace. While transforming LLMs for vector reasoning is still a hot research area, it's agreed that the process requires further development and proving that vector-based reasoning adds value beyond linguistic reasoning. The sentiment is largely mixed, with researchers acknowledging the potential of vector-based reasoning but also raising concerns about its practicality and safety.
Posted in r/MachineLearningbyu/Hope999991•5/3/2026
166
Are modern ML PhDs becoming too incremental, or is this just what research looks like now? [D]
Discussion
Many believe that modern Machine Learning (ML) PhDs are increasingly incremental due to the competitive nature of the environment and the pressure to publish. This incrementalism is considered normal for PhD research. However, certain problems are identified, including the lack of appreciation for fundamental research unless it comes from a prominent lab. Further, the influence of industry on ML research creates incentives to focus on improvements that are valuable to industry. It's acknowledged that the strength of a PhD student's scientific ability varies, with some making substantial contributions and others coasting. The discussion emphasizes focusing on one's own work rather than comparing with others.
Posted in r/MachineLearningbyu/Hackerstreak•4/28/2026
162
Visualizing Loss Landscapes of Neural Networks [P]
Project
Visualization of neural network loss landscapes is a complex task due to their multi-dimensional nature. An interactive tool has been developed to help better understand this process, showing how different optimizers navigate these spaces. The tool allows the adjustment of architectures and the use of synthetic or real image datasets. However, the tool does have limitations. For example, 2D/3D projections may create geometric surfaces that don't exist in the high-dimensional space. Users have found the tool educational but caution that determining the usefulness of these visualizations can be challenging, especially given that higher dimensional spaces may behave differently than 3D visualizations suggest.
Posted in r/dataengineeringbyu/Relative-Cucumber770•4/28/2026
161
"Junior" role asking for +5 years...
Meme
There's widespread frustration with job postings for junior roles requiring extensive experience. Many suggest ignoring specified qualifications and applying regardless, while others note the increasing demand for experienced candidates in junior roles, particularly in data engineering. Some speculate these postings aim to secure highly qualified employees at lower salaries, or are just formalities when a candidate has already been selected. To gain experience, one advice is to offer data engineering services to smaller companies that cannot afford full-time positions. This approach can provide valuable experience and potentially lead to more opportunities. Overall sentiment is mixed, leaning towards negative.
Posted in r/SQLbyu/Far-Round2092•4/27/2026
70
SQL Protocol update: the SQL game now has a shared world and chat, so you can study with other people
MySQL
The SQL Protocol game, which combines SQL queries with gameplay, has updated to include a shared world and chat feature. This allows players to interact and study together in real time. The update has been positively received with users praising the creator's work and the innovative approach to learning SQL. However, some users suggested it could be more beginner-friendly and requested the ability to indent within the game. The game is free and can be accessed via Google sign-in on a desktop. Overall, the sentiment towards this update is enthusiastic and supportive.
Posted in r/SQLbyu/Boring-Metal-7672•5/3/2026
27
In my ETL pipeline I used a Merge statement. When I asked Copilot to critique the pipeline it said Merge statements were not recommended by Microsoft. Why is this?
SQL Server
The discussion revolves around the use of Merge statements in ETL pipelines, as some users reported that these are not recommended by Microsoft. Most users agreed, suggesting alternatives like Insert and Update due to Merge's potential for data corruption, concurrency issues, and performance problems when used for insert, update, and delete together. Some pointed out that Merge was shipped with many unfixed bugs leading to incorrect data, which might be why Copilot discouraged its use. Though the issues with Merge are often corner cases, their sheer number is concerning. The overall sentiment was mixed but leaned towards avoiding Merge.
Posted in r/SQLbyu/ElixirStylish•4/27/2026
18
What execution plan mistake do juniors make most often when they analyze queries?
Discussion
Junior developers often make the mistake of focusing on the most expensive-looking step when analyzing query execution plans, instead of identifying the actual problem such as bad row estimates or missing indexes. Many juniors also over-rely on cost percentages, without considering actual vs estimated rows or data flow. Developers expressed concerns that a lack of understanding of execution plans is prevalent, and that overpowered servers may disguise inefficient queries. They advocate improving this by focusing on understanding row counts and joins, as well as considering the original query for potential inefficiencies before jumping to the execution plan.
Posted in r/SQLbyu/Notalabel_4566•4/27/2026
14
What are the most commonly asked SQL interview questions and patterns?
SQL Server
SQL interview preparation should be focused on understanding and practicing common patterns rather than memorizing questions. Key areas include joins (especially self-joins and understanding null values), group by with aggregations, window functions (like rank, row_number, lead(), lag()), and filtering (where vs having). Practical scenarios might involve tasks like finding top N per group, spotting duplicates, calculating conversion rates, or handling dates. Also important is knowing when to use CTEs, subqueries, and multiple CTEs. For senior roles, expect deeper dives into views, stored procedures, and dynamic SQL scripting. Regular practice (30-40 questions weekly) is recommended to ensure speed and clarity in writing queries.
Subscribe to data-subtldr
Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.