data subTLDR week 20 year 2026

r/MachineLearningr/dataengineeringr/SQL

Unlocking SQL's Potential: A Must-Know Guide on Transaction Fraud Detection, The Transformative Power of Window Functions, Insights for Entry-Level Data Roles, Reality Check on AI's Role in Data Validation, and a Tale of Revenge via Database Deletion

May 17, 2026•Week 20, 2026

Posted in r/SQLbyu/FixelSmith•5/14/2026

841

Six SQL patterns I use to catch transaction fraud

Discussion

The discussion primarily praises a shared SQL pattern guide for catching transaction fraud, citing its practicality, clarity, and straightforward nature. Users found the content refreshing, especially amidst AI-generated content. There's a noted desire for more similar articles, with excitement for upcoming posts about window functions and fraud logic. Some confusion around SQL's BETWEEN function was expressed. Additionally, beginners found the guide useful for learning SQL. The overall sentiment is highly positive, with users appreciating the digestible, practical content and eager for more. A few comments suggested improvements or shared related experiences, but these didn't reflect a clear trend.

102 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/Nunki08•5/15/2026

612

arXiv implements 1-year ban for papers containing incontrovertible evidence of unchecked LLM-generated errors, such as hallucinated references or results. [N]

News

ArXiv is implementing a one-year ban for papers that contain unchecked errors generated by Language Learning Models (LLMs), such as fabricated references or results. The online community widely supports this move, with some advocating for even harsher penalties due to the severity of such breaches in scientific integrity. There is a consensus that unchecked LLM usage equates to data falsification and undermines trust in entire papers, with some likening the current state to a DDOS attack on the scientific community due to the huge volume of submissions. The policy's implication that banned authors must first get a paper accepted in a peer-reviewed venue before being allowed to resubmit to arXiv is seen as a significant deterrent.

64 comments

Save

View on Reddit →

Posted in r/SQLbyu/Purple_Lobster686•5/12/2026

510

I finally understood window functions and I'm not the same person anymore

Discussion

Many SQL users appreciate the efficiency and utility of window functions, especially when they realize the functions' potential to shorten and simplify code. Users also recommend using QUALIFY, particularly with Snowflake, as it allows for filtering based on the output of window functions. Some warn against overuse, as window functions may not always be the most efficient solution. A few users acknowledged the tendency to postpone learning new concepts like window functions until a specific problem necessitates their use. Overall, the sentiment is positive towards window functions, with frequent mentions of their value in improving SQL programming skills.

62 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/NeighborhoodFatCat•5/16/2026

491

Backlash against Arxiv's proposed 1 year ban is genuinely perplexing. [D]

Discussion

Key takeaways from the discussion reveal a strong backlash against Arxiv's proposed 1-year ban on authors publishing papers with unverified references. A significant number of academics feel it's unreasonable to expect them to meticulously fact-check every reference, given the sheer volume of work and the size of some research teams. Some argue that this is a byproduct of the age of AI and that Arxiv should adapt. The sentiment is mixed; while there is frustration and resistance against the proposed policy, others express concern about the lack of thoroughness in academic referencing practices.

142 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/MysteriousShoulder35•5/16/2026

238

honestly just so tired of explaining why we can't use LLMs for data validation

Discussion

The thread reveals a shared frustration among tech professionals about the pressure to use generative AI, like LLMs, for tasks requiring strict logic, like data validation. This pressure often comes from upper management who, many believe, are swayed by hype rather than understanding of the technology. Several commenters highlight the importance of deterministic systems in critical infrastructure, where a small error rate is unacceptable. Observations indicate a shift in conversation towards deterministic AI and energy-based models from tech giants. The sentiment is largely negative, with professionals hoping for a reality check for non-technical decision-makers.

42 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/Skye7821•5/17/2026

230

Slop is making me feel disconnected from AI Research [D]

Discussion

The AI research community shows a growing concern about the quality of research due to an increasing trend towards quantity over quality. Many participants lament the loss of creativity and hands-on experience, citing the rise of coding agents like Opus4.7 and 5.5 as culprits. They argue these tools, while efficient, limit cognitive abilities and make research feel robotic. Additionally, the pressure to publish frequently and work on trendy topics is said to discourage unique, in-depth research. However, some researchers appreciate the efficiency gains and argue that AI tools can aid focus. The sentiment is mixed, with frustration about the system but appreciation for certain advancements.

80 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/kudika•5/14/2026

193

Twin brothers wipe 96 gov’t databases minutes after being fired

Discussion

Two disgruntled twin brothers deleted 96 government databases after being fired, sparking a lively discussion about workplace ethics, IT security, and personal responsibility. Most users criticized the lack of security measures that allowed such a breach, suggesting measures like cutting off access prior to termination meetings and the use of minimal, supervised permissions during transition periods. Some debated the morality of the brothers' actions, with opinions varying from seeing no issue with dismantling corrupt organizations, to stressing the personal and legal repercussions of such actions. Overall, the sentiment leaned towards the importance of better security protocols and ethical conduct.

39 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Driftcoin•5/15/2026

172

Pyspark cheat sheet

Personal Project Showcase

The Pyspark cheat sheet shared on GitHub generated mixed responses. Users appreciated the resource, but they were unsatisfied with the image's low resolution. Many requested a higher quality version, suggesting the creation of a PDF. Some expressed confusion about the necessity of a cheat sheet when AI agents are doing the work, indicating a potential disconnect in the original post's context. However, the overall sentiment leaned towards positive, with users expressing gratitude for the shared resource.

17 comments

Save

View on Reddit →

Posted in r/SQLbyu/RevenuePresent9464•5/15/2026

Entry level jobs

Discussion

For those entering data roles, understanding the basics of SQL is crucial, particularly joins, aggregates, the difference between 'where' and 'having', and using calculations in select. Practicing regularly will help you use fast and presentable queries during interviews. Recognition of bad data, understanding how your data flows through your database, and making your queries more efficient are also valuable skills. While deep SQL specialization is useful, having complementary skills such as scripting, reporting, or visualization is often more beneficial at the entry level. The sentiment is positive, highlighting the importance of understanding and applying SQL basics.

20 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 20 year 2026

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!