← Back to data subTLDR

data subTLDR week 47 year 2025

r/MachineLearningr/dataengineeringr/SQL

Unraveling SQL Interviews, Accelerating Oracle DB Extractions, AI-Generated SQL Skepticism, Real-World Data Engineering Projects, AI Integration Frustrations

Week 47, 2025
Posted in r/MachineLearningbyu/fourDnet11/18/2025
329

[D] Tsinghua ICLR paper withdrawn due to numerous AI generated citations

Discussion
Concerns have been raised about the quality of research published by top institutions, following the withdrawal of a Tsinghua University paper from the International Conference on Learning Representations (ICLR) due to AI-generated citations. The incident sparked a discussion about academic misconduct, particularly in China, where there's significant pressure for academics to publish frequently. Evidence of fake citations and data manipulation were highlighted. This has led to growing skepticism about the credibility of research emerging from certain institutions and underscores the need for rigorous review processes and academic integrity in the scientific community. The overall sentiment is negative, with worries about the impact on the field's research culture.
63 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/PolicyDecent11/20/2025
169

Data engineers who are not building LLM to SQL. What cool projects are you actually working on?

Discussion
Many data engineers are focusing on practical, hands-on projects rather than theoretical or overly complicated ones. The highest-rated comment humorously emphasizes the reality of day-to-day work: often grappling with SQL and continuous integration, continuous deployment (CI/CD) pipelines. The second most upvoted comment comes from an engineer who contributes to GraphFrames, an open-source project that aids in handling large-scale identity resolution tasks, providing a refreshing contrast to routine data moving and ETLs. Other engineers are working on improving schema evolution to prevent pipeline breakages and building internal tools to preemptively identify contract changes. The overall sentiment leans towards a preference for practical, real-world problem-solving over innovative yet untested solutions.
138 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/KateSaenko11/19/2025
143

[R] Segment Anything Model 3 (SAM 3) is released

Research
The newly released Segment Anything Model (SAM) 3, which detects, segments, and tracks objects in images and videos, has garnered positive feedback from the community. Users highlighted its advancements over previous versions, noting its integrated text prompting and improved segmentation capabilities. The model's rapid evolution was commended, though concerns about potential oversegmentation in cluttered environments were raised. The change to a 'vanilla' Vision Transformer encoder over a hierarchical one was questioned, but it was clarified that the selected Perception Encoder offered better semantic understanding and robustness. Nevertheless, the discussion revealed mixed feelings about Meta's recent layoffs within the team.
20 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/HowSwayGotTheAns11/20/2025
140

Unpopular opinion (to investors) - this current zeitgeist of force AI into everything sucks

Career
The increasing obsession with integrating AI into all aspects of technology is causing frustration and disillusionment among data engineers. Many feel that the trend, driven by corporate leadership and investors, seems misguided and is not solving meaningful problems. The forced implementation often leads to inefficiency and resource waste, as many companies lack a clear plan for leveraging AI. There's a sentiment that the push for AI is more about hype and less about product enhancement. Some also criticize the replacement of algorithmic solutions with statistical models. Overall, the sentiment leans negative, expressing dissatisfaction with the current state of AI usage in tech.
49 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/kepoinerse11/18/2025
113

[P] PapersWithCode's new open-source alternative: OpenCodePapers

Project
The open-source project OpenCodePapers is well-received for reviving the core functionality of the now-defunct PapersWithCode (PwC). The project is praised for its open list of tasks, a marked improvement over PwC's often-criticized sorting system. However, the community suggests acquiring a proper domain for credibility and ease of access. One user has already purchased a domain as a contribution, with plans to transfer it to the project. Despite initial hurdles in promoting the project, the community recognizes its value, especially given perceived lack of interest in benchmarks within the machine learning community. The sentiment is overwhelmingly positive.
19 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/Few_Noise263211/19/2025
105

why all data catalogs suck?

Discussion
The discussion centers around the inefficiencies of data catalogs. Most participants agree that the root of the problem lies in poor data management practices rather than the tools themselves. A key insight is that data should be cataloged effectively at the point of development, integrating documentation into the development process. Some participants have found success with self-hosted Open Metadata, but acknowledge it requires rigorous enforcement of documentation rules and substantial ETL work. Others express that despite investing in extensive data catalogs, the tools often remain underused, suggesting the need for better user engagement. Overall sentiment is mixed with an inclination towards solutions centered on improved data management practices.
47 comments
Share
Save
View on Reddit →
Posted in r/SQLbyu/fokass11/19/2025
71

Had a sql interview today

Discussion
In a discussion about SQL job interviews, many participants shared their experiences and insights. There was a consensus that not being able to answer every question doesn't necessarily result in a failed interview. Interviewers often look beyond technical skills, focusing on the candidate's thought process, problem-solving approach, and how they handle stress or failure. Clean, economical query structures were highlighted as more important than 100% accuracy. Participants also noted that even experienced SQL professionals frequently consult documentation, implying that perfect recall isn't expected. Some cautioned about companies that slow down candidates during timed quizzes. The overall sentiment was mixed but leaned towards positive.
48 comments
Share
Save
View on Reddit →
Posted in r/SQLbyu/Leather-Pin-915411/17/2025
68

Need advice: Extracting 1 TB table → CSV is taking 10+ hours… any faster approach?

Oracle
There's a lively debate on optimizing Oracle DB extraction into CSV. Suggestions include avoiding full-table scans and instead chunk by rowid ranges while running parallel sessions. Pre-staging data in binary format then converting to CSVs post-extract reportedly cuts runtime by ~70%. If possible, unload on the DB server using Data Pump, which has parallel, server-side unload capabilities. If CSV format is necessary, run parallel queries each writing its own CSV. There's also advice to clarify the end goal, suggesting there may be alternative solutions. The overall sentiment is constructive, with users eager to provide various solutions.
50 comments
Share
Save
View on Reddit →
Posted in r/SQLbyu/Crust_Issues131911/21/2025
37

Do you trust AI-generated SQL? Tell me your horror stories.

Discussion
There's widespread skepticism towards AI-generated SQL. Users trust it for drafting queries or boilerplate joins and case statements, but not for running them blindly due to occasional inaccurate results. Even experienced users report frustration and wasted time when using AI for SQL, particularly for complex tasks. The AI occasionally alters logic without notice, leading to potential errors that are difficult to fix. AI is seen as only useful for simple tasks like building case statements from a table or formatting syntax. However, a few find it helpful for specific tasks like data transformation, even though the code isn't always efficient. Overall sentiment is mixed leaning negative.
87 comments
Share
Save
View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

Get the weekly data subTLDR in your inbox!

We respect your privacy. No spam, ever.