← Back to data subTLDR
data subTLDR week 47 year 2025
r/MachineLearningr/dataengineeringr/SQL
Unraveling SQL Interviews, Accelerating Oracle DB Extractions, AI-Generated SQL Skepticism, Real-World Data Engineering Projects, AI Integration Frustrations
•Week 47, 2025
Posted in r/MachineLearningbyu/fourDnet•11/18/2025
329
[D] Tsinghua ICLR paper withdrawn due to numerous AI generated citations
Discussion
Concerns have been raised about the quality of research published by top institutions, following the withdrawal of a Tsinghua University paper from the International Conference on Learning Representations (ICLR) due to AI-generated citations. The incident sparked a discussion about academic misconduct, particularly in China, where there's significant pressure for academics to publish frequently. Evidence of fake citations and data manipulation were highlighted. This has led to growing skepticism about the credibility of research emerging from certain institutions and underscores the need for rigorous review processes and academic integrity in the scientific community. The overall sentiment is negative, with worries about the impact on the field's research culture.
Posted in r/dataengineeringbyu/PolicyDecent•11/20/2025
169
Data engineers who are not building LLM to SQL. What cool projects are you actually working on?
Discussion
Many data engineers are focusing on practical, hands-on projects rather than theoretical or overly complicated ones. The highest-rated comment humorously emphasizes the reality of day-to-day work: often grappling with SQL and continuous integration, continuous deployment (CI/CD) pipelines. The second most upvoted comment comes from an engineer who contributes to GraphFrames, an open-source project that aids in handling large-scale identity resolution tasks, providing a refreshing contrast to routine data moving and ETLs. Other engineers are working on improving schema evolution to prevent pipeline breakages and building internal tools to preemptively identify contract changes. The overall sentiment leans towards a preference for practical, real-world problem-solving over innovative yet untested solutions.
Posted in r/MachineLearningbyu/KateSaenko•11/19/2025
143
[R] Segment Anything Model 3 (SAM 3) is released
Research
The newly released Segment Anything Model (SAM) 3, which detects, segments, and tracks objects in images and videos, has garnered positive feedback from the community. Users highlighted its advancements over previous versions, noting its integrated text prompting and improved segmentation capabilities. The model's rapid evolution was commended, though concerns about potential oversegmentation in cluttered environments were raised. The change to a 'vanilla' Vision Transformer encoder over a hierarchical one was questioned, but it was clarified that the selected Perception Encoder offered better semantic understanding and robustness. Nevertheless, the discussion revealed mixed feelings about Meta's recent layoffs within the team.
Posted in r/dataengineeringbyu/HowSwayGotTheAns•11/20/2025
140
Unpopular opinion (to investors) - this current zeitgeist of force AI into everything sucks
Career
The increasing obsession with integrating AI into all aspects of technology is causing frustration and disillusionment among data engineers. Many feel that the trend, driven by corporate leadership and investors, seems misguided and is not solving meaningful problems. The forced implementation often leads to inefficiency and resource waste, as many companies lack a clear plan for leveraging AI. There's a sentiment that the push for AI is more about hype and less about product enhancement. Some also criticize the replacement of algorithmic solutions with statistical models. Overall, the sentiment leans negative, expressing dissatisfaction with the current state of AI usage in tech.
Posted in r/MachineLearningbyu/kepoinerse•11/18/2025
113
[P] PapersWithCode's new open-source alternative: OpenCodePapers
Project
The open-source project OpenCodePapers is well-received for reviving the core functionality of the now-defunct PapersWithCode (PwC). The project is praised for its open list of tasks, a marked improvement over PwC's often-criticized sorting system. However, the community suggests acquiring a proper domain for credibility and ease of access. One user has already purchased a domain as a contribution, with plans to transfer it to the project. Despite initial hurdles in promoting the project, the community recognizes its value, especially given perceived lack of interest in benchmarks within the machine learning community. The sentiment is overwhelmingly positive.
Posted in r/dataengineeringbyu/Few_Noise2632•11/19/2025
105
why all data catalogs suck?
Discussion
The discussion centers around the inefficiencies of data catalogs. Most participants agree that the root of the problem lies in poor data management practices rather than the tools themselves. A key insight is that data should be cataloged effectively at the point of development, integrating documentation into the development process. Some participants have found success with self-hosted Open Metadata, but acknowledge it requires rigorous enforcement of documentation rules and substantial ETL work. Others express that despite investing in extensive data catalogs, the tools often remain underused, suggesting the need for better user engagement. Overall sentiment is mixed with an inclination towards solutions centered on improved data management practices.
Posted in r/SQLbyu/fokass•11/19/2025
71
Had a sql interview today
Discussion
In a discussion about SQL job interviews, many participants shared their experiences and insights. There was a consensus that not being able to answer every question doesn't necessarily result in a failed interview. Interviewers often look beyond technical skills, focusing on the candidate's thought process, problem-solving approach, and how they handle stress or failure. Clean, economical query structures were highlighted as more important than 100% accuracy. Participants also noted that even experienced SQL professionals frequently consult documentation, implying that perfect recall isn't expected. Some cautioned about companies that slow down candidates during timed quizzes. The overall sentiment was mixed but leaned towards positive.
Posted in r/SQLbyu/Leather-Pin-9154•11/17/2025
68
Need advice: Extracting 1 TB table → CSV is taking 10+ hours… any faster approach?
Oracle
There's a lively debate on optimizing Oracle DB extraction into CSV. Suggestions include avoiding full-table scans and instead chunk by rowid ranges while running parallel sessions. Pre-staging data in binary format then converting to CSVs post-extract reportedly cuts runtime by ~70%. If possible, unload on the DB server using Data Pump, which has parallel, server-side unload capabilities. If CSV format is necessary, run parallel queries each writing its own CSV. There's also advice to clarify the end goal, suggesting there may be alternative solutions. The overall sentiment is constructive, with users eager to provide various solutions.
Posted in r/SQLbyu/Crust_Issues1319•11/21/2025
37
Do you trust AI-generated SQL? Tell me your horror stories.
Discussion
There's widespread skepticism towards AI-generated SQL. Users trust it for drafting queries or boilerplate joins and case statements, but not for running them blindly due to occasional inaccurate results. Even experienced users report frustration and wasted time when using AI for SQL, particularly for complex tasks. The AI occasionally alters logic without notice, leading to potential errors that are difficult to fix. AI is seen as only useful for simple tasks like building case statements from a table or formatting syntax. However, a few find it helpful for specific tasks like data transformation, even though the code isn't always efficient. Overall sentiment is mixed leaning negative.
Subscribe to data-subtldr
Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.