data subTLDR week 18 year 2025

r/MachineLearningr/dataengineeringr/SQL

Decoding XKCD's SQL Injection Humor, Mastering SQL Queries for Beginners, Debunking Database Myths, The Great Tech Stack Debate in Hiring, Spark vs. Rust: The Future of Data Software

May 4, 2025•Week 18, 2025

Posted in r/dataengineeringbyu/vitocomido•5/1/2025

878

Guess skills are not transferable

Meme

While the discussion indicates a mixed sentiment, the majority believe that hiring managers should prioritize experience with a specific tech stack when recruiting for key roles in data engineering. This is particularly the case for positions where the new hire will be the first data engineer and is expected to make critical decisions. Some argue that skills between different cloud platforms like Azure, AWS, and GCP are generally transferable, but each platform has unique nuances, and it takes time to become proficient in a new one. However, some users expressed concern over unrealistic expectations for new hires, suggesting that it's unreasonable to expect immediate full proficiency and warning of potential red flags in such hiring practices.

154 comments

Save

View on Reddit →

Posted in r/SQLbyu/Original_Garbage8557•4/28/2025

717

Who can explain this XKCD comic for me?

Discussion

The XKCD comic in question humorously illustrates a SQL injection attack, a cybersecurity issue. The highlighted character's name is a SQL command, which, when inputted, would execute an additional command, in this case, deleting all student records—a technique known as SQL injection. This comic lampoons the vulnerability of systems that lack proper safeguards or fail to sanitize database inputs. Enthusiasm for the comic was reflected in fond reminiscences and appreciation for its cleverness, underscoring the ongoing relevance of cybersecurity education.

37 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/rocketinter•4/30/2025

310

Spark is the new Hadoop

Blog

The sentiment towards Spark, and its future, is mixed among the tech community. Some argue that new tools built in Rust, which offer interoperability and work natively in-process, are easier to work with and will overtake Spark. Tools like Daft, for example, can run in-process with Python or across a ray cluster, allowing for more efficient scalability. However, many believe that Spark will adapt and change its internals, with frameworks like Apache DataFusion Comet offering a Rust-native replacement engine for Spark. While some are skeptical about the domination of Rust and urge not to count out Spark or Databricks anytime soon, others highlight that the future data software should be Python-first, work at any scale, and function effectively with any modality. Databricks' expansion into a mature cloud solution is also recognized, but there's a call for it to ride with the wave of change.

135 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/Classic_Eggplant8827•5/1/2025

[R] Meta releases synthetic data kit!!

News

Meta has launched a Synthetic Data Kit, a command-line interface tool designed to enhance the data preparation stage of LLM fine-tuning. The tool's four-command workflow enables the generation of high-quality synthetic training data, especially useful for task-specific reasoning in Llama-3 models. On Reddit, users shared the official repository for the Synthetic Data Kit and a Colab notebook to facilitate its use. One user queried the possibility of using Ollama for local data synthesis. The overall sentiment towards the tool was positive, with users expressing enthusiasm and practical interest in its application.

4 comments

Save

View on Reddit →

Posted in r/SQLbyu/ThrowRAhelpthebro•5/3/2025

Help! Beginner here. How to

PostgreSQL

The discussion emphasizes the importance of SQL syntax and structure, with a focus on utilizing GROUP BY and ORDER BY commands correctly. Participants recommend aggregating by category, then counting the number of occurrences for each, and finally ordering the results in descending order to find the top category for R-rated movies. The sentiment is supportive, with suggestions for further improvement and learning resources. There's a consensus that practice and understanding the fundamentals are keys to mastering SQL queries.

29 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/Classic_Eggplant8827•5/2/2025

[R] Leaderboard Hacking

Research

A research paper titled Leaderboard Illusion by Cohere and other top institutions reveals bias in Language Learning Model (LLM) benchmark evaluations. The paper details how labs test privately and selectively publish results, suggesting that Chatbot Arena rankings may be manipulated. There's a call for greater transparency in how these rankings are established. This issue has also been previously highlighted by renowned AI expert Andrej Karpathy. The sentiment leans towards skepticism of current practices and a desire for more reliable, unbiased measurements in AI development.

10 comments

Save

View on Reddit →

Posted in r/SQLbyu/derjanni•4/30/2025

Do You Really Know How To SQL? What Database Engineers Actually Recommend You Should Do.

Discussion

The discussion centers around SQL practices and Postgres functionalities. Participants generally debunked the post's assertions, highlighting the availability of scheduling in Postgres through the pg_cron extension, available on most major cloud providers. This extension was considered by many as integral as core functionality. They also discussed Postgres' transactional DDL as a time-saving feature, and the use of DDL triggers for linting interfaces and enforcing database rules. The conversation touched on the risk of cyclic trigger patterns and the importance of educating junior developers about it. There was some concern about the loss of manual SQL skills due to increasing reliance on AI.

7 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 18 year 2025

Guess skills are not transferable

Who can explain this XKCD comic for me?

Spark is the new Hadoop

[R] Meta releases synthetic data kit!!

Help! Beginner here. How to

[R] Leaderboard Hacking

Do You Really Know How To SQL? What Database Engineers Actually Recommend You Should Do.

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!