← Back to data subTLDR
data subTLDR week 18 year 2025
r/MachineLearningr/dataengineeringr/SQL
Decoding XKCD's SQL Injection Humor, Mastering SQL Queries for Beginners, Debunking Database Myths, The Great Tech Stack Debate in Hiring, Spark vs. Rust: The Future of Data Software
•Week 18, 2025
Posted in r/dataengineeringbyu/vitocomido•5/1/2025
878
Guess skills are not transferable
Meme
While the discussion indicates a mixed sentiment, the majority believe that hiring managers should prioritize experience with a specific tech stack when recruiting for key roles in data engineering. This is particularly the case for positions where the new hire will be the first data engineer and is expected to make critical decisions. Some argue that skills between different cloud platforms like Azure, AWS, and GCP are generally transferable, but each platform has unique nuances, and it takes time to become proficient in a new one. However, some users expressed concern over unrealistic expectations for new hires, suggesting that it's unreasonable to expect immediate full proficiency and warning of potential red flags in such hiring practices.
Posted in r/SQLbyu/Original_Garbage8557•4/28/2025
717
Who can explain this XKCD comic for me?
Discussion
The XKCD comic in question humorously illustrates a SQL injection attack, a cybersecurity issue. The highlighted character's name is a SQL command, which, when inputted, would execute an additional command, in this case, deleting all student records—a technique known as SQL injection. This comic lampoons the vulnerability of systems that lack proper safeguards or fail to sanitize database inputs. Enthusiasm for the comic was reflected in fond reminiscences and appreciation for its cleverness, underscoring the ongoing relevance of cybersecurity education.
Posted in r/dataengineeringbyu/rocketinter•4/30/2025
310
Spark is the new Hadoop
Blog
The sentiment towards Spark, and its future, is mixed among the tech community. Some argue that new tools built in Rust, which offer interoperability and work natively in-process, are easier to work with and will overtake Spark. Tools like Daft, for example, can run in-process with Python or across a ray cluster, allowing for more efficient scalability. However, many believe that Spark will adapt and change its internals, with frameworks like Apache DataFusion Comet offering a Rust-native replacement engine for Spark. While some are skeptical about the domination of Rust and urge not to count out Spark or Databricks anytime soon, others highlight that the future data software should be Python-first, work at any scale, and function effectively with any modality. Databricks' expansion into a mature cloud solution is also recognized, but there's a call for it to ride with the wave of change.
Posted in r/MachineLearningbyu/Classic_Eggplant8827•5/1/2025
84
[R] Meta releases synthetic data kit!!
News
Meta has launched a Synthetic Data Kit, a command-line interface tool designed to enhance the data preparation stage of LLM fine-tuning. The tool's four-command workflow enables the generation of high-quality synthetic training data, especially useful for task-specific reasoning in Llama-3 models. On Reddit, users shared the official repository for the Synthetic Data Kit and a Colab notebook to facilitate its use. One user queried the possibility of using Ollama for local data synthesis. The overall sentiment towards the tool was positive, with users expressing enthusiasm and practical interest in its application.
Posted in r/SQLbyu/ThrowRAhelpthebro•5/3/2025
78
Help! Beginner here. How to
PostgreSQL
The discussion emphasizes the importance of SQL syntax and structure, with a focus on utilizing GROUP BY and ORDER BY commands correctly. Participants recommend aggregating by category, then counting the number of occurrences for each, and finally ordering the results in descending order to find the top category for R-rated movies. The sentiment is supportive, with suggestions for further improvement and learning resources. There's a consensus that practice and understanding the fundamentals are keys to mastering SQL queries.
Posted in r/MachineLearningbyu/Classic_Eggplant8827•5/2/2025
73
[R] Leaderboard Hacking
Research
A research paper titled Leaderboard Illusion by Cohere and other top institutions reveals bias in Language Learning Model (LLM) benchmark evaluations. The paper details how labs test privately and selectively publish results, suggesting that Chatbot Arena rankings may be manipulated. There's a call for greater transparency in how these rankings are established. This issue has also been previously highlighted by renowned AI expert Andrej Karpathy. The sentiment leans towards skepticism of current practices and a desire for more reliable, unbiased measurements in AI development.
Posted in r/SQLbyu/derjanni•4/30/2025
60
Do You Really Know How To SQL? What Database Engineers Actually Recommend You Should Do.
Discussion
The discussion centers around SQL practices and Postgres functionalities. Participants generally debunked the post's assertions, highlighting the availability of scheduling in Postgres through the pg_cron extension, available on most major cloud providers. This extension was considered by many as integral as core functionality. They also discussed Postgres' transactional DDL as a time-saving feature, and the use of DDL triggers for linting interfaces and enforcing database rules. The conversation touched on the risk of cyclic trigger patterns and the importance of educating junior developers about it. There was some concern about the loss of manual SQL skills due to increasing reliance on AI.
Subscribe to data-subtldr
Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.