| Hi HN! I'm currently a Master's student at USTC (University of Science and Technology of China). I've been diving deep into Data Engineering, especially in the context of Large Language Models (LLMs). The Problem:
I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system. The Solution:
I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve. Key Features: LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems. Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search"). Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples. This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included! Check it out: Online: https://datascale-ai.github.io/data_engineering_book/ GitHub: https://github.com/datascale-ai/data_engineering_book |
I am a complete novice in training LLMs, and have been trying to train a novel architecture for Python code generation, using Apple Silicon.
I've been a bit frustrated to be honest that the data tools don't seem to have any focus on code, their modalities are generic text and images. And for synthetic data generation I would love to use EBNF-constrained outputs but SGlang is not available on MacOS. So I feel a bit stuck, downloading a large corpus of Python code, running into APFS issues, sharding, custom classifying, custom cleaning, custom mixing, etc. Maybe I've missed a tool but I'm surprised there aren't pre-tagged, pre-categorized, pre-filtered datasets for code where I can just tune the curriculum/filters to input into training.