Advent of Haystack

Try out Haystack 2.0-Beta to discover what’s coming in the next major release

with 10 challenges in the month of December πŸŽ‰

Every few days one of the doors in this page will open to reveal a new challenge

Submit your results and discuss solutions with the community πŸŽ„

image

The Haystack elves live in the forest. Every year, after winter, Elf Bilge writes a detailed report on their winter preparations, food collection, memorable moments, and the lessons learned. Other curious elves seek her guidance yearly, asking questions like β€œWhich foods should we collect?” or β€œWhat should we do against water scarcity?” 🌲

This year, Elf Bilge has this idea: make a generative system that replaces her so elves can shoot questions and get elf-style answers. As she plays with LLMs, she realizes these winter reports are too big to just throw at LLMs. Also, not every part of the report usually fits with questions. Being a Haystack elf, she knows how to solve this issue: PREPROCESSING! πŸ’‘

So, she comes up with a plan. Elf Bilge will convert all report files into Haystack Documents, break them into smaller bits, create semantic doodads ( embeddings), and toss them into a document store. That way, she can later use these docs in her RAG pipeline for their generative system. 🌟

For this challenge, you must help Elf Bilge create a pipeline to preprocess documents and index them to the document store with their embeddings.

🎯 Requirements:

  • Each split should have 200 words, and the overlap size should be 50 words.
  • Use all winter reports (winter_report_one.txt, winter_report_two.pdf, winter_report_three.md)

🧑 Some Hints:

  • Use FileTypeRouter to route files to the correct converters
  • Use DocumentJoiner to join documents from multiple converters into one list of documents.
  • You have seen how to connect components in Day 1.

πŸ’š Here is the Starter Colab