Warsaw Events: Scrape Waw4free.pl

by Admin 34 views
Warsaw Event Scraper: Adding waw4free.pl

Executive Summary

Let's get this party started, guys! We're looking to add waw4free.pl as a new source to our event scraper, specifically for free events happening in Warsaw, Poland. This is a big win because it will help us find even more cool stuff for you to do! Think of it like a local guide, just for Warsaw, similar to how we already have Karnet for KrakĂłw. The website is in Polish, so we'll need to get our Polish game on, but don't worry, we've got a plan! It's estimated to be a medium level of work, similar to what we did for Karnet.

  • Website: https://waw4free.pl/
  • Target: Free Events in Warsaw, Poland
  • Language: Polish
  • Priority: 30-40 (Local/Regional)
  • Pattern: Multi-stage HTML scraper
  • Complexity: Medium

waw4free.pl is the go-to spot for free events in Warsaw, think concerts, workshops, exhibitions, theater, sports, and family fun. It has all the details you need: where, when, what, and even links to the event organizers. Adding this will be super helpful for everyone looking for free things to do in Warsaw.


Website Technical Deep Dive

Alright, let's dive into the technical details, so you know how we'll get this done.

URL Structure

It's all about how the website is set up, so we can grab all the juicy event details.

  • Homepage: https://waw4free.pl/
  • Event Detail Pages: /wydarzenie-{id}-{slug} (e.g., /wydarzenie-144172-black-maze-4-labirynt-strachu)
  • Category Listing: /warszawa-darmowe-{category} (e.g., /warszawa-darmowe-koncerty)
  • Event ID: We'll grab this from the URL (like 144172).

Data Fields We'll Be Snagging

Here's what info we'll be pulling from each event page. It's the good stuff!

Event Detail Pages Provide:

  • âś… Title (the headline)
  • âś… Category (like "concerts" or "workshops")
  • âś… Date (in Polish, of course: "poniedziaĹ‚ek, 3 listopada 2025")
  • âś… Time (24-hour format: "15:00")
  • âś… Venue name and address (e.g., "Galeria Północna, ul. Ĺšwiatowida 17")
  • âś… District (Warsaw areas: BiaĹ‚ołęka, Praga-PoĹ‚udnie, ĹšrĂłdmieĹ›cie, etc.)
  • âś… Full description (in HTML, so we get all the formatting)
  • âś… Event image (a picture is worth a thousand words!)
  • âś… Source URL (link to the event organizer's website)
  • âś… Google Maps link (for finding the place)
  • ⚠️ Sometimes, there's info about voluntary donations: "(dobrowolna zrzutka za udziaĹ‚)"

Category Listing Pages:

  • Event cards with titles, categories, dates, times, districts
  • 60+ events per category page
  • No pagination (all events load on one page)
  • Multiple categories per event possible

Event Categories (Polish)

These are the types of events we can find, all in Polish, naturally.

  • koncerty (concerts)
  • warsztaty (workshops)
  • wystawy (exhibitions)
  • teatr (theater)
  • sport (sports)
  • dla-dzieci (for children)
  • festiwale (festivals)
  • inne (other)

Warsaw Districts

We'll make sure to note which part of Warsaw the event is in.

Białołęka, Praga-Południe, Śródmieście, Wawer, Wilanów, Żoliborz, Mokotów, Ursynów, Wola, Targówek, Bemowo, Bielany, Ochota, Rembertów, Wesoła, Włochy, Ursus


The Nitty-Gritty: Technical Requirements

Here’s how we'll build this thing, step by step.

1. Polish Language Support

Date Parser Plugin

We need a way to read those Polish dates. It’s a whole new file we'll need to create!

New file needed: lib/eventasaurus_discovery/shared/parsers/date_patterns/polish.ex

What it needs to do:

  • Read Polish dates like: "poniedziaĹ‚ek, 3 listopada 2025"
  • Know the Polish names for days: poniedziaĹ‚ek (Monday), wtorek (Tuesday), Ĺ›roda (Wednesday), czwartek (Thursday), piÄ…tek (Friday), sobota (Saturday), niedziela (Sunday)
  • Know the Polish names for months: stycznia, lutego, marca, kwietnia, maja, czerwca, lipca, sierpnia, wrzeĹ›nia, paĹşdziernika, listopada, grudnia
  • Work with our existing MultilingualDateParser (which already handles French and English)
  • Give us the DateTime in UTC (timezone: "Europe/Warsaw")

Need a reference? Check out lib/eventasaurus_discovery/shared/parsers/date_patterns/french.ex to see how it's done.

Category Mapping

We also need to translate the Polish categories into something we understand. Time for a new file!

New file needed: priv/category_mappings/waw4free.yml

Polish → Internal Taxonomy Mapping:

# waw4free.pl category mappings
concerts:
  - koncerty
workshops:
  - warsztaty
exhibitions:
  - wystawy
theater:
  - teatr
sports:
  - sport
family:
  - dla-dzieci
festivals:
  - festiwale
other:
  - inne

Need a reference? See how we did it for priv/category_mappings/karnet.yml and priv/category_mappings/sortiraparis.yml.

2. Scraper Architecture

Directory Structure

Here’s how we'll organize all the code.

lib/eventasaurus_discovery/sources/waw4free/
├── source.ex              # Configuration & metadata (Priority 30-40)
├── config.ex              # Runtime settings (base_url, rate limits)
├── transformer.ex         # Data transformation to unified format
├── client.ex              # HTTP client with rate limiting
├── html_parser.ex         # HTML parsing utilities
├── jobs/
│   ├── sync_job.ex       # Index job: Scrape category listings
│   └── event_detail_job.ex  # Detail job: Fetch individual events
└── README.md             # Documentation

priv/category_mappings/waw4free.yml  # Category mapping

Pattern: It’s a two-step process, like the Karnet scraper:

  • Stage 1 (SyncJob): Grab all the category listing pages and find event URLs.
  • Stage 2 (EventDetailJob): Go to each event page and get all the details.

3. External ID Generation

How we’ll identify each event:

  • We'll use the event ID from the URL as a unique external_id.
  • URL format: /wydarzenie-{id}-{slug}
  • Extract ID: Get the 144172 from /wydarzenie-144172-black-maze-4-labirynt-strachu.
  • External ID: We'll create something like `