SUB Importer: Tasks, Structure, And Implementation

Oct 29, 2025 by Admin 51 views

Hey guys,

This article outlines the creation of a new SUB importer within the text_preparation.importer module. We're tasking @maslionok with this, as it's a fairly self-contained project.

General Needs

Essentially, we need to build classes that import SUB's OCR/OLR data into our canonical format on S3. The text_preparation.importer module handles this in two key areas:

Classes and detect functions (in classes.py and detect.py within the importers.sub module)
Orchestrators (generic_importer.py and core.py)

Let's dive deeper into each of these areas.

Classes and Detect Functions

In the importers.sub module, you'll find classes.py and detect.py. These scripts are crucial for defining how we handle SUB data.

classes.py: This script is where we define the Canonical Issue and Canonical Page classes specifically for the SUB case. It's important to consider all the unique aspects of this particular version of METS/ALTO OCR/OLR. Think about how the data is structured, any specific fields that need to be extracted, and how it aligns with our canonical format. By creating specialized classes, we can ensure that the SUB data is properly represented and processed within our system.
detect.py: This script is all about identifying the issues we need to ingest from the file system. It uses "IssueDir" objects, which are basically named tuples holding the essential information for identifying an issue. These objects act as signposts, guiding the importer to the correct data. Inside this script, you will define the functions, where the functions are used to identify all the issues to ingest inside the file system, using "IssueDir" objects, which are named tuples holding all the main information identifying an issue.

These scripts share a lot of similarities across different importers. Don't hesitate to use code from other METS/ALTO-based importers—especially BL, BNF, and BNE-EN—as a guide and example. Copying, pasting, and adapting code snippets can save you time and effort.

Orchestrators

The orchestrators, generic_importer.py and core.py, are the masterminds behind the import process. They manage the overall flow and ensure that everything runs smoothly.

generic_importer.py: The main orchestrator of the imports. This script is responsible for the overall coordination of the import process. It manages the different stages, from identifying the issues to ingesting the data. It is important to read this script, because it can allow you to understand better how everything works.
core.py: The main orchestrator of the imports. This script provides the core functionality for the import process. It includes functions for reading data, processing it, and writing it to the canonical format.

Important: You shouldn't need to modify these orchestrators too much, as they're designed to work consistently across all importers. However, reading through them can provide valuable insights into how everything works – issue serialization, the lazy behavior of page objects, error logging, and so on. If you spot any potential errors in these scripts, it's best to flag them with me, as they could affect all canonical imports.

For more detailed information, check out the documentation for the project. And of course, feel free to ask any questions you have along the way!

Current Situation and File Structure

Currently, we have the complete contents of the "Hamburger Echo" newspaper (1887-1933) from the SUB, all in METS/ALTO format.

The file structure looks like this:

Hamburger_Echo/                          # Root directory (newspaper title)
├── 1919/                                # Year directory
│   ├── 02/                              # Month directory (February)
│   │   ├── 19/                          # Day directory (19th)
│   │   │   ├── Morgenausgabe/           # Morning edition
│   │   │   │   ├── 00000001.tif         # Page image (facsimile)
│   │   │   │   ├── 00000001.xml         # Page OCR (ALTO or PAGE XML)
│   │   │   │   ├── 00000002.tif
│   │   │   │   ├── 00000002.xml
│   │   │   │   ├── [...]                # More page pairs (.tif/.xml)
│   │   │   │   └── PPN1754726119_19190219MO.xml   # METS file (Morgenausgabe, “MO”)
│   │   │   │                                      # Format: PPN[titleID]_[YYYYMMDD][edition_code].xml
│   │   │   │                                      # Here: PPN1754726119 = newspaper ID
│   │   │   │                                      #        19190219 = date in YYYYMMDD format
│   │   │   │                                      #        MO = morning edition (Morgenausgabe)
│   │   │   │
│   │   │   ├── Abendausgabe/            # Evening edition (same day)
│   │   │   │   ├── 00000001.tif
│   │   │   │   ├── 00000001.xml
│   │   │   │   ├── [...]                # More pages
│   │   │   │   └── PPN1754726119_19190219AB.xml   # METS file (Abendausgabe, “AB”)
│   │   │   │
│   │   │   └── [other editions or none for this day...]
│   │   │
│   │   ├── 20/                          # Next day (example of multiple evening editions)
│   │   │   ├── A1-Abendausgabe/
│   │   │   │   ├── 00000001.tif
│   │   │   │   ├── 00000001.xml
│   │   │   │   ├── [...]
│   │   │   │   └── PPN1754726119_19190220A1.xml   # METS file (first evening edition)
│   │   │   │
│   │   │   ├── A2-Abendausgabe/
│   │   │   │   ├── 00000001.tif
│   │   │   │   ├── 00000001.xml
│   │   │   │   ├── [...]
│   │   │   │   └── PPN1754726119_19190220A2.xml   # METS file (second evening edition)
│   │   │
│   │   ├── 21/                          # Example of a single daily edition
│   │   │   └── Ausgabe/
│   │   │       ├── 00000001.tif
│   │   │       ├── 00000001.xml
│   │   │       ├── [...]
│   │   │       └── PPN1754726119_19190221.xml     # METS file (single daily edition)
│   │   │
│   │   └── [other days...]
│   │
│   └── [other months...]
│
└── [other years...]

Each level in this structure represents:

Title: Hamburger_Echo (newspaper)
Year: e.g., 1919
Month: e.g., 02
Day: e.g., 19, 20, 21
Edition: This can be one of the following:
- Ausgabe: Single daily edition (it seems)
- Morgenausgabe: Morning edition (MO)
- Abendausgabe: Evening edition (AB)
- A1-Abendausgabe, A2-Abendausgabe: Multiple evening editions (A1, A2) – likely also possible for morning editions.

Important Note: It's possible that the morning or evening edition is the only one for a given day. From what I've seen, "Ausgabe" typically indicates the only edition of the day, but this might not hold true for all the data.

This structure makes it easy to create Impresso IDs for each issue and page:

[media-alias]-[YYYY]-[MM]-[DD]-[edition letter(s)] (for issues)

[media-alias]-[YYYY]-[MM]-[DD]-[edition letter(s)]-p[page number filled to 4 digits] (for pages).

The editions are simply assigned a letter, starting with 'a' for the first, 'b' for the second, and so on.

We'll use the alias "hamb_echo" for the Hamburger Echo. We can decide on aliases for other titles later.

Specific Tasks and Open Questions

I've already set up some basic structures in the feature/sub-importer branch for you to build upon:

[ ] Define the SubNewspaperIssue object. What attributes should it have? How will it represent the key information about each newspaper issue, such as date, edition, and unique identifiers? Consider how this object will interact with other parts of the importer.
[ ] Define the SubNewspaperPage object. Similar to the issue object, what attributes are essential for representing a newspaper page? This might include page number, image location, OCR data, and any metadata associated with the page. Think about how this object will be used during the import process, especially when extracting and processing the text content.
[ ] Define the functions in the detect.py script. How will these functions traverse the file system, identify relevant files (METS, ALTO, images), and create IssueDir objects? Consider the different possible file structures and naming conventions you might encounter.
[ ] Modularize any helper functions in a helpers.py script. As you develop the importer, you'll likely find yourself writing reusable code snippets. Create a helpers.py file to house these functions, promoting code organization and maintainability. What kinds of helper functions might be useful? Examples could include functions for parsing dates, extracting information from file names, or handling errors.

Let me know if you have any questions. Good luck!