SUB Importer: Tasks, Structure, And Implementation
Hey guys,
This article outlines the creation of a new SUB importer within the text_preparation.importer module. We're tasking @maslionok with this, as it's a fairly self-contained project.
General Needs
Essentially, we need to build classes that import SUB's OCR/OLR data into our canonical format on S3. The text_preparation.importer module handles this in two key areas:
- Classes and detect functions (in
classes.pyanddetect.pywithin theimporters.submodule) - Orchestrators (
generic_importer.pyandcore.py)
Let's dive deeper into each of these areas.
Classes and Detect Functions
In the importers.sub module, you'll find classes.py and detect.py. These scripts are crucial for defining how we handle SUB data.
classes.py: This script is where we define theCanonical IssueandCanonical Pageclasses specifically for the SUB case. It's important to consider all the unique aspects of this particular version of METS/ALTO OCR/OLR. Think about how the data is structured, any specific fields that need to be extracted, and how it aligns with our canonical format. By creating specialized classes, we can ensure that the SUB data is properly represented and processed within our system.detect.py: This script is all about identifying the issues we need to ingest from the file system. It uses "IssueDir" objects, which are basically named tuples holding the essential information for identifying an issue. These objects act as signposts, guiding the importer to the correct data. Inside this script, you will define the functions, where the functions are used to identify all the issues to ingest inside the file system, using "IssueDir" objects, which are named tuples holding all the main information identifying an issue.
These scripts share a lot of similarities across different importers. Don't hesitate to use code from other METS/ALTO-based importersβespecially BL, BNF, and BNE-ENβas a guide and example. Copying, pasting, and adapting code snippets can save you time and effort.
Orchestrators
The orchestrators, generic_importer.py and core.py, are the masterminds behind the import process. They manage the overall flow and ensure that everything runs smoothly.
generic_importer.py: The main orchestrator of the imports. This script is responsible for the overall coordination of the import process. It manages the different stages, from identifying the issues to ingesting the data. It is important to read this script, because it can allow you to understand better how everything works.core.py: The main orchestrator of the imports. This script provides the core functionality for the import process. It includes functions for reading data, processing it, and writing it to the canonical format.
Important: You shouldn't need to modify these orchestrators too much, as they're designed to work consistently across all importers. However, reading through them can provide valuable insights into how everything works β issue serialization, the lazy behavior of page objects, error logging, and so on. If you spot any potential errors in these scripts, it's best to flag them with me, as they could affect all canonical imports.
For more detailed information, check out the documentation for the project. And of course, feel free to ask any questions you have along the way!
Current Situation and File Structure
Currently, we have the complete contents of the "Hamburger Echo" newspaper (1887-1933) from the SUB, all in METS/ALTO format.
The file structure looks like this:
Hamburger_Echo/ # Root directory (newspaper title)
βββ 1919/ # Year directory
β βββ 02/ # Month directory (February)
β β βββ 19/ # Day directory (19th)
β β β βββ Morgenausgabe/ # Morning edition
β β β β βββ 00000001.tif # Page image (facsimile)
β β β β βββ 00000001.xml # Page OCR (ALTO or PAGE XML)
β β β β βββ 00000002.tif
β β β β βββ 00000002.xml
β β β β βββ [...] # More page pairs (.tif/.xml)
β β β β βββ PPN1754726119_19190219MO.xml # METS file (Morgenausgabe, βMOβ)
β β β β # Format: PPN[titleID]_[YYYYMMDD][edition_code].xml
β β β β # Here: PPN1754726119 = newspaper ID
β β β β # 19190219 = date in YYYYMMDD format
β β β β # MO = morning edition (Morgenausgabe)
β β β β
β β β βββ Abendausgabe/ # Evening edition (same day)
β β β β βββ 00000001.tif
β β β β βββ 00000001.xml
β β β β βββ [...] # More pages
β β β β βββ PPN1754726119_19190219AB.xml # METS file (Abendausgabe, βABβ)
β β β β
β β β βββ [other editions or none for this day...]
β β β
β β βββ 20/ # Next day (example of multiple evening editions)
β β β βββ A1-Abendausgabe/
β β β β βββ 00000001.tif
β β β β βββ 00000001.xml
β β β β βββ [...]
β β β β βββ PPN1754726119_19190220A1.xml # METS file (first evening edition)
β β β β
β β β βββ A2-Abendausgabe/
β β β β βββ 00000001.tif
β β β β βββ 00000001.xml
β β β β βββ [...]
β β β β βββ PPN1754726119_19190220A2.xml # METS file (second evening edition)
β β β
β β βββ 21/ # Example of a single daily edition
β β β βββ Ausgabe/
β β β βββ 00000001.tif
β β β βββ 00000001.xml
β β β βββ [...]
β β β βββ PPN1754726119_19190221.xml # METS file (single daily edition)
β β β
β β βββ [other days...]
β β
β βββ [other months...]
β
βββ [other years...]
Each level in this structure represents:
- Title: Hamburger_Echo (newspaper)
- Year: e.g., 1919
- Month: e.g., 02
- Day: e.g., 19, 20, 21
- Edition: This can be one of the following:
- Ausgabe: Single daily edition (it seems)
- Morgenausgabe: Morning edition (MO)
- Abendausgabe: Evening edition (AB)
- A1-Abendausgabe, A2-Abendausgabe: Multiple evening editions (A1, A2) β likely also possible for morning editions.
Important Note: It's possible that the morning or evening edition is the only one for a given day. From what I've seen, "Ausgabe" typically indicates the only edition of the day, but this might not hold true for all the data.
This structure makes it easy to create Impresso IDs for each issue and page:
[media-alias]-[YYYY]-[MM]-[DD]-[edition letter(s)] (for issues)
[media-alias]-[YYYY]-[MM]-[DD]-[edition letter(s)]-p[page number filled to 4 digits] (for pages).
The editions are simply assigned a letter, starting with 'a' for the first, 'b' for the second, and so on.
We'll use the alias "hamb_echo" for the Hamburger Echo. We can decide on aliases for other titles later.
Specific Tasks and Open Questions
I've already set up some basic structures in the feature/sub-importer branch for you to build upon:
- [ ] Define the
SubNewspaperIssueobject. What attributes should it have? How will it represent the key information about each newspaper issue, such as date, edition, and unique identifiers? Consider how this object will interact with other parts of the importer. - [ ] Define the
SubNewspaperPageobject. Similar to the issue object, what attributes are essential for representing a newspaper page? This might include page number, image location, OCR data, and any metadata associated with the page. Think about how this object will be used during the import process, especially when extracting and processing the text content. - [ ] Define the functions in the
detect.pyscript. How will these functions traverse the file system, identify relevant files (METS, ALTO, images), and createIssueDirobjects? Consider the different possible file structures and naming conventions you might encounter. - [ ] Modularize any helper functions in a
helpers.pyscript. As you develop the importer, you'll likely find yourself writing reusable code snippets. Create ahelpers.pyfile to house these functions, promoting code organization and maintainability. What kinds of helper functions might be useful? Examples could include functions for parsing dates, extracting information from file names, or handling errors.
Let me know if you have any questions. Good luck!