Convert plain text to XML source

Starting with plain text, create an XML file that can be used with the ppxml code from https:bookcove.net to generate text, HTML and EPUB output.

starting point

This is a walkthrough that goes from start to finish to create and post a book to BookCove. The book chosen for this is Bill of the Wild Streak, a short story published in Argosy All-Story Weekly magazine in 1925. The commands shown are for MacOS/Linux.

A zip file has been prepared with files referred to in this writup. Download this link to download it and unzip it into a directory named files. You should also create your own working directory called working if you want to duplicate the steps in thie walkthru.

This walktrhough provides links to the project at many different stages. The starting book is wildstreak-src.txt. It has no special markup, nor has it been smoothread by human or machine. That comes later in this writeup.

convert to starting XML file

extension to .xml

The starting file is in the download set in the files directory. Copy that into your working directory:

cd working
cp ../files/wildstreak-src.txt book.xml

This has no changes yet, only the extension. We will convert it into a ready-to-build XML file.

Your book.xml should match files/book-01.xml. For here on, you will have many opportunities to check your work against the examples in the files directory.

basic editing

Make everything into paragraphs to start.

Search and replace the first line with the second line, with regular expressions enabled.

(\P{Z})\n\n(\P{Z})
\1</p>\n\n<p>\2

or if your editor doesn’t have advanced regexs:

(\S)\n\n(\S)
\1</p>\n\n<p>\2

Manually fix the first and last paragraph so they have both a  and a .

Your book.xml should match files/book-02.xml at this point.

entities

Convert XML entities. First make sure there are no HTML entites in the source file. There should be none. Check it, though, either by searching with your editor or by running this command:

grep "&" book.xml

There should be no matches. Now convert ampersand and mdash characters to their proper XML representation. Do this in your editor or use perl one-liners:

perl -pi -e 's|&|7X8W|g' book.xml
perl -pi -e 's|7X8W|&amp;|g' book.xml
perl -pi -e 's|--|&#8212;|g' book.xml

Your book.xml should match files/book-03.xml at this point.

chapters

Mark the chapters in the source file. This will be important for EPUBs and for a Table of Contents, if used. This book has “II.” which is now II. because we made everything into paragraphs. That is one example of a chapter start.

Wrap each chapter with starting and ending tags as shown in the next code block. Here is what the first two chapters should look like after this step:

<div type="chapter" n="I" xml:id="chI">
<head>I.</head>
(text of chapter one)
</div>

<div type="chapter" n="II" xml:id="chII">
<head>II.</head>
(text of chapter two)
</div>

Your book.xml should match files/book-04.xml at this point.

other markup

Chapters are one of many constructions available with the bookcove subset of TEI markup as expressed in the book’s XML source file. To see the other markup, visit https://bookcove.net/resources/ppxml/element_set.html.

Another good way to learn how to use the markup is to look at examples. All the books on bookcove have a link to their XML source. One book already in the collection includes left and right floated illustrations, a complex title page, a Table of Contents, poetry, a table, and other XML markup. Visit https://bookcove.net/books/bc3847/bc3847.xml for those examples.

quotes

I forgot to convert the quotes when creating this writeup. TEI/XML is quite happy to use <quote> and </quote> for quotations. Remember TEI marks up what something is. The ppxml generator will happily convert those tags into smart quotes in output formats. For this book, converting to quote tags is as simple as replacing all open double quotes (“) with <quote> and all close double quotes (”) with </quote>. There are no nested quotes. If there were, they would also use the <quote> and </quote> tags. The generator keeps track so you don’t have to.

The smart quote characters are perfectly acceptable in XML. Still, using <quote> is semantic and should have been done. There is no difference in the generated outputs either way.

scaffolding

There is a required structure to any TEI file, and that includes the XML file we are creating. I’ll show the whole structure here, indented for clarity:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">

  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Bill of the Wild Streak</title>
      </titleStmt>
      <publicationStmt>
        <p>bookcove.net</p>
      </publicationStmt>
      <sourceDesc>
        <p>Argosy-Allstory Weekly magazine, April 18, 1925</p>
      </sourceDesc>
    </fileDesc>
  </teiHeader>

  <text>

    <front>
    </front>
    
    <body>
    (everything we have so far)
    </body>
    
    <back>
    </back>

  </text>
</TEI>

The “everything we have so far” is your existing book.xml file. Take what yo have and add what comes before it and after it in the outline above. Text indentation doesn’t matter. I usually left adjust.

If you’ve done it right, there should be no errors in the XML, though of course it isn’t complete yet. Check at any time with: xmllint --noout book.xml

Your book.xml should match files/book-05.xml at this point.

first looks

This can build an HTML file and a text file now. If you have the ppxml code installed (from https://github.com/rbfrank/ppxml), you can run commands to build text and HTML.

The ppxml code (which is Python), needs the lxml libraries. If they aren’t already installed, create a virtual environment, install lxml, and use that to run the commands.

To install a virtual environment right in your working directory:

python3 -m venv venv
source venv/bin/activate
pip3 install lxml
deactivate

Use the Python interpreter in the venv:

venv/bin/python3 ppxml.py book.xml book.txt
venv/bin/python3 ppxml.py book.xml book.html

front matter

This short story has a simplified title page. View the source code to any full book published on BookCove for a more complete title page. We will include an illustration and a small title block. Here is the XML for that:

<front>
<div type='frontispiece'>
<figure rend="center">
  <graphic url="images/illus-fpc.jpg" width="75%"/>
  <figDesc>A dog holding a cougar by the neck.</figDesc>
</figure>
</div>

<div type='titlepage'>
<lg>
<l rend='fs14'>BILL OF THE WILD STREAK</l>
<l>BY</l>
<l rend='fs12 mb10'>Howard E. Morgan</l>
</lg>
</div>
</front>

As shown, that markup goes between the <front>...</front> tags.

Note: fs14 will be defined in CSS as font-size: 1.4em and mb10 will be margin-bottom: 1.0em, etc.

Before we move on, bring the images folder over from the files directory to your working directory:

cp -r ../files/images .

CSS

The XML markup, based on TEI principles, marks up what the book is, not what it looks like. A block quote is a “block quote” and the says nothing about how it is styled. You add that with the CSS file.

For this book, styles are relatively simple. The ppxml generator provides defaults for most of the defined markup. Your CSS can either add new styling to an existing tag, such as the way the  tag is styled. Or you can define new classes for things such as font-size changes.

Here is the CSS file styles.css for this book:

body { margin-left: 11%; margin-right: 10%; line-height: 1.25; }
h1 { margin-bottom: 2em; font-weight: normal; text-align: center;
     font-size: 1.4em; margin-bottom: 0; }
h2 { text-align: center; font-weight: normal; page-break-before: always;
     font-size: 1.25em; margin-top: 3em; margin-bottom: 1em;
     margin-left: auto; margin-right: auto; }
p { text-indent: 1.15em; margin-top: 0.1em; margin-bottom: 0.1em;
    text-align: justify; }
p.no-indent { text-indent: 0; }
/* title page styling */
.mb10 { margin-bottom: 1.0em; }
.fs12 { font-size: 1.2em; }
.fs14 { font-size: 1.4em; }
/* centering */
.titlepage
  { text-align: center; margin: 2em 0; }
.titlepage p
  { margin: 0.5em 0; }
/* front page separators */
.titlepage {
  border-bottom: 1px solid #999;
  padding-bottom: 2em;
}

That’s all standard HTML CSS. It all goes in the styles.css file at the same level as your book.xml file. It is available in the files folder if you want to quickly copy it over.

You can rebuild the HTML and text to show the effect of the CSS.

Transcriber’s Note

This is a magazine story, so it’s conventional at bookcove to include a Transcriber’s Note indicating the original issue. Two changes are needed. First, add the XML to the inside the <back>...</back> section:

<div type="notes">
  <div type="transcriber">
    <p>Transcriber’s note: This story appeared in the April 18, 1925 issue
    of <hi rend="italic">Argosy-Allstory Weekly</hi> magazine.</p>
  </div>
</div>

You will notice there are some classes to define, so add this to the styles.css CSS file:

/* transcriber's note */
.transcriber { 
  font-size: 0.9em; 
  border: 1px solid silver; 
  margin: 1.8em 8% 0;
  padding: 0.3em 2%; 
  background-color: #DDDDEE; 
}
.transcriber p { text-indent: 0; margin: 0; }

HTML+text generatioon

This XML file has everthing it needs to generate a complete, publishable HTML and text file. Making the EPUB takes a little more setup, which is covered in the next section.

To make the production-ready HTML and text, run those same commands as before. Generation of the HTML file will append your styles.css to the internal CSS and generate a standalone file. Here are the commands:

venv/bin/python3 ppxml.py book.xml book.txt
venv/bin/python3 ppxml.py book.xml book.html

EPUB

metadata.json

EPUB generation needs metadata about the book. This is stored at the top level in a metadata.json file. Here is the basic format for it:

{
  "title": "Bill of the Wild Streak",
  "contributors": [
    {
      "name": "Howard E. Morgan",
      "role": "author"
    }
  ],
  "language": "en",
  "original_publisher": "Frank A. Munsey Company",
  "original_publication_date": "1925",
  "description": "",
  "notes": "",
  "keywords": [],
  "cover_image": "cover.jpg"
}

Notice there is no description yet. There are no keywords. I’ll use AI to help me with those.

I asked AI (ChatGPT in this case):

Based on the story excerpt I will provide next, write a concise, back-cover–style flyleaf description (3–5 sentences). Then generate a list of no more than 10 SEO-optimized keywords that reflect the story’s themes, setting, and genre. Avoid spoilers. Optimize keywords for ebook and search engine discoverability.

Output the result as valid JSON with two fields: • “description”: a single string • “keywords”: an array of 10 keyword strings

Do not include any explanatory text outside the JSON. Use clear, professional language suitable for literary or archival publication.

I gave it the starting section and it came up with this for the “description”:

On a moonlit hillside at the edge of the wilderness, Bill—a powerful mongrel sheepdog with wolf blood in his veins—stands guard over his master’s flock. Bound by fierce loyalty yet haunted by an inherited hunger for the kill, Bill lives in constant tension between duty and instinct. When danger comes in the night and blood is spilled, his restraint is tested as never before. This quiet, gripping tale explores devotion, trust, and the thin, perilous line between civilization and the wild that still glimmers in a dog’s eyes.

I put that description, in quotes, into the metadata.

It came up with a list of ten keyword, which I put into the keywords field of the JSON file. The final JSON looked like this:

{
  "title": "Bill of the Wild Streak",
  "contributors": [
    {
      "name": "Morgan, Howard E.",
      "role": "author"
    }
  ],
  "language": "en",
  "original_publisher": "Frank A. Munsey Company",
  "original_publication_date": "1925",
  "description": "On a moonlit hillside at the edge of the wilderness, Bill—a powerful mongrel sheepdog with wolf blood in his veins—stands guard over his master’s flock. Bound by fierce loyalty yet haunted by an inherited hunger for the kill, Bill lives in constant tension between duty and instinct. When danger comes in the night and blood is spilled, his restraint is tested as never before. This quiet, gripping tale explores devotion, trust, and the thin, perilous line between civilization and the wild that still glimmers in a dog’s eyes.",
  "notes": "",
  "keywords": [
    "sheepdog",
    "animal short story",
    "wilderness fiction",
    "dog protagonist",
    "loyalty and instinct",
    "man and dog",
    "frontier life",
    "nature vs civilization",
    "wolf ancestry",
    "classic animal fiction"
  ],
  "cover_image": "cover.jpg"
}

Note: the description must be on one line for JSON. Use \n for a line break or \n\n for a paragraph break, if desired.

cover image

Right now, the cover.jpg image in in the images folder. It isn’t used in the HTML but it is needed for the EPUB. Copy the cover image to the top level, where book.xml is. That’s where the EPUB generator looks for it. Note that you can have a cover.jpg in the HTML images/ directory also if you want to use it in the generated HTML. The two cover.jpg files in this case are independent.

Now the EPUB can be created. Following the pattern earlier, it’s the same command but with a different extension on the output file.

To review, for HTML and text:

venv/bin/python ppxml.py book.xml book.txt
venv/bin/python ppxml.py book.xml book.html

and now for EPUB:

venv/bin/python ppxml.py book.xml book.epub

This completes the sample book generation.

Posting to bookcove

If this book will be posted at bookcove, there are some additional steps to take. None of the generation code changes, but additional files are needed to create a complete GitHub repository.

Books of bookcove are all “Born as Git”, which means the only source of truth for them is in the GitHub repository.

starting the book’s repository

First, choose an unused bookcove book identifier. It is of the form bcNNNN where NNNN is an unused 4-digit number. For this example, I’ll choose bc2914. Start the repository directory with these commands:

rm -rf bc2914 && mkdir bc2914
cd bc2914
cp ../book.xml bc2914.xml
cp -r ../images ../metadata.json ../cover.jpg ../style.css .

Two more files are needed, a README.md and a RIGHTS.md.

README.md

The README.md file goes in at the top level. It containsL

# Bill of the Wild Streak

Author: Morgan, Howard E. 
Original Publisher: Frank A. Munsey Company  
Original Publication Date: 1925

## About This Book

On a moonlit hillside at the edge of the wilderness, Bill—a powerful
mongrel sheepdog with wolf blood in his veins—stands guard over his
master’s flock. Bound by fierce loyalty yet haunted by an inherited
hunger for the kill, Bill lives in constant tension between duty and
instinct. When danger comes in the night and blood is spilled, his
restraint is tested as never before. This quiet, gripping tale explores
devotion, trust, and the thin, perilous line between civilization and
the wild that still glimmers in a dog’s eyes.

## About This Repository

This repository contains source files to generate this book in several
output formats for the [bookcove](https://bookcove.net) collection.

### Contents

\- `metadata.json` metadata (title, author, publication info, subjects)  
\- `<filename>.xml` Book content in XML format  
\- `css/` Stylesheets for different output formats  
\- `images/` Illustrations and figures  
\- `cover.jpg` Cover image  
\- other subdirectories as needed (i.e. `music/` or `fonts/`)  
\- `RIGHTS.md` and `README.md`

### Output Formats

The source XML is a proper subset of TEI markup. It can be built into
HTML, plain text, EPUB3 or PDF using standard TEI conversion utilities
or the software tools at bookcove.net.

### Part of bookcove

Visit [bookcove.net](https://bookcove.net) for more public domain books or to join our community.

RIGHTS.md

The RIGHTS.md file is also at the top level and contains:

This work is believed to be in the public domain in the United States. Copyright status in other countries may vary. Users are responsible for verifying the copyright status of this work in their jurisdiction.

Making the repository and pushing to GitHub

At this point, you should have this in your bc2914 directory:

bc2914/
├── bc2914.xml
├── cover.jpg
├── images
│   └── illus-fpc.jpg
├── metadata.json
├── README.md
├── RIGHTS.md
└── style.css

To turn this into a repository, run these commands (or have a bookcove administrator run them for you):

rm -rf .git .gitignore
git init
git add .
git commit -m "Initial commit"
gh repo create bookcovebooks/bc2914 --source=. --push --public

At this point you are done. A period script on bookcove will notice the new repository and build everything from that, including the catalog page, the detail page, all media formats, etc.

administrator note

To post immediately and not wait for the periodic update, run these commands:

(login to c57d.io as an administrator)
cd /var/www/bookcove/tools
sudo php import_from_github.php

It will immediately be posted.

Smoothreading

As of the time of this writing, AI cannot find everything that needs to be fixed. Tests have show that if a project gets “smoothread” by AI, a very good human proofreader will probably find something else worth checking.

For this book, I had AI proofread the text that this walkthrough generated. It found about 30 corrections to make, all of which were valid.

The magic is in the prompt given. Here’s the prompt I currently use for ChatGPT:

I have completed OCR text recognition of a short story from 1925. I want you to help me spot grammar and punctuation errors, paying special attention to ensuring that all quotation marks are properly paired (every opening quote has a matching closing quote). Please note that italics and emphasis are marked with underscores. Please read the text twice before reporting errors. I would like to give it to you in chunks. If an error is found, please report only the error. List errors plainly.

Here is the prompt I use if I choose to have Claude smoothread the text:

I will provide a blocks of text after these instructions. For each of them, please identify only grammatical errors, spelling mistakes, obvious OCR errors (including common letter substitutions like ‘rn’ for ‘m’, ‘cl’ for ‘d’, ‘u’ for ‘n’, etc.), incorrect spacing around punctuation, smart quote direction errors, missing punctuation (especially after dialogue tags), and incomplete sentences or incorrect punctuation in this text. Don’t suggest stylistic improvements or flag unusual but potentially correct word choices from the original era. Please review the text twice to ensure accuracy and pay special attention to missing periods at the end of sentences.

Note: I also have API version of these that I use for book-length projects. To have AI smoothread an entire book through the API costs about seven cents. Using the prompts above is free but the story needs to be fed in about 100 lines at a time.

Final Form

To see this book in it’s final form, visit its page on the bookcove site by clicking on this link.