Starting with plain text, create an XML file that can be used with
the ppxml code from https:bookcove.net to
generate text, HTML and EPUB output.
This is a walkthrough that goes from start to finish to create and post a book to BookCove. The book chosen for this is Bill of the Wild Streak, a short story published in Argosy All-Story Weekly magazine in 1925. The commands shown are for MacOS/Linux.
A zip file has been prepared with files referred to in this writup.
Download this link to download it and unzip it
into a directory named files. You should also create your
own working directory called working if you want to
duplicate the steps in thie walkthru.
This walktrhough provides links to the project at many different
stages. The starting book is wildstreak-src.txt. It has no
special markup, nor has it been smoothread by human or machine. That
comes later in this writeup.
The starting file is in the download set in the files
directory. Copy that into your working directory:
cd working
cp ../files/wildstreak-src.txt book.xml
This has no changes yet, only the extension. We will convert it into a ready-to-build XML file.
Your book.xml should match
files/book-01.xml. For here on, you will have many
opportunities to check your work against the examples in the
files directory.
Make everything into paragraphs to start.
Search and replace the first line with the second line, with regular expressions enabled.
(\P{Z})\n\n(\P{Z})
\1</p>\n\n<p>\2
or if your editor doesn’t have advanced regexs:
(\S)\n\n(\S)
\1</p>\n\n<p>\2
Manually fix the first and last paragraph so they have both a
<p> and a </p>.
Your book.xml should match
files/book-02.xml at this point.
Convert XML entities. First make sure there are no HTML entites in the source file. There should be none. Check it, though, either by searching with your editor or by running this command:
grep "&" book.xml
There should be no matches. Now convert ampersand and mdash characters to their proper XML representation. Do this in your editor or use perl one-liners:
perl -pi -e 's|&|7X8W|g' book.xml
perl -pi -e 's|7X8W|&|g' book.xml
perl -pi -e 's|--|—|g' book.xml
Your book.xml should match
files/book-03.xml at this point.
Mark the chapters in the source file. This will be important for
EPUBs and for a Table of Contents, if used. This book has “II.” which is
now <p>II.</p> because we made everything into
paragraphs. That is one example of a chapter start.
Wrap each chapter with starting and ending tags as shown in the next code block. Here is what the first two chapters should look like after this step:
<div type="chapter" n="I" xml:id="chI">
<head>I.</head>
(text of chapter one)
</div>
<div type="chapter" n="II" xml:id="chII">
<head>II.</head>
(text of chapter two)
</div>
Your book.xml should match
files/book-04.xml at this point.
Chapters are one of many constructions available with the bookcove subset of TEI markup as expressed in the book’s XML source file. To see the other markup, visit https://bookcove.net/resources/ppxml/element_set.html.
Another good way to learn how to use the markup is to look at
examples. All the books on bookcove have a link to their
XML source. One book already in the collection includes left and right
floated illustrations, a complex title page, a Table of Contents,
poetry, a table, and other XML markup. Visit
https://bookcove.net/books/bc3847/bc3847.xml for those
examples.
I forgot to convert the quotes when creating this writeup. TEI/XML is
quite happy to use <quote> and
</quote> for quotations. Remember TEI marks up what
something is. The ppxml generator will happily
convert those tags into smart quotes in output formats. For this book,
converting to quote tags is as simple as replacing all open double
quotes (“) with <quote> and all close double quotes
(”) with </quote>. There are no nested quotes. If
there were, they would also use the <quote> and
</quote> tags. The generator keeps track so you don’t
have to.
The smart quote characters are perfectly acceptable in XML. Still,
using <quote> is semantic and should have been done.
There is no difference in the generated outputs either way.
There is a required structure to any TEI file, and that includes the XML file we are creating. I’ll show the whole structure here, indented for clarity:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Bill of the Wild Streak</title>
</titleStmt>
<publicationStmt>
<p>bookcove.net</p>
</publicationStmt>
<sourceDesc>
<p>Argosy-Allstory Weekly magazine, April 18, 1925</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<front>
</front>
<body>
(everything we have so far)
</body>
<back>
</back>
</text>
</TEI>
The “everything we have so far” is your existing
book.xml file. Take what yo have and add what comes before
it and after it in the outline above. Text indentation doesn’t matter. I
usually left adjust.
If you’ve done it right, there should be no errors in the XML, though
of course it isn’t complete yet. Check at any time with:
xmllint --noout book.xml
Your book.xml should match
files/book-05.xml at this point.
This can build an HTML file and a text file now. If you have the
ppxml code installed (from
https://github.com/rbfrank/ppxml), you can run commands to
build text and HTML.
The ppxml code (which is Python), needs the
lxml libraries. If they aren’t already installed, create a
virtual environment, install lxml, and use that to run the
commands.
To install a virtual environment right in your working
directory:
python3 -m venv venv
source venv/bin/activate
pip3 install lxml
deactivate
Use the Python interpreter in the venv:
venv/bin/python3 ppxml.py book.xml book.txt
venv/bin/python3 ppxml.py book.xml book.html
This short story has a simplified title page. View the source code to any full book published on BookCove for a more complete title page. We will include an illustration and a small title block. Here is the XML for that:
<front>
<div type='frontispiece'>
<figure rend="center">
<graphic url="images/illus-fpc.jpg" width="75%"/>
<figDesc>A dog holding a cougar by the neck.</figDesc>
</figure>
</div>
<div type='titlepage'>
<lg>
<l rend='fs14'>BILL OF THE WILD STREAK</l>
<l>BY</l>
<l rend='fs12 mb10'>Howard E. Morgan</l>
</lg>
</div>
</front>
As shown, that markup goes between the
<front>...</front> tags.
Note: fs14 will be defined in CSS as
font-size: 1.4em and mb10 will be
margin-bottom: 1.0em, etc.
Before we move on, bring the images folder over from the
files directory to your working directory:
cp -r ../files/images .
The XML markup, based on TEI principles, marks up what the book is, not what it looks like. A block quote is a “block quote” and the says nothing about how it is styled. You add that with the CSS file.
For this book, styles are relatively simple. The ppxml
generator provides defaults for most of the defined markup. Your CSS can
either add new styling to an existing tag, such as the way the
<p> tag is styled. Or you can define new classes for
things such as font-size changes.
Here is the CSS file styles.css for this book:
body { margin-left: 11%; margin-right: 10%; line-height: 1.25; }
h1 { margin-bottom: 2em; font-weight: normal; text-align: center;
font-size: 1.4em; margin-bottom: 0; }
h2 { text-align: center; font-weight: normal; page-break-before: always;
font-size: 1.25em; margin-top: 3em; margin-bottom: 1em;
margin-left: auto; margin-right: auto; }
p { text-indent: 1.15em; margin-top: 0.1em; margin-bottom: 0.1em;
text-align: justify; }
p.no-indent { text-indent: 0; }
/* title page styling */
.mb10 { margin-bottom: 1.0em; }
.fs12 { font-size: 1.2em; }
.fs14 { font-size: 1.4em; }
/* centering */
.titlepage
{ text-align: center; margin: 2em 0; }
.titlepage p
{ margin: 0.5em 0; }
/* front page separators */
.titlepage {
border-bottom: 1px solid #999;
padding-bottom: 2em;
}
That’s all standard HTML CSS. It all goes in the
styles.css file at the same level as your
book.xml file. It is available in the files
folder if you want to quickly copy it over.
You can rebuild the HTML and text to show the effect of the CSS.
This is a magazine story, so it’s conventional at bookcove to include
a Transcriber’s Note indicating the original issue. Two changes are
needed. First, add the XML to the inside the
<back>...</back> section:
<div type="notes">
<div type="transcriber">
<p>Transcriber’s note: This story appeared in the April 18, 1925 issue
of <hi rend="italic">Argosy-Allstory Weekly</hi> magazine.</p>
</div>
</div>
You will notice there are some classes to define, so add this to the
styles.css CSS file:
/* transcriber's note */
.transcriber {
font-size: 0.9em;
border: 1px solid silver;
margin: 1.8em 8% 0;
padding: 0.3em 2%;
background-color: #DDDDEE;
}
.transcriber p { text-indent: 0; margin: 0; }
This XML file has everthing it needs to generate a complete, publishable HTML and text file. Making the EPUB takes a little more setup, which is covered in the next section.
To make the production-ready HTML and text, run those same commands
as before. Generation of the HTML file will append your
styles.css to the internal CSS and generate a standalone
file. Here are the commands:
venv/bin/python3 ppxml.py book.xml book.txt
venv/bin/python3 ppxml.py book.xml book.html
EPUB generation needs metadata about the book. This is stored at the
top level in a metadata.json file. Here is the basic format
for it:
{
"title": "Bill of the Wild Streak",
"contributors": [
{
"name": "Howard E. Morgan",
"role": "author"
}
],
"language": "en",
"original_publisher": "Frank A. Munsey Company",
"original_publication_date": "1925",
"description": "",
"notes": "",
"keywords": [],
"cover_image": "cover.jpg"
}
Notice there is no description yet. There are no keywords. I’ll use AI to help me with those.
I asked AI (ChatGPT in this case):
Based on the story excerpt I will provide next, write a concise, back-cover–style flyleaf description (3–5 sentences). Then generate a list of no more than 10 SEO-optimized keywords that reflect the story’s themes, setting, and genre. Avoid spoilers. Optimize keywords for ebook and search engine discoverability.
Output the result as valid JSON with two fields: • “description”: a single string • “keywords”: an array of 10 keyword strings
Do not include any explanatory text outside the JSON. Use clear, professional language suitable for literary or archival publication.
I gave it the starting section and it came up with this for the “description”:
On a moonlit hillside at the edge of the wilderness, Bill—a powerful mongrel sheepdog with wolf blood in his veins—stands guard over his master’s flock. Bound by fierce loyalty yet haunted by an inherited hunger for the kill, Bill lives in constant tension between duty and instinct. When danger comes in the night and blood is spilled, his restraint is tested as never before. This quiet, gripping tale explores devotion, trust, and the thin, perilous line between civilization and the wild that still glimmers in a dog’s eyes.
I put that description, in quotes, into the metadata.
It came up with a list of ten keyword, which I put into the
keywords field of the JSON file. The final JSON looked like
this:
{
"title": "Bill of the Wild Streak",
"contributors": [
{
"name": "Morgan, Howard E.",
"role": "author"
}
],
"language": "en",
"original_publisher": "Frank A. Munsey Company",
"original_publication_date": "1925",
"description": "On a moonlit hillside at the edge of the wilderness, Bill—a powerful mongrel sheepdog with wolf blood in his veins—stands guard over his master’s flock. Bound by fierce loyalty yet haunted by an inherited hunger for the kill, Bill lives in constant tension between duty and instinct. When danger comes in the night and blood is spilled, his restraint is tested as never before. This quiet, gripping tale explores devotion, trust, and the thin, perilous line between civilization and the wild that still glimmers in a dog’s eyes.",
"notes": "",
"keywords": [
"sheepdog",
"animal short story",
"wilderness fiction",
"dog protagonist",
"loyalty and instinct",
"man and dog",
"frontier life",
"nature vs civilization",
"wolf ancestry",
"classic animal fiction"
],
"cover_image": "cover.jpg"
}
Note: the description must be on one line for JSON. Use
\n for a line break or \n\n for a paragraph
break, if desired.
Right now, the cover.jpg image in in the images folder.
It isn’t used in the HTML but it is needed for the EPUB. Copy the cover
image to the top level, where book.xml is. That’s where the
EPUB generator looks for it. Note that you can have a
cover.jpg in the HTML images/ directory also
if you want to use it in the generated HTML. The two
cover.jpg files in this case are independent.
Now the EPUB can be created. Following the pattern earlier, it’s the same command but with a different extension on the output file.
To review, for HTML and text:
venv/bin/python ppxml.py book.xml book.txt
venv/bin/python ppxml.py book.xml book.html
and now for EPUB:
venv/bin/python ppxml.py book.xml book.epub
This completes the sample book generation.
If this book will be posted at bookcove, there are some additional steps to take. None of the generation code changes, but additional files are needed to create a complete GitHub repository.
Books of bookcove are all “Born as Git”, which means the only source of truth for them is in the GitHub repository.
First, choose an unused bookcove book identifier. It is of the form
bcNNNN where NNNN is an unused 4-digit number.
For this example, I’ll choose bc2914. Start the repository
directory with these commands:
rm -rf bc2914 && mkdir bc2914
cd bc2914
cp ../book.xml bc2914.xml
cp -r ../images ../metadata.json ../cover.jpg ../style.css .
Two more files are needed, a README.md and a
RIGHTS.md.
The README.md file goes in at the top level. It containsL
# Bill of the Wild Streak
Author: Morgan, Howard E.
Original Publisher: Frank A. Munsey Company
Original Publication Date: 1925
## About This Book
On a moonlit hillside at the edge of the wilderness, Bill—a powerful
mongrel sheepdog with wolf blood in his veins—stands guard over his
master’s flock. Bound by fierce loyalty yet haunted by an inherited
hunger for the kill, Bill lives in constant tension between duty and
instinct. When danger comes in the night and blood is spilled, his
restraint is tested as never before. This quiet, gripping tale explores
devotion, trust, and the thin, perilous line between civilization and
the wild that still glimmers in a dog’s eyes.
## About This Repository
This repository contains source files to generate this book in several
output formats for the [bookcove](https://bookcove.net) collection.
### Contents
\- `metadata.json` metadata (title, author, publication info, subjects)
\- `<filename>.xml` Book content in XML format
\- `css/` Stylesheets for different output formats
\- `images/` Illustrations and figures
\- `cover.jpg` Cover image
\- other subdirectories as needed (i.e. `music/` or `fonts/`)
\- `RIGHTS.md` and `README.md`
### Output Formats
The source XML is a proper subset of TEI markup. It can be built into
HTML, plain text, EPUB3 or PDF using standard TEI conversion utilities
or the software tools at bookcove.net.
### Part of bookcove
Visit [bookcove.net](https://bookcove.net) for more public domain books or to join our community.
The RIGHTS.md file is also at the top level and
contains:
This work is believed to be in the public domain in the United States. Copyright status in other countries may vary. Users are responsible for verifying the copyright status of this work in their jurisdiction.
At this point, you should have this in your bc2914
directory:
bc2914/
├── bc2914.xml
├── cover.jpg
├── images
│ └── illus-fpc.jpg
├── metadata.json
├── README.md
├── RIGHTS.md
└── style.css
To turn this into a repository, run these commands (or have a bookcove administrator run them for you):
rm -rf .git .gitignore
git init
git add .
git commit -m "Initial commit"
gh repo create bookcovebooks/bc2914 --source=. --push --public
At this point you are done. A period script on bookcove will notice the new repository and build everything from that, including the catalog page, the detail page, all media formats, etc.
To post immediately and not wait for the periodic update, run these commands:
(login to c57d.io as an administrator)
cd /var/www/bookcove/tools
sudo php import_from_github.php
It will immediately be posted.
As of the time of this writing, AI cannot find everything that needs to be fixed. Tests have show that if a project gets “smoothread” by AI, a very good human proofreader will probably find something else worth checking.
For this book, I had AI proofread the text that this walkthrough generated. It found about 30 corrections to make, all of which were valid.
The magic is in the prompt given. Here’s the prompt I currently use for ChatGPT:
I have completed OCR text recognition of a short story from 1925. I want you to help me spot grammar and punctuation errors, paying special attention to ensuring that all quotation marks are properly paired (every opening quote has a matching closing quote). Please note that italics and emphasis are marked with underscores. Please read the text twice before reporting errors. I would like to give it to you in chunks. If an error is found, please report only the error. List errors plainly.
Here is the prompt I use if I choose to have Claude smoothread the text:
I will provide a blocks of text after these instructions. For each of them, please identify only grammatical errors, spelling mistakes, obvious OCR errors (including common letter substitutions like ‘rn’ for ‘m’, ‘cl’ for ‘d’, ‘u’ for ‘n’, etc.), incorrect spacing around punctuation, smart quote direction errors, missing punctuation (especially after dialogue tags), and incomplete sentences or incorrect punctuation in this text. Don’t suggest stylistic improvements or flag unusual but potentially correct word choices from the original era. Please review the text twice to ensure accuracy and pay special attention to missing periods at the end of sentences.
Note: I also have API version of these that I use for book-length projects. To have AI smoothread an entire book through the API costs about seven cents. Using the prompts above is free but the story needs to be fed in about 100 lines at a time.
To see this book in it’s final form, visit its page on the bookcove site by clicking on this link.