Duncan’s Word Data Generator for velut
Testing
Open your browser’s console, then click “Run tests”. The tests confirm that the behaviour of my data-generating functions is as I expect.
About
Note: If you’re not me, you’re unlikely to have much use for this Word Data Generator.
The Word Data Generator generates the data for my Latin rhyming dictionary, velut; specifically it generates the “words” data. For each word and corresponding list of lemmata, this generates fields such as the phonetic representation, the number of syllables, the rhyming-part, and strings that words are sorted on.
This webpage is nice for showing you what the generator does. You can input several words (and the lemma or lemmata for each) in the first box, click “Generate Json”, and see the output in the second box. The resultant Json can be downloaded or copied to the clipboard.
But the generator can also be used by simply running one JavaScript file named generator.js in Node, outside of the browser. If it’s running in Node, it reads from a hardcoded filepath and saves its output to another hardcoded filepath. (It also saves the output to a smaller file for each batch of 50,000 words.) Once the output is on my hard drive, I have a script to upload it to the MongoDB database that the velut website uses.
Input and output format
Each line of input must be a word, whitespace, and the space-separated list of lemmata. The “Load sample” button will give you some examples; the examples use a tab for the whitespace between the word and the lemmata, because you get tabs when pasting from Excel cells. But you can use a normal space (or several) if you prefer.
The Json generated does not have commas separating the objects, or square brackets around the entire array. This is not the standard Json format, but is the format required by mongoimport (which is the tool my script uses to import into the database).
The velut Excel file & how I’ve replaced parts of it
Although the velut website uses a MongoDB database, and this page produces Json data for the MongoDB database, I privately have a large Excel file for generating and storing the data in velut. This webpage is intended to replace a sheet called “wordsform” in that Excel file. The sheet can generate all the data from words and lemmata entered into the second and third columns — which is why the second and third fields of the output are Word and Lemmata, the two columns that don’t have Excel formulae.
One of the differences between the “wordsform” sheet and this Word Data Generator is that in the sheet the output data are in Excel cells, but in the generator they’re in Json format. Copying data from Excel makes them tab-delimited. To convert the tab-delimited data to Json, I use my Json Generator, which is a separate webpage. But I have less need of that than when I didn’t have this Word Data Generator, because the data from this are already in Json format. (The Json Generator is still useful for other sheets in the Excel file.)
The benefit of running generator.js in Node is that I can process all my Latin words (more than 120 thousand), in about twenty seconds. If I tried in the browser for that quantity of data, my browser would freeze, unsurprisingly! Likewise, Excel would probably crash if I tried to use the “wordsform” sheet to regenerate all the data.
Version control
I track the data-files in Git so I can check whether a change to my code has (inadvertently or deliberately) altered the output. But I don’t track the file that contains all the output — it’s huge. Instead, the Node-only code splits the data into batches of 50,000 words and saves the batches as files, and Git tracks those files.
Checking the output in Node
I can also use Node to check the output against all the “words” data I previously generated. The code for this check is at the end of generator.js. When I ran it against all the “words” data I had from Excel, everything matched, except for some cases where I had bugs in the Excel which I have corrected in the JavaScript. These changes of behaviour are listed in the next section.
Behaviour changes between my Excel and JavaScript code
Some changes would have been noticeable because of inaccuracies on the pages on the velut website for these words. For example, coiēns was scanned as –– instead of ⏑⏑–.
- Present participles of compounds of eō that have “i” between two vowels (coiēns, deiēns, introiēns) should have the “i” as vocalic, I believe. But in my Excel file, this “i” was treated as consonantal. (Compare with iēns, where both my Excel formulae and my JavaScript functions correctly handle the initial vowel.)
- iūsiūrandum has both “i”s consonantal. My Excel formulae had kept the first “i” as a vowel (as if it were ïūsjūrandum instead of jūsjūrandum).
Other bug-fixes do not change anything displayed on the velut website. But I wanted all the code here to be correct, even when not (yet) used directly.
- Forms of coiciō beginning with “coic-” are now treated as if they have “cojic-” at their start. This doesn’t really affect anything — the first syllable is still short. (Contrast with coniciō, pronounced conjiciō, with a long first syllable.)
-
My Excel formula for
RhymeConsonants
interpreted “nf” as “mf”. This is because (at least for classical Latin according to velut) you don’t pronounce “-m” at the end of a word, or “-n-” between a vowel and “f” or “s”, but the previous vowel becomes nasalised. The bug was in how the phonetic value of a word got converted back into the consonants as written. -
My Excel formula for
IsFitForDactyl
was meant to determine whether a word could fit into a dactylic hexameter. But the formula I wrote was too simplistic and it was wrong for many words. The JavaScript version is (to my knowledge) reasonable for all words. The velut website has never usedIsFitForDactyl
and it doesn’t affect other functions.
Behaviour I might change in the future
Excel (and my lack of formal training in software development) led to me doing some things with the Word Data Generator that I wouldn’t have done if I hadn’t used Excel to create velut.
-
You might expect the functions
IsLemma
,IsNonLemma
, andIsFitForDactyl
to give simple booleans, true or false. But they give 1 or 0 instead. The word false is valid in Latin, and I didn’t want Excel converting it to a boolean in whatever columns it might appear. Similarly, falsē also becomes a boolean, despite the macron. And my phonetic representation of truae in ecclesiastical pronunciation is true. So I decided that I shouldn’t let TRUE/FALSE appear in the “words” data (in any column, for simplicity’s sake).
(Extra technical note: These fields are not used on the velut website. But if they were, they would be treated correctly as boolean. I use Mongoose to connect to the database, and I’ve specified in the Mongoose schema that these fields are boolean, so it would convert from 1/0 to true/false automatically.) -
The
Ord
function gives a serial number to each word, starting at one. I named it years ago as an abbreviation of “ordinal”. If I were starting from scratch, I’d probably call it “Id”, and start it from zero.
Testing in the browser
If you’re not me, you won’t have access to all the input data, nor will you have access to the data from Excel that I compare the output to in Node. But you can run some tests yourself in your browser’s console by clicking the “Run tests” button above. These tests run the following:
- all my JavaScript functions against some words,
- the two fiddliest functions (
Phonetic
andStress
) against a lot of the words that could befuddle them, and - a couple more functions against a few words.
My current workflow for managing velut
The Word Data Generator is reliable enough that I’ve begun using it for real. The velut website uses the data generated.
Nonetheless, I’m still using the “wordsform” Excel sheet within Excel, because I don’t want to break other parts of the Excel file. Much of my Excel file still relies on me having all the data in Excel — not in the Json format that this page produces. Eventually enough of the Excel file will be obsolete that I can stop storing all my data there in Excel. Only then will I stop using the “wordsform” sheet.
It’s all part of my long-term project of converting my Excel file into websites and webpages that are easier to share and maintain. I’m very much in a transition period of using the Excel file for some things and my newer websites/webpages for others. But the Word Data Generator is another step in the process. At the moment, the whole velut project is very convoluted; in the future, it won’t be as bad.