About SFM2Web V0.46

These web pages were produced by an alpha (preliminary) version of the “SFM2Web” program, version V0.46, written by Robert Hunt. See here for more information.

SFM2Web is available for free. It is licensed under the GNU GPL licence, version 3.0 or later. See here for more information about GPLv3.

You can download SFM2Web from here.
 

The release notes are immediately below, followed by the "read me" file, and the tutorial, and then the release history list. Finally the "to do" list is at the end.
 


ReleaseNotes.txt

[Back to top of page]


ReadMe.txt

ReadMe.txt for SFM2Web Program V0.46

Last modified by: Robert Hunt    Email: <[email protected]>



    Copyright (C) 2009-2011 Robert Hunt
    Author: Robert Hunt    Email: <[email protected]>
    License: See gpl-3.0.txt


Contents



0. Quickstart

To run SFM2Web to convert SFM files into web pages, you need to do the following:
    a/ Ensure that you have Python3 installed on your computer
    b/ Install SFM2Web
    c/ Run SFM2Web
        This should produce the web pages plus other output files
            for the sample SFM database included

To now run SFM2Web on your own data:
    d/ Edit Projects.txt
            Change UseSampleData from True to False
            Change other settings as desired
    e/ Edit the other control files in the control files folders for each project
            to suit your system and your needs
    f/ Run SFM2Web
        Check Logfiles/SFM2Web_log.txt for errors, correct any errors in your data
            and see if any controls need changing, then repeat steps e and f above
    This should produce the web pages plus other output files
        depending on what you have specified

Once you have it working correctly, if you update your data:
    g/ Just run SFM2Web again
            to back up the existing web pages (if requested)
            and then produce updated pages

[Back to ReadMe top]


1. Introduction

SFM2Web is a Python program that converts materials encoded using SIL's Standard Format Marker format (see a very brief description at https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=glossary#sfm) into web pages. The program can convert a lexicon, interlinear texts, Bible books, a multilanguage phrasebook, and language lessons, etc. into xHTML pages with formatting determined by cascading stylesheets (CSS).

These web pages may then be copied to a web server for publication to the Internet or to an intranet. But these aren't the only uses for the web pages. They can also be viewed on a local computer to view and/or check one's own work. Or they can be sent to a dictionary or translation consultant as a convenient way for him/her to review one's work, especially since the consultant can take advantage of the formatted dictionary and text pages, etc., and the side-by-side or interlinear displays of the Scriptures with back translation, complete with live links into the dictionary/lexicon. They can also be put onto a thumbdrive or burned onto optical media for distribution and viewing on computers without broadband internet access.

SFM2Web is not intended to be a multipurpose program – it is only intended to do one thing–that is to process SFM (text) files into cross-linked web pages. However, it can do a number of side jobs including:
    a/ validating SFM files for missing fields, fields out of order, etc.
        (This information can be found in the logfile)
    b/ finding vernacular words (e.g., in lexicon examples, texts, and/or Bible)
        that are mispelt or missing from the lexicon
        (This information can be found in a created wordlist in the logfile folder)
    c/ finding glosses that are mispelt or missing from the lexicon gloss language index
        (This information can be found in a separate created wordlist) We have found many inconsistencies in our dictionary and Scripture files which this program has helped us to track down and correct.

The SFM2Web program is controlled by a number of control files which can be edited by a text editor, e.g., gedit or kwrite in Linux, or Wordpad in Windows. Once properly set up, the program can be run in a single click without operator decisions (apart from handling any new errors in the data) in order to produce an updated set of web pages.

NOTE: The user of the program is responsible concerning the copyright of any data which is published by running it through this program and then placing it on a web server. The author of SFM2Web takes no responsibility for the way that this program is used.

[Back to ReadMe top]


2. Intended users

SFM2Web is currently intended for medium-level computer users with a working knowledge of English.

However, if it needed to be set up for a less skilled or educated user, a linguist or computer technician could conceivably set up and test all of the control files, leaving the "naive" user a simple link/shortcut to click to run the program as lexical and/or other SFM data is updated.

[Back to ReadMe top]


3. Control Files

The SFM2Web control files are just simple text files (with the .txt file extension) that are usually found in the ControlFiles folder. You can edit them with any suitable text editor. (If the lines appear all messed up in MS-Notepad, try using MS-Wordpad or another editor that can also handle non-Windows linebreak conventions.)

The control files control the execution of the program. Each file contains a number of named fields that can contain True or False (or On or Off) or a text entry (such as a heading or a language name) or a number of fields separated by spaces (such as words to be ignored).

...to be expanded...

Note the SFM2Web can handle multiple sets of control files for different data sets or web site output styles. See below for more information about doing this (using the -c command line parameter).

[Back to ReadMe top]


4. Style templates

The style templates are just simple text files (with the .css file extension) that are usually found in the Templates/Stylesheets folder. You can edit them with any suitable text editor. (If the lines appear all messed up in MS-Notepad, try using MS-Wordpad or another editor that can also handle non-Windows linebreak conventions.)

...to be expanded...

Currently SFM2Web only handles one set of stylesheets for all generated web sites.

[Back to ReadMe top]


5. Polish

This program is not finished nor polished. You should regard it as a "proof-of-concept" only at this stage. It is not even an alpha version. Not all controls do anything in the program, some simply don't work at all or don't work properly yet, and most combinations of controls have not been tested.

If you see room for improvement in the program design, controls, algorithms, layout, styles, colours, etc., see CONTRIBUTIONS below.

You should also be aware that the techniques for guessing roots by removing combinations of affixes are very rudimentary, and will not work at all for some types of languages. What's more, even when they do work reasonably well, they are likely to give some incorrect results.

SFM2Web is developed on Linux with Python 3.1, and has not been tested on other operating systems yet. However, it has been designed in such a way that it should theoretically require few, if any, changes to run on other operating systems.

SFM2Web loads all of its data into memory for processing. If your computer has limited memory, the operating system might need to use a lot of temporary files on the hard disk and run quite slowly. There are no real plans to change this behaviour, since the program is considered to be a "server" type program, rather than a "data entry" type program. So if you have problems, copy your SFM files to a more powerful computer, and install and run SFM2Web from there. (Then you have to move the output files to your web server to make them "live" for the world to access.)

[On my computer with a 2,2GHz dual processor and 4GB of RAM, SFM2Web takes around ten minutes to run with a full lexicon and New Testament, using around 4% (200MB) of memory.]

The format of control files is not guaranteed to be compatible with future releases until V1.0 is released. After that point, the program will automatically issue warnings and helpful instructions concerning changes to the format of the control files.

SFM2Web output has only been tested on Firefox 3 thus far.

[Back to ReadMe top]


6. Use of log files

As mentioned in the introduction above, the three main log files can be used to help clean up the SFM data. Most errors related to the running of the program and the control files settings are stored in SFM2Web_log.txt (stored by default in the 'Logfiles' folder). This includes errors from validation and hierarchical checking if enabled. Vernacular words used in example sentences, etc. but which couldn't be located in the 'WordsNotInLexicon.txt' file in the same folder, and gloss language words (e.g., from the sentence translations) which couldn't be found in the reversal index are listing in the 'GlossesNotInLexicon.txt' file.

However, it's easily possible for these files to contain thousands – or even tens of thousands – of lines, especially for the lexicon. This is where it's helpful to filter the logfile(s) in order to view only the lines of interest. The 'grep' program (which is always included with GNU/Linux and which can be downloaded for Windows from <gnuwin32.sourceforge.net/packages/grep.htm> can be used for filtering.

For example, to only show lines from the log file containing validation errors, use:
        grep -i validation SFM2Web_log.txt

More information on using SFM2Web for checking language data can be found below.

[Back to ReadMe top]


7. Python installation

This program requires Python 3.1 or better, which can be freely downloaded from
        https://www.python.org/download/releases/3.1/

NOTE: Many Linux systems have Python 2 already installed. You can determine your Python
    version by typing
            python -V
    from the command line. If you already have Python 2 and are installing Python 3
    from source, you should use the ALTERNATE INSTALL instructions specified in the
    ReadMe file (so that Python 2 and 3 can coexist without breaking your system).

[Back to ReadMe top]


8. Program installation

I haven't ever written an installer for a Python program yet! Maybe you can do it for me?

Currently, place the enclosed files and subfolders in folder somewhere, open a command window, navigate to your folder and run:
    python3.1 SFM2Web.py

This should do a successful run using the test data. Now navigate to the ControlFiles folder and edit MainControls.txt, changing the setting for UseSampleData from True to False. Now edit all ofthe control files to suit your project. Run the program again exactly as above, looking at both the terminal output and the log file (in a folder called LogFiles by default) for issues that need to be addressed.

Once the setup files are customised for your situation, you shouldn't need to alter them to run the program again. If your data files have been updated, just run the program again and it will generate new web pages. (The old ones will be saved in a backup folder if you have specified this.) Just move the new pages over to your web server, and it will now display your most up-to-date data.

[Back to ReadMe top]


9. Command line parameters

SFM2Web can accept parameters from the command line. A list can be printed with:
    python3.1 SFM2Web.py –help

Other parameters include:
    –base=FOLDER To set the base directory for the program
                            so the program can easily be run from a different directory
                            (especially if running it via a script or batch file)
    –folder=FOLDER To look for the project file in a different location
    –project=PROJECT To use a different file instead of the default Projects.txt
    –silent To produce minimal output (can also be set in Projects.txt)
    –quiet To produce less output (can also be set in Projects.txt)
    –informative To produce more output (can also be set in Projects.txt)
    –verbose To produce even more output (can also be set in Projects.txt)

[Back to ReadMe top]


10. Internationalisation

Sorry, I don't really know much about internationalisation of the program itself.

The meta language for the control files is English. However, if a graphical front-end were to be created for editing the control files (see below), the internationalisation could be built-in at that level.

Program error messages and the program log are all in English. I don't plan to change this at this stage.

All HTML classnames and Id fields are in English and this is not planned to change.

Concerning the HTML generation, I've tried not to always assume that English will be the meta language of the created web pages. Specialised fonts can be specified by changing the stylesheet(s), either changing the templates in the Stylesheets folder of the program for permanent changes, or in the Stylesheets folder in the output folder for testing temporary changes. Some English text IS still within the program code for creating HTML but it is planned to remove all of this (moving it into setup files) as the code is reviewed and cleaned up.

Concerning the internationalisation of the input data handling, perhaps cleaning up the handling of punctuation in the SFM files would be a good start as different languages use different symbols. Maybe the control files should specify character sets, so these can be checked against the data. And no doubt, as soon as one or two people start trying the program on their data, dozens of issues are likely to arise.

[Back to ReadMe top]


11. Graphical interface

SFM2Web is a command line program. However, future addons could conceivably include a graphical editor for control line parameters and then automatically running the program. This might be more intuitive for some users.

[Back to ReadMe top]


12. Standards

SFM2Web uses standards as much as possible, both in interpreting its input files, and in creating its output files. However, by use of control files, it also has the flexibility to handle non-standard input files to some degree.

It should also be relatively easy for a Python programmer to adjust the scripts to handle special cases where the designed-in flexibility is still insufficient. (Of course, if the adjustments make the code more generally useful, it would be nice to submit them to the author so they can be added to the main code base for wider distribution – see below.)

The control files are UTF-8 text files. The encoding can be specified for the user's input files. The created web pages should all be valid xHTML 1.0 "strict" and the created stylesheets should all be valid CSS – verifiable for online pages by clicking the relevant icons included at the bottom of every page.

SFMs (Standard Format Markers) are a SIL standard for text files. SFM files use a backslash (\) followed immediately by a field name and some optional text. The fieldname must be terminated by a space, tab, or newline character (or in some cases, an asterisk). Most often, "record or paragraph level" fields will begin with the backslash in the first column of the line, however, some SFM documents or databases will allow backslash fields within the line, often to indicate "character" formatting. SFM is only a very weak standard in that it only specifies the markup format for and doesn't necessary specify the record or field structure of the file. Thus very often users freely define their own record structure plus many use their own "character" level markup schemes. Very often the closure of a field is not explicitly marked and is determined by the user to occur at the occurence of the next SFM, or at the end of the record, etc.

MDF (MultiDictionary Format???) is a SIL standard for dictionaries/lexicon databases. It is more specialised and more specified than SFM format as it clearly specifies which standard format markers must be used for which field, and somewhat specifies a hierarchical structure that indicates which SFMs must occur in each record and in what order. MDF files contain a few database controls at the top of the file, and then following that, a "record" structure where each record must begin with the same SFM.

USFM (Unified Standard Format Markers) is a UBS/SIL standard for Scripture markup. (See <https://ubs-icap.org/usfm>.) It is more specialised and more specified than SFM format as it specifies a large set of allowable "paragraph" markers, plus "character" markers which can occur within a "paragraph" field. Character markers are usually expected to be closed with a repeat of the marker followed by an asterisk, but may also be assume to close at the end of the "paragraph" level field.

The output of SFM2Web can be automatically packed and compressed into a ZIP archive ready to be transferred to your web server.

[Back to ReadMe top]


13. Compatibility

While this remains a "proof of concept" version (not even an alpha version), no effort will be made to retain compatibility of control files with older versions. Thus control fields may be renamed without any warning other than that either the program will produce a fatal error, or your web output simply might not look correct. Sorry, but that's life.

If ever V1.0 is released, any future changes to the control files will be documented and the program will automatically give warnings to the user.

[Back to ReadMe top]


14. Licensing

SFM2Web is licensed under the GNU General Public License V3.0. More details can be found in the file 'gpl-3.0.txt' included with this program. But you should at least be aware of the following:


      This program is free software: you can redistribute it and/or modify
      it under the terms of the GNU General Public License as published by
      the Free Software Foundation, either version 3 of the License, or
      (at your option) any later version.


      This program is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
      GNU General Public License for more details.


      You should have received a copy of the GNU General Public License
      along with this program. If not, see <www.gnu.org/licenses/>.

The sample data files are all Copyright (C) by SIL Philippines and are included only to demonstrate the use of the program. They may not be published on an Intranet except to demonstrate the output of the program, and they may certainly not be published on the Internet. Please note that the sample lexicon is only a portion of the real file.

[Back to ReadMe top]


15. Contributions

Contributions which make the code or control files more useful or flexible are welcomed. Also improvements to the xHTML templates and the CSS stylesheets. If you are willing to submit more attractive templates and/or stylesheets, these could possibly be included in the distribution as alternatives.

I don't know anything about merging "patches" to program files yet, so if you submit a "patch", either you have to be prepared to include instructions on how to use it, or else I will probably just merge it manually.

Please note that the structure of the prototype program is very tentative, as it will soon be tidied up and modularised. So don't try to do that yet. And once that's done, a test program suite should be written for quality control.

Also, if you are willing to contribute databases to be tested, particularly if they current fail to be processed yet follow a well-known or standard format, then that would be helpful. These can be treated confidentially. (However, the author is updating this program as a hobby / spare-time project, so no promises or deadlines can be given to any database contributors.)

[Back to ReadMe top]


16. Future plans

Making sure that it works easily for MDF lexicons is a priority. So is including more USFM Bible fields – so far it only contains the subset that the author has used in his project.

If someone would volunteer to done some of their artistic talent to the xHTML templates and CSS stylesheets, that would be great. Thus far the main focus has been on getting the program working and making it more generally useful, than in making it pretty.

It might perhaps be nice to include an option to produce output in ISO standard Open Document Format for offline use on an Intranet or on personal computers (but then I'd probably have to change the name of the program). That would also give a path for easy production of nicely formatted PDF files for printing or for hosting on the Internet.

[Back to ReadMe top]


17. Inspiration

This program was largely inspired by the SIL Lexique Pro software. (See <www.lexiquepro.com>.) Lexique Pro is able to take a SFM lexicon and index and display it, as well as export it in many forms, including MS-Word (TM) files, PDF files, and HTML files. It can also encrypt a lexical database for limited distribution.

Being also inspired by the Free Software Foundation, SFM2Web does not include encryption or other facilities to limit the distribution of lexical data. It is mainly intended for users or communities who are interested in making their data collections available to as many people as possible.

Finally, this program was inspired by the great prophet and teacher, Jesus Christ, who said, "Freely you have received, freely give." (Matthew 10:8b, NIV)

[Back to ReadMe top]


18. Comparison with other programs

Other programs which have overlapping functions with SFM2Web include:
    a/ Lexique Pro (LexPro) from SIL (See <www.lexiquepro.com>.)
        Lexique Pro is able to take a SFM lexicon and index and display it, as well
        as export it in many forms, including MS-Word (TM) files, PDF files, and HTML
        files. It can also encrypt a lexical database for limited distribution.
    b/ Prophero from SIL (in development)
        Prophero is able to take OurWord or USFM Bible files through an OSIS (XML) path
        and produce an HTML or xHTML website with full concordancing and nice
        highlighting.
    c/ Paratext 7 (PT7) from UBS (in development)
        Paratext is a full USFM Bible-editing and checking environment, but it can also
        do an export to HTML (Use Tools/Export Project to HTML...). The user can choose
        either the current book or all books and select the output folder. Everything
        else is automatic.

Here is an attempt at a comparison between these four programs:

                                         LexPro   Prophero     PT7      SFM2Web 
    Free (in cost)                          Y         ?         N         Y 
    Freely available to all users           Y         ?         N         Y 
    Downloadable now                        Y         N         Y         N (coming on sfm2web.sourceforge.net) 
    Full Unicode (with UTF-8 output)        Y         Y         Y         Y 
  INTERFACE 
    Graphical interface                     Y         Y         Y         N (but could be built on as an additional program) 
    One click site update (once setup)      N         ?         N         Y 
  INPUT 
    Handles lexicon                         Y         N         N         Y 
        MDF                                 Y         -         -         Y (hyphen/dash means not applicable) 
        Custom                              Y         -         -         Y 
        Photos in lexicon                Manual       -         -       Auto 
        Sound files in lexicon           Manual       -         -       Auto 
        Classification system            Moe-DDP      -         -   Moe-DDP/HRAF-OCM 
        Aware of morphology                 ?         -         -       Basic 
        Handles two lexicons for same lg    N         -         -         Y 
    Handles phonology                       N         N         N       coming 
        Does some automatic analysis        -         -         -       coming 
    Handles grammar sketch                  N         N         N     hopefully 
    Handles readers / literacy materials    N         N         N         Y 
    Handles interlinear texts               N         N         N         Y 
        Links texts to lexicon              -         -         -         Y 
        Links lexicon to texts              -         -         -         Y 
    Handles multilingual phrasebook         N         N         N         Y 
        Drawings in phrasebook              -         -         -     hopefully 
        Links phrasebook to lexicon         -         -         -         Y 
        Links lexicon to phrasebook         -         -         -         Y 
    Handles language lessons                N         N         N         Y 
        Sound clips in language lessons     -         -         -       coming 
        Links language lessons to lexicon   -         -         -         Y 
        Links lexicon to language lessons   -         -         -         Y 
    Handles health books                    N         N         N       coming 
    Handles Bibles                          N         Y         Y         Y 
        Inputs USFM files                   -         Y         Y         Y 
            Handles one chapter per file    -         Y         Y         N 
        Inputs OSIS format files            -         Y         N         N 
        Inputs OurWord format files         -         Y         N         N 
        Does side-by-side version           -         N         N         Y 
        Does interlinear version            -         N         N         Y 
        Includes back matter                -         Y        ???        Y (glossary) 
        Makes complete concordance          -         Y         N         Y 
        Makes topical concordance           -         N         N         Y 
        Links Bible text to lexicon         -         -         -         Y 
    Handles songbook                        N         N         N         Y 
    Handles multiple languages on one site  N         Y         N         Y 
  OUTPUT 
    Automatically gathers statistics        N         N         N         Y 
    Uses frames / Javascript               Y/N        Y         N        Y/N 
    HTML output validates with w3.org       N         ?         ?     Strict xHTML 
    All formatting in CSS (Stylesheets)    Most       ?         N         Y 
    Can produce PDF files                   Y         N         N         N 
    Can produce OpenOffice Writer files     N         N         N         N 
    Can produce MS-Word files               Y         N         N         N 
    Can encrypt output                      Y         N         N         N 
    Can produce ZIP file for uploading      N         N         N         Y 
    Can produce ZIP files for downloading   N         N         N         Y 
    Auxilliary programs for CD autorun      N         Y         N         N 
  PROGRAM 
    Open source code (freely customizable)  N         N         N         Y 
    Platforms                              WIN       WIN       WIN       ALL 
    Source language                       Delphi     ???       ???     Python3 

[Back to ReadMe top]


19. Using SFM2Web for checking Lexicons

If you use the SIL Toolbox program as your dictionary editor, it does not necessarily impose limitations and restrictions on the contents or orders of fields. This allows SFM databases to contain many kinds of format and structure errors. SFM2Web can help to discover and list these errors (for you to correct).

However, please be warned that a typical language learning dictionary with two to ten thousand entries, can generate many tens of thousands of warning messages. You should study #4 above about how to use 'grep' as a filter to only show lines of particular interest out of the log and wordlist files.

Note that SOLID from <https://projects.palaso.org> is also a tool for checking and cleaning up SFM databases and can be obtained from the SOLID home page.

More coming...

[Back to ReadMe top]


20. Using SFM2Web for checking Bibles

If you use the Paratext or Bibledit programs as Bible editors, they do not necessarily check your entered data—various checks have to be manually run. Also, they make no attempt to check across projects, e.g., to check that a Back Translation project conforms to the corresponding Vernacular project. SFM2Web can do many Scripture checks and can also be used to check that a Back Translation matches its vernacular source. It can also check that the book name abbreviations given in section and cross references are valid.

However, the same warning applies as for lexicons—the first time you run a check you may be discouraged to receive many hundreds of warning or error messages. The good news is that as you gradually reduce these warnings, the quality of your publication is improving.

More coming...

[Back to ReadMe top]


21. Handling multiple projects (on one site)

If you are using SFM2Web to combine several independant language projects onto one web site, SFM2Web can do this and even make an overall cover page. You should follow the following steps:
    a/ Edit the Projects.txt file and add lines for
        Project2Name = (your project name here—will show on cover page)
        Project2ControlFileFolder = (folder path here)
    b/ You can go up to project #9 but the numbers must be consecutive (starting from #1)
    c/ Adjust the block of controls under "Overall cover pages" to your requirements
    d/ Adjust each project's control files to your requirements
    e/ Just run the program as normal and it should process each project in turn,
            and then create the cover page(s).

[Back to ReadMe top]


22. Handling multiple sites (on one computer)

If you are using SFM2Web to make several independent web sites, this can be done by following these steps:
    a/ Copy the Projects.txt file to a file of another name
        e.g. Projects.Site2.txt
    b/ Edit the new file to suit the new site
    c/ Be sure to change the LogfileFolder control if you want the logfiles for each site
        to be saved separately
    d/ Run the program with the following command line
        python3.1 SFM2Web.py -p Projects.Site2.txt
        (python3.1 SFM2Web.py –help lists all available command line options)

[Back to top of page]


Tutorial.txt

[Back to top of page]


ReleaseHistory.txt

[Back to top of page]


ToDo.txt

[Back to top of page]