General
Scope
This document describes a set of rules for creating and maintaining documentations or some other kinds of materials, which is not project-specific. This document is also self-conforming to these rules.
Notation
Specific formats may be used. The visual output may depend on the method of rendering.
Texts intended to be handled differently (for example, portions of program source code) may be in some specific format. Other texts are considered normal.
Hyperlinks may be used in normal texts pointing to external references, local pages, or anchors in specific documents.
Normal text empasized in general are in the specific format, usually (visually) bold.
Local terms in the normal text are emphasized at first appearence in the specific format, usually (visually) italic.
For terms used globally in this document and any other derivations, see below.
Terms and definitions
Resources
Contents of materials are split as resources (e.g. files) in possibly nested namespaces (e.g. directories). A namespace is also considered as a resource for convenience.
Paths and identifiers
A path is used to identifying or locating a resource, which can be in various forms (e.g. filesystem path or URL).
A path may have several components denoting different levels of namespace or the last level non-namespace resource.
An empty path is a path without any components.
A path with more than one components shall have syntactic separators (e.g. a slash(/) or whitespace) to split different components.
An identifier is a path with exactly one components without any separators, which can be used to differentiate resources in the same namespace or to collectively name some sets of resources in various namespaces.
A resource may be denoted with not necessarily the unique identifier or path. However, all resources this document discussed below are named.
Languages
Rules of natural languages are specified in this subclause. They have effects on normal texts.
Normal text of noun phrases may have embedded translations for different natural languages or more detailed descriptions following its first occurence, in parentheses (( and )).
Different letter cases (if appropriate) may be used for sentences, acronyms and words in the titles of clauses.
Editions in languages
A set of documentation may be in one (natural) language. The IETF language tag with at least one subtag and an additional prefix dot(.) shall be placed in the end of identifier of the resource before the dot and the extension name (if any). Otherwise the documentation shall be in multiple languages or without text contents (e.g. containing only ideographic images), and no language code shall be in the identifier of the resource.
When the additional dot and tag is removed, all different resource with same names shall refer to the same set of contents only in different languages, or at least one of them shall be incomplete which means to be completed as in former case. The resource is one edition in the specific language of the documentation.
Unless explicitly specified, when the meaning is in conflict for multiple editions in different languages, the complete one shall be valid over others. If there is not only one complete edition, the validity is specified in following order:
en-USenzh-CNzh
NOTE The form of these literals conforms to the recommendation of IETF language tag, specifically, the "language" and "region" syntax elements in RFC 5646.
If no one edition in above languages is complete, the documentation is defective.
A language tag may be used to annotate one or more words in text. An annotation of such use is a language tag annotation, which consists of a tag combined with one pair of enclosing parentheses (namely, ( and )).
Hyperlinks in pages should preferrably link to localized contents corresponding to the language or one of the major languages used in the page (if any) when suitable. If contents of the linked target is in other languages (esp. when there are more than one semantically identical editions in multiple languages), at least one language tag for majority of the contents should be noted subsequent to the hyperlink; otherwise, the tag should be omitted.
For compatibility of client programs, each link of URI should be encoded in form of normalized Percent-Encoding in RFC 3986.
Additionally, several hyperlinks are normalized with the same form for a specific language. Currently the rule consists of following cases:
- For Chinese Wikipedia(zh-CN), the link shall following the rule in the section in the language conversion help page(zh-CN), with small letter "cn" in the link.
In English
Stylistic usage of letter cases shall be respected in the following precedence:
- All uppercase should not be used normally.
- Acronyms and other proper noun (pharses) shall be in the appropriate styles.
- The title case style shall be used for page or document titles.
- Either the title case or the sentence case shall be used in the titles in a page. This shall be consistent within a document.
- Either the title case or the sentence case shall be used in the detailed descriptions for acronyms in parentheses. This may vary in the same page.
- Detailed descriptions for acronyms in parentheses may use title case or sentence case.
- All lowercase style shall be used for words in the embedded translations or detailed descriptions in parentheses in other cases.
- Sentence case should be used otherwise.
English wording documentation is intended to be conforming to the ISO/IEC directive, part 3.
NOTE The use of modal verbs is distinct with RFC 2119.
For wording referenced from RFC documents, RFC 2119 is preferred, but not necessary with the case clarification (i.e. RFC 8174) for documents published earlier than RFC 8174 due to compatibility issues.
The following grammartical forms of English (with en or en-US tags) are considered idiomatic and application of such forms may be preferred:
- answer ellipsis to elide the subject in the summary of commit messages where a question for the topic of the log message is assumed
- bare passive clause omitting the auxiliary verb for short descriptive notes (e.g. commit messages in repositories and assertions messages in programs)
- null subject and pronoun dropping in imperative forms
- zero article for singular form of a countable noun denoting a specialized term being referenced, usually used in a terse-style title or in a list term (like this line)
Informative notes: The tense and mood used in the logs in version control systems are opinion-based. However, the implied rules are choosed here to avoid imperative forms by default, because:
- First, it should be respected same in all information processing system: to make sure who are the messages in the logs serve to.
- Version control systems are capable for reading and writing operations on the version history, with asymmetric operational frequency in general.
- For most stakeholders to a repository in most cases, read-only accesses of the version history are more frequent compared to changing opertions.
- This is also consistent with the idiom pattern used in programming: do not abuse imperative updates with side effects.
- For most users, commit logs are entries of journal of the version history.
- They do not and should not care about imperative changes in the logical perspective.
- Version control systems are capable for reading and writing operations on the version history, with asymmetric operational frequency in general.
- Unconstrained changes in the version history as effectful operations can make messes easily.
- They are usually only well-behaved enough within some local context (e.g. in a single branch of a reliable instance of the version history).
- They often make troubles in other cases (e.g. when stripped as patches possibly reordered).
- Messages in the logs may be cooperated with other instances of version history.
- No imperative mood can essentially assume the changes described will always be applied in the exactly same way.
- As mentioned above, out-of-order changes make messes. If the messages are precise, they also make messes like other changed contents.
- In general, messages in the logs work for distributed repositories.
- There is simply no standpoint for the global view of the universe of the version history by default.
- Messages should be ready to be audited by random accesses, besides being applied subsequently in some replays.
- These facts further undermines the necessity of imperative changes.
Format-specific rules
Text files
Unless otherwise specified, all text files should be encoded as UTF-8 with BOM enabled.
Any use of encoding which may not be converted verbatim and losslessly in binary form to UTF-8 shall be explicit specified in documantation.
Any UTF BOMs shall not exist in a derived text format with normative specification which excludes the BOM explicitly.
- Rationale The specification such text format may assume the contents are all the payload in a known UTF encoding.
BOM should be omitted for text files dedicated to tools without capability of properly handling it. Otherwise, BOM shall be used as possible when it can clarify the encoding being used.
Unless definitely intended and explictly specified in documentation, newlines shall be consistent. Default use of newline is CR+LF.
Two subsequent newlines indicate an EOF logically. Subsequent newlines out of verbatim quoted text (including source code) should only be used at EOF.
There are also some default rules on typography implemented by ordinary characters in plain texts:
- No space characters should be at EOL.
- For text other than verbatim quoted, no more than one whitespace characters should be used to represent a single indent, except there are preferred combination in the language.
- Rationale By default, no more than one whitespaces should be used to represent an indent, because there should be no chance to insert a character in the middle of an indent.
- NOTE An example of preferred exceptional case is that the hanging indent (in the first line of a paragraph) in east Asian languages where dedicated combination of fullwidth whitespaces are preferred. Typically, the sequence consists of 2 ideographic space (U+3000).
- NOTE To keep the semantics rules clear, when possible (in horizontal texts and out of the context of making tables) and no other forms are more preferred by the rules of the language, use horizontal tab character(U+0009) instead of other spaces (i.e. U+0020) to indent.
- For Western languages, except at the first of line, each word which consists of alphanumeric characters should be seperated by a single space character (U+0020) with other words.
- Space characters (U+0020) should be used for alignment when portability is required.
- Rationale This makes the visal effect easy to predicate in the usual settings with monospaced fonts in contexts like source code of programs.
- NOTE Other spaces like non-breaking space (U+00A0) may be better in specific uses, but not portable as U+0020.
Markdown
Names of markdown files should be with .md extension.
NOTE Several rules here can be enforced by markdownlint tools like markdownlint-cli2. A compatible configuration file is included in this project.
NOTE Markdown files derive text files. The rule of subsequent newlines is covered by a modified setting of markdownlint rule MD009 with "strict" mode.
Dialects
Unless explicitly specified elsewhere, only common dialects are to be used. Currently this should be GFM (GitHub Flavored Markdown).
NOTE This is not GLFM (GitLab Flavored Markdown), which also abbreviated as GFM formerly.
And if the content may be presented on Bitbucket wiki, stricter rules applies, notably:
- There is currently no inline HTML support.
- There is currently no anchor support.
NOTE This repository is not intended deployed in Bitbucket wiki now. The stricter rules on Bitbucket wiki above are not applicable here.
Syntactic restrictions
As text files, markdown files shall obey the same rules above. The indentation rule is necessary to avoid some compatibility issues, e.g. this.
As specified, reserved characters defined by RFC 3986 should be percentage-encoded. Notably, the parentheses(()) in hyperlinks shall be encoded to make it more fault-tolerent for some editors.
Headers should be prefixed by #s.
There should be no redundant characters allowed between the annotated words and annotation (esp. whitespace characters), even there are whitespaces in the words. The annotation in this rule includes any language tag annotation defined in previous subclause.
Rational This is for the sake of compact annotation representations, as well as to ease the transition from plain texts to structured ones (e.g., to get align with the Markdown links without redundant whitespaces between [] and ()).
NOTE The whitespace rules in the language annotation is also applicable. Instead, it is also allowed to use word combination (instead of the annotation) when gramatically correct, so this rule does not apply.
Code block names
Laugnage names are required on the code blocks.
NOTE This is covered by markdownlint rule MD040.
The following requirements are hold on the names:
- Use the langauge names in a well-known style, unless explicitly noted here.
- Currently, this is specified in the list of supported languages of hightlight.js.
- For all other cases (including program output), use
text. - Rationale Proper language identifiers make syntactic highlight and other cosumers of Markdown work appropriately.
- As an exception, custom names can be used only when it is either explicitly defined or mentioned unambiguously in the document contents.
- Rationale This is only for human readers of the Markdown source. Otherwise it should be like
text.
- Rationale This is only for human readers of the Markdown source. Otherwise it should be like
- Use the most proper language names actually reflect the contents of the code with maximum portability in the intended use cases, even they are equivalent in some implementaions of the consumer of the Markdown code.
- Use
shellwhen the code is conforming to POSIX shell language. - Use
consolewhen the code is conforming to both of shell and DOS/Windows batch scripts. - Use
dos/bat/cmdcontextually. The differences of batch commands in interactive script environment and in files shall be respected. - Rationale These rules make it clear to the human readers of the Markdown source by clear intent.
- Use
Cross references
This document is used by the YSLib project. It may be also referenced by other repositories.
Except for the following list, do not edit unless ultimately necessary.
Known referenced by:
Annex (Informative)
Alternaive imcompatible rules
Usually there the rules of documentation here are compatible to other rules in various specifications. However, some of the well-known rules are considered overspecified (albeit not rigouous) and with insufficient quality in specification. Thus, these rules are deliberately kept incompatible, and never accepted here:
- The specification may be too vague by missing separating the comformance rules and the suggestions, so it is difficult to manually verify the conformance just by the specification text.
- Some confusions may be from the lack of rules on the modal verbs.
- Some rules may be underspecified for external resources. For example, the claim of "be a valid Markdown file" is unclear without further notes, because there is no unique standard to determine the definition of "valid", since there are multiple dialects of the Markdown language and no flavor is definitely more representative than others.
- The rules of mandated letter cases (in particular, capitalization) may be too restrictive.
- This may be generally too subjective. It can be good to sticking to a well-name for to ease for use cases for technical merits (like for machine verication), but the fixed spelling on cases may be overspecified.
- The exception is when the name is standardized and machine-oriented by default.
- As a notable instance, RFC 5646 recommends but does not mandate the capitalization for the codes from ISO 639-1 and ISO 3166-1, while the preferred capitalization diverges in the 2 standards.
- On the other hand, mandotory like "README" instead of "Readme" is too restrictve. It will be problatic to be transferred between case-insensitive enviornments and case-sensitive environments (e.g. names in filesystems), where one environment may allow entries of
READMEandReadmecoexisting but another may not.- When techically feasible to having different cases coexisting, "README" and "Readme" are symmentric, i.e. no one is definitely more preferred than the other for machines. It is then not intuitive to reason why "README" must be preferred to "Readme" instead of the exact opposite in the specification, in particular with the fact that such entry is mainly created for human readers but not machines.
- Instead, keeping one overridable as well as a recommended default form (which does not necessarily to be all capitalized) of spelling is better for both portability and other needs.
- Further, names like "README.md" are less consistent to "README.MD". The latter is at least required in some ancient systems not support the small case, hence even more preferred for portability (in extreme cases).
- This may be generally too subjective. It can be good to sticking to a well-name for to ease for use cases for technical merits (like for machine verication), but the fixed spelling on cases may be overspecified.
- Prioritizing non-regional subtags for languages should not be recommended normally, because this is less accurate, and the confusion may even be offensive to specific culture, since there can be lack of consensus that one subtag can override another without changing the meaning of the text (which is not the case of the relationship between tags and subtags).
- Validation of hyperlinks should be acknowledged not always possible when the linked resource is out of the control (i.e. external) in a document.
- Anyway, there is no persistency guarantee for most hyperlinks in the Web.
- Mandating the state of the referenced resource of hyperlinks unconditionally will make any verification result one-time, because the exteranl links may be broken immediately after the verification. Then the conformance is non-deterministic.
- Such mandatory is applicable only for hyperlinks provable to be persistent. But this is infeasible with automatic methods at least for external links on the Web, because the test of persistency may be unreliable until the link is broken.
- So, unless external links are not allowed (which seems an overkill), rules having impractical assumptions of the validation process should be in the specification.
An example of most bullets above can be found in the specification of standard-readme.
Notes on the overall writing style
This subclause is used for informative supplement (guidelines and rationales) on the writing style used across the materials in the repository. The "adoptions" in the text below shall be interpreted as suggestions, but not mandotory rules.
There can be some stylish emphasize on the use of the language for personality, say, this. However, a line of the boundary to prevent overly stylish should be drawn. That is, it should be generally grammatical without drafting a new dialect not well-known enough. Dialects not well-known enough have troubles being adoption, and they carry (and increase) confusions. Even without the adoption problems, consider dialects as specifications, or standards, the divergence would not be eliminated without extraordinary efforts, see also xkcd 927. Hence, for a context out of conlang, avoid it.
There are actually many rules shared with the link above in the writing here, tailored by the rule. Analyze them one by one and then categorize, with the reasons attached:
- Rejected:
- Article “an” vs “a”
- This is rejected because it has ungrammatical, and it has significant damange on the experience for readers on smoothness about phonetics.
- Pronoun “I” vs “i”
- Rejected for ungrammatical issue. There are no enough benefits to take ad-hoc rules over and over.
- Saxon Genitive: “James's car” vs “James' car”
- Ditto.
- Symbol Used for the Quotation Mark
- Straight quotes should be used for English, or there are no place to use.
- Curly double quotes are in the orthography in some languages, but not in English.
- Quoting Computer Code
- Non-idomatic to use
「and」in English. Even more non-idomatic to use them for quotation of the code. - Non-idomatic to use
〔and〕in English. Even more non-idomatic to use them for paths. - There are more conventional semantic styles for those quotation, e.g. Markdown ``` quotes. The styles can be also customized by the style sheets.
- Non-idomatic to use
- Period in Person's Names
- Non-idomatic in majority.
- Though uncommon, there is an ambiguity to the shortened form with a name having only one letter without the period.
- There are no rationals. If the ambighity between the last punctuation of a sentence with this use is a matter, why limit it to "person's names"? E.g., "e.g. " with "." (and a space character after the 2nd ".") is actually used by the text. This is not logical at all.
- Citation Format
- The requirements on the precise format is intentionally omit here globally, to allow better interoperbility to the contexts.
- Adaption to the target context is important than stiking on a concrete known format. However, being internally consistency is also important. These two points should stay together in peace.
- For multiple targets, more than one format for the same content can coexist. Notice that a machine readable format can be more efficient for specific uses, but not that readable for humans. So there often can be a mixture.
- Afterall, too many aspects can be customized here.
- Article “an” vs “a”
- Weakly rejected:
- Use Logical Spelling Variant
- It is somewhat subjective to identify what is "logical" enough.
- Alghough some are occasionally the same (e.g. "color" instead of "colour"), this is nothing to do with "logical". Just apply the rule of "en-US is preferrend to other English dialects" by default.
- No “and” for Last Item in Sequence
- Althoug abuse of conjunctions can be a problem, it usually should not happen without some delibarte effots.
- By not eliminating the "too odd" exceptional cases, this rule is less beneficial.
- Simply keeping "and" is still the simplest.
- “My favorite fruits are: {peach, banana, cherry}” is not that logical. Reduce it with some substituion, it has the form "My favorite fruits are: a set of fruits", which looks not grammatical due to the non-canonical use of verb "are". A more grammatical form should have "are in" instead of a plain "are" here.
- No “respectively” in Parallel Sequences
- Good to prevent redundancy, but when used, the emphasize is not redundant.
- “{peach, banana, cherry}, colored {pink, yellow, red}” is confusing due to the possibly ambiguity in the composition of multiple set notations (curly brace pairs), given that it is not an idiomatic way of expression in natural languages.
- Why does this mean a parallel sequence instead of a Cartesian product?
- It is perfectly "grammatical" to the nature of the extended syntax of "{}".
- The phrase "colored {pink, yellow, red}" can be reviewed as a result from the predictor "colored" being applied by the set {pink, yellow, red}.
- Before the semantic analysis on the precise meaning of this sentence (to render {peach, banana, cherry} × colored {pink, yellow, red}) nonsense, no notational sensitivity can be induced.
- Instead, a "respectively" clearly rules the ambiguity out by syntax.
- Use Logical Spelling Variant
- Weakly adopted (contextually):
- Short Sentence Length
- This is mildly adopted, but not always good in reality, because it is often more difficult to split the words correctly and precisely without redundancy using short sentences.
- Basically it should be forgotten, and when there are sentences looking unconfortably lengthy, revive and champion this as a rule.
- Avoid Idioms in Tech Writing
- Good in general.
- But it incurs a mental cost to identify the idioms to pick up some alternatives intentioanlly. This is actually acknoledged as "it's very time consuming".
- Also it can lead to words not in the basic English without very careful treatment of the validity of the alternatives (so clashing with "Diction" section). To distinguish whether a particular word in the vocabulary of basic English is difficult for non-specialists of English education.
- No Right Justification
- There should be the exceptional cases for typographic uses, esp. per sentence alignment like this.
- Otherwise, what on earth is that?
- Use Metric System
- Sometimes idiomatic. However, Instead of "USD$1M", use "USD 1M" instead, to avoid the "dollar" redundancy between "D" in "USD" and the dollar sign ("$"). Then the reason is not strong enough there: it is not idiomatic to use the idiomatic prefix "M" actually as a suffix. The common use of abbreviations like "MB" to "M" is informal, and the omission of other basic units is non-idiomatic.
- Distinguishing of "k" and "ki" is good. Likewise "M" and "Mi".
- Person's Name, Title
- Order of First Name, Last Name
- Alphabaticlal order can be the default convention, but not the only.
- No Honorary Title
- It is plausible to prefer name to title for accuracy.
- Degrees from different majors and different places. There may be confusions. "PhD" is lilely certainly not about "philosophy", but what if "philosophy" is exactly needed?
- An exception is to make confusion intionally. That is, word punning.
- Order of First Name, Last Name
- Short Sentence Length
- Adopted (without exceptional cases):
- Omitting Articles: {a, the}
- This is adopted for non-mandotory.
- Diction
- Good for clarity.
- Hyphen, Dash
- Ditto.
- Apostrophe
- Grammatical. Orthographic.
- Punctuation in List Items
- Conventional.
- The only exception is the phrase itself: "full stop" may have a preferred form "period" in en-US, although there is not a big difference nowadays (also see below). This does not change the status of adoption, though.
- Sentences Should Always End in a Period
- Idiomatic.
- The titile is problematic to include exclamation mark or question mark as "period", as period is just the period symbol in normal uses. The detailed text is correct, though.
- The exceptional cases like "e.g." and "i.e." are explained as "haven't fully though out" here, but actually, they should not be the problem, becuase a period is not necessarily the end of a sentence. The ancient use which differentiate "full stop" and "period" use the former for the case, but nowadays it is conventional to treat them the same.
- Omitting Articles: {a, the}
- Strongly adopted (with even more and stricter rules):
- Active vs Passive Voice
- Passive voice is almost always preferred (for technical writings).
- Almost no subjects denoting readers are even used.
- Cold, inhuman and precise statements are preferred, except when the opposite are the objects being discussed (e.g. metophors, for GUI).
- The first-person subject pronouns ("I" and "we") are also rarely used. (This also largely prevents the concern of “I” vs “i” mentioned above.)
- Moreover, this also covers the instance of commit messages.
- Do Not Spell Out Numbers
- This is also adopted. Moreover, numbers are treated the same if it is at the beginning at the sentence. More and moreover, ordinal numbers are also enforced besides cardinal numbers, for cases just need number (e.g. "1st" is preferred over "first" with exceptional cases when used as an adverb or in phrases like "at first").
- Date format
- Idomatic.
- This prevents the ambiguity between "dd-mm-yyyy" and "mm-dd-yyyy" in particular.
- Moreover, in some contexts, more detailed ISO 8601 forms are required (e.g. with time zone information).
- Active vs Passive Voice