Skip to content

Discussion

Dirk Roorda edited this page Oct 18, 2017 · 11 revisions

Main idea

The meaning of verbs depends critically on the number and nature of the complements found in their neighbourhood. It is possible to write a flowchart, that lists those meanings as a function of the patterns of complements.

Having a linguistic text database of the Hebrew Bible at our disposal, it is possible to implement those flowcharts into an algorithm and apply that algorithm to all verb occurrences in the Hebrew Bible.

Complications

While the idea expressed above is simple, the execution of it meets a number of challenges.

Data correction

The ETCBC data has been encoded by an ongoing effort of decades, during which principles of encoding and linguistic theories were subject to change. This has led to several inconsistencies in the details of the encoding, in particular the assignment of functions to phrases.

We have met this challenge by implementing a data correction workflow, where faulty phrase function assignments have been replaced by better ones.

Data enrichment

Moreover, the stance of the ETCBC encoders is objectivistic: a piece of markup is applied only when there is objective, measured evidence to do so. On the other hand, the traits needed by the flowcharts to base decisions on, are often at a higher level of interpretation. For example, the notion of indirect object is not present in the ETCBC encoding.

We have met this challenge by enriching the encoding with a bunch of higher level features, that we compute from the ensemble of lower level features. We also have a workflow in place to manually adjust the outcome of this process.

However, the present results do not rely on manual enrichment, because the enrichment algorithm is still work in progress. For now, we bet on improving the algorithm, rather than supplying manual enhancements. In this stage, manual enhancements are counter productive, because checking manual enhancements after updates to the algorithm, is an extra layer of complexity, especially when the vocabulary of the enrichments is in development as well.

Target language dependency

When talking about the meanings of Hebrew verbs in English, the verbal valence patterns in the English language interfere with those in the Hebrew language. This hampers a clear organization of meanings of Hebrew verbs in their own terms, and may camouflage complexities in Hebrew meanings, or complicate simple meaning structures.

We deal with this challenge by decoupling the valence patterns from the meanings. So, for each verb context, we annotate its valence pattern, categorize the relevant bits of context, and leave it there. For selected verbs, we have an explicit map from valence patterns to meanings in English, a.k.a. a flowchart. For those verbs, we will insert a link to the flowchart of that verb.

Direct objects

We need to discriminate between various types of direct objects, especially when there are multiple direct objects in a clause.

Multiple direct objects

If there is multiple direct objects in a clause, we will compute which is the principal one. The others are deemed secundary ones. If there is only one direct object, we do not mark it as principal.

An object can be a phrase or a clause.

Clauses as objects

We will treat clauses marked as Objc by feature rela as direct objects. Additionally, we identify clauses marked as InfC by feature typ as direct objects if they are preceded by the preposition L and if there is a direct object phrase elsewhere in the clause.

We will not mark all these object clauses as principal direct objects, by rules stated later on.

Implied objects

There are many cases where there is a direct object without it being marked as such in the data. Those are cases where there are no objective, unambiguous signals for a direct object. We call them implied objects. Examples:

  • the relative pronoun in relative clauses
  • complements starting with MN (from) or L (to)
  • assumed objects: objects that figure in the context, but do not have a concrete presence in the clause under consideration

In the case of implied objects we have to guess. Initially we assume that there are no implied objects.

Later, when we inspect individual cases, we can mark principal objects and implied objects manually for those cases where these rules do not suffice.

Finding the principal direct object

When there are multiple direct objects, we use the rules formulated by (Janet Dyk, Reinoud Oosting and Oliver Glanz, 2014) to determine which one is the principal one. The rules are stated below where we make some remarks about how we apply them to our data.

Interpretation

When looking for principal direct objects, we restrict ourselves to direct objects at the phrase level, either being complete phrases, or pronominal suffixes within phrases. The following rules express a preference for the principal direct object. In a given context, we select the direct object that is preferred by applying those rules as the principal direct object. We only apply these rules if there are at least two direct objects. If there is only one direct object, it is not marked as principal.

Rule 1: pronominal suffixes > preferred above marked objects > unmarked objects

In a given clause, we collect all phrases with function PreO or PtcO. If this collection is non-empty, we pick the one that is textually first (by rule 3 below) and stop applying rules. Otherwise, we proceed as follows.

We collect all the phrases with function Objc. If this collection is empty, there will not be a principal object. Otherwise, we split it up in marked and unmarked object phrases.

An object phrase is marked if and only if it contains, somewhere, the object marker >T. If there are marked object phrases, we pick the one that is textually first (by rule 3 below) and stop applying rules. Otherwise we proceed with the next rule.

Rule 2: determined phrases > undetermined phrases

We only arrive here if there are multiple Objc phrases, neither of which is marked. In this case, we take the textually first one (by rule 3) which has the value det for its feature det, if there is one, and stop applying rules. Otherwise we proceed with the next rule.

Rule 3: earlier phrases > later phrases (by textual order)

This rule is implicitly applied if one of the rules before yielded more than one candidate for the principal object. Furthermore, we arrive here if the previous rules have not selected any principal direct object, while we do have more than one Objc phrase.

In this case, we pick the textually first Objc phrase.

Non principal objects

In case there is a principal object, we divide the other objects into two kinds:

  • clause objects
  • phrase objects

We will give the phrase objects the grammatical label NP_direct_object.

Complements as LK Objects

In some cases, a complement functions as objects, such as in Genesis 21:13 I make him (into) a people.

Candidates are those complements that:

  • start with either preposition L or K and
  • the L or K in question does not carry a pronominal suffix
  • should also not be followed by a body part

We generated grammatical labels L_object and K_object in these cases. The flowchart will make a distinction between L_object and K_object.

An L/K object is never a principal direct object.

Indirect objects

Finding indirect objects

The ETCBC database has not feature that marks indirect objects. We will use computation to determine whether a complement is an indirect object or a locative. This computation is just an approximation.

Cues for a locative complement

  • # loc lexemes how many distinct lexemes with a locative meaning occur in the complement (given by a fixed list)
  • # topo how many lexemes with nametype = topo occur in the complement (nametype is a feature of the lexicon)
  • # prep_b how many occurrences of the preposition B occur in the complement
  • # h_loc how many H-locales are carried on words in the complement
  • body_part is 2 if the phrase starts with the preposition L followed by a body part, else 0
  • locativity ($loc$) a crude measure of the locativity of the complement, just the sum of # loc lexemes, #topo, # prep_b, # h_loc and body_part.

Cues for an indirect object

  • # prep_l how many occurrences of the preposition L or >L with a pronominal suffix on it occur in the complement
  • # L prop how many occurrences of L or >L plus proper name or person reference word occur in the complement
  • indirect object ($ind$) a crude indicator of whether the complement is an indirect object, just the sum of # prep_l and # L prop

The decision

We take a decision as follows.

In words:

  • if there are positive signals for L or I and none for the other, we choose the one for which there are positive signals;
  • if there are positive signals for both L and I, we follow the majority count, but only if the difference is at least two;
  • in all other cases we leave it at C: not necessarily locative and not necessarily indirect object.

See the enrich notebook for a more formal account of the decision.

Flowcharts

  • BR> - ברא - create
  • DBQ - דבק - cling
  • NTN - נתן - give
  • <FH - עשׂה - make
  • QR> - קרא - call
  • CJT - שׁית - set
  • FJM - שׂים - put
  • ZQN - זקן - be old
Clone this wiki locally