Hacker News new | ask | show | jobs
by thaumasiotes 261 days ago
> but English has what 30-40% of its vocabulary from French?

You have to be careful what you're counting when you quote figures like that. Here is your comment, but including only the words derived from French:

-----

... basic grammar sure, ............. influence ... Latin ...... just .... cultural .......... exposed† ..... Romance languages .................

exposed is unlike "normal" French-derived words in English in that it is not derived from Old French; the equivalent from Old French is expound(ed), and even there I'm not sure why we have ex- instead of es-. I might credit exposed more to Latin than French.

-----

Here's English:

-----

for xxxx xxxx xxxx, but English has what 30 to 40 xxxx of its xxxx from French? There's also a lot of xxxx from xxxx and xxxx in English as well.

Likely it's xxxx less xxxx-xxxx sharing from Welsh into English. We xxxx much more xxxx to more tidbits from xxxx xxxx or xxxx in English than we do Welsh or xxxx.

xxxx! Something to read up on.

-----

53 / 71 words (including Welsh, but not Gaelic) are native English.

(Welsh ultimately derives from the name of a Celtic tribe known to us from Roman writers. In Germanic, the name became a generic word for foreigners. I think it's fair to call it English; it was already like that in proto-Germanic. Gaelic is more recent.)

10 / 71 words, including the somewhat questionable exposed, are from French.

5 are Latin, two are Norse, and then there's Gaelic. Greek is not represented except in the -ic ending on Gaelic (or basic).

If you're listening to someone speak English, knowing French is unlikely to be worth much.

1 comments

Nice observation but it just illustrates what the GP is saying: the basic grammar is English while a huge proportion of the vocabulary comes from French. If you remove the grammatical words from the English selection you made, there's hardly anything left.

> If you're listening to someone speak English, knowing French is unlikely to be worth much.

It can help a lot when learning because of the huge vocabulary overlap, e.g. more or less every word ending with -tion, you just learn to pronounce it differently

I thought this was an interesting idea.

I rated each word in the comment for how much I felt it represented grammar vs semantics (total adding to 1 for each word; ratings in increments of 0.1).

The ratings divided into 31.5 words worth of syntax and 37.5 words worth of semantics, adding up to 69 instead of 71 because I combined "a lot" and "as well" into one word each for this purpose.

French accounted for 6% of the grammar (reflecting my rating of sure and just as 90% "grammatical" each), and 22% of the semantics.

English got 91% of the grammar and 59% of the semantics. The point you might be most likely to disagree with is that I rated many prepositions as 50% semantic. (For example, to in the phrase thirty to forty got that rating, although to in get exposed to and something to read up on were rated 0% semantic.) The second point, cutting in the other direction, is that I rated all pronouns as 0% semantic; realistically they should rate a bit higher. In a better model, I'd probably like to rate them 100% grammatical and also ~30% semantic.

(The residual ~3% of grammar is the passive marker get, from Norse.)

If this is the kind of thing you enjoy, I'd be interested in your evaluation.

I'd say I'm quite sceptical about that kind of evaluative scheme because it seems to add a degree of subjectivity and arbitrariness about how things are rated.

At a first pass I'd just say that adjectives, nouns, and adverbs are "vocabulary", and everything else is grammar.

That won't work as a first pass. That gets you results like "there's also a lot of influence from French" being 2/3 semantics and 1/3 grammar†, with there holding just as much semantic content as influence does. It also disqualifies pronouns from counting as grammar at all, which is much more defensible than disqualifying semantically empty words, but not a common perspective.

I tend to take the perspective that if a foreign speaker is unlikely to have any trouble learning how to use a word correctly, that word is semantic, and otherwise, the word is grammatical.

† Assuming that the omission of verbs from your list of semantic words was a mistake. Otherwise you're up to 44% grammar. I did count "is" as being grammar, but I would certainly not extend that judgment to all verbs.

--- results ---

By your standard, English is 61% of the semantics and 91% of the grammar (if verbs have no semantics), or 62% of the semantics and 96% of the grammar (if verbs do have semantics).

French is 21% of the semantics and 6% of the grammar (if verbs have no semantics), or 20% of the semantics and 4% of the grammar (if verbs do have semantics).

I don't think much of your methodology, but it's worth noting that your overall numbers are almost identical to mine. (When verbs are meaningless; still very close but distinguishable otherwise.)

In reality, of course, many verbs such as sharing are rich in semantics, and many others such as do are more or less empty.

Oh, true, it was just a mistake to exclude verbs. Of course they should be vocabulary.

But I think of pronouns as grammatical, as well as the auxiliary particles in verb forms like "there is", "to go to", etc. So "have" and "is" can function grammatically when they're part of the verb form of another root verb, like "have been seen" and so on.

"Do" is obviously semantic when it's the main verb, e.g. "I'm doing my job" versus "I'm leaving my job". In the selection you quoted it's also playing a grammatical role which is just to point to the main verb form of the sentence, i.e. it could be replaced by repeating "get exposed to (titbits from)" without changing the meaning of the sentence.

So in "there is also a lot of influence from French", I would put "there _ also a _ of _ from _" as grammatical.

I'm sure my way is naive, but it's based I think on well-established categories. I'm not sure how linguists would distinguish grammatical words or even if they categorize based on words at all. e.g. "a lot of" as a quantifier might be completely grammatical, same as "more", "less", "thirty", etc.

> I think of pronouns as grammatical, as well as the auxiliary particles in verb forms like "there is", "to go to", etc.

There would not usually be considered a particle. It is a noun, but one that has no semantics whatever; it is there only to satisfy the grammatical rule requiring the verb in that clause to have a subject. (The term of art here is, straightforwardly enough, "dummy subject".)

You could ask questions about extraposition (as in "it's tragic that XXXXX", which is equivalent to "that XXXXX is tragic"); "there is [noun]" is obviously similar in some ways and less similar in other ways. One way in which it's gotten less similar over time is that the verb used to agree in number with [noun], but today it is more commonly always is, appearing to agree with there regardless of whether [noun] is singular or plural.

> "Do" is obviously semantic when it's the main verb, e.g. "I'm doing my job" versus "I'm leaving my job".

I don't think this is so obvious. Do (as a primary verb) is a verb in the same way that thing is a noun - it has all the same grammatical properties, and usually no semantic content. (Technically, since we have two meaningfully distinct classes of noun, we need more than one empty noun. The counterpart to thing is stuff. These do technically differ in their semantics, conveying the speaker's idea of how divisible the objects or materials in question are.)

In your example, I would say that doing is closely related to job and the semantics (still pretty weak) arise from the pairing. You can do many things by taking advantage of conventional fixed expressions. But if I were to remark to you that my friend was "doing a book", I suspect that you wouldn't know what that meant. Maybe my friend is an author. Maybe he's an illustrator. Maybe he's an editor. Maybe he's a press. Some words are vaguer than others; do is maximally vague.

> I'm sure my way is naive, but it's based I think on well-established categories.

Mostly, yes. Adverbs can be a bit hazier than nouns, verbs, and adjectives. You did yourself a big favor by defining a miscellaneous "other" category.

I will note that I excluded more (in more tidbits, but not in more exposed where it's an adverb) from the semantic category on the grounds that it is a determiner (same part of speech as the). This is something I think you might not have anticipated. I should also note that also is an adverb (adverbs are very broadly defined), so your methodology rated it as semantic. I think I rated it as 70% grammatical.

Prepositions are difficult to deal with. (This is generally true of almost every language.) For there is a lot of influence from French, my view is the following:

(1) From has fundamental semantics involving something being in a certain location and then moving out of that location;

(2) in this specific use, those semantics are close to the surface. A foreigner putting this phrase together would likely be able to guess that from was the right preposition to use.

Contrast something like refrain [from], where the semantics are still not entirely gone, but the foreigner is going to have a much harder time.

I didn't want to think very hard about exactly how much the semantics were present in prepositions, so if I thought they were present in a nontrivial way, I gave them 50%.

> "a lot of" as a quantifier might be completely grammatical, same as "more", "less", "thirty", etc.

I had a lot of trouble with thirty and ended up scoring it as an adjective for the unprincipled reason that that would make it count as semantic. Grammatically the least we can say is that it's not a normal adjective. This is also true of more and less (where we can say more), so good eye.

"A lot [of]" is heavily grammaticalized and this process appears to be continuing. Here's a blog post observing that native speakers often think of "a lot" as a single word: https://hyperboleandahalf.blogspot.com/2010/04/alot-is-bette...

It's not quite the same thing as more and less, though. They can substitute for it:

A lot of the students...

More of the students...

But it can't substitute for them:

More students...

*A lot students...

This problem won't go away if we include the of; then we'd get

*More the students...

I think it's better not to include the of.

> I'm not sure how linguists would distinguish grammatical words or even if they categorize based on words at all.

Linguists use word to mean an atomic element. Exactly which parts of a certain stretch of speech are atomic depends on the analysis you're trying to do, and linguists have explicit terms for elements that are atomic at different levels or in different ways. By default a "word" would probably be taken to mean a lexeme, which is something that requires its own dictionary entry. A "morpheme" is something like "the smallest element to which we can assign independent significance" and might rarely be considered a "word". At this level you might observe that "fascinate" derives from Latin but its -ing ending, a separate morpheme, does not. A "phoneme" is a sound that is meaningfully distinct from other sounds, and would never be called a "word".

There is a concept of a "clitic", which is something that behaves like an independent word in some ways and like a dependent particle or inflection in other ways. This is almost always a lexeme that is pronounced as if it is part of a nearby word. I don't know of a term for "pronunciational atom", but I wouldn't be surprised if there is one.

Linguists make all kinds of observations about how certain words are semantically weak or in the process of losing their semantics ("semantic bleaching"). And of course they also make all kinds of observations about grammatical rules. So "how grammatical is this word" is definitely a question close to the heart of linguistics, but as you note the concepts are difficult to define and the question often cannot be answered rigorously as applied to particular words.