These things will make your grammar more complicated, but you can expect them to show up in any community generated corpus, like immediately
Onomatopoeia. Fart noises and the like. In English, they get italicized and I’m sure someone has written a paper on what part of speech they are.
Direct speech. Though you didn’t have embedded sentences? Now you do.
Fragments. If a sentence is cut off, either on purpose or accident, grammatically what can you do with it? If understanding a sentence is a process of parsing by applying syntactical rules, what are the rules for dealing with fragments?
Diglossia. If you mix English and another language, what are the rules for swapping parts out?
Errors. This one stumps me. When computers parse, if one word is off, the computer completely fails. It can’t do anything with that program. But for human speech, if ten things are wrong, we apply a set of syntax like rules to fix it up and we don’t even notice. Sound absurd? This is essentially how modems worked with error correcting and check sums. If you’ve ever used an application called resharper, it does a similar thing for programming languages. It uses static analysis to find syntax mistakes and suggest corrections.
Punctuation. Think you don’t need punctuation? Think again– a simple grammar can yield dozens of alternative parsings. Punctuation brings that down to a manageable level. If it needs to be audible punctuation like lojban, that’s another story.
Compound words. (and neologisms) Think you only have 1000 words? Almost immediately, phrasal compound words will appear. Interestingly, if you treat them as compound words, the parser does better glosses and you have fewer alternative (wrong) parsings. Toki pona & Klingon have this issue– since both languages have a fixed number of bound & unbound morphemes (one by design, one by community choice). The alternative of just imagining all these common word pairings to be “ad hoc” phrases is really just dishonest.
I’m running into all of these issue when trying to machine parse a toki pona corpus– that language only has about 10 rules in the formal grammar. But my parser just keeps getting more and more lines of code to deal with issues like the above.