Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any interest in loading emoji data from unicode data files? #145

Open
yob opened this issue Dec 29, 2020 · 6 comments
Open

Any interest in loading emoji data from unicode data files? #145

yob opened this issue Dec 29, 2020 · 6 comments

Comments

@yob
Copy link

yob commented Dec 29, 2020

I was interested in using emojis released over 2019/2020 however I wasn't sure of the correct way to edit the data files so I created a fork that loads emoji data from Unicode Consortium data files.

A nice side-effect of this approach is updating to support newly released emoji looks like this:

$ npm update emoji-datasource
$ npm run gen-emojis
$ git commit [email protected]/data/emoji.js -m "updated emoji data"

A downside is customising keywords and names gets harder (and in my fork at least, I've skipped any customisations to keep things simple).

It works pretty well, however it's a hacky solution that works for me and I didn't put much consideration into making it suitable for merging upstream. I'd be happy to polish it up and help resolve issue #28 if you're interested? If so I'd love some guidance on your preferred approach. It's also completely fine if this approach doesn't work for you, I'm happy to run a customised fork for now.

Thanks for a great extension 🫀🪅🪠🪃

@xurizaemon
Copy link
Contributor

xurizaemon commented Nov 19, 2021

It would be a shame to lose the customised naming. It would be great to make aliasing / customised naming easier and simplify updating with new emoji too!

Can we lay our own data over the top, starting with "official" emoji & names from the built data and overlaying useful extras like categories and aliases already in this package? Then the autopopulated data / default emoji names can be safely updated beneath the Emoji Selector additions, and we get best of both worlds.

{
  "😀": { "categories": ["people"], "aliases": ["grinning face", "grin"] },
  "😁": { "categories": ["people"], "aliases": ["grinning face with smiling eyes", "grin", "smile"] },
  "🤡": { "categories": ["people", "git"], "aliases": ["clown", "mocking"] },
  "🚨": { "categories": ["whatever", "git"], "aliases": ["police car light", "police", "revolving light", "rotating light", "linter", "tests"],
}

(NB: Quick examples above only - I do recall from #80 that we don't want a category just for emoji commit things and am not proposing a category change here)

@maoschanz
Copy link
Owner

Oh sorry it looks like i didn't see this issue

That's a great idea and i had plans to do something similar, however:

  • as you say there is the customization problem, and i should change the code handling it before merging your approach
  • this data source has numerous languages, which is a great opportunity to finally translate the keywords
  • ...so it would be great if the extension uses a default "english" file, but is able to generate the data for emojis in any languages
  • to do this on the end-user machine, i think it shouldn't use insane bloatware like npm

@xurizaemon
Copy link
Contributor

to do this on the end-user machine, i think it shouldn't use insane bloatware like npm

Would there be any reason to run those commands on the end user machine? GitHub actions or a developer task could do that occasional work when updates ship, and trigger a release i believe.

@maoschanz
Copy link
Owner

a single one of these files is already quite big, so the size of the entire extension if i ship all the possible translations? An extension shouldn't be dozens of megabytes big.

Also, a big potential pro of relying on an external data source would be that users don't have to wait for updates from me when new emojis are released by unicode

@zelch
Copy link

zelch commented Dec 5, 2022

I note that this discussion has been idle for the past year, but I'm pretty interested in it.

From a 'how' perspective, I suggest that we start with Github actions stuff, and then figure out how to optimize the process and experience once we know what the download sizes, processed files sizes, processing time, and the like work out to be.

On the source side, hashes that start with the unicode character for the emoji, and then contain things like what language the entry is in, categories, aliases, etc, would make it pretty easy to take the current list of stuff, and future customizations, and merge them with the upstream unicode data.

What that turns into after processing could easily be the same structures that we have today, or something else. What makes sense for the compilation of the data, and what makes sense for using the data, are almost exact opposites. After all, the aliases are what we want to search by, not the raw unicode of the emoji.

Thoughts?

@maoschanz
Copy link
Owner

sorry, as you point out i didn't have any thought about any of this for the past year, and i will need to go back to it before saying anything

i need the silly 🥸 emoji so bad so i think i'll do it this winter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants