March 2009

Wiktionary for automated lookup of translations and synonyms

Hi, I'm building a software which needs to look up information about words and I'm not sure if Wikitionary is the right source for it.

It's not a dictinory-like software for users specially interested in words. Think of it like a search engine where the user enters a word like dog and it asks him if he means an animal or a morally reprehensible person and may also search for hound, canine or cad, bounder, blackguard, fool, hound, heel, scoundrel depending on his choice. It should also offer translations, e.g. seach for the german words Hund, Köter, Töle automatically. It's not excactly what I am building, but it's the easiest way to describe my needs, so please don't debate if it is reasonable to build this piece of software...

So my question is: Do you think wiktionary is the right source for that kind of information? Are there any other digital dictionarys which may be better suited to that problem? If wiktionary is the right source, are there any APIs that are recommended for those lookups and give me structured output in XML, or should I just get the unprocessed wiki text and parse it?

thanks in advance for your answers, Prauch 13:42, 3 March 2009 (UTC)[reply]

I'm afraid Wiktionary isn't quite there yet (though we're working on it, and you'd be welcome to help). It sounds like WordNet would be more suited to your needs, at least in terms of synonyms. You could then try to integrate that with the translation data from Wiktionary, I suppose. -- Visviva 14:25, 3 March 2009 (UTC)[reply]

Thanks, that's looking very promising. Too bad that WordNet is only available in english, while Wiktionary is multilingual. Maybe I have a try at your advice to combine the two. Prauch 10:18, 5 March 2009 (UTC)[reply]

I'm also wanting to get some of the wiktionary data into a database (I'm working on a word quizzer as a hobby project) and am wondering if there are any good parsers out there. If not I'll write my own, but I'd rather use one that's out there already if possible. Could anyone let me know how people are currently doing this? Thanks in advance. --Gyroidben 06:10, 22 August 2009 (UTC)[reply]

I have some parsers, and I have access to a particularly good one (in python) for Translations written by User:Polyglot (who seems to be afk at the moment) which, although not quite finished can extract translations from many of the Wiktionaries. I have some guess-work parsers that extract translations (python), synonyms and derived terms, and some methods to determine whether definitions are form-of (awk/perl) or not (all used in the indices). If you let me know what you're interested in doing, I'd be very happy to help - mapping dictionary data to a relational database is something I've tried and failed at several times - my latest Idea was to have a go using Neo4J which is a graph database, but I haven't had the time yet. (While the simple stuff is reasonably easy, a lot of the annotations I feel we should have structured too end up exploding in size). Conrad.Irwin 21:07, 22 August 2009 (UTC)[reply]

Cool. I'm interested in getting as much of the data out into a database as possible. I'd rather go with a conventional database rather than Neo4J, since I don't have any experience with that. A few months ago I made a simple parser for the dewiktionary xml dump which worked by simple pattern matching but wasn't very robust and got stumped on a significant fraction of the words. I thought this time I'd like to try something that was a bit more reliable and could be easily extended to different languages. It sounds like there's nothing out there already that does exactly what I want so I'll have a crack at it myself. Do you wiktionary guys have a preferred site to host code? If not I'll default to code.google.com. I'd be interested in seeing any of the parsers you've got and, if the authors don't mind, building upon them. Are they available online? I'll send you an email with my email address. Cheers, Gyroidben 19:35, 23 August 2009 (UTC)[reply]

I think all the stuff I use for indexing, which includes some guess-work translation extraction is at http://jelzo.com/indexing.zip and a checkout of the svn repository containing Polyglots stuff and some other misc wiktionary stuff is at http://jelzo.com/svn.zip . There's also User:Conrad.Irwin/parser.js, which deals mainly with the HTML parsing instead of wikitext, but only at a high level. It's all one huge mess, so you may be better writing your own. Conrad.Irwin 22:38, 23 August 2009 (UTC)[reply]

Thanks a bunch. Will let you know when I make some progress. Gyroidben 03:05, 24 August 2009 (UTC)[reply]

Converting Wiktionary data to MDF/Shoebox format

Has anybody worked on converting slices of Wiktionary into MDF format? I vaguely recall somebody mentioning something about this, but can't remember where it was. I was thinking of doing this, mostly to take advantage of that sweet Lexique Pro interface.[1] Seems like a great deal of reprocessing will be required, particularly where templates are involved, but it should be doable... -- Visviva 06:12, 5 March 2009 (UTC)[reply]

For what it's worth, many MDF codes don't work in Lexique Pro. I don't remember all the details since it's been about a year that I fiddled with it (I was doing conlanging at the time), but I doubt it'd be easy to automatize it. Circeus 04:23, 13 March 2009 (UTC)[reply]

Well, my initial test on the Italian-English slice of Wiktionary went OK, except that Lexique Pro insisted on inserting question marks as spacing characters between the headword and the definitions (very weird and unsatisfactory; can't figure out why that happened or how to make it stop). However, I was only using definitions and POS's. You're probably right that anything more advanced would run into problems.

Lately I've been looking at the LIFT XML format as a better alternative. There aren't many applications that support it natively (aside from the 3.0 beta of Lexique Pro, which seems to have all the same issues as the stable version), but there are lots of generic XML parsers, so if Wiktionary data could once be ported to LIFT (even imperfectly), it would be trivial for downstreamers to extract specific information of interest. Haven't yet found the time to make a serious attempt at this, though. -- Visviva 10:29, 13 March 2009 (UTC)[reply]

Usually it's an anon wondering why we do things the way we do, then throwing up their hands and leaving, thinking our little project won't get anywhere useful. DAVilla 04:43, 23 March 2009 (UTC)[reply]

Generally, the reason we use mere wikitext and templates can be summarised in saying that using a "proper" dictionary information encoding format would increase the complexity of even a simple entry above and beyond what the average wiki user is willing to go to. Plus we'd still need to figure a way to convert between that and wiki markup, or have a MediaWiki extension made. Circeus 14:18, 23 March 2009 (UTC)[reply]

LST limits?

Hi all,

Does anyone know what the limits on the proper use of Labeled Section Transclusion are, particularly how many labeled sections it is wise and/or possible to have on a page? I'm wondering specifically how far I can go with something like User:Visviva/Linkeration (where the relevant section is transcluded in the edit-intro when you click on the "edit" link). How many sections of this kind can there be on one page before the burden on the server to process the page becomes unacceptable, or the extension just stops working? Any guesses?

I glanced over mw:Extension:Labeled_Section_Transclusion and its talk page, but didn't see anything informative. -- Visviva 18:42, 6 March 2009 (UTC)[reply]

In answer to my own question, whatever the system limit is, it is quite high; even a 600-KB page with hundreds of labeled sections is processed smoothly. And I suppose, assuming the extension uses a regex match or similar, the server load would actually be minimal (less than rendering the source page in the first place). Nothing to see here, move along. :-) -- Visviva 03:58, 8 March 2009 (UTC)[reply]

I wouldn't expect any particular limit on the page with the tagged sections, those tags have no effect on the page itself when rendered, so just disappear during parsing. On the page doing the transcluding, I'd expect the usual 2Mb limit to apply. Robert Ullmann 14:20, 12 March 2009 (UTC)[reply]

It does do a regex on the "target" page (which is why generating the section tags with #tag or other templates doesn't work). For the gory details look here. The code is pretty fragile, you have to get the section tags just right. All of which is to say that what you are doing is perfectly reasonable (:-) Robert Ullmann 14:28, 12 March 2009 (UTC)[reply]

Hidden content in Navframe's

In an effort to get collapsible boxes to "play well" with right-hand side elements, I removed the 'style="clear:both"' attribute from the div tags in some collapsible box templates. They predictable then, shared the width with R-side elements. The problem is that these boxes aren't scrollable, so that when content is wide relative to the usable area, some content is hidden and unretrievable. This was always a problem, but now it's more of a problem since sharing the width with R-side elements means less width for the content frame. Is there a good solution to this? Can we make these boxes scrollable? I don't think the final solution should be to insert the 'clear' styles again as this display problem will still affect people with small displays and pages with really wide tables (though this can be mitigated by re-formatting tables), plus we'd leave many pages unsightly if the collapsible boxes and R-hand sides can't share the page horizontally. --Bequw → ¢ • τ 21:24, 7 March 2009 (UTC)[reply]

Can you post an example? AFAICR, when I happen to have the browser window set very narrow, the nav content just wraps. -- Visviva 04:01, 8 March 2009 (UTC)[reply]

It's noticeable when there's a table in the box. See circumvenio where there's not much to wrap, so when it get's shrunk it just clips content. For boxes that share the width with right-hand side elements, see for instance botar with the right-side ToC preference. There were more obvious examples with the Latin conjugation boxes before EP put back in the 'clear' style attribute. --Bequw → ¢ • τ 05:05, 8 March 2009 (UTC)[reply]

I think I've got something that works. I've added the style "overflow:auto;", and it seems to allow the navframe to scroll. To see the difference, alternate expanding each of the conjugation templates below. The second one should allow you to scroll right to see the hidden content, while the top one won't.

English Wikipedia has an article on:

March

Wikipedia

Template:es-conj-ar Template:es-conj-ar (errar)

It works with for my setups (IE 7, Firefox, and Chrome on Vista), does it not work for anyone? Assuming this works, would this be useful to add to all the collapsible boxes, so that no content is ever hidden? --Bequw → ¢ • τ 04:08, 12 March 2009 (UTC)[reply]

Yes, while thinking about it in the middle of the night (I don't get normal sleep since a drug given to me years ago...) I realized that the behaviour noted must mean that there is an "overflow:hidden;" somewhere, and there is (was). I've fixed the style sheet, so they should all work properly (probably breaking your example ;-) It ("overflow:auto;") should not be in individual templates. Robert Ullmann 14:10, 12 March 2009 (UTC)[reply]

(Oh, and just for the record: I'm the one who introduced this bug in the first place ;-) Robert Ullmann 14:14, 12 March 2009 (UTC)[reply]

automated adding of pinyin to Chinese entries

I have noticed there is heaps of work to do regarding the Chinese entries and perhaps not so many editors adding; so perhaps it would be good to focus on what computers/programs cannot really do, like adding definitions [while they can be imported too, that wouldn't still provide present-day words, unless somebody would donate his or her book, and good books are hard to find anyway.]

Once say a transliteration is provided, the corresponding IPa can be created automatically [not that I would know how to practicallydo it, but I do can think and understand what computers can do], which would work the other way too, only the transliteration is easier to giv in. Actually, already from the characters, a computer could make an educated guess about the transcription systems'specific forms, apart from characters which have several pronunciations, but still he could guess from context, in the opposite way say mypinyin input system for Chinese characters works

The reason I ask is that as a newcomer, having spent substantial amounts of time. looking at the wiki code, it is a really pains staking effort to even just create one single entry, even with the simplest of entries. likesay for a country name, as I did for a gabon[Chinese entry, layout was improved by an experienced Wikipedian

A related thing would be to have a bot adding things like the animated stroke order diagrams , which are a tremendous help to people starting to learn Chinese[So that's one I am not asking for myself]-- I recognized the corresponding wikicode, it's relatively straightforward andsimple, but I really don't think. Especially me with my RSI inflicted arms should even start adding that template to 10,000 character entries or so.

as they say in Chinese. I'm throw in a "brick" crude in tha way of anot refined/inperfect remark in the hope of getting back JAde from people more knowledgeable with computers than I am smiley

I do feel that IpA is one of the single most most helpful tools learning a new language, and often overlooked and or presented in a confusing way 'n here once again wictionary could make such a difference!! Thank you in advance219.69.81.128 03:56, 10 March 2009 (UTC) I somehow got logged out, sorry史凡 04:00, 10 March 2009 (UTC)[reply]

the same could be done for Japanese and Korean from which I do would benefit IP a wise

giving in Chinese entries, nnumberd pinyin. I could givin/input with my speech recognition; the version of it diacritical signs I cannot, and it is quite cumbersome to do the latter in my search mask. ; after having giving in each special symbol I have to again click on the mask to givin the regular letters--such might not matter too much for healthy arms for me with my RSI. It pushes my arms over the edge, making the difference between potentially quite a few edits daily to just one or two to keep the pain in my arms and check. So please please, could somebody figure out a bot or macro or so to say after me having giving in the numberd pinyin have the regular pinyin and IPA appear? [. My speech recognition right nownow works halfway decent, it's almost a breeze geving diz'in/dictateing thissmiley]史凡 16:56, 11 March 2009 (UTC)[reply]

`[[foo#bar|]]`

Unless my memory is playing tricks on me at one time [[foo#bar|]] could be used as a shortcut to enter [[foo#bar|foo]] (foo). Now [[foo (bar)|]] works as a shortcut for [[foo (bar)|foo]] (foo), but that isn't much use here on Wiktionary. If my memory is correct and it once worked, anyone know why support for [[foo#bar|]] was dropped? It certainly would be a less burdensome to the server alternative to {{l|bar|foo}} in many instances where {{l}} is being used primarilly to ensure that both foos are identical. (For those not aware of this trick, if one uses [[foo (bar)|]], [[foo (bar)|foo]] is what is saved in the wikitext.) Carolina wren 02:34, 20 March 2009 (UTC)[reply]

I thought it was requested but never implemented (the feature request I read, if memory serves, asked initially that [[foo#bar|]] was expanded into [[foo#bar|bar]] (again more useful on the 'pedia) but this didn't happen because elsewhere the other way round would be preferred. Conrad.Irwin 15:47, 20 March 2009 (UTC)[reply]

Yes, see bugzilla:845 (and also bugzilla:14734, bugzilla:17675), the section anchor was the preferred display, not the article name. Robert Ullmann 17:33, 20 March 2009 (UTC)[reply]

Bug: case translation in search box

If you type WT:RFD#cooperation into the search box, you are taken to the expected subsection, but if you type wt:rfd#cooperation, you are not. In converting wt:rfd to WT:RFD, the search box also (incorrectly) capitalises the anchor to COOPERATION. Equinox ◑ 15:37, 20 March 2009 (UTC)[reply]

Yes. But can't be fixed, as the UC: string function applies to the whole string. If this particular case is irritating, you might create WT:rfd to redirect to the same target. Robert Ullmann 16:59, 20 March 2009 (UTC)[reply]

Re: "But can't be fixed, as the UC: string function applies to the whole string.": Searching uses the UC: string function? —Ruakh_TALK 21:13, 20 March 2009 (UTC)[reply]

Um, assuming that it is the result of autoredirection from the "didyoumean" generated possibilities, via the javascript. But it did say "the search box" ... I was thinking of http://en.wiktionary.org/wiki/wt:rfd#cooperation but that in fact loses the section link (!)

Yes, the search function tries case as given, all lc, all uc, initial cap, and title case (in two variants), but pays no attention to treating a section link differently. You might add it to bugzilla. Robert Ullmann 08:08, 22 March 2009 (UTC)[reply]

Gory details: http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/SearchEngine.php?revision=47776&view=markup Robert Ullmann 08:31, 22 March 2009 (UTC)[reply]

when a page needs attention by someone versed in a particular field

It would be nice to have a way to say "this needs attention from someone who knows [e.g.] mathematics", much as we have {{attention}} to say "this needs attention from someone who knows [e.g.] French". Doing it by categories, as we do it for languages, is unwieldy at best, as there would be numerous such possible categories (shall I keep an eye on the math category? the topology one? the algebra? arithmetic? geometry?). It would be nice to have some other solution, though. Perhaps this is one: Add {{{topic}}} parameter to {{attention}}, and have a bot list on some page all the pages so tagged, along with what they're tagged with, arranged by topic, with the topics as headers, and with FORCETOC set. This will enable people to galnce down the list to see what's needed. Thoughts?—msh210℠ 17:34, 20 March 2009 (UTC)[reply]

Watching a parent category would go some way toward solving the problem of numerousness. For example, Trigonometry is a subcategory of Mathematics, and Marketing (IIRC) a subcategory of Business. I don't know to what extent this is possible: whether watching a category is just like watching one page, or whether viewing alerts for all the subcategories could work. Equinox ◑ 21:20, 20 March 2009 (UTC)[reply]

I'm with Equinox on this one. Watching cats is the way to go, but it really needs to be automated by the software. There are so many categories that I want to watch, but there is no practical way to do it, other than checking related changes for each one (and I'm far too lazy to do that on any kind of regular basis). Are there any programmers who could institute such a change in the software? If this is not possible, then I'm down with a bot doing it. -Atelaes λάλει ἐμοί 22:00, 20 March 2009 (UTC)[reply]

Are we saying something like recent changes by category? That would be great. I'd sign up for a few. DCDuring TALK 22:53, 20 March 2009 (UTC)[reply]

While this is certainly off-topic, I was struck by a typo msh made, and decided to check Google to see how common it was. I put my results on Talk:galnce. Hope this isn't a problem! 75.214.219.198 (really, User:JesseW/not logged in) 22:29, 20 March 2009 (UTC)[reply]

Since there is no means of watching cats, I've effected the suggestion I made above, or something similar. See template talk:attention for more info, but, roughly, {{{topic}}}, used in {{attention}}, now adds the entry to category:Entries needing topical attention, and a bot can patrol that category (or special:whatlinkshere/template:attention) for pages that use {{{topic}}} and list the pages by topic on Wiktionary:Entries needing topical attention. A bot that can do so, using pywikipedia, is at Wiktionary:Entries needing topical attention/bot code. Since I have no permanent connection to the Internet, I cannot promise to run that bot periodically (which is one reason I posted the code: so others may run it) — but I do hope to run it from time to time. Please fix anything you see wrong, or that can be better, of course. (That's the other reason for posting the bot code.)—msh210℠ 20:53, 25 March 2009 (UTC)[reply]

I've added the topic= parameter to the templates rfc, rfv, rfv-sense, and rfdef, in addition to attention.—msh210℠ 19:24, 1 June 2009 (UTC)[reply]

edit toolbar problem

The customized edit toolbar has a problem with its enhancement for BOLD:

I suspect was was intended was that it would produce when nothing was selected:

'''{{subst:PAGENAME}}'''

but instead it is substituting PAGENAME too soon, producing for this page:

'''{{subst:Grease pit}}'''

which does nothing as the parser is apparently smart enough to avoid this basic level of recursion. Carolina wren 04:29, 24 March 2009 (UTC)[reply]

Overlapping content in Chrome

While trying to adjust "1" (removing a large blank space shown by Firefox and Chrome), I came across a rendering inconsistency that became apparent in another page of the page. I made a simple example here that uses just a right-hand side image and a table that specifies a width of 100% (like {{top3}}). IE 8 and Firefox both put the table below the image, whereas Chrome puts it behind the image (covering up content). Neither is great, but arguable Chromes presentation is more problematic (I'm not sure if one layout is "wrong" or not). I've seen this problem on other, "mature" pages with lots of content. Is there a simple fix? Why do the {top*} templates force a width of 100% (which causes the problems)? Is it just so that the columns of one table match up with others vertically? --Bequw → ¢ • τ 05:14, 25 March 2009 (UTC)[reply]