Multimedia:Meeting in Paris/Notes/Metadata

Structured data!

Wikitext as primary source? [editable]
- -> cache in index [searchable, queryable]
- -> can override other sources [exif, file-inherent data]
- External (read-only) authority databases?
  - (Do we need/want such or just import things per-source when doing batch uploads?)

How to source/edit the stuff...

Normalization of property specs/extraction most important

Normalization of property values can be left for later

Possibility...

Define for <dc:description> sourcing from:
- "template" "Information@desc/<lang>"
- "exif" "Description"
Gives us these trips:
- "File:Hallo.jpg" "dc:description@lang=de" "hallo"
- "File:Hallo.jpg" "dc:description@lang=en" "hello"
- "File:Hallo.jpg" "dc:description" "Created with The Gimp"
also other sources of info available:
- revision data / uploader / editor
- external authority data [rdf triples given from bulk uploader?]

Possibly spec as a special page or a mediawiki:horrid or config file. ;)

Whatever consider that it may change; so we need to be able to re-run metadata extraction in bulk.

mysql? (page/field/keyword trips ... or searchindex?)
lucene? (for each page page: field/keyword pairs)
specialized triple store? (sparql)
- potentially really cool searches possible!
- recommend leaving this to later - third parties cover this area well like dbpedia and freebase
Look at what Semantic MediaWiki does for its storage
Minor note: with RDF triples, subject can be tricky (page vs image vs subject of image)
- SMW -- talk page's subject is the page, so you can put data about a page there ;)
- This seems simpler

metadata licensing is a scary world out there...
Let's pretend the problem doesn't exist for now ;)
- [but database protection stuff in Europe may need to be looked at... but we think that hits on our side mostly]

Good news: exif extraction is already here
Work: template extraction sounds feasible and not too insane
- [brion: see about a proof of concept to make sure this works like we think :D]
Work: need to specify an extraction spec language
Work: core DB storage + query/dump export
Work: need to add interface to add the fields to search backends
- Lucene <- key for live
- MySQL fulltext <- good for third-parties

sometimes metadata is about *subject of image* instead of *image*. Ughhhhh?
- at minimum, need to be able to spec for the subject... maybe
Location: for paintings -- have loc of the painting or the subject of the painting?

raw RDF via API
XML RDF in special:export and dump XML
- ^if included in dumpBackup and OAI export, our Lucene indexer can easily ingest that!
RDFa embedded in HTML pages