Multimedia:Meeting in Paris/Notes/Metadata

From Wikimedia Usability Initiative
  • Saturday, November 7, 2009 - room 2
  • Moderator: Duesentrieb
  • Note-taker: Brion

Structured data!

  • metadata about pages/revisions
  • metadata about media files
    • from MediaWiki [categories etc]
    • embedded
      • inherent: file size, page count etc
      • add'l data [EXIF, ICTP, XMP]
    • supplied on description page
    • external data
      • For batch uploads... would these come into the page?

What's the authoritative source?

  • Wikitext as primary source? [editable]
    • -> cache in index [searchable, queryable]
    • -> can override other sources [exif, file-inherent data]
    • External (read-only) authority databases?
      • (Do we need/want such or just import things per-source when doing batch uploads?)

Structured data use cases

  • media file metadata export for re-use
    • api, embedded
      • [Modify EXIF or XMP data to embed extra stuff?] [Recommend pushing that to later -brion]
      • orig artists, repo, institution...
    • Including info from authority files
    • using structured data about things+files for search and research

Let's try to keep this minimal?

  • Real use cases?
    • search
      • license, keywords etc
    • reuse
      • -> Generate byline
        • author(s)
        • license
    • Main fields... (many of these can be tied to Dublin Core fields, so import/export for academic collections can hit common denominators)
      • Fixed data formats
        • license
          • license features
          • dc:rights, creative commons extensions
        • geo coordinates
        • file format
          • jpeg vs tiff vs ogg
        • media type
          • drawing vs photo vs map
        • filesize / resolution
        • content language
          • dc:language
        • source collection + identifier
        • source URL
          • dc:source
        • creation date
          • dc:date
      • Freetext [can have multilingual variants]
        • author
          • dc:creator
        • description
          • dc:description
        • source
          • dc:source
        • "other fields" from foreign source
          • keywords
            • (problems with consistency? controlled vocabularies with some sources, but don't match up reliably)
          • people shown in picture
          • etc

How to source/edit the stuff...

  • source fields from file inherent/EXIF
  • scary templates
    • Nesting is required for extracting triples sanely...
      • {{Information|desc={{en|hello}}{{de|hallo}}}}
        • We _think_ we can break this out
        • "template" "Information@desc/en" "hello"
        • "template" "Information@desc/de" "hallo"

Normalization of property specs/extraction most important

Normalization of property values can be left for later

Metadata extraction spec

Possibility...

  • Define for <dc:description> sourcing from:
    • "template" "Information@desc/<lang>"
    • "exif" "Description"
  • Gives us these trips:
    • "File:Hallo.jpg" "dc:description@lang=de" "hallo"
    • "File:Hallo.jpg" "dc:description@lang=en" "hello"
    • "File:Hallo.jpg" "dc:description" "Created with The Gimp"
  • also other sources of info available:
    • revision data / uploader / editor
    • external authority data [rdf triples given from bulk uploader?]

Possibly spec as a special page or a mediawiki:horrid or config file. ;)

Whatever consider that it may change; so we need to be able to re-run metadata extraction in bulk.

Backend storage & search...

  • mysql? (page/field/keyword trips ... or searchindex?)
  • lucene? (for each page page: field/keyword pairs)
  • specialized triple store? (sparql)
    • potentially really cool searches possible!
    • recommend leaving this to later - third parties cover this area well like dbpedia and freebase
  • Look at what Semantic MediaWiki does for its storage
  • Minor note: with RDF triples, subject can be tricky (page vs image vs subject of image)
    • SMW -- talk page's subject is the page, so you can put data about a page there ;)
    • This seems simpler

Horrible questions: licensing?

  • metadata licensing is a scary world out there...
  • Let's pretend the problem doesn't exist for now ;)
    • [but database protection stuff in Europe may need to be looked at... but we think that hits on our side mostly]

Points of action!

  • Good news: exif extraction is already here
  • Work: template extraction sounds feasible and not too insane
    • [brion: see about a proof of concept to make sure this works like we think :D]
  • Work: need to specify an extraction spec language
  • Work: core DB storage + query/dump export
  • Work: need to add interface to add the fields to search backends
    • Lucene <- key for live
    • MySQL fulltext <- good for third-parties

Other considerations

  • sometimes metadata is about *subject of image* instead of *image*. Ughhhhh?
    • at minimum, need to be able to spec for the subject... maybe
  • Location: for paintings -- have loc of the painting or the subject of the painting?

Export methods

  • raw RDF via API
  • XML RDF in special:export and dump XML
    • ^if included in dumpBackup and OAI export, our Lucene indexer can easily ingest that!
  • RDFa embedded in HTML pages