Friday, September 11, 2015

Music21 v2.0.10 beta -- MusicXML Reading

This post announces the v.2.0.10 beta release of music21, which is moving quickly to the official v.2 release, v.2.1.  Some of the changes have already been announced on the music21list Google Groups mailing list.

Upgrade by downloading from or by running "pip install --upgrade music21"

The major changes include:

  • New parsing engine for MusicXML (see below)
  • DurationTuples replace DurationUnits
  • Percussion clefs and No Clefs now are supported properly in musicxml output
  • Improvements to the RomanText and clercqTemperly formats (thanks DT!)
  • Some obscure modules removed from the main namespace:
    1. intervalNetwork becomes scale.intervalNetwork and BoundIntervalNetwork becomes simply IntervalNetwork.
    2. scala becomes scale.scala
    3. chord becomes a package and chordTables becomes chord.tables 
  • In the next version, expect languageExcerpts to become text.languageDetection and the "xmlnode" module to disappear.
  • Environment and CapellaXML, which depended on XMLNode now don't.  CapellaXML processing is 10x faster.
  • jsonpickling is upgraded and safer.
  • Building documentation now works on IPython 4/Jupyter 4.0
  • MusicXML output with Unicode now works on Py3 (thanks Sarig!)
  • Spanners on Rests now export properly in MusicXML
  • VexFlow only supports the music21j based output now. More bug fixes there to come (or will be moved to alpha support)
  • Everything overall is about 30% faster than a month ago.

The biggest change in this version is how MusicXML is processed.  When Christopher Ariza joined the music21 team in 2008, music21 had a tiny limitation: it didn't work with MusicXML, at all. Whoops! It was just too big a task to tackle for me when I was still figuring out how Streams, Sites, Durations, etc. would work. Thankfully Chris took it on and extremely quickly produced a great parser for MusicXML.  The problem back then was that few people were on the latest, greatest version of Python 2.5, and music21 aimed to support at least back to Python 2.1, and only the newest Python 2.5 had the brand new "ElementTree" Python processing module (and there were still substantial bugs in that module before Python 2.6).  We were determined not to make MusicXML parsing require an external library such as "lxml", so that left two choices, xml.minidom and xml.sax.

Anyone who knows anything about the structure of MusicXML and the differences in philosophy between DOM and SAX will know that DOM is the logical choice for MusicXML parsing -- it allows nodes to look at their neighbors, parents, children, and make logical decisions (am I a note, rest, or chord?) based on the context.  SAX on the other hand is built on calling functions whenever a particular tag start is encountered, whenever data is encountered, and whenever a stop tag is encountered. Great for certain types of text formatting, insanely difficult for a format like MusicXML (or MEI or just about any music format besides perhaps MIDI).  So, if memory serves, Chris wrote a quick DOM processor for MusicXML and it was getting notes, durations, measures, beautifully.

But Chris Ariza is also probably the best programmer I've ever met and before going further he profiled the system and extrapolated what it would be like to work with a large corpus of MusicXML files using it.  Slow as slime.  The minidom was implemented entirely in Python, not highly optimized, and was not going to make anyone want to use MusicXML in the toolkit.

So, he basically did the impossible: implemented a blazingly fast SAX processor for MusicXML that built a close-to-the-original representation of the file (musicxml.mxObjects) and then processed that in a much more friendly format.  Bam! Speed went up by an order of magnitude, and everything that music21 could do with MusicXML was born.  In the dozens of releases since he moved on from the project, I've barely had to touch the internals at all even as the rest of the system has expanded and changed dramatically. And there was a system for caching the mxObjects representation for a speedup in the next parse.

Fast forward 7 years.  Python has changed.  Version 2.7 is now the minimum requirement (it's over five years old already; we just found a check for Python > 2.2 somewhere in the system! removed it) V.3.3 and 3.4 are supported (3.5 should be out this week and of course will be supported).  And everyone has access to xml.etree.ElementTree now. And the final representation of all parsed formats is now cached, so there is no need for the mxObjects cache.   So in the interest of simplifying parsing (and getting a 40% speedup over SAX + mxObjects), it made sense to rewrite the MusicXML parsing engine.

The new version is called musicxml.xmlToM21.  There are a few miscellaneous files in a new musicxml.xmlObjects file, but basically all the parsing takes place in the xmlToM21 file.  Every tag in musicxml is now written directly into the file to make it easier to see exactly which tag is causing any particular problem. (Line number properties may be possible to add soon).  Because the format of the parser is now much closer to the format of the MusicXML document, a TODO: has been added for every missing tag, or attribute.  Expect music21 to support every tag and attribute in MusicXML 3.0 sometime soon.  If you've ever wanted to hack additional support into Music21's MusicXML parsing but it seemed too daunting, give another look at the code now.

This is a major change on the most used format for music21. Thankfully, Ariza wrote so many tests into the system that I am relatively confident that everything now works exactly like before.  The exceptions are: non-printed notes are no longer skipped (this was to prevent the next bug), notes with incorrect divisions are now corrected rather than skipped, and spanners preceding rests are now attached to the rest rather than the next adjacent note.  (My intention was to be 100% compatible with before, but it would've been very hard to replicate this incorrect behavior).  The one negative side-effect you will see is that parsing some of the Beethoven files is now slower (rather than 40% faster) because some of those files used a large number of incorrectly notated, non-printing notes to represent playback of trills.  For certain files (such as the Große Fuge) the number of notes in the score will almost double with the new system.

Because this change is major, for now you can still use the old parsing system via converter.parse('filename.xml', format='oldmusicxml').  I suggest also adding "forceSource=True" to make sure that you are reading the file from disk and not from Cache.

I'm extremely excited by this change -- we will get the writing of music21 files to use the new system by the next release (a much easier task).

As always, music21 has been supported by the Seaver Institute, the NEH Digging into Data grant, and MIT Music and Theater Arts/SHASS.

No comments:

Post a Comment