Dealing with MathML

I’ve been copy-pasting into Notes from web-based lectures, usually by converting HTML source grabbed from the lecture to RTF and then pasting it into notes, based on several scripts floating around here.

However, MathML seems to gum up all the works. As a small snippet:

<math<msub<mn1011</mn<mn2</mn</msub</math

This should turn into:

10112

Is there a way to have things display properly? A direct copy-paste does this:
1011
2

I hope this makes sense…rather frustrated with it.

Edit: I see the forum formats the numbers as this: 10112, so I took out the closing ">"s

Perhaps an illustrative example is this?

    use AppleScript version "2.4" -- Yosemite (10.10) or later
    use framework "Foundation"
    use framework "AppKit" -- needed for initWithHTML:
    use scripting additions

    -- classes, constants, and enums used
    property NSAttributedString : a reference to current application's NSAttributedString
    property NSUTF8StringEncoding : a reference to 4
    property NSString : a reference to current application's NSString
    property NSHTMLTextDocumentType : a reference to current application's NSHTMLTextDocumentType
    property NSRTFTextDocumentType : a reference to current application's NSRTFTextDocumentType
    set theHTML to "<p><math xmlns=\"http://www.w3.org/1998/Math/MathML\" id=\"a19faca6-7934-3f6f-abf5-1c58fa6075c4\"><msub id=\"e5fc92dd-8623-339b-8fc8-f45eb184a945\"><mn id=\"ha3a726c-42ed-3925-b47e-7ce84f6444d1\">1011</mn><mn id=\"dca6ded6-ada6-319a-9918-06a3587b7480\">2</mn></msub><mo id=\"tfc14090-e7ba-3dde-9fcf-5933183ca65f\">=</mo><mn id=\"h59ae6c7-90f1-3234-b3e2-b29d21b2b256\">1</mn><mo id=\"z3254b73-ebb0-3979-8b2e-0b96217e9d4d\">*</mo><msup id=\"c7a70f6a-720d-3e30-a6f1-1040ffb44f8b\"><mn id=\"n6e38293-3aae-3b81-a244-be1245742bf0\">2</mn><mn id=\"f4eb32d5-8b2a-37f9-bd11-8c17f19f2be3\">0</mn></msup><mo id=\"qcd31fa7-dda4-3362-a3b9-9266a0fe66bb\">+</mo><mn id=\"l429697d-dc69-3f18-a8b9-b442e05dce4a\">1</mn><mo id=\"rafe8d37-04a7-39a6-adac-8ca23a84baa8\">*</mo><msup id=\"g50eddbe-9539-3b66-a921-d9a7a869b051\"><mn id=\"w32b53bb-3974-389b-a45d-27eceec1404b\">2</mn><mn id=\"r0e37325-3d3c-3cfd-9ba1-253d643dbb30\">1</mn></msup><mo id=\"bf346158-5559-3716-8e52-395c58e893f9\">+</mo><mn id=\"t4cb18aa-6282-37a9-85cb-bb498bad5738\">0</mn><mo id=\"c928c995-a1dc-307e-a74e-fef1c1a6fe0d\">*</mo><msup id=\"a41125ea-b166-3ef7-8f37-45eb166dd287\"><mn id=\"l2afde10-3c8e-36ac-96ff-3c05910dfd58\">2</mn><mn id=\"sa28af24-1d13-3908-b89f-1f32ddf8d10c\">2</mn></msup><mo id=\"q09b22ec-95d6-3cab-950e-55bf72064e72\">+</mo><mn id=\"adf25973-391c-35ae-9dfd-902e9a429c2b\">1</mn><mo id=\"l5bccfd0-721f-378a-b2e2-dff8f064da82\">*</mo><msup id=\"fcb47f6d-db64-3e33-8801-f41ba8dc9c14\"><mn id=\"x070c131-eca5-391c-ad0c-90d07a11f0da\">2</mn><mn id=\"u6da6f6a-0d4d-30bb-a30d-9409506d9ba3\">3</mn></msup><mo id=\"t0d0e4e9-0ae7-3e6a-bead-ef5f98c024a9\">=</mo><mn id=\"oe230960-4c18-3757-ad4b-e7f116c024ab\">11</mn></math></p>
    "

    set theHTML to NSString's stringWithString:theHTML
    set theData to theHTML's dataUsingEncoding:NSUTF8StringEncoding
    set theATS to NSAttributedString's alloc()'s initWithHTML:theData documentAttributes:(missing value)
    set {htmlData, theError} to theATS's dataFromRange:{0, theATS's |length|()} documentAttributes:{DocumentType:NSHTMLTextDocumentType} |error|:(reference)
    if htmlData = missing value then error theError's localizedDescription() as text
    set theString to (NSString's alloc()'s initWithData:htmlData encoding:NSUTF8StringEncoding) as text

    set {rtfData, err} to theATS's dataFromRange:{0, theATS's |length|()} documentAttributes:{DocumentType:"NSRTF"} |error|:(reference)
    set RTFString to (NSString's alloc()'s initWithData:rtfData encoding:NSUTF8StringEncoding) as text
    set RTFAttr to NSAttributedString's alloc()'s initWithRTF:rtfData documentAttributes:(missing value)

Your code is going in circles a bit there, but that’s probably not the issue. When I try this after your set theATS to ... line:

set rtfData to theATS's RTFFromRange:{0, theATS's |length|()} documentAttributes:{DocumentType:"NSRTF"}
rtfData's writeToFile:"/users/shane/desktop/Out.rtf" atomically:true

I see the problem you describe. So it looks it’s occurring in the conversion from HTML, and I don’t see any easy workaround for that.

The only alternative is to create a Web view and try to work from that — not exactly simple.

Yes, very circular I admit!

I did end up using regex on a mutablecopy, following another post on this forum, which took out the line breaks, but the sub/superscripts don’t translate, I guess.

I’m working on an XML solution at the moment, slow going.

After probably my most productive exploration of the NSXML documentation, I think I came up with something that works for at least my test snippet. I’ve tidied it up to the essentials, but I would welcome any pointers for further tidiness!

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

property NSUTF8StringEncoding : a reference to 4
property NSString : a reference to current application's NSString
property NSAttributedString : a reference to current application's NSAttributedString
property NSRTFDTextDocumentType : a reference to current application's NSRTFDTextDocumentType
property NSPasteboardTypeRTFD : a reference to current application's NSPasteboardTypeRTFD

---MathML From Lecture HTML---
set theMathML to "<math xmlns=\"http://www.w3.org/1998/Math/MathML\" id=\"a19faca6-7934-3f6f-abf5-1c58fa6075c4\"><msub id=\"e5fc92dd-8623-339b-8fc8-f45eb184a945\"><mn id=\"ha3a726c-42ed-3925-b47e-7ce84f6444d1\">1011</mn><mn id=\"dca6ded6-ada6-319a-9918-06a3587b7480\">2</mn></msub><mo id=\"tfc14090-e7ba-3dde-9fcf-5933183ca65f\">=</mo><mn id=\"h59ae6c7-90f1-3234-b3e2-b29d21b2b256\">1</mn><mo id=\"z3254b73-ebb0-3979-8b2e-0b96217e9d4d\">*</mo><msup id=\"c7a70f6a-720d-3e30-a6f1-1040ffb44f8b\"><mn id=\"n6e38293-3aae-3b81-a244-be1245742bf0\">2</mn><mn id=\"f4eb32d5-8b2a-37f9-bd11-8c17f19f2be3\">0</mn></msup><mo id=\"qcd31fa7-dda4-3362-a3b9-9266a0fe66bb\">+</mo><mn id=\"l429697d-dc69-3f18-a8b9-b442e05dce4a\">1</mn><mo id=\"rafe8d37-04a7-39a6-adac-8ca23a84baa8\">*</mo><msup id=\"g50eddbe-9539-3b66-a921-d9a7a869b051\"><mn id=\"w32b53bb-3974-389b-a45d-27eceec1404b\">2</mn><mn id=\"r0e37325-3d3c-3cfd-9ba1-253d643dbb30\">1</mn></msup><mo id=\"bf346158-5559-3716-8e52-395c58e893f9\">+</mo><mn id=\"t4cb18aa-6282-37a9-85cb-bb498bad5738\">0</mn><mo id=\"c928c995-a1dc-307e-a74e-fef1c1a6fe0d\">*</mo><msup id=\"a41125ea-b166-3ef7-8f37-45eb166dd287\"><mn id=\"l2afde10-3c8e-36ac-96ff-3c05910dfd58\">2</mn><mn id=\"sa28af24-1d13-3908-b89f-1f32ddf8d10c\">2</mn></msup><mo id=\"q09b22ec-95d6-3cab-950e-55bf72064e72\">+</mo><mn id=\"adf25973-391c-35ae-9dfd-902e9a429c2b\">1</mn><mo id=\"l5bccfd0-721f-378a-b2e2-dff8f064da82\">*</mo><msup id=\"fcb47f6d-db64-3e33-8801-f41ba8dc9c14\"><mn id=\"x070c131-eca5-391c-ad0c-90d07a11f0da\">2</mn><mn id=\"u6da6f6a-0d4d-30bb-a30d-9409506d9ba3\">3</mn></msup><mo id=\"t0d0e4e9-0ae7-3e6a-bead-ef5f98c024a9\">=</mo><mn id=\"oe230960-4c18-3757-ad4b-e7f116c024ab\">11</mn></math>"
-----------------

set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithXMLString:theMathML options:0 |error|:(reference)
if theXMLDoc is missing value then error (theError's localizedDescription() as text)
set Zeus to theXMLDoc's childAtIndex:0 --Child 0 and the XMLDoc are the same?
set ZeusChildren to Zeus's children() --The actual child nodes of MathML
set HTMLCodeString to ""

repeat with ZeusChild in ZeusChildren --Loop through nodes
	set ChildNum to ZeusChild's childCount as integer --Check if there are nested elements within the node
	if ChildNum is greater than 1 then --Means subscripts or superscripts for this MathML snippet
		set SubSup to ZeusChild's localName as text
		if SubSup is "msup" then
			set HTMLCode to ((ZeusChild's childAtIndex:0)'s stringValue as text) & "<sup>" & ((ZeusChild's childAtIndex:1)'s stringValue as text) & "</sup>"
		else if SubSup is "msub" then
			set HTMLCode to ((ZeusChild's childAtIndex:0)'s stringValue as text) & "<sub>" & ((ZeusChild's childAtIndex:1)'s stringValue as text) & "</sub>"
		end if
		set HTMLCodeString to HTMLCodeString & HTMLCode
	else
		set ZeusString to ZeusChild's stringValue as text
		set HTMLCodeString to HTMLCodeString & ZeusString
	end if
end repeat
set HTMLCodeString to "<div><pre>" & HTMLCodeString & "</pre></div>"

---Convert HTML to Attributed String
set HTMLString to NSString's stringWithString:HTMLCodeString
set htmlData to HTMLString's dataUsingEncoding:NSUTF8StringEncoding
-- make attributed string
set attString to NSAttributedString's alloc()'s initWithHTML:htmlData documentAttributes:(missing value)

---Convert Attributed String to RTF then copy that to clipboard for Notes paste
set rtfData to attString's RTFDFromRange:{0, attString's |length|()} documentAttributes:{DocumentType:NSRTFDTextDocumentType}
set pb to current application's NSPasteboard's generalPasteboard() -- get pasteboard
pb's clearContents()
pb's writeObjects:{attString}
pb's setData:rtfData forType:NSPasteboardTypeRTFD

You don’t need the rtf stuff there — putting the attributed string on the clipboard handles it for you. So:

-- make attributed string
set attString to NSAttributedString's alloc()'s initWithHTML:htmlData documentAttributes:(missing value)
set pb to current application's NSPasteboard's generalPasteboard() -- get pasteboard
pb's clearContents()
pb's writeObjects:{attString}

The other approach would be to apply an XSLT pattern to the html.

It’s also generally a good idea to include the parens with ASObjC properties and methods that don’t take parameters. So childCount(), localName() and so on. And you might replace this:

set Zeus to theXMLDoc's childAtIndex:0

with this:

set Zeus to theXMLDoc's rootElement()

to make it a tad more readable/logical.

Here is an almost “production” level test script that I put together based around your XSLT suggestion, incorporating ideas/code I found on this forum. The Source HTML is taken out of a Canvas LMS
Interactive Lecture iframe, which I have previously banged my head against extracting with javascript etc, but now just save through Safari’s “Save Frame As”. Hopefully this will be used with a Folder Action.

use AppleScript version "2.5" -- Yosemite (10.10) or later
use framework "Foundation"
use framework "AppKit"
use scripting additions
use script "RegexAndStuffLib" version "1.0.6"

-- classes, constants, and enums used
property NSXMLDocumentTidyXML : a reference to 1024
property NSRegularExpressionSearch : a reference to 1024
property NSPasteboard : a reference to current application's NSPasteboard
property NSXMLDocument : a reference to current application's NSXMLDocument
property NSXMLElement : a reference to current application's NSXMLElement
property NSUTF8StringEncoding : a reference to 4
property NSISOLatin1StringEncoding : a reference to 5 --Needed for old W3 XSLT file I found
property NSString : a reference to current application's NSString
property NSAttributedString : a reference to current application's NSAttributedString

-----Load Lecture source file and XSLT---
set {theLecture, theError} to NSString's stringWithContentsOfFile:"/Users/ianday-gennett/Documents/School/Module 2 Interactive Lecture.html" encoding:NSUTF8StringEncoding |error|:(reference)
if theLecture is equal to missing value then error (theError's localizedDescription() as text)

set {theXSLT, theError} to NSString's stringWithContentsOfFile:"/Users/ianday-gennett/Downloads/pmathmlcss.xsl" encoding:NSISOLatin1StringEncoding |error|:(reference)
if theXSLT is equal to missing value then error (theError's localizedDescription() as text)
---End Load----

---Chop Unnecessary Beginning and End off---
set theResult to "<h2" & item 2 of (((NSString's stringWithString:theLecture)'s componentsSeparatedByString:"<h2 class=\"Intro-moduleTitle\"") as list)
set theResult to item 1 of (((NSString's stringWithString:theResult)'s componentsSeparatedByString:"<!--<script") as list)
---End Chop----

---Take Internal Pagination out and rectify header damage from chop---
set theResult to regex batch theResult change pairs {{"(<div class=\"mount-section-footer u-mt16\"(?s).*?)(<h3)", "$2"}, {"<h2", "<h1"}, {"</h2>", "</h1>"}, {"<h3", "<h2"}, {"</h3>", "</h2>"}, {"<h4", "<h3"}, {"</h4>", "</h3>"}}
--End Batch Regex---

---Precision XSLT needed because otherwise I get odd characters---
set FindMath to regex search theResult search pattern "<math.*?</math>" capture groups 0 --Look for MathML sections
if length of FindMath is greater than 0 then
	set theResult to NSString's stringWithString:theResult
	repeat with i from (count FindMath) to 1 by -1 --Replace strings from the back to preserve range?
		set theMatch to item i of FindMath
		set MatchRange to (theResult's rangeOfString:theMatch) --Get just the MathML 
		set {MathXML, theError} to (NSXMLDocument's alloc()'s initWithXMLString:theMatch options:NSXMLDocumentTidyXML |error|:(reference))
		if MathXML is missing value then error (theError's localizedDescription() as text)
		set {theTransform, theError} to (MathXML's objectByApplyingXSLTString:theXSLT arguments:(missing value) |error|:(reference))
		set TransformString to theTransform's XMLString() --Get useable output
		set aString to (NSString's stringWithString:theResult)
		set HTMLString to (aString's stringByReplacingOccurrencesOfString:"<math.*?</math>" withString:TransformString options:NSRegularExpressionSearch range:MatchRange)
	end repeat
end if

--Convert HTML to Attributed String
set htmlData to HTMLString's dataUsingEncoding:NSUTF8StringEncoding
-- Make attributed string
set attString to NSAttributedString's alloc()'s initWithHTML:htmlData documentAttributes:(missing value)
set pb to NSPasteboard's generalPasteboard() -- get pasteboard
pb's clearContents()
pb's writeObjects:{attString} --This way preserves the "typed-in" formatting of Notes.app unlike setting the body of a note

tell application "Notes"
	activate
	repeat while frontmost is false
		delay 1
	end repeat
	tell application "System Events"
		tell application process "Notes"
			keystroke "n" using command down
			delay 1
			keystroke "v" using command down
		end tell
	end tell
end tell

Is there a faster way to accomplish this? I am also struggling with formatting the HTML so it looks more the way I like it, mainly because adding more regex just slows it down. It would probably be better if I could take the time to learn/write my own XSLT, I guess. Thanks for all the help so far.

Edit Note. Now I’m thinking that the MathML DTD may be helpful? Just rambling thoughts…

There is: write your own XSLT. Not a simple task if you’re not familiar with it, though.