Converting HTML (or xml) table to AppleScript list

Below is a raw text table I’m pulling out of a message in Mail.

What I need from that is just the data in an appleScript list (That will be displayed in Myriad Tables). (This example has just two rows, the full tables usually have about 50.)

I suppose I could write a long routine to strip out all the formatting changes and row/data delimiters, but I’m hoping either someone has already done that and has an ExtractListFromHTMLTable handler, or there’s an AOBJc command that could do it.

Any suggestions?

<tbody>
<tr>
<td nowrap=\"nowrap\" style=\"width: 47px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>5070</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>1/29/2017</strong></td>
<td nowrap=\"nowrap\" style=\"width: 60px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>11:00 AM</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88-14-Glendale SC#5</strong></td>
<td nowrap=\"nowrap\" style=\"width: 45px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS</strong></td>
<td nowrap=\"nowrap\" style=\"width: 93px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-88-Grakasian</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-13-Way</strong></td>
<td nowrap=\"nowrap\" style=\"width: 42px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 114px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 102px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
</tr>
<tr>
<td nowrap=\"nowrap\" style=\"width: 47px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>6022</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>1/29/2017</strong></td>
<td nowrap=\"nowrap\" style=\"width: 60px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>12:30 PM</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88-14-Glendale SC#5</strong></td>
<td nowrap=\"nowrap\" style=\"width: 45px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14GA</strong></td>
<td nowrap=\"nowrap\" style=\"width: 93px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14GA-88A-Boehm</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14GA-13B-Baer</strong></td>
<td nowrap=\"nowrap\" style=\"width: 42px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 114px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-88-Grakasian(HELP)</strong></td>
<td nowrap=\"nowrap\" style=\"width: 102px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-13-Way</strong></td>
</tr>
</tbody>

Ed, sorry I don’t have an immediate answer for you.
But if you want to display HTML code in Discourse, you can use the code tag with “html” as the language:

 ```html
 <table style="width:100%">
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
 </table>

Thanks, much better!

@estockly, lots of ways to skin this cat.
Have you done an Internet search for something like “AppleScript extract html table”? I’m betting someone has already done this.

Parsing HTML can be very tricky. The best way is to use JavaScript in a browser with the built-in DOM and JavaScript commands. You can easily get just the plain text from the HTML code using this approach. But you would need to do this using JavaScript (from AppleScript), and then pass the data to an AppleScript list.

Another approach is to use RegEx. Here’s a simple, brut-force RegEx that expects all HTML to be EXACTLY like you posted:

EDIT: 2017-01-26 7:07 PM CT

  • Here’s a better RegEx, that should work (but no promises) with any HTML <td>. . .</td> code:
<td.+>(?:<.*>)*?(.+?)(?:<.*>)*?<\/td>

If you want to see how this works, see this RegEx101 example.

To use this, you would split the HTML code into an AppleScript list using “<tr>” and “</tr>

Then apply the RegEx for each item in the list to extract the cell data for that HTML row. I’d use the satimage.osax find command to do that. I think it would return a list of values for that HTML row, just like you would need it.

Pretty easy I think, just a bit tedious to flesh out the details.
Again, note that if the HTML code is not exactly like your example, then the RegEx may fail. You could make it more accommodating, but I didn’t take the time to do so. I just updated this post to make it be a more general solution.

Good luck. If you don’t find an existing handler for this, please publish your solution. I’m sure it would be useful to many.

@estockly, just updated my above post for a more general RegEx.

Cocoa can take HTML and produce an attributed string, which you can then get the plain text from. It’s a quick and simple way to remove HTML code – however, it makes a right mess of tables.

But there’s no reason you can’t just treat the table HTML as XML. So:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set theTable to "<table>
<tbody>
<tr>
<td nowrap=\"nowrap\" style=\"width: 47px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>5070</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>1/29/2017</strong></td>
<td nowrap=\"nowrap\" style=\"width: 60px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>11:00 AM</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88-14-Glendale SC#5</strong></td>
<td nowrap=\"nowrap\" style=\"width: 45px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS</strong></td>
<td nowrap=\"nowrap\" style=\"width: 93px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-88-Grakasian</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-13-Way</strong></td>
<td nowrap=\"nowrap\" style=\"width: 42px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 114px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 102px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
</tr>
<tr>
<td nowrap=\"nowrap\" style=\"width: 47px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>6022</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>1/29/2017</strong></td>
<td nowrap=\"nowrap\" style=\"width: 60px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>12:30 PM</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88-14-Glendale SC#5</strong></td>
<td nowrap=\"nowrap\" style=\"width: 45px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14GA</strong></td>
<td nowrap=\"nowrap\" style=\"width: 93px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14GA-88A-Boehm</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14GA-13B-Baer</strong></td>
<td nowrap=\"nowrap\" style=\"width: 42px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 114px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-88-Grakasian(HELP)</strong></td>
<td nowrap=\"nowrap\" style=\"width: 102px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-13-Way</strong></td>
</tr>
</tbody>
</table>"
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithXMLString:theTable options:0 |error|:(reference)
if theXMLDoc is missing value then error (theError's localizedDescription() as text)
set {theRows, theError} to theXMLDoc's nodesForXPath:("//tr") |error|:(reference)
if theRows is missing value then error (theError's localizedDescription() as text)
set listOfRowLists to {}
repeat with aRow in theRows
	set theEntries to (aRow's elementsForName:"td")
	set end of listOfRowLists to (theEntries's valueForKey:"stringValue") as list
end repeat
return listOfRowLists
2 Likes

Hey Ed,

Firstly – why are you using the message source when you can get just the content of the message from Mail?

tell application "Mail"
   set theMessage to first item of (get selection)
   
   tell theMessage
      # Content:
      set msgContent to content
      # Source:
      set msgSource to source
   end tell
   
end tell

Content gets you the text of the message.

Secondly – if you have to work with the source of the table it’s dead-simple to turn into text:

-------------------------------------------------------------------------
# Satimage.osax Dependent { http://tinyurl.com/dc3soh }
-------------------------------------------------------------------------

set _text to bbeditFrontWinText() -- Pulling the source from a BBEdit window.

# Strip Tags:
set _text to cng("<[^>]+>", "", _text) of me
# Pull Content into a list:
set theList to fnd("^.+", _text, true, true) of me

-------------------------------------------------------------------------
--» HANDLERS
-------------------------------------------------------------------------
on cng(_find, _replace, _data)
   change _find into _replace in _data with regexp without case sensitive
end cng
-------------------------------------------------------------------------
on fnd(_find, _data, _all, strRslt)
   try
      find text _find in _data all occurrences _all string result strRslt with regexp without case sensitive
   on error
      return false
   end try
end fnd
-------------------------------------------------------------------------
on bbeditFrontWinText()
   tell application "BBEdit"
      tell front document to its text
   end tell
end bbeditFrontWinText
-------------------------------------------------------------------------

** I can turn this into ASObjC if you need it.

-Chris

1 Like

Great solution Shane! :thumbsup:
I’ve added your script to my library.

Chris, thanks for sharing yet another great solution! :thumbsup:
Also added to my library.

However, I do have one point to pick with you:
It may be “dead-simple” for you, but it ain’t for most of us (at least not for me.) :wink:

But as the RegEx master, I’d expect no less from you. :wink:

Thanks Chris!

Firstly – why are you using the message source when you can get just the content of the message from Mail?

The content breaks every cell into it’s own paragraph. I had been trying this, but sometimes the cells will have linefeeds and that screws everything up.

In what way does it make a mess of the tables? Seems to work pretty well. (Thanks!)

I was talking about two different approaches. The XML approach works fine; the approach going via an attributed string is the one that messes things up.

Hey Ed,

Aha! Okay, that make perfect sense.

This problem is still easy to solve with regex – especially by using some tricks the Satimage.osax has up its sleeve.

** For testing I’ve taken the liberty of adding a linefeed with an extra entry in one of the cells.

-------------------------------------------------------------------------
# Satimage.osax-Dependent { http://tinyurl.com/dc3soh }
-------------------------------------------------------------------------

set theTable to "<table>
<tbody>
<tr>
<td nowrap=\"nowrap\" style=\"width: 47px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>5070
I AM NUTS</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>1/29/2017</strong></td>
<td nowrap=\"nowrap\" style=\"width: 60px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>11:00 AM</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88-14-Glendale SC#5</strong></td>
<td nowrap=\"nowrap\" style=\"width: 45px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS</strong></td>
<td nowrap=\"nowrap\" style=\"width: 93px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-88-Grakasian</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-13-Way</strong></td>
<td nowrap=\"nowrap\" style=\"width: 42px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 114px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 102px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
</tr>
<tr>
<td nowrap=\"nowrap\" style=\"width: 47px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>6022</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>1/29/2017</strong></td>
<td nowrap=\"nowrap\" style=\"width: 60px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>12:30 PM</strong></td>
<td nowrap=\"nowrap\" style=\"width: 78px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88-14-Glendale SC#5</strong></td>
<td nowrap=\"nowrap\" style=\"width: 45px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14GA</strong></td>
<td nowrap=\"nowrap\" style=\"width: 93px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14GA-88A-Boehm</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14GA-13B-Baer</strong></td>
<td nowrap=\"nowrap\" style=\"width: 42px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 96px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>Region 88</strong></td>
<td nowrap=\"nowrap\" style=\"width: 114px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-88-Grakasian(HELP)</strong></td>
<td nowrap=\"nowrap\" style=\"width: 102px;height: 26px;mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;\"><strong>U14BS-13-Way</strong></td>
</tr>
</tbody>
</table>"

set theTable to cng("(?m)<td[^>]+>|</td>", "•", theTable) of me
set theTable to cng("<[^>]+>", "", theTable) of me
set theList to fndUsing("(?m)•(.+?)•", "\\\1", theTable, true, true) of me

# Remove Vertical Whitespace from List Items if Necessary:
# set theList to cng("[\\n\\r]+", " ", theList) of me

-------------------------------------------------------------------------
--» HANDLERS
-------------------------------------------------------------------------
on cng(_find, _replace, _data)
   change _find into _replace in _data with regexp without case sensitive
end cng
-------------------------------------------------------------------------
on fndUsing(_find, _capture, _data, _all, strRslt)
   try
      set findResult to find text _find in _data using _capture all occurrences _all ¬
         string result strRslt with regexp without case sensitive
   on error
      false
   end try
end fndUsing
-------------------------------------------------------------------------

Three whole lines of code, and I could have done it in two:

set theList to fndUsing("(?m)<td[^>]+>(.+?)</td>", "\\\1", theTable, true, true) of me
set theTable to cng("<[^>]+>", "", theList) of me

When parsing complex text it is sometimes much easier to remove segments or replace them with a placeholder, as I did with the bullet in the first script.

This methodology can often let you visualize what you’re doing much more easily than trying to get to your objective in one fell swoop.

Ultimately a real HTML or XML parser is going to be a more robust solution than manually parsing with regular expressions, but as you can see they have their place.

One of the cool things about the Satimage.osax is that it will find and change text directly in an AppleScript list object.

-Chris


P.S. Confound it! Discourse’s code blocks have a bad bug in them that hasn’t been fixed for years now.

This text “\\1” turns into this mush “\`” (minus the quotes).

To get around that in the code I’ve written "\\\1", and you the user must remove one of those backslashes.

1 Like

Chris, thanks for sharing this.
That feature of Satimage.osax is very powerful, well worth noting.

Hi, does somebody make a script parse table html considering “rowspan” tag?
It is a headache…

Before this post, I had an idea to this processing…

–Serial fill 2d list toward horizontal(xFill) /vertical (yFill)
http://piyocast.com/as/archives/7627

–Write out html tables visible on Safari window
http://piyocast.com/as/archives/7635