Welcome!

Welcome to the official BlackBerry Support Community Forums.

This is your resource to discuss support topics with your peers, and learn from each other.

inside custom component

Java Development

Reply
Developer
justindutoit
Posts: 400
Registered: ‎05-31-2009
My Device: Not Specified

Tough- fastest way to strip html & convert entity references

Hey. I'd like to learn best practices for code for removing html tags and converting ERs like & to & or their equivalent chars. My goal is legible text from an html page.

 

- First, if there is a utility I'm not aware of which would be best to do this

- Second, which is faster, using e.g.

 

  indexOf("<div", index) == 0  or

  substring(index, index + 4).equals("<div")

?

 

- Third, anything which reduces work for the processor would be great

 

I haven't yet gotten away from the basic way of jumping from one ampersand/lessthan to the next and checking over and over for html tags or &quot; and other ERs. It's inefficient. Rather than start from my current effort, pls give advice on how to redo this.

 

Cheers,

Justin D.

 

 

 

 

Please use plain text.
Developer
Ted_Hopp
Posts: 1,305
Registered: ‎01-21-2009
My Device: Not Specified

Re: Tough- fastest way to strip html & convert entity references

  1. If the HTML is properly structured (as in, valid XML), then you can parse it as XML, which will resolve all those entity references. If the HTML is not well-formed, you can always just strip out all the HTML tags except <html> and </html> to get a well-structured XML doc to parse.
  2. if the string starts with "<div", then there's not a lot of difference. But if the string does not, the second method (using substring) will tell you faster (unless the string is only 4 characters long).
  3. Can you be more specific? We'd all like to reduce work for the processor.



Solved? click "Accept as solution". Helpful? give kudos by clicking on the star.
Please use plain text.
Developer
justindutoit
Posts: 400
Registered: ‎05-31-2009
My Device: Not Specified

Re: Tough- fastest way to strip html & convert entity references

Hi again Ted. Well to parse it as xml, I first need to strip out the html tags as you say. Is there any utility or tutorial out there on this? I am finding that the html I strip is very unpredictable. I guess what I need is a way to strip which isn't too 'low level'. Less code. It works but it has to strip a lot of tags and people won't wait more than 15 seconds (doesn't download images). Thanks for your time :smileyhappy:
Please use plain text.
Developer
geeneeus
Posts: 80
Registered: ‎09-12-2009
My Device: Bold 9700
My Carrier: Vodafone UK

Re: Tough- fastest way to strip html & convert entity references

Just a quick suggestion that may (or may not) help.

 

Why not parse the information from a webserver (there are plenty of free hosts) and then pull the data from their already parsed? Reduces data traffic as well if you are only pulling the information you need.

 

I've wrote a few parsers for BlackBerry and they work but they do take few seconds depending on the size of a file.

Genius Development Scotland
Website: http://www.genius-dev.co.uk
Please use plain text.
Developer
justindutoit
Posts: 400
Registered: ‎05-31-2009
My Device: Not Specified

Re: Tough- fastest way to strip html & convert entity references

Hi. The urls are the full articles for news stories for an rss feed. They change constantly so I would have to send a BB request to my web app page which would do the actual call and strip. But I don't think I will as it means maintaining a web app/server and so on. Thanks anyway :smileyhappy:. \n\nAny utilities for java out there? \n\nJustin
Please use plain text.
Developer
geeneeus
Posts: 80
Registered: ‎09-12-2009
My Device: Bold 9700
My Carrier: Vodafone UK

Re: Tough- fastest way to strip html & convert entity references

But if they constantly change, would it not be better to update the web app that strips excess data rather than having to release a App Update to fix changes to the RSS Feed?

 

I could have just completely mis-understood you there but just curious.

 

I use a PHP based web app and have the web app format the information it pulls from a database before it is read by the BB, that way if I wanted to re-format the info in any way I could do so from the web app.

 

Again this solution may not work for yourself but just thought it would be a good suggestion.

 

As for writing your own I can suggest the following and hope it works out.

 

Try reading character by character,

if you encounter an < then start ignoring characters until you reach a >

continue adding characters to the stringbuffer and repeat until you reach the end of a document.

 

Hope this helps.

Genius Development Scotland
Website: http://www.genius-dev.co.uk
Please use plain text.
Developer
justindutoit
Posts: 400
Registered: ‎05-31-2009
My Device: Not Specified

Re: Tough- fastest way to strip html & convert entity references

Hi thanks. How much faster is it to append strings to a stringbuffer rather than this: mywholehtml = mywholehtmlbeforetag + mywholehtmlaftertag; ? That's what it does now. Cheers, J.
Please use plain text.
Developer
Ted_Hopp
Posts: 1,305
Registered: ‎01-21-2009
My Device: Not Specified

Re: Tough- fastest way to strip html & convert entity references

The compiler basically converts String addition to StringBuffer appends. If you're doing all your appending in one expression, there's no difference in speed. On the other hand, if you're doing something like this:

String x = a + b;

String y = x + c;

String z = y + d;

then using your own explicit StringBuffer is more efficient. The reason is that the compiler will create a new StringBuffer for each of the above statements.

 

I should point out that performance gains at this level are going to be miniscule unless you're doing this thousands of times (such as in a loop of some sort).




Solved? click "Accept as solution". Helpful? give kudos by clicking on the star.
Please use plain text.