Quantcast

Jump to content

» «
Photo

Making an HTML parser for a website

7 replies to this topic
David.
  • David.

    Street Cat

  • Members
  • Joined: 29 Apr 2009
  • India

#1

Posted 01 June 2013 - 09:51 AM

Hey guys,

Let's say I'm trying to parse data from a website that has no JSON, RSS feeds. For instance, this forum website. If I were to make an app for this website would it be wise to parse raw HTML or is there a better way?


Thanks!

OzzySM12
  • OzzySM12

    Hmm...

  • Members
  • Joined: 07 Nov 2004

#2

Posted 03 June 2013 - 08:38 AM

It will depend what you can get from the site.

Does the site have a print version or a basic/mobile template you can request? If you can get those then it will send more of the data you need and less of the added sh*t.

Also try asking the folk to set up an RSS feed. I'd imagine they would if it is of benefit to them.

fastman92
  • fastman92

    фастман92 | ف

  • Members
  • Joined: 28 Jul 2009

#3

Posted 03 June 2013 - 12:04 PM

If you want to parse HTML document and read neccessary informations, then there are few decent HTML parsers.
For C# i have used Html Agility Pack.

Using RegEx is not good way to extract informations from HTML document, HTML document may be poorly written.
You can't write a complete HTML parser, unless you have solid programming skills.

eggburt
  • eggburt

    Lil' G Fizzle

  • Members
  • Joined: 23 Jan 2003
  • United-Kingdom

#4

Posted 03 June 2013 - 04:59 PM

What language are you using? The best ones I know of are

(Python) Beautiful Soup
(PHP) PHP Dom
(Javascript) jQuery

If you clarify what language you're after I'd be happy to provide a few more hints :-)

David.
  • David.

    Street Cat

  • Members
  • Joined: 29 Apr 2009
  • India

#5

Posted 03 June 2013 - 05:06 PM

QUOTE (OzzySM12 @ Monday, Jun 3 2013, 11:38)
It will depend what you can get from the site.

Does the site have a print version or a basic/mobile template you can request? If you can get those then it will send more of the data you need and less of the added sh*t.

Also try asking the folk to set up an RSS feed. I'd imagine they would if it is of benefit to them.

Well, for GTAforums.com there is no mobile version and I don't think we'll have an RSS feed setup anytime soon.

@fastman92

since I'm coding in Java, I've found that TagSoup and HTML Cleaner are pretty good.

I guess what I'm asking is this a really good idea? Just parsing the raw HTML documents to make a mobile app?

fastman92
  • fastman92

    фастман92 | ف

  • Members
  • Joined: 28 Jul 2009

#6

Posted 05 June 2013 - 09:12 PM

QUOTE (d4v1d @ Monday, Jun 3 2013, 19:06)
QUOTE (OzzySM12 @ Monday, Jun 3 2013, 11:38)
It will depend what you can get from the site.

Does the site have a print version or a basic/mobile template you can request? If you can get those then it will send more of the data you need and less of the added sh*t.

Also try asking the folk to set up an RSS feed. I'd imagine they would if it is of benefit to them.

Well, for GTAforums.com there is no mobile version and I don't think we'll have an RSS feed setup anytime soon.

@fastman92

since I'm coding in Java, I've found that TagSoup and HTML Cleaner are pretty good.

I guess what I'm asking is this a really good idea? Just parsing the raw HTML documents to make a mobile app?

In case that server doesn't provide any data except pages, then it's necessary to parse HTML pages and get informations.
You can find information by parsing a HTML with library and extracting them by selecting item by certain ID/class/name/tag or different way.

eggburt
  • eggburt

    Lil' G Fizzle

  • Members
  • Joined: 23 Jan 2003
  • United-Kingdom

#7

Posted 06 June 2013 - 11:10 AM

QUOTE (d4v1d @ Monday, Jun 3 2013, 17:06)
I guess what I'm asking is this a really good idea? Just parsing the raw HTML documents to make a mobile app?

Purely from a support point of view I'd try to find another option. Generally speaking if the site changes their layout in any way your app will break.

What site is it you're looking to app-ify? If it was indeed GTAF your best bet would be the printable version of each page.

For example this topic is
http://www.gtaforums...howtopic=558931

You'd take the showtopic value and use it for the t value here
http://www.gtaforums...rinter&t=558931

Which is more likely to remain the same layout for a long time.


Something else to watch out for however is that in this case the printable page shows the entire topic. Your app may not last too long if it becomes popular due to the amount of load it could put the site under - and depending what the site's hosting plan is how much bandwidth it chomps down.

I'd definitely recommend seeking permission from the site first, they may even have a better option for you to use you may not know about


David.
  • David.

    Street Cat

  • Members
  • Joined: 29 Apr 2009
  • India

#8

Posted 07 June 2013 - 03:25 AM

Thanks for all the help guys!


I guess I'll try asking Mr. Tank in a mail or something if there are any additional options. Otherwise I'll just go ahead with the printable version like eggburt suggested.

I'm rather new to Android development, and I thought it would be a fun project to try out-making an app for GTAF.




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users