I have a great ubotter, but we seem to be running into some issues scraping a site. Font error, scraping legal symbols, and even apostrophes within the texts.I need multi-bot modular bots to help with speed and efficiency. the First BOT:
1: Scrape articles HG.org: https://www.hg.org/articles-for-260-areas-of-law.asp
2: User input: Articles By Area of Practice: Estate Planning etc…
3: All scraped data goes into an XAMMP db.
4: When scraping the components of the article must be stripped and separated. (the website has <script> mixed into for ads and whatnot, they need to be removed)
Components for DB:
A: scraped Url
E: h2-Article body
F: H2 — Bot needs to make this an H3
G: h3-Article Body
H: H2 — Bot needs to make this an H4
I: h4-Article Body
J: H2 —- — Bot needs to make this an H5
K: h5-Article Body
L: H2 – —- — Bot needs to make this an H6
M: h6-article body
** any articles that go past this just keep them at H6.
***** If possible, some article do not have any H2 tags etc…
Make the bot smart enough, if the Text is 8 words or less, with a double <br><br> Make that an H tag in the corresponding order.
Other known issues: This website, blocks IP address after 100 page views. So we need to swap proxies after that. The Ip will be dead for 30 days after 100 page views. I have plenty of proxies to use.
** GETTING THE DATA PROPERLY INTO
THE DB IS THE HIGHEST PRIORITY **