All Questions

Tagged with
Filter by
Sorted by
Tagged with
3 votes
1 answer
2k views

Scrapy crawl extracted links

I need to crawl a website, and crawl every url from that site on a specific xpath for example.: I need to crawl "http://someurl.com/world/" which has 10 links in the container (xpath("//div[@class='...
Nikola Niko's user avatar
2 votes
2 answers
3k views

How to get python to load right library (a .dylib, not .so.3 on OSX)

I'm using the extractor module in python 2.7 via pip install extractor. I'm on OS X using homebrew, and I have previously run homebrew install libextractor. This creates files with extensions .a ...
FrobberOfBits's user avatar
1 vote
2 answers
4k views

Wikipedia extractor problem ValueError: cannot find context for 'fork'

My aim is to get plain text (without links, tags, parameters and other trash, only articles text) from wikipedia xml dumps (https://dumps.wikimedia.org/backup-index.html). I found WikiExtractor python ...
Shurup's user avatar
  • 11
0 votes
0 answers
14 views

Why can't my regex pick up the phone #s on the web page? [duplicate]

Hey guys so I am building a phone and email extractor using python regex and while it works for the emails, it won't work for the phone numbers. The code for finding phone number matches on the ...
Marcelino Velasquez's user avatar
0 votes
1 answer
192 views

How can resolve recursion depth exceeded (Goose-extractor)

I am one problem with goose-extractor This is my code: for resultado in soup.find_all('a', href=True,text=re.compile(llave)): url = resultado['href'] article = g.extract(url=url) ...
papabomay's user avatar
  • 205